CN111339407A - Implementation method of information extraction cloud platform - Google Patents
Implementation method of information extraction cloud platform Download PDFInfo
- Publication number
- CN111339407A CN111339407A CN202010100115.3A CN202010100115A CN111339407A CN 111339407 A CN111339407 A CN 111339407A CN 202010100115 A CN202010100115 A CN 202010100115A CN 111339407 A CN111339407 A CN 111339407A
- Authority
- CN
- China
- Prior art keywords
- relation
- extraction
- vector
- sequence
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 103
- 238000000034 method Methods 0.000 title claims abstract description 82
- 238000012549 training Methods 0.000 claims abstract description 31
- 239000013598 vector Substances 0.000 claims description 55
- 238000013527 convolutional neural network Methods 0.000 claims description 18
- 238000004422 calculation algorithm Methods 0.000 claims description 13
- 238000002372 labelling Methods 0.000 claims description 13
- 230000007246 mechanism Effects 0.000 claims description 13
- 230000002457 bidirectional effect Effects 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 11
- 238000004458 analytical method Methods 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 238000013135 deep learning Methods 0.000 claims description 8
- 125000004122 cyclic group Chemical group 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 230000007704 transition Effects 0.000 claims description 7
- 230000009193 crawling Effects 0.000 claims description 6
- 235000014347 soups Nutrition 0.000 claims description 6
- 230000010339 dilation Effects 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 2
- 238000005192 partition Methods 0.000 claims description 2
- 230000001502 supplementing effect Effects 0.000 claims 1
- 238000003058 natural language processing Methods 0.000 abstract description 6
- 238000010801 machine learning Methods 0.000 abstract description 5
- 230000000295 complement effect Effects 0.000 abstract description 4
- 238000013461 design Methods 0.000 abstract description 4
- 230000007547 defect Effects 0.000 abstract description 2
- 238000004891 communication Methods 0.000 abstract 1
- 238000013215 result calculation Methods 0.000 abstract 1
- 238000011160 research Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000015654 memory Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention realizes an implementation method of an information extraction cloud platform through a method in the field of natural language processing, realizes the communication of each link for information extraction based on remote supervision through three steps of data acquisition, relation extraction design, relation extraction model establishment and result calculation and output, adopts a mixed model based on a template and machine learning to perform a mutually complementary extraction mode, simultaneously reasonably schedules a network and computing resources to complete the acquisition of a training text and the information extraction method of the training of the information extraction model, and makes up the defects of the traditional information extraction method.
Description
Technical Field
The invention relates to the field of natural language processing, in particular to an implementation method of an information extraction cloud platform.
Background
With the development of the internet and the continuous growth of network information, more and more information can be retrieved from the internet through a search engine, and search results show the characteristics of data quantification, diversified forms, coverage comprehension and the like, so that on one hand, the possibility of searching the results by a user is improved, and on the other hand, the user is difficult to quickly and accurately locate the required information. The method is an urgent need of people in the information age for rapidly and accurately obtaining useful information from massive information, and the demand also promotes the problem of information extraction to become a research hotspot in the current natural language processing field. Information extraction techniques are techniques that extract structured information from semi-structured or unstructured information. The research focus of the patent is to extract a relation triple consisting of an entity 1, a relation and an entity 2 from a plain text sentence. The information extraction technology can semi-automatically extract key information from mass data to construct a knowledge map, and assists people to better utilize big data to solve problems, and the characteristics of high efficiency and convenience enable information extraction to become a research hotspot which is widely concerned by researchers and is urgently required to be developed.
Currently, there are two main methods for relationship and entity extraction:
1. the template-based method mainly relies on various rules which can identify structured information, such as grammar rules and the like set by engineering personnel for extracting information. This approach is difficult to implement, fails to adequately consider comprehensiveness and is poorly versatile, often yielding only suboptimal results. But this method can be effectively a machine learning based method as a complement to the extraction result when extracting a specific sub-domain.
2. Based on a machine learning method, deep learning training is carried out on the model by utilizing a large amount of existing marked data, and relational entities are extracted through the obtained deep learning model. For example, the discovery is performed by extracting entities using a model such as an expanded convolutional neural network, and training the extracted entities using deep learning of a bidirectional GRU plus attention network to obtain a relationship extraction model.
Although information extraction has made a significant breakthrough in recent years with the rapid progress of deep learning, the following problems still need to be solved urgently in this field: first, unlike the extraction problem of English text information, the implementation of Chinese information extraction model is more complicated and difficult. On one hand, compared with English, the Chinese sentence is complex in syntax structure, complex in ambiguity resolution in terms of words and words, flexible and diverse in semantic expression, and a plurality of uncertain factors are added for training. The marked Chinese text training data are less, and the probability of under-fitting of the deep learning model is increased. In the case of lack of training data, how to automatically obtain a large amount of labeled Chinese text data is a great importance in the field of information extraction. Secondly, although there are many people who have proposed methods for improving the Chinese entity extraction and relationship extraction in the scientific research field, few people merge the processes to form an automation tool, which is convenient for others to call. In the fast-paced information age, how to quickly read and automatically schedule computing resource training and obtain an information extraction model is an urgent pain point problem to be solved. Finally, conventional information extraction methods rely primarily on specific data sets and relational lists to train information extraction methods that are applicable to specific data sets. Once the relationship list needs to be extended, it is difficult to train an information extraction method suitable for a new relationship list based on the existing data set. For a specific field, neither a single template-based extraction method nor a single machine learning-based extraction method can actually solve the problem in the actual scene. Therefore, a practical mixed extraction model is also an engineering strategy in the field of information extraction.
Therefore, a cloud platform implementation method which can fully utilize network resources, automatically extract the marked Chinese text from the network, train a model by using the marked Chinese text to finally obtain the cloud platform implementation method which can ensure the accuracy and the breadth of information extraction and can macroscopically integrate the whole information extraction process is yet to be provided.
In recent years, the wave generated by deep learning is rolled into the whole world, and the deep learning also affects various fields of natural language processing under the support of massive resources and strong calculation power. Knowledge maps organized in entities and relationships are widely used in search engines and question-answering systems. Because of the large scale of knowledge and the expensive manual annotation, it is impossible to add these new knowledge by manual annotation alone. In order to add more abundant world knowledge to the knowledge graph as timely and accurately as possible, researchers strive to explore a method for efficiently and automatically acquiring the world knowledge, namely an information extraction technology. The task of information extraction has become a research focus of researchers in recent years, but the traditional information extraction method and the tool thereof have the common problems pointed out above.
Disclosure of Invention
In order to achieve the above object, the present invention designs a complete model construction process for information extraction in the vertical domain: the method comprises the steps of remote supervision training data acquisition, entity recognition, data labeling, a remote supervision relation extraction algorithm and a rule-based relation extraction algorithm, and the method for realizing the information extraction cloud platform is realized based on the algorithm.
It is divided into three steps.
The method comprises the following steps: the data acquisition method comprises the following specific processes: firstly, inputting a selected field and an initial relationship set by a user, and acquiring a knowledge base from the selected field and the initial relationship set, wherein the knowledge base comprises entities and relationships in data; then, acquiring a text library by adopting a trained remote supervision acquisition method through remote supervision; finally, a named entity identification method is adopted, and a knowledge base is used for data annotation;
step two: designing a relation extraction method, namely a sentence-level attention relation extraction method, and converting a method for expressing each word in a sentence by using a word vector in the sentence-level attention relation extraction method into a method based on a bidirectional gating cyclic unit to express an input sequence, wherein the method can be divided into five parts: an input layer, i.e. an input plain text sentence; a low-dimensional embedding layer, i.e. mapping each word in the input sentence to a low-dimensional vector, e.g. using a pre-trained low-dimensional word vector; a high-dimensional embedding layer for obtaining high-dimensional embedding from low-dimensional embedding through a bidirectional gating circulation unit; combining weight vectors generated by the attention layers at the sentence level at each time step into a characteristic vector at the sentence level in a mode of multiplying high-dimensional embedding at the word level by the weight vectors; completing softmax relation classification by using the finally obtained sentence-level feature vector;
step three: and establishing a relationship extraction model and calculating an output result, wherein the relationship extraction model is a rule-based information extraction method, the dependence on the dependency syntax analysis is relied on, and the rule-based information extraction method of extracting triples by using the result of the dependency syntax analysis is another relationship extraction method, so as to supplement the defect that the deep learning-based method in the step two can only extract a limited relationship list. The two extraction methods are comprehensively used, so that the relation triples can be extracted more accurately and comprehensively.
The training process of the trained remote supervision acquisition method comprises the following steps:
firstly, according to a field selected by a user and a given initial relation set, acquiring other relations of the vertical field and corresponding entity pairs under the relation from a knowledge Chinese knowledge map through a data crawling script, wherein the crawling script is realized on the basis of a News paper3k library and an open Beautiful Soup library which are open at github, the Beautiful Soup library is utilized to circularly traverse triple information corresponding to each relation, search websites corresponding to corresponding entity pairs are continuously constructed to obtain corresponding news, relevant sentences are extracted by using a webpage text extraction function carried by the News paper3k library to serve as data of the given relation, and the goal of acquiring sentences related to two entity pairs in the triple as data of the corresponding relation from the Internet is completed.
The named entity identification method comprises the following steps:
training the characteristics of a sequence by using an expansion-based convolutional neural network, and completing sequence labeling work by using a conditional transition field method, wherein the structure mainly comprises the following steps: training an expansion convolution neural network module for inputting sequence features and a conditional random field module for using the sequence features output by the expansion convolution neural network module by using the word vector of each character, and finally completing sequence labeling, namely named entity recognition, by the conditional random field module;
the random field module is a linear conditional random field model which uses a Viterbi algorithm to calculate the probability of state transition, and the sequence labeling target method comprises the following steps:
wherein,is a local factor that is a function of the local factor,is toSuccessive labelling of the pairwise factors, Z, for scoringxIs a partition function, and F (x) is a feature of the input sequence x. Conditional random field methods to avoid overfitting thereinThe factor is independent of both the timing t and the input sequence x, the prediction of the conditional random field is accomplished by global search using the viterbi algorithm, x being the input sequence and y being the output sequence;
using a dilated convolutional neural network to pair each element x in an input sequence xtPerforming feature representation and outputting ct:
Representing full concatenation of vectors, WcIs a convolution kernel, i.e. a convolution matrix used over a sliding window of width r in the input sequence, a characteristic feature characteristic of a dilated convolutional neural network is the dilation width δ applied to the sliding window.
The specific implementation manner of the relationship extraction method is as follows:
using a bi-directional gated-loop unit and a word-level attention mechanism to obtain a representation of the input sequence, the input sequence x will first obtain a sequence representation based on forward and backward directions by the gated-loop unit, i.e. for the t position in the input sequence, the method will obtain the sequence vector represented by the bi-directional gated-loop unit by the following formula:
rt=σ(Wrxt+Urht-1)
ut=σ(Wuxt+Uuht-1)
where σ represents a logical Sigmoid (Sigmoid) function, x is the input, ht-1Is the previous hidden state. h ist,rt,utRespectively representing a hidden layer state, a reset gate state and a forgetting gate state at the position t, Wr,Wu,Wc,Ur,Uu,UcThe vector on the t position obtained by the bidirectional gating circulation unit can be expressed as a vector h obtained by fully connecting a forward hidden layer state vector obtained by carrying out gating circulation unit training from front to back and a backward hidden layer state vector obtained by carrying out gating circulation unit training from back to frontt(ii) a And obtaining a combined vector H ═ H1,...,hT]Then, the sentence-level attention is used for relationship classification, and the sentence-level attention mechanism is as follows:
M=tanh(H)
α=softmax(wTM)
r=HαT
the attention mechanism was originally proposed by DeepMind for image classification, which allows "the neural network to focus more on relevant parts of the input and less on irrelevant parts when performing the prediction task; whereas in the direction of natural language processing, attention mechanisms were originally used in the field of machine translation. This is a schematic of a general attention mechanism, as shown in fig. 1. The core of the attention mechanism is to put a sequenceI.e. Keys maps to d _ k dimensional attention weight a. Where K may be a word vector or a word vector of the text. In most cases, another input element q, called Query, is used as a reference when computing the attention distribution. If a query is defined, noteThe intent mechanism will act on input elements that are deemed relevant to the task from the query; if a query is not defined, the attention mechanism considers the scope as an input element related to the task itself.
Where w is a weight vector, α is an attention vector, and r is a feature vector of a sentence used in the final classification training of the relationship extraction method.
The specific implementation manner of extracting the triple by using the result of the dependency syntax analysis in the relationship extraction method is as follows: and setting a triple extraction rule based on the fixed relation, the fixed language postposition relation and the main and predicate guest relation.
The technical scheme of the application aims at the problems which are urgently needed to be solved in the fields of natural language processing, relation extraction and entity identification, and design research and development are carried out. For the problem of insufficient labeled data, a remote supervision mode is adopted, a large amount of training data containing entities are obtained by utilizing network space big data, and the training data are further screened by adopting a remote supervision noise reduction mode to ensure the quality of the training data. For the common requirements of different fields on information extraction, a method for extracting the information from the cloud platform is provided and realized, each link for extracting the information based on remote supervision is opened, a mixed model based on a template and based on machine learning is adopted to perform an extraction mode which is complementary to each other, and simultaneously, a network and computing resources are reasonably scheduled to complete the acquisition of a training text and the training of an information extraction model, so that a user can conveniently complete the information extraction aiming at a specific field, and further, the knowledge graph of the specific field is enriched to support various applications on the knowledge graph.
Drawings
FIG. 1 is a mechanism of attention;
FIG. 2 is a system flow diagram;
FIG. 3 is a data acquisition flow diagram;
FIG. 4 request address for News in New wave;
FIG. 5 request address for hundred degree News;
FIG. 6 illustrates the overall architecture of the named entity recognition method;
FIG. 7 is a diagram illustrating an overall structure of a relationship extraction method;
FIG. 8 an example of a dependency syntax analysis result;
FIG. 9 dependency syntax relationship types;
Detailed Description
The following is a preferred embodiment of the present invention and is further described with reference to the accompanying drawings, but the present invention is not limited to this embodiment.
In order to achieve the above object, the present invention designs a complete model construction process for information extraction in the vertical domain: the method comprises the steps of remote supervision training data acquisition, entity identification, data annotation, a remote supervision relation extraction algorithm and a rule-based relation extraction algorithm. The information extraction system flow is shown in fig. 2.
The data acquisition and data labeling utilize remote supervision, so that the problem that an entity in a named entity identification task is lack of labels is solved to a certain extent, and the problem that the quantity of relation data sets in a relation extraction task is insufficient is solved to a certain extent; then, the relation triple extraction is completed by using a relation extraction algorithm based on the depth science, so that the relation triple matched with the existing relation list can be obtained more accurately; the rule-based extraction method can solve the problem of error transmission existing in the entity identification and relation extraction model to a certain extent, and improves the extraction accuracy to a certain extent.
Acquiring remote supervision training data:
firstly, according to the domain selected by the user and a given initial relationship set, we acquire richer relationships of the vertical domain and corresponding entity pairs under the relationships from the Chinese knowledgegraph. The data acquisition flow is shown in fig. 3.
The method comprises the steps of selecting a character relation field as a case field, wherein the field comprises various entities such as characters, places, schools, film and television works and the like, and also comprises various relations such as nationality, height, birth date and the like.
In order to crawl data with sufficient quantity and good quality, a detailed data crawling script is designed to ensure the reliability of the quality of the crawler. In the implementation aspect, the system selects a Newstand 3k library of open sources on gitubs and a Beautiful Soup library of open sources, and builds a complete crawler from constructing a news search request to crawling corresponding news sentences.
The first task to be accomplished is to construct access requests for Baidu News and Newcastle news. As can be seen by accessing the search bar of the news website, the source address of the news of the new seas is "http:// search. site. com. cn", and the access request with two entities as the key can be constructed as the address shown in fig. 4 and fig. 5:
the triplets of entity pairs may then be used as keywords to search for news.
And (4) circularly traversing the triple information corresponding to each relation by using the Beautiful Soup library, and continuously constructing a search website corresponding to the corresponding entity pair to obtain corresponding news. And related sentences are extracted by using a webpage text extraction function carried by the library of Newspaper3k as data of a given relationship, so that the aim of acquiring sentences related to two entity pairs in the triple as data of the corresponding relationship from the Internet is fulfilled.
In order to solve the problem of extracting entities as much as possible from the plain text data, the named entity recognition method is an information extraction method which uses a conditional random field to label sequences on the basis of an expanded convolutional neural network.
The core idea of the method is similar to the traditional method for identifying the named entity by using a long-short term memory network. The conventional method mainly uses a Long Short Term Memory network (LSTM) or a Gated Recursive Unit (GRU) to change text information into feature information, i.e. each word or phrase is represented by a vector with the same dimension. The transition probabilities for each word or phrase state are then computed using a model such as a hidden markov model or a conditional random field to label each word or phrase. Such a method is common, and there is a problem that a long-short term memory network or a gated cyclic unit belongs to a sequence model, and the model is inferior to a simple convolutional neural network in performance utilization and multi-GPU performance, but the advantage is that the effect of doing so is generally better than that of the convolutional neural network.
The expanded convolutional neural network is used for replacing a long-short-term memory network or a gated cyclic unit to complete feature extraction, and the advantage of doing so is that the effect similar to or even better than that of the long-short-term memory network is obtained while the resource utilization and the speed improvement are optimized. The overall architecture of the named entity recognition method is shown in fig. 6.
As shown in FIG. 6, the method mainly uses the convolutional neural network based on expansion to train the characteristics of the sequence, and then uses the conditional transition field method to complete the sequence labeling work. The structure mainly comprises an expansion convolution neural network module for training input sequence characteristics by using a word vector of each character and a conditional random field module for using the sequence characteristics output by the expansion convolution neural network module, and finally, the conditional random field module finishes sequence labeling, namely named entity recognition.
The key part of the whole method is the expansion convolution neural network of the front part. The latter part is simply a linear conditional random field model that uses the viterbi algorithm to compute the state transition probabilities.
For a conditional random field portion, the input sequence is x ═ x1,…,xt]The output labeled sequence for each element in the sequence is y ═ y1,...,yt]. Under the condition that the dilated convolutional neural network already gives the features f (x) of the input sequence, the sequence labeling target designed herein can be represented by formula (1):
on the basis, the sequence labeling using conditional random fields designed by the invention can be represented by formula (2):
wherein,is a partThe factor(s) is (are),is a pairwise factor, Z, that scores successive labelsxIs the partitioning function f (x) is characteristic of the input sequence x. Conditional random field methods to avoid overfitting thereinThe factor is independent of both the timing t and the input sequence x and the prediction of the conditional random field is done by a global search using the viterbi algorithm.
The present invention uses a dilated convolutional neural network to characterize sequence features. Using a dilated convolutional neural network to align an input sequence x ═ x1,...,xt]Each element x intPerforming feature representation and outputting ctCan be represented by equation (3):
here for convenience of representation use is made ofTo represent a full concatenation of vectors. WcIs the convolution kernel in the conventional sense, i.e. the convolution matrix used over a sliding window of width r in the input sequence. A characteristic feature of the dilated convolutional neural network is the width δ of the dilation applied to the sliding window. Such a dilation width may enable the convolutional network to obtain a wider input sequence feature to obtain an input feature comparable to that of the recurrent neural network as much as possible.
It is obvious that the value of the dilation width δ significantly affects the effect of the convolutional neural network. When the value of delta is 1, the convolutional neural network is common. When the width is larger and larger, the more input features are acquired, and the better the theoretical recognition effect is.
The relation extraction method comprises the following steps:
a sentence-level attention relationship extraction method is currently employed. The input sequence is expressed by a method based on a bidirectional gated cyclic unit instead of using a word vector as a representation of each word in a sentence level attention relation extraction method, and the structure of the method is mainly shown in fig. 7.
The relation extraction method can be specifically divided into 5 parts: 1. the input layer, i.e. the input plain text sentence 2, the low-dimensional embedding layer, i.e. each word in the input sentence is mapped to a low-dimensional vector, e.g. using a pre-trained low-dimensional word vector. 3. And a high-dimensional embedding layer, which obtains high-dimensional embedding from low-dimensional embedding through a bidirectional gating circulation unit. 4. The weight vectors generated at each time step by the sentence-level attention layer are combined into a sentence-level feature vector by multiplying the high-dimensional embedding at the word level by the weight vectors. 5. And completing softmax relation classification by using the finally obtained sentence-level feature vector.
In particular, for an attention mechanism using bi-directional gated loop units and word levels to obtain a representation portion of an input sequence, the input sequence x ═ x1,...,xt]A sequence representation will first be acquired by a gated round-robin unit based on forward and backward directions. That is, for the t position in the input sequence, the method will obtain the sequence vector represented by the bi-directional gated cyclic unit through equations (4), (5), (6) and (7).
rt=σ(Wrxt+Urht-1) (4)
ut=σ(Wu,xt+Uuht-1) (5)
Where σ represents a logical Sigmoid (Sigmoid) function, x is the input, ht-1Is the previous hidden state. h ist,rt,utRespectively representing the value of the t positionLayer state, reset gate state and forget gate state, Wr,Wu,Wc,Ur,Uu,UcThe vectors at the t position obtained by the bidirectional gating circulation unit can be expressed as vectors h obtained by fully connecting forward hidden layer state vectors subjected to gating circulation unit training from front to back and backward hidden layer state vectors subjected to gating circulation unit training from back to frontt(ii) a And obtaining a vector H ═ H given by the bidirectional gating cyclic unit1,...,hT]The method then uses the sentence-level attention for relationship classification, and the sentence-level attention mechanism is shown in equations (8), (9) and (10).
M=tanh(H) (8)
α=softmax(wTM) (9)
r=HαT(10)
The last softmax layer of the method uses the sentence vectors given by sentence level attention and the one-hot coded relationship list to train the final relationship classification.
Based on a rule relation extraction algorithm:
rule-based information extraction methods rely primarily on dependency parsing. Dependency grammar (DP) analysis techniques analyze grammatical information that may be extracted based on grammatical rules, such as "principal and predicate object", "shape and complement", and the like, possibly existing in a sentence, so as to obtain the connection and relationship between components in the sentence. For the sentence example "delusional" is a Guangdong song by Zhangoriang singers in hong Kong men, the analysis results are shown in FIG. 8.
The dependency syntax relations that can be obtained by the method are 15 in total, as shown in fig. 9. For this sentence, the triplet that can be extracted using the results of the dependency parsing includes "hong Kong singer Zhangori" triplet information.
Finally, 10 triple extraction rules such as a fixed relation, a fixed language post-relation, a principal-predicate-object relation and the like are set in the rule-based information extraction method, and the rule template is strictly tested and set, so that a satisfactory triple extraction result can be provided by the rule-based information extraction method. Thus, a rule-based information extraction method suitable for extracting texts with most relations can be obtained. The method has the main disadvantages that the rules are made manually, all conditions are difficult to cover, and the relation in the extracted triples is possibly not accurate enough, so that the method is only used as a supplement of a deep learning information extraction method.
Claims (5)
1. An implementation method of an information extraction cloud platform is characterized in that:
the method comprises the following steps: the data acquisition method comprises the following specific processes: firstly, inputting a selected field and an initial relationship set by a user, and acquiring a knowledge base from the selected field and the initial relationship set, wherein the knowledge base comprises entities and relationships in data; then, acquiring a text library by adopting a trained remote supervision acquisition method through remote supervision; finally, a named entity identification method is adopted, and a knowledge base is used for data annotation;
step two: designing a relation extraction method, namely a sentence-level attention relation extraction method, and converting a method for expressing each word in a sentence by using a word vector in the sentence-level attention relation extraction method into a method based on a bidirectional gating cyclic unit to express an input sequence, wherein the method can be divided into five parts: an input layer, i.e. an input plain text sentence; a low-dimensional embedding layer, i.e. mapping each word in the input sentence to a low-dimensional vector, e.g. using a pre-trained low-dimensional word vector; a high-dimensional embedding layer for obtaining high-dimensional embedding from low-dimensional embedding through a bidirectional gating circulation unit; combining weight vectors generated by the attention layers at the sentence level at each time step into a characteristic vector at the sentence level in a mode of multiplying high-dimensional embedding at the word level by the weight vectors; completing softmax relation classification by using the finally obtained sentence-level feature vector;
step three: and establishing a relation extraction model and calculating an output result, wherein the relation extraction model is a rule-based information extraction method, extracting a triple rule-based rule by using a result of dependency syntax analysis depending on dependency syntax analysis, and supplementing the deep learning-based method in the second step to realize comprehensive use of the two extraction methods.
2. The method for implementing the information extraction cloud platform according to claim 1, wherein: the training process of the trained remote supervision acquisition method comprises the following steps: firstly, according to a field selected by a user and a given initial relation set, acquiring other relations of the vertical field and corresponding entity pairs under the relation from a knowledge Chinese knowledge map through a data crawling script, wherein the crawling script is realized on the basis of a News paper3k library and an open Beautiful Soup library which are open at Github, the Beautiful Soup library is utilized to circularly traverse triple information corresponding to each relation, search websites corresponding to corresponding entity pairs are continuously constructed to obtain corresponding news, relevant sentences are extracted by using a webpage text extraction function carried by the News paper3k library to serve as data of the given relation, and the goal of acquiring sentences related to two entity pairs in the triple as data of the corresponding relation from the Internet is completed.
3. The method for implementing the information extraction cloud platform according to claim 2, wherein: the named entity identification method comprises the following steps:
training the characteristics of a sequence by using an expansion-based convolutional neural network, and completing sequence labeling work by using a conditional transition field method, wherein the structure mainly comprises the following steps: training an expansion convolution neural network module for inputting sequence features and a conditional random field module for using the sequence features output by the expansion convolution neural network module by using the word vector of each character, and finally completing sequence labeling, namely named entity recognition, by the conditional random field module;
the random field module is a linear conditional random field model which uses a Viterbi algorithm to calculate the probability of state transition, and the sequence labeling target method comprises the following steps:
wherein,is a local factor that is a function of the local factor,is a pairwise factor, Z, that scores successive labelsxIs a partition function, and F (x) is a feature of the input sequence x. Conditional random field methods to avoid overfitting thereinThe factor is independent of both the timing t and the input sequence x, the prediction of the conditional random field is accomplished by global search using the viterbi algorithm, x being the input sequence and y being the output sequence;
using a dilated convolutional neural network to pair each element x in an input sequence xtPerforming feature representation and outputting ct:
4. The method for implementing the information extraction cloud platform according to claim 3, wherein the method comprises the following steps: the specific implementation manner of the relationship extraction method is as follows:
using a bi-directional gated-loop unit and a word-level attention mechanism to obtain a representation of the input sequence, the input sequence x will first obtain a sequence representation based on forward and backward directions by the gated-loop unit, i.e. for the t position in the input sequence, the method will obtain the sequence vector represented by the bi-directional gated-loop unit by the following formula:
rt=σ(Wrxt+Urht-1)
ut=σ(Wuxt+Uuht-1)
where σ represents a logical Sigmoid (Sigmoid) function, x is the input, ht-1Is the previous hidden state. h ist,rt,utRespectively representing a hidden layer state, a reset gate state and a forgetting gate state at the position t, Wr,Wu,Wc,Ur,Uu,UcThe vector on the t position obtained by the bidirectional gating circulation unit can be expressed as a vector h obtained by fully connecting a forward hidden layer state vector obtained by carrying out gating circulation unit training from front to back and a backward hidden layer state vector obtained by carrying out gating circulation unit training from back to frontt(ii) a And obtaining a combined vector H ═ H1,...,hT]Then, the sentence-level attention is used for relationship classification, and the sentence-level attention mechanism is as follows:
M=tanh(H)
α=softmax(wTM)
r=HαT
where w is a weight vector, α is an attention vector, and r is a feature vector of a sentence used in the final classification training of the relationship extraction method.
5. The method for implementing the information extraction cloud platform according to claim 4, wherein the method comprises the following steps: the specific implementation manner of extracting the triple by using the result of the dependency syntax analysis in the relationship extraction method is as follows:
and setting a triple extraction rule based on the fixed relation, the fixed language postposition relation and the main and predicate guest relation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010100115.3A CN111339407B (en) | 2020-02-18 | 2020-02-18 | Implementation method of information extraction cloud platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010100115.3A CN111339407B (en) | 2020-02-18 | 2020-02-18 | Implementation method of information extraction cloud platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111339407A true CN111339407A (en) | 2020-06-26 |
CN111339407B CN111339407B (en) | 2023-12-05 |
Family
ID=71183592
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010100115.3A Active CN111339407B (en) | 2020-02-18 | 2020-02-18 | Implementation method of information extraction cloud platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111339407B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111737497A (en) * | 2020-06-30 | 2020-10-02 | 大连理工大学 | Weak supervision relation extraction method based on multi-source semantic representation fusion |
CN111984790A (en) * | 2020-08-26 | 2020-11-24 | 南京柯基数据科技有限公司 | Entity relation extraction method |
CN112906368A (en) * | 2021-02-19 | 2021-06-04 | 北京百度网讯科技有限公司 | Industry text increment method, related device and computer program product |
CN113360642A (en) * | 2021-05-25 | 2021-09-07 | 科沃斯商用机器人有限公司 | Text data processing method and device, storage medium and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109284396A (en) * | 2018-09-27 | 2019-01-29 | 北京大学深圳研究生院 | Medical knowledge map construction method, apparatus, server and storage medium |
US20190034416A1 (en) * | 2016-01-26 | 2019-01-31 | Koninklijke Philips N.V. | Systems and methods for neural clinical paraphrase generation |
CN110377903A (en) * | 2019-06-24 | 2019-10-25 | 浙江大学 | A kind of Sentence-level entity and relationship combine abstracting method |
CN110502749A (en) * | 2019-08-02 | 2019-11-26 | 中国电子科技集团公司第二十八研究所 | A kind of text Relation extraction method based on the double-deck attention mechanism Yu two-way GRU |
-
2020
- 2020-02-18 CN CN202010100115.3A patent/CN111339407B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190034416A1 (en) * | 2016-01-26 | 2019-01-31 | Koninklijke Philips N.V. | Systems and methods for neural clinical paraphrase generation |
CN109284396A (en) * | 2018-09-27 | 2019-01-29 | 北京大学深圳研究生院 | Medical knowledge map construction method, apparatus, server and storage medium |
CN110377903A (en) * | 2019-06-24 | 2019-10-25 | 浙江大学 | A kind of Sentence-level entity and relationship combine abstracting method |
CN110502749A (en) * | 2019-08-02 | 2019-11-26 | 中国电子科技集团公司第二十八研究所 | A kind of text Relation extraction method based on the double-deck attention mechanism Yu two-way GRU |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111737497A (en) * | 2020-06-30 | 2020-10-02 | 大连理工大学 | Weak supervision relation extraction method based on multi-source semantic representation fusion |
CN111737497B (en) * | 2020-06-30 | 2021-07-20 | 大连理工大学 | Weak supervision relation extraction method based on multi-source semantic representation fusion |
CN111984790A (en) * | 2020-08-26 | 2020-11-24 | 南京柯基数据科技有限公司 | Entity relation extraction method |
CN111984790B (en) * | 2020-08-26 | 2023-07-25 | 南京柯基数据科技有限公司 | Entity relation extraction method |
CN112906368A (en) * | 2021-02-19 | 2021-06-04 | 北京百度网讯科技有限公司 | Industry text increment method, related device and computer program product |
CN113360642A (en) * | 2021-05-25 | 2021-09-07 | 科沃斯商用机器人有限公司 | Text data processing method and device, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN111339407B (en) | 2023-12-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114020862B (en) | Search type intelligent question-answering system and method for coal mine safety regulations | |
Wang et al. | Multilayer dense attention model for image caption | |
Wang et al. | Retrieval topic recurrent memory network for remote sensing image captioning | |
CN111581401B (en) | Local citation recommendation system and method based on depth correlation matching | |
CN104318340B (en) | Information visualization methods and intelligent visible analysis system based on text resume information | |
CN111339407B (en) | Implementation method of information extraction cloud platform | |
CN105393265A (en) | Active featuring in computer-human interactive learning | |
CN111738007A (en) | Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network | |
CN116151256A (en) | Small sample named entity recognition method based on multitasking and prompt learning | |
CN113377953B (en) | Entity fusion and classification method based on PALC-DCA model | |
CN113761890A (en) | BERT context sensing-based multi-level semantic information retrieval method | |
Zhang et al. | Image caption generation with adaptive transformer | |
CN111967267A (en) | XLNET-based news text region extraction method and system | |
Su et al. | Answer acquisition for knowledge base question answering systems based on dynamic memory network | |
CN114239730B (en) | Cross-modal retrieval method based on neighbor ordering relation | |
CN116010553A (en) | Viewpoint retrieval system based on two-way coding and accurate matching signals | |
CN114972907A (en) | Image semantic understanding and text generation based on reinforcement learning and contrast learning | |
CN118364816A (en) | Open information extraction method based on lexical information enhancement | |
CN113192030B (en) | Remote sensing image description generation method and system | |
CN112989811B (en) | History book reading auxiliary system based on BiLSTM-CRF and control method thereof | |
Dittakan et al. | Image caption generation using transformer learning methods: a case study on instagram image | |
Tian et al. | Scene graph generation by multi-level semantic tasks | |
CN116386895B (en) | Epidemic public opinion entity identification method and device based on heterogeneous graph neural network | |
CN107633259A (en) | A kind of cross-module state learning method represented based on sparse dictionary | |
CN111813927A (en) | Sentence similarity calculation method based on topic model and LSTM |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |