CN109460434B

CN109460434B - Data extraction model establishing method and device

Info

Publication number: CN109460434B
Application number: CN201811251141.5A
Authority: CN
Inventors: 岳永鹏; 邹晶
Original assignee: Beijing Knownsec Information Technology Co Ltd
Current assignee: Beijing Knownsec Information Technology Co Ltd
Priority date: 2018-10-25
Filing date: 2018-10-25
Publication date: 2020-11-03
Anticipated expiration: 2038-10-25
Also published as: CN109460434A

Abstract

The application provides a method and a device for establishing a data extraction model, wherein the method comprises the following steps: establishing a data extraction model comprising CRF and CNN, wherein the CRF is used for identifying entities, and the CNN is used for determining the relationship between two entities in an entity pair; acquiring a training sample set comprising a plurality of training sentences, and preprocessing the training sentences in the training sample set; and performing parallel training on the CRF and the CNN in the data extraction model through the preprocessed training sample set. Through the design, the identification of the entity and the relation extraction of the entity pair can be realized simultaneously.

Description

Data extraction model establishing method and device

Technical Field

The application relates to the technical field of machine learning, in particular to a method and a device for establishing a data extraction model.

Background

The traditional named entity recognition and relationship extraction model is usually realized by a mode of firstly training an entity extraction model and then constructing a relationship extraction model on the basis of the model, and the mode ignores the correlation between the entity extraction model and the relationship extraction model and easily causes error propagation.

Disclosure of Invention

In view of the above, the present application provides a method and an apparatus for building a data extraction model to at least partially improve the above problem.

In a first aspect, an embodiment of the present application provides a data extraction model establishing method, where the method includes:

establishing a data extraction model, wherein the data extraction model comprises a Conditional Random Field (CRF) for identifying an entity and a CNN for determining a relationship category of two entities in a pair of entities;

acquiring a training sample set comprising a plurality of training sentences, and preprocessing the training sentences in the training sample set;

and performing parallel training on the CRF and the CNN in the data extraction model through the preprocessed training sample set.

Optionally, preprocessing the training sentences in the training sample set includes:

aiming at training sentences which comprise a plurality of relation categories in the training sample set, determining the number of the relation categories which are included in the training sentences, and copying the training sentences according to the number to obtain a plurality of training sentences so as to enable the plurality of training sentences to be in one-to-one correspondence with the plurality of relation categories;

analyzing the training sentences, counting the number of the training sentences with the relation category aiming at each relation category in the training sentences, and adjusting the training sentences in the training sample set according to the counting result so as to balance the relation categories of the training sentences in the training sample set;

aiming at a training sentence of which the training sample set comprises an entity, if the entity comprises a plurality of words, adding a first label to a first word in the plurality of words, adding a second label to a last word in the plurality of words, and adding a third label to a word between the first word and the last word; if the entity comprises a word, adding a fourth label to the word;

processing the relation labels of the entity pairs into triples aiming at the training sentences with the entity pairs in the training sample set, wherein the triples comprise the position information of the two entities in the entity pairs and the relation classes among the entities;

processing each training sentence in the training sample set into a target training sentence in an integer identification form through a class dictionary;

and aiming at each target training sentence, expanding the target training sentence to obtain a first sentence with the first number of words and a second sentence with the second number of characters.

Optionally, training the CRF and the CNN in the data extraction model in parallel through the preprocessed training sample set includes:

extracting a preset number of entry mark training sentences from a plurality of target training sentences obtained by preprocessing, and retrieving word vector information of a first sentence corresponding to the target training sentence from a preset word vector library for each extracted target training sentence so as to convert the first sentence into a word vector; training a second sentence corresponding to the target training sentence through a BilSM network to convert the second sentence into a character vector; splicing the word vector and the character vector obtained by conversion into a mixed feature vector;

coding the mixed characteristic vector through a BilSM model, and outputting corresponding coding information;

and inputting the coding information into the CRF to identify the entity in the training sentence, and adding an entity label to the identified entity to obtain an entity labeling sequence of the training sentence.

Optionally, the training the CRF and the CNN in the data extraction model in parallel by the preprocessed training sample set further includes:

for each entity pair in the target training statement, acquiring the feature vector of each of the two entities in the entity pair from the coding information according to the triplet of the entity pair as a first feature vector, and acquiring the statement feature vector between the two entities in the entity pair from the mixed feature vector of the target training statement as a second feature vector;

acquiring entity labels of two entities in the entity pair from the entity labeling sequence, and performing random vectorization representation on the acquired entity labels to obtain a third feature vector;

splicing the first feature vector, the second feature vector and the third feature vector to obtain a target spliced vector;

and inputting the target splicing vector into the CNN to obtain the relationship category of the two entities in the entity pair.

stopping training when the sum of the loss function of the CRF and the loss function of the CNN converges, or when the iteration number of training reaches a preset maximum value.

In a second aspect, an embodiment of the present application further provides a data extraction model building apparatus, where the apparatus includes:

the model establishing module is used for establishing a data extraction model, wherein the data extraction model comprises a CRF (conditional random number) for identifying the entity and a CNN (conditional context) for determining the relationship category of two entities in the entity pair;

the system comprises a preprocessing module, a training module and a training module, wherein the preprocessing module is used for acquiring a training sample set comprising a plurality of training sentences and preprocessing the training sentences in the training sample set;

and the training module is used for performing parallel training on the CRF and the CNN in the data extraction model through the preprocessed training sample set.

Optionally, the preprocessing module comprises:

the relation processing submodule is used for determining the number of relation classes included in the training sentences aiming at the training sentences including a plurality of relation classes in the training sample set, and copying the training sentences according to the number to obtain a plurality of training sentences so as to enable the plurality of training sentences to correspond to the plurality of relation classes one by one;

the balancing submodule is used for analyzing the training sentences, counting the number of the training sentences with the relation category aiming at each relation category in the training sentences, and adjusting the training sentences in the training sample set according to the counting result so as to balance the relation categories of the training sentences in the training sample set;

a first label processing submodule, configured to add, to a training sentence in which the training sample set includes an entity, a first label to a first word of the multiple words, add a second label to a last word of the multiple words, and add a third label to a word between the first word and the last word, if the entity includes multiple words; if the entity comprises a word, adding a fourth label to the word;

a second label processing sub-module, configured to, for a training statement in which an entity pair exists in the training sample set, process a relationship label of the entity pair into a triple, where the triple includes respective position information of two entities in the entity pair and a relationship category between the two entities;

the first conversion submodule is used for processing each training sentence in the training sample set into a target training sentence in an integer identification form through a class dictionary;

and the second conversion submodule is used for expanding the target training sentences to obtain first sentences of which the word number is the first number and second sentences of which the character number is the second number.

Optionally, the training module comprises:

the first splicing submodule is used for extracting a preset number of item-marked training sentences from a plurality of target training sentences obtained through preprocessing, and retrieving word vector information of a first sentence corresponding to the target training sentence from a preset word vector library aiming at each extracted target training sentence so as to convert the first sentence into a word vector; training a second sentence corresponding to the target training sentence through a BilSM network to convert the second sentence into a character vector; splicing the word vector and the character vector obtained by conversion into a mixed feature vector;

and the coding submodule is used for inputting the coding information into the CRF so as to identify the entity in the training sentence, and adding an entity label to the identified entity to obtain an entity labeling sequence of the training sentence.

Optionally, the training module further comprises:

a feature obtaining sub-module, configured to, for each entity pair in the target training sentence, obtain, according to the triplet of the entity pair, a feature vector of each of two entities in the entity pair from the coding information as a first feature vector, and obtain, from the mixed feature vector of the target training sentence, a sentence feature vector located between the two entities in the entity pair as a second feature vector; acquiring entity labels of two entities in the entity pair from the entity labeling sequence, and performing random vectorization representation on the acquired entity labels to obtain a third feature vector;

the second splicing submodule is used for splicing the first eigenvector, the second eigenvector and the third eigenvector to obtain a target splicing vector;

and the relationship extraction submodule is used for inputting the target splicing vector into the CNN to obtain the relationship category of the two entities in the entity pair.

Optionally, the training module further comprises:

and the stopping submodule is used for stopping training when the sum of the loss function of the CRF and the loss function of the CNN converges or when the iteration number of the training reaches a preset maximum value.

Compared with the prior art, the embodiment of the application has the following beneficial effects:

the embodiment of the application provides a method and a device for establishing a data extraction model, wherein the method comprises the following steps: establishing a data extraction model comprising CRF and CNN, wherein the CRF is used for identifying entities, and the CNN is used for determining the relationship between two entities in an entity pair; acquiring a training sample set comprising a plurality of training sentences, and preprocessing the training sentences in the training sample set; and performing parallel training on the CRF and the CNN in the data extraction model through the preprocessed training sample set. Through the design, the identification of the entity and the relation extraction of the entity pair can be realized simultaneously.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic block diagram of a data processing apparatus according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a data extraction model establishing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating the sub-steps of step S23 shown in FIG. 2;

fig. 4 is a schematic flowchart of a data extraction model building apparatus according to an embodiment of the present application;

FIG. 5 is a sub-block diagram of the pre-processing module shown in FIG. 4;

FIG. 6 is a sub-block diagram of the training module shown in FIG. 4.

Icon: 100-a data processing device; 110-data extraction model building means; 111-model building module; 112-a pre-processing module; 1121-relationship processing submodule; 1122-an equalization submodule; 1123-a first tag processing submodule; 1124-a second tag processing sub-module; 1125-a first conversion submodule; 1126-a second conversion submodule; 113-a training module; 1131 — a first splicing submodule; 1132 — an encoding submodule; 1133-identify submodule; 1134, a feature acquisition sub-module; 1135, a second splicing submodule; 1136-a relationship extraction submodule; 1137-stop submodule; 120-a processor; 130-machine readable storage medium.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the embodiments of the present application, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", and the like indicate orientations or positional relationships based on orientations or positional relationships shown in the drawings or orientations or positional relationships that the products of the application usually place when used, and are only used for convenience of description and simplicity of description, but do not indicate or imply that the devices or elements being referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the application. Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

For convenience of understanding, technical terms related to the embodiments of the present application are explained below:

entity (Entity) refers to a noun having a specific meaning in a text, such as a person's name, a place name, an organization name, a proper noun, and the like. Accordingly, Named Entity Recognition (NER) refers to recognizing entities in text that have a particular meaning. Entity relationship extraction (EntityRelationExtraction) refers to identifying a specific relationship that exists between an entity and an entity in text.

As shown in fig. 1, which is a block schematic diagram of a data processing apparatus 100 according to an embodiment of the present application, the data processing apparatus 100 includes a data extraction model building device 110, a processor 120, and a machine-readable storage medium 130.

The elements of the processor 120 and the machine-readable storage medium 130 are electrically connected to each other, directly or indirectly, to enable data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The data extraction model creating device 110 includes at least one software function module which can be stored in the form of software or firmware (firmware) in the machine-readable storage medium 130 or solidified in an Operating System (OS) of the data processing apparatus 100. The processor 120 is used for executing executable modules in the machine-readable storage medium 130, such as software functional modules and computer programs included in the data extraction model building apparatus 110.

The machine-readable storage medium 130 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

It should be understood that what has been described above is merely an example of the structure of the data processing device 100, and that the data processing device 100 may also comprise more, fewer or completely different configurations than those shown in fig. 1, for example, and may also comprise a communication unit. In addition, each component shown in fig. 1 may be implemented by software, hardware, or a combination thereof, and the embodiment of the present application does not limit this.

Referring to fig. 2, fig. 2 is a flowchart of a data extraction model building method applied to the data processing apparatus 100 shown in fig. 1, and the method including various steps will be described in detail below.

Step S21, establishing a data extraction model, where the data extraction model includes a CRF (conditional random field) for identifying an entity and a CNN (conditional Neural Network) for determining a relationship category of two entities in the entity pair.

Wherein the data extraction model is a hybrid model composed of the CRF and the CNN, which may be stored in a machine-readable storage medium 130 of the data processing apparatus 100.

Step S22, a training sample set comprising a plurality of training sentences is obtained, and the training sentences in the training sample set are preprocessed.

In this embodiment, the training sentences in the training sample set are training sentences with entity labels and relationship labels between entities. In practice, the training sentences in the training sample set may be preprocessed through the following process, which will be described in detail below.

Firstly, aiming at training sentences which comprise a plurality of relation categories in the training sample set, determining the number of the relation categories which are included in the training sentences, and copying the training sentences according to the number to obtain a plurality of training sentences so as to enable the plurality of training sentences to be in one-to-one correspondence with the plurality of relation categories.

In other words, in the embodiment of the present application, each training sentence is decomposed into a form including only one relationship category. For example, when two relation categories exist in a certain training sentence, the training sentence may be copied once to obtain two training sentences, and one training sentence corresponds to one relation category.

Secondly, analyzing the training sentences, counting the number of the training sentences with the relation category aiming at each relation category in the training sentences, and adjusting the training sentences in the training sample set according to the counting result so as to balance the relation categories of the training sentences in the training sample set.

The adjustment of the training sentences in the training sample set may be balance down-sampling, so that the data distribution in the training sample set is relatively balanced.

Thirdly, aiming at the training sentences of the training sample set including the entity, if the entity includes a plurality of words, adding a first label to a first word in the plurality of words, adding a second label to a last word in the plurality of words, and adding a third label to a word between the first word and the last word; if the entity includes a word, a fourth tag is added to the word.

If a word is a non-entity, the word may be labeled as a particular label different from the first, second, third, and fourth labels described above.

For the training sentences in the training sample set, if the training sentences include entities, the entity labels of the training sentences can be labeled in a sequence labeling BIOES mode. In this case, "B" may serve as the first tag, "E" may serve as the second tag, "I" may serve as the third tag, "S" may serve as the fourth tag, and "O" may serve as the specific tag.

And fourthly, processing the relation labels of the entity pairs into triples aiming at the training sentences with the entity pairs in the training sample set, wherein the triples comprise the position information of the two entities in the entity pairs and the relation classes among the entities.

The position information of each of the two entities refers to position information of each of the two entities in the training sentence where the two entities are located. Alternatively, the triplet may be represented in the form:

[[e₁_loc₁,e₁_loc₂],[e₂_loc₁,e₂_loc₂],r]，

wherein e is_i_loc_jThe position of the ith entity in the training sentence is shown, where j-1 denotes the position where the word of the entity starts, j-2 denotes the position where the word of the entity ends, and r denotes the relationship category.

And fifthly, processing each training sentence in the training sample set into a target training sentence in an integer identification form through a class dictionary.

Specifically, a category dictionary may be obtained, and the number of words (word), the number of characters (char), the number of entity categories (ner), and the number of relationship categories (r) in the training sentence may be counted respectively, and the statistical result may be converted into a dictionary for storage. Wherein, the key value of the dictionary is a corresponding word, and the value is a corresponding integer number.

It should be noted that when value is 0, for the word dictionary and char dictionary, the identifier does not retrieve the corresponding word or character; for a ner dictionary, the word is represented as a non-entity (i.e., the label is "O"); for an r dictionary, there is no relationship between pairs of representation entities. Through the process, all the text information in the training sample set can be converted into the integer identification information.

And sixthly, aiming at each target training sentence, expanding the target training sentence to obtain a first sentence with the first number of words and a second sentence with the second number of characters.

In this embodiment, the data obtained through the fifth conversion step may be subjected to dimension expansion, so that the lengths of each target training sentence are consistent, and subsequent processing is facilitated. In practice, the words may be filled in to expand the plurality of target training sentences into target training sentences of a consistent number of words, for example, into target training sentences each including a first number of words. Wherein the first number may be selected as follows: the number of words included in the target training sentence having the largest number of words in the plurality of target training sentences is determined, and the first number may be any number greater than or equal to the number of words.

In addition, characters may be filled in to expand a plurality of target training sentences into target training sentences having a uniform number of characters, for example, into target training sentences each including a second number of characters. Wherein the second number may be selected by: the number of characters included in the target training sentence having the largest number of characters in the plurality of target training sentences is determined, and the second number may be any number greater than or equal to the number.

Alternatively, in the present embodiment, a dictionary of value-0 may be employed as the filled word or character.

Step S23, performing parallel training on the CRF and the CNN in the data extraction model through the processed training sample set.

By parallel training the CRF and the CNN, the overall optimization of the CRF and the CNN can be realized, and the cumulative effect of errors before and after series training is reduced, so that the identification of entities and the extraction of relationships among the entities can be realized simultaneously.

Optionally, in the embodiment of the present application, step S23 may include the sub-steps as shown in fig. 3.

Step S31, extracting a preset number of entry training sentences from a plurality of target training sentences obtained by preprocessing, and retrieving word vector information of a first sentence corresponding to the target training sentence from a preset word vector library for each extracted target training sentence so as to convert the first sentence into a word vector; converting a second statement corresponding to the target training statement into a character vector through a BiLSM (bidirectional Long short-term memory neural network); and splicing the word vector and the character vector obtained by conversion into a mixed feature vector.

The preset number may be determined according to actual demand, and is usually set to a value of a specific variable (e.g., batch _ size). The preset word vector library may be a Glove word vector library, or may be other commonly used word vector libraries, such as word2Vec, which is not limited in this embodiment.

Aiming at an extracted target training sentence, the target training sentence has sentences in two expression forms of a word (word) identification form and a character (char) identification form, when the target training sentence is implemented, the sentences in the word identification form can be converted into a word vector form, the sentences in the char identification form are converted into a character vector form, and then the obtained word vectors and character vectors are spliced to obtain a mixed feature vector so as to perform subsequent processing based on the mixed feature vector.

In this embodiment, the character vectors and the word vectors are used to describe the original features of the training sentences together, which can reduce the influence of the word vector library and the word segmentation tool, and reduce the influence on the accuracy of the data extraction model when the test sentence contains more words that are not registered during training or when the words of the test sentence are inconsistent with the word segmentation result in the training sentence, compared with the related art in which only the word vectors are used to describe the original features of the training sentences.

And step S32, coding the mixed feature vector through a BilSM model, and outputting corresponding coding information.

In this embodiment, a BilSM model is used to construct a coding layer, and the mixed feature vector obtained in step S31 is input into the coding layer, so as to obtain corresponding coding information. Wherein the encoded information is hidden layer information.

Step S33, inputting the coding information into the CRF to identify the entity in the training sentence, and adding an entity label to the identified entity to obtain an entity tagging sequence of the training sentence.

In this embodiment, the CRF has a loss function, such as loss _ entry.

Optionally, referring to fig. 3 again, the step S23 may further include the following steps.

Step S34, for each entity pair in the target training sentence, obtaining, from the encoded information according to the triplet of the entity pair, a feature vector of each of the two entities in the entity pair as a first feature vector, and obtaining, from the mixed feature vector of the target training sentence, a sentence feature vector located between the two entities in the entity pair as a second feature vector.

For example, assume that the triplet of an entity pair in the target training statement is [ [ e ]₁_loc₁,e₁_loc₂],[e₂_loc₁,e₂_loc₂],r]Then, according to [ e₁_loc₁,e₁_loc₂]And [ e₂_loc₁,e₂_loc₂]The positions of two entities (assumed to be entity 1 and entity 2) in the pair of entities in the target training sentence may be determined, so that the feature vector of entity 1 and the feature vector of entity 2 are obtained from the coding information corresponding to the target training sentence, and two first feature vectors are obtained.

After determining the positions of the entity 1 and the entity 2, a clause between the entity 1 and the entity 2 may be determined, so that a feature vector of the clause may be obtained from the mixed feature vector of the target training sentence, resulting in the second feature vector described above.

Step S35, obtaining the entity labels of the two entities in the entity pair from the entity tagging sequence, and performing random vectorization representation on the obtained entity labels to obtain a third feature vector.

In this embodiment, the entity tagging sequence output by the CRF includes entity tags of the identified entities. Still taking the example that the entity pair includes the entity 1 and the entity 2, correspondingly, the entity tag 1 of the entity 1 and the entity tag 2 of the entity 2 may be obtained from the entity tagging sequence, and both the entity tag 1 and the entity tag 2 are randomly vectorized to obtain two random vectors (i.e., the third eigenvector).

And step S36, splicing the first feature vector, the second feature vector and the third feature vector to obtain a target spliced vector.

In implementation, for a target training sentence, two first feature vectors, one second feature vector and two third feature vectors may be obtained, and the target concatenation vector may be obtained by concatenating all of them.

And step S37, inputting the target splicing vector into the CNN to obtain the relationship category of the two entities in the entity pair.

In implementation, the target stitching vector obtained in step S36 is used as the input of the CNN, so as to obtain the relationship category of the entity 1 and the entity 2 in the entity pair. Wherein the CNN has a loss function, such as denoted as loss _ relation.

In this embodiment, the sum of the loss function loss _ entry and the loss function loss _ relationship is used as the loss function of the entire data extraction model (assuming loss), and then loss is equal to loss _ entry + loss-relationship. An adaptive moment estimation gradient descent algorithm (e.g., Adam) algorithm may be employed to adjust parameters of the data extraction model according to loss to optimize the data extraction model.

Optionally, step S23 may further include the steps of:

Wherein the sum of the loss function of the CRF and the loss function of the CNN is the loss function described above. In practice, after training is completed for the extracted preset number of target training sentences, if the number of iterations of training does not reach the maximum value or loss does not converge, the above-described steps S31-S37 may be re-executed.

After training is completed, the test sentence without the label can be converted into the target test sentence in the integral identification form through the word dictionary and the char dictionary. And performing dimensionality extension on the target test statement to obtain a first test statement with the word quantity as a first quantity and a second test statement with the char quantity as a second quantity. And converting the first test statement into a word vector through a word vector library, converting the second test statement into a character vector through a BilSM, and splicing to obtain a mixed test vector.

Inputting the mixed test vector into a trained data extraction model, outputting a predicted entity tagging sequence by the CRF of the data extraction model, extracting an entity pair (comprising two entities) from the entity tagging sequence by the data extraction model each time, splicing entity relationship extraction characteristics, and inputting the obtained splicing characteristics into the CNN of the data extraction model so as to obtain the relationship category of the two entities in the entity pair. And repeating the steps until the relation types among all the entity pairs in the entity labeling sequence are predicted, and outputting the entity relation result.

Fig. 4 is a functional block diagram of a data extraction model building device 110 applied to the data processing apparatus 100 shown in fig. 1 according to an embodiment of the present application. The data extraction model establishing device 110 comprises a model establishing module 111, a preprocessing module 112 and a training module 113.

The model building module 111 is used to build a data extraction model, wherein the data extraction model includes a conditional random field, CRF, for identifying entities and a CNN for determining a relationship class of two entities in a pair of entities.

In this embodiment, the model building module 111 may be configured to execute step S21 shown in fig. 2, and for the description of the model building module 111, reference may be specifically made to the detailed description of step S21.

The preprocessing module 112 is configured to obtain a training sample set including a plurality of training sentences, and preprocess the training sentences in the training sample set.

In this embodiment, the preprocessing module 112 may be configured to execute step S22 shown in fig. 2, and the detailed description of step S22 may be referred to for the description of the preprocessing module 112.

The training module 113 is configured to perform parallel training on the CRF and the CNN in the data extraction model through the preprocessed training sample set.

In this embodiment, the training module 113 may be configured to execute step S23 shown in fig. 2, and for the description of the training module 113, reference may be specifically made to the detailed description of step S23.

Optionally, as shown in fig. 5, the preprocessing module 112 may include a relationship processing sub-module 1121, an equalization sub-module 1122, a first tag processing sub-module 1123, a second tag processing sub-module 1124, a first conversion sub-module 1125, and a second conversion sub-module 1126.

The relationship processing submodule 1121 is configured to determine, for a training sentence in which the training sample set includes a plurality of relationship categories, the number of the relationship categories included in the training sentence, and copy the training sentence according to the number to obtain a plurality of training sentences, so that the plurality of training sentences correspond to the plurality of relationship categories one to one.

The balancing sub-module 1122 is configured to analyze the training sentences, count the number of training sentences having a relationship category for each relationship category in the training sentences, and adjust the training sentences in the training sample set according to the statistical result to balance the relationship categories of the training sentences in the training sample set.

The first label processing submodule 1123 is configured to, for a training sentence in which the training sample set includes an entity, add a first label to a first word in a plurality of words, add a second label to a last word in the plurality of words, and add a third label to a word between the first word and the last word if the entity includes the plurality of words.

The second tag processing sub-module 1124 is configured to, for a training statement having an entity pair in the training sample set, process the relationship tag of the entity pair into a triple, where the triple includes respective location information of two entities in the entity pair and a relationship category therebetween.

The first conversion sub-module 1125 is configured to process, for each training sentence in the training sample set, the training sentence into a target training sentence in a shape identification form through a class dictionary.

The second conversion sub-module 1126 is configured to, for each target training sentence, expand the target training sentence to obtain a first sentence with the first number of words and a second sentence with the second number of characters.

Optionally, referring to fig. 6, the training module 113 may include a first splicing sub-module 1131, an encoding sub-module 1132, and a recognition sub-module 1133.

The first splicing sub-module 1131 is configured to, for each target training sentence, retrieve word vector information of a first sentence corresponding to the target training sentence from a preset word vector library to convert the first sentence into a word vector; training a second sentence corresponding to the target training sentence through a BilSM network to convert the second sentence into a character vector; and splicing the word vector and the character vector obtained by conversion into a mixed feature vector.

The encoding submodule 1132 is configured to encode the mixed feature vector through a BiLSM model, and output corresponding encoding information.

The identifying submodule 1133 is configured to input the coding information into the CRF to identify an entity in the training sentence, and add an entity tag to the identified entity to obtain an entity tagging sequence of the training sentence.

Optionally, referring to fig. 6 again, the training module 113 may further include a feature obtaining sub-module 1134, a second splicing sub-module 1135, and a relationship extracting sub-module 1136.

The feature obtaining sub-module 1134 is configured to, for each entity pair in the target training sentence, obtain, according to the triplet of the entity pair, a feature vector of each of two entities in the entity pair from the coding information as a first feature vector, and obtain, from the mixed feature vector of the target training sentence, a sentence feature vector located between the two entities in the entity pair as a second feature vector; and acquiring the entity labels of the two entities in the entity pair from the entity labeling sequence, and performing random vectorization representation on the acquired entity labels to obtain a third feature vector.

The second splicing sub-module 1135 is configured to splice the first eigenvector, the second eigenvector, and the third eigenvector to obtain a target splicing vector.

The relationship extraction sub-module 1136 inputs the target stitching vector into the CNN to obtain a relationship category of two entities in the entity pair.

Optionally, the training module 113 may further include a stop sub-module 1137.

The stopping sub-module 1137 is configured to stop the training when the sum of the CRF loss function and the CNN loss function converges, or when the number of training iterations reaches a preset maximum value.

To sum up, the method and the apparatus for establishing a data extraction model provided in the embodiments of the present application include: establishing a data extraction model comprising CRF and CNN, wherein the CRF is used for identifying entities, and the CNN is used for determining the relationship between two entities in an entity pair; acquiring a training sample set comprising a plurality of training sentences, and preprocessing the training sentences in the training sample set; and performing parallel training on the CRF and the CNN in the data extraction model through the preprocessed training sample set. Through the design, the identification of the entity and the relation extraction of the entity pair can be realized simultaneously.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for establishing a data extraction model, the method comprising:

establishing a data extraction model, wherein the data extraction model comprises a conditional random field CRF for identifying an entity and a convolutional neural network CNN for determining a relationship class of two entities in a pair of entities;

performing parallel training on the CRF and the CNN in the data extraction model through the preprocessed training sample set;

preprocessing the training sentences in the training sample set, wherein the preprocessing comprises the following steps:

aiming at each target training sentence, expanding the target training sentence to obtain a first sentence with the first number of words and a second sentence with the second number of characters;

wherein the parallel training of the CRF and the CNN in the data extraction model by the preprocessed training sample set comprises:

extracting a preset number of entry mark training sentences from a plurality of target training sentences obtained through preprocessing, and retrieving word vector information of a first sentence corresponding to the target training sentence from a preset word vector library for each extracted target training sentence so as to convert the first sentence into a word vector; training a second sentence corresponding to the target training sentence through a BilSM network to convert the second sentence into a character vector; splicing the word vector and the character vector obtained by conversion into a mixed feature vector;

inputting the coding information into the CRF to identify the entity in the training sentence, and adding an entity label to the identified entity to obtain an entity labeling sequence of the training sentence;

training the CRF and the CNN in the data extraction model in parallel through the preprocessed training sample set, and further comprising:

2. The method of claim 1, wherein training the CRF and the CNN in the data extraction model in parallel through the pre-processed training sample set, further comprises:

3. A data extraction model creation apparatus, the apparatus comprising:

the model establishing module is used for establishing a data extraction model, wherein the data extraction model comprises a Conditional Random Field (CRF) for identifying an entity and a conditional random field (CNN) for determining a relationship category of two entities in an entity pair;

the training module is used for performing parallel training on the CRF and the CNN in the data extraction model through the preprocessed training sample set;

wherein the preprocessing module comprises:

the second conversion submodule is used for expanding the target training sentences to obtain first sentences the word number of which is the first number and second sentences the character number of which is the second number aiming at each target training sentence;

the training module comprises:

the coding submodule is used for coding the mixed characteristic vector through a BilSM model and outputting corresponding coding information;

the recognition submodule is used for inputting the coding information into the CRF so as to recognize the entity in the training sentence, and adding an entity label to the recognized entity so as to obtain an entity labeling sequence of the training sentence;

the training module further comprises:

4. The apparatus of claim 3, wherein the training module further comprises: