CN107908642B

CN107908642B - Industry text entity extraction method based on distributed platform

Info

Publication number: CN107908642B
Application number: CN201710902720.0A
Authority: CN
Inventors: 武克杰; 周书勇
Original assignee: Jiangsu Huatong Data Cloud Technology Co ltd
Current assignee: Jiangsu Huatong Data Cloud Technology Co ltd
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2021-11-12
Anticipated expiration: 2037-09-29
Also published as: CN107908642A

Abstract

The invention discloses an industry text entity extraction method based on a distributed platform, which comprises the following steps: training a text data set by adopting a deep learning neural network to obtain a relation characteristic model; generating a plurality of elastic distributed relational feature data sets RDD from the extracted relational features; extracting class characteristics from a class characteristic model obtained by training a data set in the RDD through an improved nonlinear SVM classification algorithm; finding out entity models corresponding to the contexts according to the extracted category characteristics, and extracting entity data in texts corresponding to the category characteristics through the trained entity models; and judging whether the number of the texts corresponding to the context exceeds a set threshold, if so, retraining the entity model of the context, extracting entity data in the texts with the corresponding category characteristics by using the retrained entity model, and otherwise, storing the entity characteristics of the texts and the text data. The text feature entities under different contexts can be processed, and the entity extraction efficiency and the entity extraction accuracy are effectively improved.

Description

Industry text entity extraction method based on distributed platform

Technical Field

The invention relates to a text entity extraction method, in particular to an industry text entity extraction method based on a distributed platform.

Background

The traditional text extraction method adopts a pattern matching relation extraction method, a relation extraction method based on dictionary drive, a relation extraction method based on machine learning and the like, and most of the methods firstly extract words with high word frequency in the text as effective entities by a word segmentation method. The methods are suitable for a scene that entities in the text are relatively single, but the methods cannot effectively distinguish the entities in different contexts, and can wrongly divide and merge the entities which do not need to be divided or merged originally.

Meanwhile, the traditional detection method is difficult to extract the words which do not appear in the previous text by a word segmentation method.

Recently, many methods for extracting entities based on deep learning have appeared, wherein the algorithm for extracting entities is divided into two models which have good computational performance but not high extraction accuracy, and have high extraction accuracy but slow computational performance. For example, a fast linear entity extraction model and a convolutional neural network are fast models, and a nonlinear entity extraction model and a deep neural network model are models with better accuracy.

Chinese patent document CN2017100036859 discloses an online Chinese medicine text named entity recognition method based on deep learning, the entity extraction method enriches a text training sample set through a crawler, and simultaneously adopts a neural network method to extract text features, so that the accuracy of the entity of the sample can be extracted to a certain extent, but the extracted entity model is increased along with the increase of the training sample, the training time is gradually increased, and the feature extraction time is increased along with the increase.

Disclosure of Invention

Aiming at the technical problem, the invention aims to: the method for extracting the industrial text entities based on the distributed platform is provided, and the text characteristic entities under different contexts are processed by utilizing a plurality of elastic distributed entity extraction models in the Spark platform, so that the entity extraction efficiency can be effectively improved, and the entity extraction accuracy can also be improved. Meanwhile, the weight in the support vector machine classification algorithm is improved, the generalization capability of the text is enhanced, and the accuracy of the text is further improved.

The technical scheme of the invention is as follows:

an industry text entity extraction method based on a distributed platform comprises the following steps:

s01: adopting a deep learning neural network training text data set to obtain a relation characteristic model, and extracting relation characteristics in a target text through the relation characteristic model;

s02: generating a plurality of elastic distributed relational feature data sets RDD from the extracted relational features;

s03: extracting class characteristics from a class characteristic model obtained by training a data set in the RDD through an improved nonlinear SVM classification algorithm;

s04: finding out entity models corresponding to the contexts according to the extracted category characteristics, and extracting entity data in texts corresponding to the category characteristics through the trained entity models;

s05: and judging whether the number of the texts corresponding to the context exceeds a set threshold value T, if so, retraining the entity model of the context, extracting entity data in the texts with the corresponding category characteristics by using the retrained entity model, and otherwise, storing the entity characteristics of the texts and the text data.

Preferably, the step S01 specifically includes:

s11: segmenting the text by an ansj open source segmentation method, counting word frequency of each word in all the texts and word frequency of each word in the current text, removing common auxiliary words, stop words and words with high frequency, extracting N words from all the texts according to the relationship between the word frequency of the current text and the word frequency of all the texts, and placing each category in the same folder;

s12: randomly setting each word in the N words as an A-dimensional data characteristic, and forming N-A-dimensional data by each text;

s13: taking each word feature as a deep learning neural network input node neuron, performing convolution through a first hidden layer, performing subsampling and local averaging through a second hidden layer, performing second convolution through a third hidden layer, performing second subsampling and local average calculation through a fourth hidden layer, forming a full connection layer, converting a text into B-dimensional data, and adjusting accuracy through multiple tests to obtain a relational feature model.

Preferably, the step S03 specifically includes:

s31: adjusting the weight and the offset in the nonlinear SVM classification algorithm to ensure that the error between the input relational characteristics and the characteristics of the labeled samples is in a set range, and storing a text class characteristic model;

s32: the selected classification model method is an improved nonlinear SVM classification algorithm, and the training model thereof has a classification objective function of

Wherein

The classification constraint is that y is equal to w' phi (x)_i)+b+ε_iTo obtain a discriminant function

Wherein the weight value

C is a penalty factor, which is an empirical parameter, i is the RDD number, w is the vector weight, s_iIs the Euclidean distance between positive and negative samples in the relation characteristic, b is the threshold value in classification, epsilon_iFor error, phi (x)_i) Is a non-linear kernel function;

s32: gradually adjusting the penalty factor C, and testing and selecting the optimal penalty factor C, wherein the nonlinear kernel function phi (x)_i) Is min (x (i), x)_s(i) Where x (i), x_s(i) Feature vectors extracted from any two text relation feature samples; the label of each type of relation characteristic sample is a corresponding type number, and alpha of a discriminant function is obtained through multiple off-line training_iAnd b, wherein the discriminant function

The corresponding class feature model.

Preferably, in step S03, the sample text with poor extraction and obvious error is placed into a new class, and the test sample is adjusted step by step, so that the test sample class is optimized.

Compared with the prior art, the invention has the advantages that:

the invention improves a classification algorithm model, wherein a weighting coefficient of a punishment factor is mainly added in a classification target function of a training model, the generalization capability of the training classification model is enhanced, and a nonlinear kernel function min (x (i), x) is adopted_s(i) To enable the corresponding category of text to be accurately found. Meanwhile, the text extraction entity model is divided into the text extraction entity models of a plurality of scenes through the distributed spark platform, the problems of large training and calculation load of the traditional text extraction entity are solved, the entities in each text can be extracted quickly, and the accuracy can be improvedAnd extracting the text entity.

Drawings

The invention is further described with reference to the following figures and examples:

FIG. 1 is a flow chart of an industry text entity extraction method based on a distributed platform.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

Example (b):

as shown in fig. 1, the industry text entity extraction method based on the distributed platform includes the following steps:

(1) in text acquisition, text data information of each industry is respectively acquired through an akka communication module in a Spark open source platform, and text data of entities needing to be extracted and acquired by monitoring equipment is transmitted to a distributed Spark platform.

(2) And (3) constructing a spark platform cluster, wherein one server is used as a management node, and 4 servers are used as service nodes. The management node mainly records the dependency relationship between data streams and is responsible for task scheduling and generating new RDDs. The service node mainly realizes the storage function of the analysis algorithm and the data.

(3) Training an existing text data set by a deep learning neural network method to obtain a relation feature model, and extracting relation features in a new text by using the relation feature model;

the generation of the relationship characteristic model specifically comprises the following steps:

s21: the method comprises the steps of segmenting a text by using an ansj open source segmentation method, then calculating the word frequency of each word in all texts and the word frequency of each word in the current text in a statistical mode, removing general auxiliary words, stop words and words with higher frequency, then extracting N important words according to the relation between the word frequency of the current text and the word frequency of all the texts, and meanwhile placing each class in the same folder.

S22: then, the data features of 200 dimensions of each vocabulary are randomly arranged, so that each text sample can form data of N x 200 dimensions.

S23: taking the relational features of each word as input node neurons of a deep learning neural network, performing convolution through a first hidden layer, performing subsampling and local averaging through a second hidden layer, performing secondary convolution through a third hidden layer, performing secondary subsampling and local average calculation through a fourth hidden layer, and performing full connection layer to realize conversion from N x 200 dimensional data to 1000 dimensional data. With 70% of the data used for training and 30% for testing. And adjusting the trained deep network generated model gradually through testing accuracy for multiple times, and finally obtaining that the optimal network model is the relation model for generating the text.

(4) And converting the extracted relational feature text data into text elastic distributed RDD relational feature text data, and then dividing the text elastic distributed RDD relational feature text data into a plurality of RDDs according to the text context feature stream to perform fragmentation processing.

(5) And converting the converted elastic distributed RDD characteristic text data into class characteristics through a class characteristic model trained by an improved nonlinear SVM classification algorithm, wherein the trained data set is an existing and classified industry text data set, and meanwhile, the spark distributed platform has the advantage of fast calculation, so that the industry text data set can be quickly trained for correcting again to obtain a new class characteristic model.

Placing sample texts which are not well extracted and have obvious errors into a new class, and gradually adjusting the test samples to enable the test sample class to be optimal; the new text set can form different categories, and the spark platform distribution is characterized, so that all samples can be quickly extracted to corresponding entities through entity models of corresponding types. With the increase of the categories, the robustness of the corresponding multi-category entity models is stronger, and the accuracy of the entity extraction is better.

The class feature model trained by the improved nonlinear SVM classification algorithm comprises the following steps:

selecting an improved support vector machine model as a training classification model, wherein the training classification target function of the improved support vector machine model is

The corresponding constraint is that y is equal to w' phi (x)_i)+b+ε_iDeriving discriminant functions from the objective function and constraint conditions

Wherein the weight value

C is a penalty factor which is an adjustable parameter, i is the number of 1 to n training text samples, w is a weight vector, s_iIs the Euclidean distance between positive and negative samples and is used as the weighting coefficient of penalty factor in the target function, b is the threshold value, epsilon_iFor error, phi (x)_i) Is a non-linear kernel function;

setting the penalty factor C between 1 and 100, extracting the characteristics of the positive and negative samples of the pedestrian prepared in advance, and obtaining the corresponding kernel function phi (x)_i) Is min (x (i), x)_s(i) Where x (i), x_s(i) Feature vectors extracted from any two positive and negative samples; the label of the positive sample is 1, the label of the negative sample is-1, and the alpha of the discriminant function is obtained by off-line training_iAnd b, wherein the discriminant function

Namely the corresponding nonlinear SVM detection model;

by judging the result y of the detection model_iAnd outputting the category of which the corresponding value corresponds to the text context.

(6) Finding a corresponding context entity model according to the category characteristics of the text, and extracting and selecting entity data in the text of a corresponding type through the trained entity model, wherein the context entity model is the entity model of the text obtained by training an open source word2vec tool on an existing industry text data set;

(7) when the number of texts of a certain scene exceeds a threshold value T, the scene entity model is retrained by using a word2vec tool, and when the number of texts of the certain scene exceeds the threshold value T, data can be stored on a distributed platform firstly, wherein T generally exceeds the number of 1 ten thousand samples.

It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims

1. An industry text entity extraction method based on a distributed platform is characterized by comprising the following steps:

s03: extracting class characteristics from a class characteristic model obtained by training a data set in the RDD through an improved nonlinear SVM classification algorithm; the step S03 specifically includes:

Wherein

=sin(

) The classification constraint condition is

To obtain a discriminant function

Wherein the weight value

C is a penalty factor, which is an empirical parameter, i is the RDD number, w is the vector weight,

is the Euclidean distance between the positive sample and the negative sample in the relation characteristic, b is the threshold value during classification,

in order to be an error, the error is,

is a non-linear kernel function;

s33: gradually adjusting the penalty factor C, and testing and selecting the optimal penalty factor C, wherein the kernel function is nonlinear

Is composed of

Wherein

Feature vectors extracted from any two text relation feature samples; the label of each type of relation characteristic sample is a corresponding type number, and a discriminant function is obtained through multiple off-line training

And

in which a discriminant function is

Is the corresponding class feature model;

2. The distributed platform-based industry textual entity extraction method according to claim 1, wherein the step S01 specifically comprises:

s13: taking each word feature as a deep learning neural network input node neuron, performing convolution through a first hidden layer, performing subsampling and local averaging through a second hidden layer, performing second convolution through a third hidden layer, performing second subsampling and local average calculation through a fourth hidden layer, performing full connection layer, converting a text into B-dimensional data, and adjusting accuracy through multiple tests to obtain a relational feature model.

3. The distributed platform based industry textual entity extraction method according to claim 1, wherein in step S03, the sample texts with poor extraction and obvious errors are placed in a new class, and the test samples are adjusted step by step to make the test sample class reach the optimum.