CN107908642B - Industry text entity extraction method based on distributed platform - Google Patents

Industry text entity extraction method based on distributed platform Download PDF

Info

Publication number
CN107908642B
CN107908642B CN201710902720.0A CN201710902720A CN107908642B CN 107908642 B CN107908642 B CN 107908642B CN 201710902720 A CN201710902720 A CN 201710902720A CN 107908642 B CN107908642 B CN 107908642B
Authority
CN
China
Prior art keywords
entity
text
model
texts
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710902720.0A
Other languages
Chinese (zh)
Other versions
CN107908642A (en
Inventor
武克杰
周书勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Huatong Data Cloud Technology Co ltd
Original Assignee
Jiangsu Huatong Data Cloud Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Huatong Data Cloud Technology Co ltd filed Critical Jiangsu Huatong Data Cloud Technology Co ltd
Priority to CN201710902720.0A priority Critical patent/CN107908642B/en
Publication of CN107908642A publication Critical patent/CN107908642A/en
Application granted granted Critical
Publication of CN107908642B publication Critical patent/CN107908642B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention discloses an industry text entity extraction method based on a distributed platform, which comprises the following steps: training a text data set by adopting a deep learning neural network to obtain a relation characteristic model; generating a plurality of elastic distributed relational feature data sets RDD from the extracted relational features; extracting class characteristics from a class characteristic model obtained by training a data set in the RDD through an improved nonlinear SVM classification algorithm; finding out entity models corresponding to the contexts according to the extracted category characteristics, and extracting entity data in texts corresponding to the category characteristics through the trained entity models; and judging whether the number of the texts corresponding to the context exceeds a set threshold, if so, retraining the entity model of the context, extracting entity data in the texts with the corresponding category characteristics by using the retrained entity model, and otherwise, storing the entity characteristics of the texts and the text data. The text feature entities under different contexts can be processed, and the entity extraction efficiency and the entity extraction accuracy are effectively improved.

Description

Industry text entity extraction method based on distributed platform
Technical Field
The invention relates to a text entity extraction method, in particular to an industry text entity extraction method based on a distributed platform.
Background
The traditional text extraction method adopts a pattern matching relation extraction method, a relation extraction method based on dictionary drive, a relation extraction method based on machine learning and the like, and most of the methods firstly extract words with high word frequency in the text as effective entities by a word segmentation method. The methods are suitable for a scene that entities in the text are relatively single, but the methods cannot effectively distinguish the entities in different contexts, and can wrongly divide and merge the entities which do not need to be divided or merged originally.
Meanwhile, the traditional detection method is difficult to extract the words which do not appear in the previous text by a word segmentation method.
Recently, many methods for extracting entities based on deep learning have appeared, wherein the algorithm for extracting entities is divided into two models which have good computational performance but not high extraction accuracy, and have high extraction accuracy but slow computational performance. For example, a fast linear entity extraction model and a convolutional neural network are fast models, and a nonlinear entity extraction model and a deep neural network model are models with better accuracy.
Chinese patent document CN2017100036859 discloses an online Chinese medicine text named entity recognition method based on deep learning, the entity extraction method enriches a text training sample set through a crawler, and simultaneously adopts a neural network method to extract text features, so that the accuracy of the entity of the sample can be extracted to a certain extent, but the extracted entity model is increased along with the increase of the training sample, the training time is gradually increased, and the feature extraction time is increased along with the increase.
Disclosure of Invention
Aiming at the technical problem, the invention aims to: the method for extracting the industrial text entities based on the distributed platform is provided, and the text characteristic entities under different contexts are processed by utilizing a plurality of elastic distributed entity extraction models in the Spark platform, so that the entity extraction efficiency can be effectively improved, and the entity extraction accuracy can also be improved. Meanwhile, the weight in the support vector machine classification algorithm is improved, the generalization capability of the text is enhanced, and the accuracy of the text is further improved.
The technical scheme of the invention is as follows:
an industry text entity extraction method based on a distributed platform comprises the following steps:
s01: adopting a deep learning neural network training text data set to obtain a relation characteristic model, and extracting relation characteristics in a target text through the relation characteristic model;
s02: generating a plurality of elastic distributed relational feature data sets RDD from the extracted relational features;
s03: extracting class characteristics from a class characteristic model obtained by training a data set in the RDD through an improved nonlinear SVM classification algorithm;
s04: finding out entity models corresponding to the contexts according to the extracted category characteristics, and extracting entity data in texts corresponding to the category characteristics through the trained entity models;
s05: and judging whether the number of the texts corresponding to the context exceeds a set threshold value T, if so, retraining the entity model of the context, extracting entity data in the texts with the corresponding category characteristics by using the retrained entity model, and otherwise, storing the entity characteristics of the texts and the text data.
Preferably, the step S01 specifically includes:
s11: segmenting the text by an ansj open source segmentation method, counting word frequency of each word in all the texts and word frequency of each word in the current text, removing common auxiliary words, stop words and words with high frequency, extracting N words from all the texts according to the relationship between the word frequency of the current text and the word frequency of all the texts, and placing each category in the same folder;
s12: randomly setting each word in the N words as an A-dimensional data characteristic, and forming N-A-dimensional data by each text;
s13: taking each word feature as a deep learning neural network input node neuron, performing convolution through a first hidden layer, performing subsampling and local averaging through a second hidden layer, performing second convolution through a third hidden layer, performing second subsampling and local average calculation through a fourth hidden layer, forming a full connection layer, converting a text into B-dimensional data, and adjusting accuracy through multiple tests to obtain a relational feature model.
Preferably, the step S03 specifically includes:
s31: adjusting the weight and the offset in the nonlinear SVM classification algorithm to ensure that the error between the input relational characteristics and the characteristics of the labeled samples is in a set range, and storing a text class characteristic model;
s32: the selected classification model method is an improved nonlinear SVM classification algorithm, and the training model thereof has a classification objective function of
Figure BDA0001423398580000021
Wherein
Figure BDA0001423398580000022
The classification constraint is that y is equal to w' phi (x)i)+b+εiTo obtain a discriminant function
Figure BDA0001423398580000023
Wherein the weight value
Figure BDA0001423398580000031
C is a penalty factor, which is an empirical parameter, i is the RDD number, w is the vector weight, siIs the Euclidean distance between positive and negative samples in the relation characteristic, b is the threshold value in classification, epsiloniFor error, phi (x)i) Is a non-linear kernel function;
s32: gradually adjusting the penalty factor C, and testing and selecting the optimal penalty factor C, wherein the nonlinear kernel function phi (x)i) Is min (x (i), x)s(i) Where x (i), xs(i) Feature vectors extracted from any two text relation feature samples; the label of each type of relation characteristic sample is a corresponding type number, and alpha of a discriminant function is obtained through multiple off-line trainingiAnd b, wherein the discriminant function
Figure BDA0001423398580000032
The corresponding class feature model.
Preferably, in step S03, the sample text with poor extraction and obvious error is placed into a new class, and the test sample is adjusted step by step, so that the test sample class is optimized.
Compared with the prior art, the invention has the advantages that:
the invention improves a classification algorithm model, wherein a weighting coefficient of a punishment factor is mainly added in a classification target function of a training model, the generalization capability of the training classification model is enhanced, and a nonlinear kernel function min (x (i), x) is adopteds(i) To enable the corresponding category of text to be accurately found. Meanwhile, the text extraction entity model is divided into the text extraction entity models of a plurality of scenes through the distributed spark platform, the problems of large training and calculation load of the traditional text extraction entity are solved, the entities in each text can be extracted quickly, and the accuracy can be improvedAnd extracting the text entity.
Drawings
The invention is further described with reference to the following figures and examples:
FIG. 1 is a flow chart of an industry text entity extraction method based on a distributed platform.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.
Example (b):
as shown in fig. 1, the industry text entity extraction method based on the distributed platform includes the following steps:
(1) in text acquisition, text data information of each industry is respectively acquired through an akka communication module in a Spark open source platform, and text data of entities needing to be extracted and acquired by monitoring equipment is transmitted to a distributed Spark platform.
(2) And (3) constructing a spark platform cluster, wherein one server is used as a management node, and 4 servers are used as service nodes. The management node mainly records the dependency relationship between data streams and is responsible for task scheduling and generating new RDDs. The service node mainly realizes the storage function of the analysis algorithm and the data.
(3) Training an existing text data set by a deep learning neural network method to obtain a relation feature model, and extracting relation features in a new text by using the relation feature model;
the generation of the relationship characteristic model specifically comprises the following steps:
s21: the method comprises the steps of segmenting a text by using an ansj open source segmentation method, then calculating the word frequency of each word in all texts and the word frequency of each word in the current text in a statistical mode, removing general auxiliary words, stop words and words with higher frequency, then extracting N important words according to the relation between the word frequency of the current text and the word frequency of all the texts, and meanwhile placing each class in the same folder.
S22: then, the data features of 200 dimensions of each vocabulary are randomly arranged, so that each text sample can form data of N x 200 dimensions.
S23: taking the relational features of each word as input node neurons of a deep learning neural network, performing convolution through a first hidden layer, performing subsampling and local averaging through a second hidden layer, performing secondary convolution through a third hidden layer, performing secondary subsampling and local average calculation through a fourth hidden layer, and performing full connection layer to realize conversion from N x 200 dimensional data to 1000 dimensional data. With 70% of the data used for training and 30% for testing. And adjusting the trained deep network generated model gradually through testing accuracy for multiple times, and finally obtaining that the optimal network model is the relation model for generating the text.
(4) And converting the extracted relational feature text data into text elastic distributed RDD relational feature text data, and then dividing the text elastic distributed RDD relational feature text data into a plurality of RDDs according to the text context feature stream to perform fragmentation processing.
(5) And converting the converted elastic distributed RDD characteristic text data into class characteristics through a class characteristic model trained by an improved nonlinear SVM classification algorithm, wherein the trained data set is an existing and classified industry text data set, and meanwhile, the spark distributed platform has the advantage of fast calculation, so that the industry text data set can be quickly trained for correcting again to obtain a new class characteristic model.
Placing sample texts which are not well extracted and have obvious errors into a new class, and gradually adjusting the test samples to enable the test sample class to be optimal; the new text set can form different categories, and the spark platform distribution is characterized, so that all samples can be quickly extracted to corresponding entities through entity models of corresponding types. With the increase of the categories, the robustness of the corresponding multi-category entity models is stronger, and the accuracy of the entity extraction is better.
The class feature model trained by the improved nonlinear SVM classification algorithm comprises the following steps:
selecting an improved support vector machine model as a training classification model, wherein the training classification target function of the improved support vector machine model is
Figure BDA0001423398580000051
The corresponding constraint is that y is equal to w' phi (x)i)+b+εiDeriving discriminant functions from the objective function and constraint conditions
Figure BDA0001423398580000052
Wherein the weight value
Figure BDA0001423398580000053
C is a penalty factor which is an adjustable parameter, i is the number of 1 to n training text samples, w is a weight vector, siIs the Euclidean distance between positive and negative samples and is used as the weighting coefficient of penalty factor in the target function, b is the threshold value, epsiloniFor error, phi (x)i) Is a non-linear kernel function;
setting the penalty factor C between 1 and 100, extracting the characteristics of the positive and negative samples of the pedestrian prepared in advance, and obtaining the corresponding kernel function phi (x)i) Is min (x (i), x)s(i) Where x (i), xs(i) Feature vectors extracted from any two positive and negative samples; the label of the positive sample is 1, the label of the negative sample is-1, and the alpha of the discriminant function is obtained by off-line trainingiAnd b, wherein the discriminant function
Figure BDA0001423398580000054
Namely the corresponding nonlinear SVM detection model;
by judging the result y of the detection modeliAnd outputting the category of which the corresponding value corresponds to the text context.
(6) Finding a corresponding context entity model according to the category characteristics of the text, and extracting and selecting entity data in the text of a corresponding type through the trained entity model, wherein the context entity model is the entity model of the text obtained by training an open source word2vec tool on an existing industry text data set;
(7) when the number of texts of a certain scene exceeds a threshold value T, the scene entity model is retrained by using a word2vec tool, and when the number of texts of the certain scene exceeds the threshold value T, data can be stored on a distributed platform firstly, wherein T generally exceeds the number of 1 ten thousand samples.
It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims (3)

1. An industry text entity extraction method based on a distributed platform is characterized by comprising the following steps:
s01: adopting a deep learning neural network training text data set to obtain a relation characteristic model, and extracting relation characteristics in a target text through the relation characteristic model;
s02: generating a plurality of elastic distributed relational feature data sets RDD from the extracted relational features;
s03: extracting class characteristics from a class characteristic model obtained by training a data set in the RDD through an improved nonlinear SVM classification algorithm; the step S03 specifically includes:
s31: adjusting the weight and the offset in the nonlinear SVM classification algorithm to ensure that the error between the input relational characteristics and the characteristics of the labeled samples is in a set range, and storing a text class characteristic model;
s32: the selected classification model method is an improved nonlinear SVM classification algorithm, and the training model thereof has a classification objective function of
Figure 7656DEST_PATH_IMAGE001
Wherein
Figure 580589DEST_PATH_IMAGE002
=sin(
Figure 625905DEST_PATH_IMAGE003
) The classification constraint condition is
Figure 21115DEST_PATH_IMAGE004
To obtain a discriminant function
Figure 242011DEST_PATH_IMAGE005
Wherein the weight value
Figure 330053DEST_PATH_IMAGE006
C is a penalty factor, which is an empirical parameter, i is the RDD number, w is the vector weight,
Figure 862666DEST_PATH_IMAGE007
is the Euclidean distance between the positive sample and the negative sample in the relation characteristic, b is the threshold value during classification,
Figure 327145DEST_PATH_IMAGE008
in order to be an error, the error is,
Figure 173789DEST_PATH_IMAGE009
is a non-linear kernel function;
s33: gradually adjusting the penalty factor C, and testing and selecting the optimal penalty factor C, wherein the kernel function is nonlinear
Figure 432732DEST_PATH_IMAGE009
Is composed of
Figure 452640DEST_PATH_IMAGE010
Wherein
Figure 330598DEST_PATH_IMAGE011
Feature vectors extracted from any two text relation feature samples; the label of each type of relation characteristic sample is a corresponding type number, and a discriminant function is obtained through multiple off-line training
Figure 119562DEST_PATH_IMAGE012
And
Figure 814986DEST_PATH_IMAGE013
in which a discriminant function is
Figure 322190DEST_PATH_IMAGE005
Is the corresponding class feature model;
s04: finding out entity models corresponding to the contexts according to the extracted category characteristics, and extracting entity data in texts corresponding to the category characteristics through the trained entity models;
s05: and judging whether the number of the texts corresponding to the context exceeds a set threshold value T, if so, retraining the entity model of the context, extracting entity data in the texts with the corresponding category characteristics by using the retrained entity model, and otherwise, storing the entity characteristics of the texts and the text data.
2. The distributed platform-based industry textual entity extraction method according to claim 1, wherein the step S01 specifically comprises:
s11: segmenting the text by an ansj open source segmentation method, counting word frequency of each word in all the texts and word frequency of each word in the current text, removing common auxiliary words, stop words and words with high frequency, extracting N words from all the texts according to the relationship between the word frequency of the current text and the word frequency of all the texts, and placing each category in the same folder;
s12: randomly setting each word in the N words as an A-dimensional data characteristic, and forming N-A-dimensional data by each text;
s13: taking each word feature as a deep learning neural network input node neuron, performing convolution through a first hidden layer, performing subsampling and local averaging through a second hidden layer, performing second convolution through a third hidden layer, performing second subsampling and local average calculation through a fourth hidden layer, performing full connection layer, converting a text into B-dimensional data, and adjusting accuracy through multiple tests to obtain a relational feature model.
3. The distributed platform based industry textual entity extraction method according to claim 1, wherein in step S03, the sample texts with poor extraction and obvious errors are placed in a new class, and the test samples are adjusted step by step to make the test sample class reach the optimum.
CN201710902720.0A 2017-09-29 2017-09-29 Industry text entity extraction method based on distributed platform Active CN107908642B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710902720.0A CN107908642B (en) 2017-09-29 2017-09-29 Industry text entity extraction method based on distributed platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710902720.0A CN107908642B (en) 2017-09-29 2017-09-29 Industry text entity extraction method based on distributed platform

Publications (2)

Publication Number Publication Date
CN107908642A CN107908642A (en) 2018-04-13
CN107908642B true CN107908642B (en) 2021-11-12

Family

ID=61840291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710902720.0A Active CN107908642B (en) 2017-09-29 2017-09-29 Industry text entity extraction method based on distributed platform

Country Status (1)

Country Link
CN (1) CN107908642B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508757B (en) * 2018-10-30 2020-10-09 北京陌上花科技有限公司 Data processing method and device for character recognition
CN111274348B (en) * 2018-12-04 2023-05-12 北京嘀嘀无限科技发展有限公司 Service feature data extraction method and device and electronic equipment
CN109754014B (en) * 2018-12-29 2021-04-27 北京航天数据股份有限公司 Industrial model training method, device, equipment and medium
CN111950279B (en) * 2019-05-17 2023-06-23 百度在线网络技术(北京)有限公司 Entity relationship processing method, device, equipment and computer readable storage medium
CN112052646B (en) * 2020-08-27 2024-03-29 安徽聚戎科技信息咨询有限公司 Text data labeling method
CN114756385B (en) * 2022-06-16 2022-09-02 合肥中科类脑智能技术有限公司 Elastic distributed training method under deep learning scene

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933164A (en) * 2015-06-26 2015-09-23 华南理工大学 Method for extracting relations among named entities in Internet massive data and system thereof
CN106168965A (en) * 2016-07-01 2016-11-30 竹间智能科技(上海)有限公司 Knowledge mapping constructing system
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7284191B2 (en) * 2001-08-13 2007-10-16 Xerox Corporation Meta-document management system with document identifiers
US20170124181A1 (en) * 2015-10-30 2017-05-04 Oracle International Corporation Automatic fuzzy matching of entities in context
CN105389378A (en) * 2015-11-19 2016-03-09 广州精标信息科技有限公司 System for integrating separate data
US9940384B2 (en) * 2015-12-15 2018-04-10 International Business Machines Corporation Statistical clustering inferred from natural language to drive relevant analysis and conversation with users
CN106599032B (en) * 2016-10-27 2020-01-14 浙江大学 Text event extraction method combining sparse coding and structure sensing machine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933164A (en) * 2015-06-26 2015-09-23 华南理工大学 Method for extracting relations among named entities in Internet massive data and system thereof
CN106168965A (en) * 2016-07-01 2016-11-30 竹间智能科技(上海)有限公司 Knowledge mapping constructing system
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于深度学习的医疗命名实体识别";张帆 等;《计算技术与自动化》;20170331;第36卷(第1期);第123-127页 *

Also Published As

Publication number Publication date
CN107908642A (en) 2018-04-13

Similar Documents

Publication Publication Date Title
CN107908642B (en) Industry text entity extraction method based on distributed platform
CN110826630B (en) Radar interference signal feature level fusion identification method based on deep convolutional neural network
CN113378632B (en) Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method
CN108710651B (en) Automatic classification method for large-scale customer complaint data
US20160267359A1 (en) Image object category recognition method and device
CN110046634B (en) Interpretation method and device of clustering result
CN104615676B (en) One kind being based on the matched picture retrieval method of maximum similarity
CN107480688B (en) Fine-grained image identification method based on zero sample learning
CN106971180B (en) A kind of micro- expression recognition method based on the sparse transfer learning of voice dictionary
CN110942091A (en) Semi-supervised few-sample image classification method for searching reliable abnormal data center
CN107357895B (en) Text representation processing method based on bag-of-words model
CN113254643B (en) Text classification method and device, electronic equipment and text classification program
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
CN111507350A (en) Text recognition method and device
CN110928981A (en) Method, system and storage medium for establishing and perfecting iteration of text label system
Delgado et al. Fast single-and cross-show speaker diarization using binary key speaker modeling
CN107527058B (en) Image retrieval method based on weighted local feature aggregation descriptor
CN112990371B (en) Unsupervised night image classification method based on feature amplification
Ghayoumi et al. Local sensitive hashing (LSH) and convolutional neural networks (CNNs) for object recognition
Hao et al. Improvement of word bag model based on image classification
CN104008095A (en) Object recognition method based on semantic feature extraction and matching
US20150332173A1 (en) Learning method, information conversion device, and recording medium
Dileep et al. Speaker recognition using pyramid match kernel based support vector machines
Rothacker et al. Robust output modeling in bag-of-features HMMs for handwriting recognition
CN106202562B (en) method for reducing false judgment rate of sensitive information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant