CN107908642B - Industry text entity extraction method based on distributed platform - Google Patents
Industry text entity extraction method based on distributed platform Download PDFInfo
- Publication number
- CN107908642B CN107908642B CN201710902720.0A CN201710902720A CN107908642B CN 107908642 B CN107908642 B CN 107908642B CN 201710902720 A CN201710902720 A CN 201710902720A CN 107908642 B CN107908642 B CN 107908642B
- Authority
- CN
- China
- Prior art keywords
- entity
- text
- model
- texts
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The invention discloses an industry text entity extraction method based on a distributed platform, which comprises the following steps: training a text data set by adopting a deep learning neural network to obtain a relation characteristic model; generating a plurality of elastic distributed relational feature data sets RDD from the extracted relational features; extracting class characteristics from a class characteristic model obtained by training a data set in the RDD through an improved nonlinear SVM classification algorithm; finding out entity models corresponding to the contexts according to the extracted category characteristics, and extracting entity data in texts corresponding to the category characteristics through the trained entity models; and judging whether the number of the texts corresponding to the context exceeds a set threshold, if so, retraining the entity model of the context, extracting entity data in the texts with the corresponding category characteristics by using the retrained entity model, and otherwise, storing the entity characteristics of the texts and the text data. The text feature entities under different contexts can be processed, and the entity extraction efficiency and the entity extraction accuracy are effectively improved.
Description
Technical Field
The invention relates to a text entity extraction method, in particular to an industry text entity extraction method based on a distributed platform.
Background
The traditional text extraction method adopts a pattern matching relation extraction method, a relation extraction method based on dictionary drive, a relation extraction method based on machine learning and the like, and most of the methods firstly extract words with high word frequency in the text as effective entities by a word segmentation method. The methods are suitable for a scene that entities in the text are relatively single, but the methods cannot effectively distinguish the entities in different contexts, and can wrongly divide and merge the entities which do not need to be divided or merged originally.
Meanwhile, the traditional detection method is difficult to extract the words which do not appear in the previous text by a word segmentation method.
Recently, many methods for extracting entities based on deep learning have appeared, wherein the algorithm for extracting entities is divided into two models which have good computational performance but not high extraction accuracy, and have high extraction accuracy but slow computational performance. For example, a fast linear entity extraction model and a convolutional neural network are fast models, and a nonlinear entity extraction model and a deep neural network model are models with better accuracy.
Chinese patent document CN2017100036859 discloses an online Chinese medicine text named entity recognition method based on deep learning, the entity extraction method enriches a text training sample set through a crawler, and simultaneously adopts a neural network method to extract text features, so that the accuracy of the entity of the sample can be extracted to a certain extent, but the extracted entity model is increased along with the increase of the training sample, the training time is gradually increased, and the feature extraction time is increased along with the increase.
Disclosure of Invention
Aiming at the technical problem, the invention aims to: the method for extracting the industrial text entities based on the distributed platform is provided, and the text characteristic entities under different contexts are processed by utilizing a plurality of elastic distributed entity extraction models in the Spark platform, so that the entity extraction efficiency can be effectively improved, and the entity extraction accuracy can also be improved. Meanwhile, the weight in the support vector machine classification algorithm is improved, the generalization capability of the text is enhanced, and the accuracy of the text is further improved.
The technical scheme of the invention is as follows:
an industry text entity extraction method based on a distributed platform comprises the following steps:
s01: adopting a deep learning neural network training text data set to obtain a relation characteristic model, and extracting relation characteristics in a target text through the relation characteristic model;
s02: generating a plurality of elastic distributed relational feature data sets RDD from the extracted relational features;
s03: extracting class characteristics from a class characteristic model obtained by training a data set in the RDD through an improved nonlinear SVM classification algorithm;
s04: finding out entity models corresponding to the contexts according to the extracted category characteristics, and extracting entity data in texts corresponding to the category characteristics through the trained entity models;
s05: and judging whether the number of the texts corresponding to the context exceeds a set threshold value T, if so, retraining the entity model of the context, extracting entity data in the texts with the corresponding category characteristics by using the retrained entity model, and otherwise, storing the entity characteristics of the texts and the text data.
Preferably, the step S01 specifically includes:
s11: segmenting the text by an ansj open source segmentation method, counting word frequency of each word in all the texts and word frequency of each word in the current text, removing common auxiliary words, stop words and words with high frequency, extracting N words from all the texts according to the relationship between the word frequency of the current text and the word frequency of all the texts, and placing each category in the same folder;
s12: randomly setting each word in the N words as an A-dimensional data characteristic, and forming N-A-dimensional data by each text;
s13: taking each word feature as a deep learning neural network input node neuron, performing convolution through a first hidden layer, performing subsampling and local averaging through a second hidden layer, performing second convolution through a third hidden layer, performing second subsampling and local average calculation through a fourth hidden layer, forming a full connection layer, converting a text into B-dimensional data, and adjusting accuracy through multiple tests to obtain a relational feature model.
Preferably, the step S03 specifically includes:
s31: adjusting the weight and the offset in the nonlinear SVM classification algorithm to ensure that the error between the input relational characteristics and the characteristics of the labeled samples is in a set range, and storing a text class characteristic model;
s32: the selected classification model method is an improved nonlinear SVM classification algorithm, and the training model thereof has a classification objective function ofWhereinThe classification constraint is that y is equal to w' phi (x)i)+b+εiTo obtain a discriminant functionWherein the weight valueC is a penalty factor, which is an empirical parameter, i is the RDD number, w is the vector weight, siIs the Euclidean distance between positive and negative samples in the relation characteristic, b is the threshold value in classification, epsiloniFor error, phi (x)i) Is a non-linear kernel function;
s32: gradually adjusting the penalty factor C, and testing and selecting the optimal penalty factor C, wherein the nonlinear kernel function phi (x)i) Is min (x (i), x)s(i) Where x (i), xs(i) Feature vectors extracted from any two text relation feature samples; the label of each type of relation characteristic sample is a corresponding type number, and alpha of a discriminant function is obtained through multiple off-line trainingiAnd b, wherein the discriminant functionThe corresponding class feature model.
Preferably, in step S03, the sample text with poor extraction and obvious error is placed into a new class, and the test sample is adjusted step by step, so that the test sample class is optimized.
Compared with the prior art, the invention has the advantages that:
the invention improves a classification algorithm model, wherein a weighting coefficient of a punishment factor is mainly added in a classification target function of a training model, the generalization capability of the training classification model is enhanced, and a nonlinear kernel function min (x (i), x) is adopteds(i) To enable the corresponding category of text to be accurately found. Meanwhile, the text extraction entity model is divided into the text extraction entity models of a plurality of scenes through the distributed spark platform, the problems of large training and calculation load of the traditional text extraction entity are solved, the entities in each text can be extracted quickly, and the accuracy can be improvedAnd extracting the text entity.
Drawings
The invention is further described with reference to the following figures and examples:
FIG. 1 is a flow chart of an industry text entity extraction method based on a distributed platform.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.
Example (b):
as shown in fig. 1, the industry text entity extraction method based on the distributed platform includes the following steps:
(1) in text acquisition, text data information of each industry is respectively acquired through an akka communication module in a Spark open source platform, and text data of entities needing to be extracted and acquired by monitoring equipment is transmitted to a distributed Spark platform.
(2) And (3) constructing a spark platform cluster, wherein one server is used as a management node, and 4 servers are used as service nodes. The management node mainly records the dependency relationship between data streams and is responsible for task scheduling and generating new RDDs. The service node mainly realizes the storage function of the analysis algorithm and the data.
(3) Training an existing text data set by a deep learning neural network method to obtain a relation feature model, and extracting relation features in a new text by using the relation feature model;
the generation of the relationship characteristic model specifically comprises the following steps:
s21: the method comprises the steps of segmenting a text by using an ansj open source segmentation method, then calculating the word frequency of each word in all texts and the word frequency of each word in the current text in a statistical mode, removing general auxiliary words, stop words and words with higher frequency, then extracting N important words according to the relation between the word frequency of the current text and the word frequency of all the texts, and meanwhile placing each class in the same folder.
S22: then, the data features of 200 dimensions of each vocabulary are randomly arranged, so that each text sample can form data of N x 200 dimensions.
S23: taking the relational features of each word as input node neurons of a deep learning neural network, performing convolution through a first hidden layer, performing subsampling and local averaging through a second hidden layer, performing secondary convolution through a third hidden layer, performing secondary subsampling and local average calculation through a fourth hidden layer, and performing full connection layer to realize conversion from N x 200 dimensional data to 1000 dimensional data. With 70% of the data used for training and 30% for testing. And adjusting the trained deep network generated model gradually through testing accuracy for multiple times, and finally obtaining that the optimal network model is the relation model for generating the text.
(4) And converting the extracted relational feature text data into text elastic distributed RDD relational feature text data, and then dividing the text elastic distributed RDD relational feature text data into a plurality of RDDs according to the text context feature stream to perform fragmentation processing.
(5) And converting the converted elastic distributed RDD characteristic text data into class characteristics through a class characteristic model trained by an improved nonlinear SVM classification algorithm, wherein the trained data set is an existing and classified industry text data set, and meanwhile, the spark distributed platform has the advantage of fast calculation, so that the industry text data set can be quickly trained for correcting again to obtain a new class characteristic model.
Placing sample texts which are not well extracted and have obvious errors into a new class, and gradually adjusting the test samples to enable the test sample class to be optimal; the new text set can form different categories, and the spark platform distribution is characterized, so that all samples can be quickly extracted to corresponding entities through entity models of corresponding types. With the increase of the categories, the robustness of the corresponding multi-category entity models is stronger, and the accuracy of the entity extraction is better.
The class feature model trained by the improved nonlinear SVM classification algorithm comprises the following steps:
selecting an improved support vector machine model as a training classification model, wherein the training classification target function of the improved support vector machine model isThe corresponding constraint is that y is equal to w' phi (x)i)+b+εiDeriving discriminant functions from the objective function and constraint conditionsWherein the weight valueC is a penalty factor which is an adjustable parameter, i is the number of 1 to n training text samples, w is a weight vector, siIs the Euclidean distance between positive and negative samples and is used as the weighting coefficient of penalty factor in the target function, b is the threshold value, epsiloniFor error, phi (x)i) Is a non-linear kernel function;
setting the penalty factor C between 1 and 100, extracting the characteristics of the positive and negative samples of the pedestrian prepared in advance, and obtaining the corresponding kernel function phi (x)i) Is min (x (i), x)s(i) Where x (i), xs(i) Feature vectors extracted from any two positive and negative samples; the label of the positive sample is 1, the label of the negative sample is-1, and the alpha of the discriminant function is obtained by off-line trainingiAnd b, wherein the discriminant functionNamely the corresponding nonlinear SVM detection model;
by judging the result y of the detection modeliAnd outputting the category of which the corresponding value corresponds to the text context.
(6) Finding a corresponding context entity model according to the category characteristics of the text, and extracting and selecting entity data in the text of a corresponding type through the trained entity model, wherein the context entity model is the entity model of the text obtained by training an open source word2vec tool on an existing industry text data set;
(7) when the number of texts of a certain scene exceeds a threshold value T, the scene entity model is retrained by using a word2vec tool, and when the number of texts of the certain scene exceeds the threshold value T, data can be stored on a distributed platform firstly, wherein T generally exceeds the number of 1 ten thousand samples.
It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.
Claims (3)
1. An industry text entity extraction method based on a distributed platform is characterized by comprising the following steps:
s01: adopting a deep learning neural network training text data set to obtain a relation characteristic model, and extracting relation characteristics in a target text through the relation characteristic model;
s02: generating a plurality of elastic distributed relational feature data sets RDD from the extracted relational features;
s03: extracting class characteristics from a class characteristic model obtained by training a data set in the RDD through an improved nonlinear SVM classification algorithm; the step S03 specifically includes:
s31: adjusting the weight and the offset in the nonlinear SVM classification algorithm to ensure that the error between the input relational characteristics and the characteristics of the labeled samples is in a set range, and storing a text class characteristic model;
s32: the selected classification model method is an improved nonlinear SVM classification algorithm, and the training model thereof has a classification objective function ofWherein=sin() The classification constraint condition isTo obtain a discriminant functionWherein the weight valueC is a penalty factor, which is an empirical parameter, i is the RDD number, w is the vector weight,is the Euclidean distance between the positive sample and the negative sample in the relation characteristic, b is the threshold value during classification,in order to be an error, the error is,is a non-linear kernel function;
s33: gradually adjusting the penalty factor C, and testing and selecting the optimal penalty factor C, wherein the kernel function is nonlinearIs composed ofWhereinFeature vectors extracted from any two text relation feature samples; the label of each type of relation characteristic sample is a corresponding type number, and a discriminant function is obtained through multiple off-line trainingAndin which a discriminant function isIs the corresponding class feature model;
s04: finding out entity models corresponding to the contexts according to the extracted category characteristics, and extracting entity data in texts corresponding to the category characteristics through the trained entity models;
s05: and judging whether the number of the texts corresponding to the context exceeds a set threshold value T, if so, retraining the entity model of the context, extracting entity data in the texts with the corresponding category characteristics by using the retrained entity model, and otherwise, storing the entity characteristics of the texts and the text data.
2. The distributed platform-based industry textual entity extraction method according to claim 1, wherein the step S01 specifically comprises:
s11: segmenting the text by an ansj open source segmentation method, counting word frequency of each word in all the texts and word frequency of each word in the current text, removing common auxiliary words, stop words and words with high frequency, extracting N words from all the texts according to the relationship between the word frequency of the current text and the word frequency of all the texts, and placing each category in the same folder;
s12: randomly setting each word in the N words as an A-dimensional data characteristic, and forming N-A-dimensional data by each text;
s13: taking each word feature as a deep learning neural network input node neuron, performing convolution through a first hidden layer, performing subsampling and local averaging through a second hidden layer, performing second convolution through a third hidden layer, performing second subsampling and local average calculation through a fourth hidden layer, performing full connection layer, converting a text into B-dimensional data, and adjusting accuracy through multiple tests to obtain a relational feature model.
3. The distributed platform based industry textual entity extraction method according to claim 1, wherein in step S03, the sample texts with poor extraction and obvious errors are placed in a new class, and the test samples are adjusted step by step to make the test sample class reach the optimum.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710902720.0A CN107908642B (en) | 2017-09-29 | 2017-09-29 | Industry text entity extraction method based on distributed platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710902720.0A CN107908642B (en) | 2017-09-29 | 2017-09-29 | Industry text entity extraction method based on distributed platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107908642A CN107908642A (en) | 2018-04-13 |
CN107908642B true CN107908642B (en) | 2021-11-12 |
Family
ID=61840291
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710902720.0A Active CN107908642B (en) | 2017-09-29 | 2017-09-29 | Industry text entity extraction method based on distributed platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107908642B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109508757B (en) * | 2018-10-30 | 2020-10-09 | 北京陌上花科技有限公司 | Data processing method and device for character recognition |
CN111274348B (en) * | 2018-12-04 | 2023-05-12 | 北京嘀嘀无限科技发展有限公司 | Service feature data extraction method and device and electronic equipment |
CN109754014B (en) * | 2018-12-29 | 2021-04-27 | 北京航天数据股份有限公司 | Industrial model training method, device, equipment and medium |
CN111950279B (en) * | 2019-05-17 | 2023-06-23 | 百度在线网络技术(北京)有限公司 | Entity relationship processing method, device, equipment and computer readable storage medium |
CN112052646B (en) * | 2020-08-27 | 2024-03-29 | 安徽聚戎科技信息咨询有限公司 | Text data labeling method |
CN114756385B (en) * | 2022-06-16 | 2022-09-02 | 合肥中科类脑智能技术有限公司 | Elastic distributed training method under deep learning scene |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104933164A (en) * | 2015-06-26 | 2015-09-23 | 华南理工大学 | Method for extracting relations among named entities in Internet massive data and system thereof |
CN106168965A (en) * | 2016-07-01 | 2016-11-30 | 竹间智能科技(上海)有限公司 | Knowledge mapping constructing system |
CN106599041A (en) * | 2016-11-07 | 2017-04-26 | 中国电子科技集团公司第三十二研究所 | Text processing and retrieval system based on big data platform |
CN106682220A (en) * | 2017-01-04 | 2017-05-17 | 华南理工大学 | Online traditional Chinese medicine text named entity identifying method based on deep learning |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7284191B2 (en) * | 2001-08-13 | 2007-10-16 | Xerox Corporation | Meta-document management system with document identifiers |
US20170124181A1 (en) * | 2015-10-30 | 2017-05-04 | Oracle International Corporation | Automatic fuzzy matching of entities in context |
CN105389378A (en) * | 2015-11-19 | 2016-03-09 | 广州精标信息科技有限公司 | System for integrating separate data |
US9940384B2 (en) * | 2015-12-15 | 2018-04-10 | International Business Machines Corporation | Statistical clustering inferred from natural language to drive relevant analysis and conversation with users |
CN106599032B (en) * | 2016-10-27 | 2020-01-14 | 浙江大学 | Text event extraction method combining sparse coding and structure sensing machine |
-
2017
- 2017-09-29 CN CN201710902720.0A patent/CN107908642B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104933164A (en) * | 2015-06-26 | 2015-09-23 | 华南理工大学 | Method for extracting relations among named entities in Internet massive data and system thereof |
CN106168965A (en) * | 2016-07-01 | 2016-11-30 | 竹间智能科技(上海)有限公司 | Knowledge mapping constructing system |
CN106599041A (en) * | 2016-11-07 | 2017-04-26 | 中国电子科技集团公司第三十二研究所 | Text processing and retrieval system based on big data platform |
CN106682220A (en) * | 2017-01-04 | 2017-05-17 | 华南理工大学 | Online traditional Chinese medicine text named entity identifying method based on deep learning |
Non-Patent Citations (1)
Title |
---|
"基于深度学习的医疗命名实体识别";张帆 等;《计算技术与自动化》;20170331;第36卷(第1期);第123-127页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107908642A (en) | 2018-04-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107908642B (en) | Industry text entity extraction method based on distributed platform | |
CN110826630B (en) | Radar interference signal feature level fusion identification method based on deep convolutional neural network | |
CN113378632B (en) | Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method | |
CN108710651B (en) | Automatic classification method for large-scale customer complaint data | |
US20160267359A1 (en) | Image object category recognition method and device | |
CN110046634B (en) | Interpretation method and device of clustering result | |
CN104615676B (en) | One kind being based on the matched picture retrieval method of maximum similarity | |
CN107480688B (en) | Fine-grained image identification method based on zero sample learning | |
CN106971180B (en) | A kind of micro- expression recognition method based on the sparse transfer learning of voice dictionary | |
CN110942091A (en) | Semi-supervised few-sample image classification method for searching reliable abnormal data center | |
CN107357895B (en) | Text representation processing method based on bag-of-words model | |
CN113254643B (en) | Text classification method and device, electronic equipment and text classification program | |
CN110751027B (en) | Pedestrian re-identification method based on deep multi-instance learning | |
CN111507350A (en) | Text recognition method and device | |
CN110928981A (en) | Method, system and storage medium for establishing and perfecting iteration of text label system | |
Delgado et al. | Fast single-and cross-show speaker diarization using binary key speaker modeling | |
CN107527058B (en) | Image retrieval method based on weighted local feature aggregation descriptor | |
CN112990371B (en) | Unsupervised night image classification method based on feature amplification | |
Ghayoumi et al. | Local sensitive hashing (LSH) and convolutional neural networks (CNNs) for object recognition | |
Hao et al. | Improvement of word bag model based on image classification | |
CN104008095A (en) | Object recognition method based on semantic feature extraction and matching | |
US20150332173A1 (en) | Learning method, information conversion device, and recording medium | |
Dileep et al. | Speaker recognition using pyramid match kernel based support vector machines | |
Rothacker et al. | Robust output modeling in bag-of-features HMMs for handwriting recognition | |
CN106202562B (en) | method for reducing false judgment rate of sensitive information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |