CN108595429A

CN108595429A - The method for carrying out Text character extraction based on depth convolutional neural networks

Info

Publication number: CN108595429A
Application number: CN201810379548.XA
Authority: CN
Inventors: 张黎; 邹开红; 宗旭; 肖增辉
Original assignee: Hangzhou Flash Press Information Polytron Technologies Inc
Current assignee: Hangzhou Flash Press Information Polytron Technologies Inc
Priority date: 2018-04-25
Filing date: 2018-04-25
Publication date: 2018-09-28

Abstract

The present invention provides the method for carrying out Text character extraction based on depth convolutional neural networks, belongs to text feature extraction technology field.This is included the following steps based on the method that depth convolutional neural networks carry out Text character extraction：S1：Word in sentence sample is converted into term vector；S2：The term vector, which is scanned, by depth convolutional neural networks obtains scanning feature；S3：Depth characteristic is generated by carrying out sampling to scanning feature；S4：Depth characteristic is input to classification layer and obtains classification results.The word in sentence sample is converted into term vector in the present invention, term vector, which is scanned, by depth convolutional neural networks obtains scanning feature, depth characteristic is generated by carrying out sampling to scanning feature, depth characteristic is input to classification layer and obtains classification results, the feature extraction to text is then completed, solves the problems, such as that sentence is different in size, improves the accuracy rate and performance of Text character extraction, it is less to consume resource, it is more efficient.

Description

The method for carrying out Text character extraction based on depth convolutional neural networks

Technical field

The invention belongs to text feature extraction technology field, it is related to carrying based on depth convolutional neural networks progress text feature The method taken.

Background technology

With the fast development of internet, internet has become the main channel that people obtain information, on internet Text data content the trend of exponential increase is presented.Text data on internet includes abundant information, these information Knowledge base is built for us or knowledge mapping is highly useful；But the workload for manually carrying out relevant knowledge extraction is excessive, such as Fruit by computer understanding and can extract useful information, can thus save a large amount of manpower.But on internet Text data is existed in the form of natural language, i.e. Un-structured, and computer can not be handled directly.In order to Solve the problems, such as this, information extraction technique comes into being, and information extraction technique extracts knot from the text data of Un-structured Structure data.Text Mining Technology can help people quickly and effectively to obtain key message from mass data, and text is special Sign extraction is then the committed step of text mining.

Invention content

The present invention existing above problem in view of the prior art, provides that carry out text based on depth convolutional neural networks special The method for levying extraction, the technical problem to be solved by the present invention is to：How by depth convolutional neural networks to the spy in text Sign extracts.

Object of the invention can be realized by the following technical scheme：

Based on the method that depth convolutional neural networks carry out Text character extraction, include the following steps：

S1：Word in sentence sample is converted into term vector；

S2：The term vector, which is scanned, by depth convolutional neural networks obtains scanning feature；

S3：Depth characteristic is generated by carrying out sampling to scanning feature；

S4：Depth characteristic is input to classification layer and obtains classification results.

Preferably, sentence sample is separated by word according to dictionary in step S1.

Preferably, word is converted to by term vector by embedding in step S1.

Preferably, it is specifically included in step S2：

S21：Term vector is subjected to calculating fraction and obtains eigenmatrix；

S22：Scanning feature is obtained by the filter scan eigenmatrix of depth convolutional neural networks.

Preferably, it is specifically included in step S3：

S31：Scanning feature is sampled by max-pool to obtain sampling feature；

S32：Depth characteristic is filtered out from sampling feature.

Preferably, maximum value is filtered out as depth characteristic from sampling feature in step S32.

Preferably, the distance moved every time when the filter scan eigenmatrix is equal.

Preferably, layer of classifying in step S4 connects depth characteristic entirely generates connection features, and connection features are inputted and are classified Connection features and class library are subjected to comparison in layer and generate classification results.

Preferably, the classification layer is softmax classification layers.

Preferably, the width of the filter and the width of eigenmatrix are equal.

The word in sentence sample is converted into term vector in the present invention, scanning term vector by depth convolutional neural networks obtains To scanning feature, accuracy rate is high, generates depth characteristic by carrying out sampling to scanning feature, prevents overfitting, facilitate optimization, Depth characteristic is input to classification layer and obtains classification results, then completes the feature extraction to text, it is different in size to solve sentence The problem of, the accuracy rate and performance of Text character extraction are improved, consumption resource is less, more efficient.

Description of the drawings

Fig. 1 is the flow diagram of the present invention.

Specific implementation mode

Following is a specific embodiment of the present invention in conjunction with the accompanying drawings, technical scheme of the present invention will be further described, However, the present invention is not limited to these examples.

Referring to Fig. 1, the method for carrying out Text character extraction based on depth convolutional neural networks in the present embodiment, it can be with Include the following steps：

S1：Word in sentence sample is converted into term vector；

S2：Term vector, which is scanned, by depth convolutional neural networks obtains scanning feature；

Sentence sample can be separated into word according to dictionary in step S1.Sentence sample can thus be divided according to dictionary It is divided into the single word for having semanteme, is reduced because sentence sample separates the extraction of incorrect influence depth feature, and then influences classification The classification results of layer cause the text feature of extraction incorrect, influence the result of entire sentence sample extraction.Distich in step S1 Word in sentence sample is converted to term vector by subsample after being pre-processed, and pretreatment includes：By the expression in sentence sample The corresponding word of symbol substitutes and deletes the dittograph in sentence sample.It thus can be to avoid word be converted to word Converted when vectorial it is unsuccessful, reduce depth characteristic extraction do not cause full classification result it is incomplete, last Text character extraction It is imperfect.

Word can be mapped to embedding layers by embedding in step S1 can be converted to term vector. Embedding is word insertion, and text and word can be converted to the acceptable numerical value vector of machine.Word insertion uses low Dimension, dense, real value term vector indicate each word, to assigning word abundant semantic meaning, and to calculate word The degree of correlation is possibly realized.It, can be by each word if indicating word using bivector by taking simplest situation as an example Regard a point in plane as, position, that is, transverse and longitudinal coordinate of point is determined by corresponding bivector, can be arbitrary and continuous. If it is desired to contain the semanteme of word in the position of point, then the adjacent point in position should have related or similar language in plane Justice.For the language of mathematics, two words have semantic correlation or similar, then distance phase between the term vector corresponding to them Closely, the distance between measuring vector can use classical Euler's distance and cosine similarity etc..

It can be specifically included in step S2：

S21：Term vector is subjected to calculating fraction and obtains eigenmatrix, carrying out calculating fraction by the probability for term vector occur obtains To eigenmatrix；

S22：Scanning feature is obtained by the filter scan eigenmatrix of depth convolutional neural networks, uses depth convolutional Neural The filter scan eigenmatrix of network, accuracy rate is higher, and efficiency is higher.

Herein, the weight of each neuron connection data window can be fixed in eigenmatrix, and each neuron only closes One characteristic of note.Neuron can be filter, and each filter has the text feature oneself paid close attention to, all neurons Add up be exactly entire sentence sample feature extractor set.Scanning feature can be done into Nonlinear Mapping, depth characteristic volume ReLU may be used in the excitation function of product neural network（Correct linear unit）, the convergence of this excitation function is fast, asks gradient simple. The columns of scanning feature can be 1.

It can be specifically included in step S3：

S31：Scanning feature is sampled by max-pool to obtain sampling feature；

S32：Depth characteristic is filtered out from sampling feature.

Maximum value can be filtered out as depth characteristic from sampling feature, adopted using max-pool in step S32 Sample obtains sample, using the maximum value in sample as depth characteristic, prevents overfitting, facilitates optimization.Thus it may be implemented To sampling the dimension-reduction treatment of feature so that the output of max-pool is the maximum values of each Feature Map, i.e., one one-dimensional Vector, you can to obtain an one-dimensional depth characteristic.

Herein, sampling feature can be obtained by being sampled to scanning feature by max-pool, by the sampling feature of acquisition In maximum value can be used as depth characteristic.Max-pool can be used for the amount of compressed data and parameter, carry out dimension-reduction treatment, prevent Only overfitting, more convenient optimization.Max-pool can retain most important feature in text, and it is inessential to remove some Information will be repeated or be removed without too multiduty this kind of redundancy, most important feature extraction is come out.Max-pool Output be each Feature Map maximum values, i.e. an one-dimensional vector, depth characteristic can be one it is one-dimensional to Amount.

The distance moved every time when filter scan eigenmatrix can be equal.It in this way can be to improve the efficiency of classification, keep away Holiday term vector when non-filter scanning feature matrix causes the scanning feature generated incomplete, influences subsequent operation, lead Cause last classification results inaccurate, classification effectiveness is relatively low.

The width of filter can be equal with the width of eigenmatrix.The width of eigenmatrix can be with the length of term vector Equal, the width of such filter can ensure that all words of filter scan in this way with the equal length of term vector Vector ensures the accuracy of scanning result, ensures the accuracy of scanning feature.

Layer of classifying in step S4, which can connect depth characteristic entirely, generates connection features, and connection features are inputted in classification layer Connection features and class library are subjected to comparison and generate classification results.By depth characteristic input grader in existing class library into The classification of text is realized in row comparison, to realize Text character extraction.By depth characteristic by way of connecting full link sort In layer.Each corresponding output of input, thus may be implemented to connect entirely.Full connection can use Dropout technologies, Dropout refers to that the weight of the certain hidden layer nodes of network is allowed not work at random in model training, those idle nodes Can temporarily not think be network structure a part, but its weight must remain because when next sample input it It may work again, the limitation of L2 regularizations is given to the weighting parameter of depth characteristic, the advantage of doing so is that preventing from hiding Layer unit is adaptive（Or it is symmetrical）, to mitigate the degree of over-fitting.

Layer of classifying can be softmax classification layers.Softmax classification layers can improve the accuracy of classification marker sequence, Ensure that classification results accuracy is high so that the accuracy rate of Text character extraction is high, and efficiency is higher, and consumption resource is also less.

Specific embodiment described herein is only an example for the spirit of the invention.Technology belonging to the present invention is led The technical staff in domain can make various modifications or additions to the described embodiments or replace by a similar method In generation, however, it does not deviate from the spirit of the invention or beyond the scope of the appended claims.

Claims

1. the method for carrying out Text character extraction based on depth convolutional neural networks, which is characterized in that include the following steps：

S1：Word in sentence sample is converted into term vector；

2. the method for carrying out Text character extraction based on depth convolutional neural networks as described in claim 1, it is characterised in that： Sentence sample is separated by word according to dictionary in step S1.

3. the method for carrying out Text character extraction based on depth convolutional neural networks as claimed in claim 1 or 2, feature exist In：Word is converted to by term vector by embedding in step S1.

4. the method for carrying out Text character extraction based on depth convolutional neural networks as claimed in claim 3, which is characterized in that It is specifically included in step S2：

S22 ：Scanning feature is obtained by the filter scan eigenmatrix of depth convolutional neural networks.

5. the method for carrying out Text character extraction based on depth convolutional neural networks as claimed in claim 4, which is characterized in that It is specifically included in step S3：

S31：Scanning feature is sampled by max-pool to obtain sampling feature；

S32：Depth characteristic is filtered out from sampling feature.

6. the method for carrying out Text character extraction based on depth convolutional neural networks as claimed in claim 4, it is characterised in that： Maximum value is filtered out as depth characteristic from sampling feature in step S32.

7. the method for carrying out Text character extraction based on depth convolutional neural networks as claimed in claim 4, it is characterised in that： The distance moved every time when the filter scan eigenmatrix is equal.

8. the method for carrying out Text character extraction based on depth convolutional neural networks as claimed in claim 1 or 2, feature exist In：Layer of classifying in step S4 connects depth characteristic entirely generates connection features, connection features is inputted in classification layer that connection is special Sign carries out comparison with class library and generates classification results.

9. the method for carrying out Text character extraction based on depth convolutional neural networks as claimed in claim 8, it is characterised in that： The classification layer is softmax classification layers.

10. the method for carrying out Text character extraction based on depth convolutional neural networks as claimed in claim 4, feature exist In：The width of the filter and the width of eigenmatrix are equal.