CN109684637A

CN109684637A - A kind of integrated use method of text feature

Info

Publication number: CN109684637A
Application number: CN201811571221.9A
Authority: CN
Inventors: 段强; 李锐; 高明; 于治楼
Original assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Current assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2019-04-26

Abstract

The invention discloses a kind of integrated use methods of text feature, belong to field of artificial intelligence, this method handles corpus with completely consistent Text Pretreatment method, then TFIDF Feature Engineering model and Word2vec Feature Engineering model is respectively trained, obtains the same corpus that two different vector matrixs indicate；Then obtain two vector matrixs are simply spliced into the higher vector matrix of dimension, use vector matrix training classification task model.Complementary relevance between one word of description that can be more comprehensive and accurate conspicuousness in a document and context is carried out present invention incorporates the respective advantage of TFIDF and word2vec and characteristic, promotes the accuracy of subsequent train classification models.

Description

A kind of integrated use method of text feature

Technical field

The present invention relates to field of artificial intelligence, the integrated use method of specifically a kind of text feature.

Background technique

In the practice of intelligent medical treatment, all generating a large amount of data all the time, as the health status of patient, prescription, Medical doctor's advice, progress note, consultation note etc..Make it what current intelligent medical treatment flourished, by collecting, storing these numbers Have profound significance according to classification later, is carried out to it, not only help preferably management data, in case later analysis and calling, The distribution and inherent rule of classification discovery data can more be passed through.In face of the data of Rapid Accumulation, although manual sort can guarantee Higher accuracy rate, but compared to the method with machine learning, the efficiency of manual sort is then more low, therefore, in big data It is imperative that sorted generalization is carried out by data of the machine learning to medical industry increasingly developed today.

Summary of the invention

Technical assignment of the invention be against the above deficiency place, a kind of integrated use method of text feature, energy are provided Relevance between enough more comprehensive and accurate one words of description conspicuousness in a document and context, promotes subsequent training The accuracy of disaggregated model.

The technical solution adopted by the present invention to solve the technical problems is:

A kind of integrated use method of text feature is handled with completely consistent Text Pretreatment method in the method Then TFIDF Feature Engineering model and Word2vec Feature Engineering model is respectively trained in corpus, obtain two different vectors The same corpus that matrix indicates, but the two different vector matrixs are each has different focus by oneself, such as the conspicuousness of vocabulary Or the correlation of context；

Then obtain two vector matrixs are simply spliced into the higher vector matrix of dimension, word is increased with this The comprehensive and accuracy of vector description is able to ascend subsequent supervised training using vector matrix training classification task model Learning effect.

Wherein, TFIDF is for calculating word frequency, including original word frequency algorithm and inverse document frequency value；

Word2vec is used on the basis of TFIDF, solves the relevance of word within a context.

TFIDF had both considered the frequency of word in a document, it is also considered that frequency of the word in entire corpus, therefore most of Task shows preferably again and stablizes.TF-IDF algorithm is simple and quick, is as a result more conform with actual conditions, but is come with " word frequency " merely The importance for measuring a word is obviously not comprehensive enough, and the number that sometimes important word occurs is simultaneously few, and this algorithm can not Embody the location information of word.And by combining Word2Vec algorithm that can effectively solve the relevance of word within a context.

In the text that medical field generates, not only there is significant indicative medical vocabulary, there are also stronger causes and effects to close System, i.e. context, which exist, to be closely connected.Therefore, the feature that integrated use TFIDF and word2vec is generated synthesizes new feature, This feature considers that the particularity of single word and context are contacted with it simultaneously.Feature after merging can be used for optimizing subsequent Model training.

Preferably, the Text Pretreatment method includes segmenting and removing stop words.

Preferably, TFIDF and Word2vec is respectively trained, obtains the same corpus that two different vector matrixs indicate Library, then obtain two vector matrixs are simply spliced into the higher vector matrix of dimension, it operates as follows:

Document vector dimension: N*K is obtained using TFIDF；Document vector dimension: N*L is obtained using Word2vec；Two to Moment matrix is spliced into higher vector matrix of dimension: N* (K+L).

Specifically, the specific implementation steps are as follows for this method:

1), Text Pretreatment, including segment and remove stop words, there are many tool that can be used, such as jieba, Thulac etc., the deactivated dictionary that deactivated dictionary oneself can be constructed or be increased income using each mechanism, the two all can be according to actual needs Manually expanded in conjunction with the characteristics of medical text；

2) two Feature Engineering models, are trained using TFIDF and Word2Vec respectively, save as vector matrix mode；

3) two feature vectors, are spliced into a new feature by the direction (i.e. extension columns) for degree of being extended to Vector, the vector include description of the TFIDF and Word2Vec to the same corpus；

4), the vector of synthesis is used for the training of the disaggregated model of text.

Specifically, the disaggregated model include Linear SVM Classifier, logistic regression orBayes Classifier etc..

Specifically, the IFIDF model training,

The TF value for first calculating word, uses formulaWherein, n_iTime occurred in a certain document for a certain word Number, the N sum of word contained by document thus；

The IDF value for calculating word again, uses formulaThat is article total number is divided by including the word Article number after take logarithm；

The product for calculating TF and IDF, obtains the TFIDF weight of required word.

TFIDF (term frequency-inverse document frequency) algorithm is a kind of common weighting technique for information retrieval and data mining, It plays a significant role assessing the significance level of a word in a document.Instinctively, the high frequency words in article can represent this The feature of a document, that is, original word frequency (TF) algorithm.But certain words are such as " we ", the words such as " everybody " are in most of texts The frequency occurred in chapter is very high, can not represent the feature of a certain particular document, so we are also contemplated that the inverse document frequency of vocabulary Rate value (IDF).The two is combined as TF-IDF algorithm.

In this, we first calculate the TF value an of word, can use formulaIt obtains, wherein n_iExist for a certain word The number occurred in a certain document, the N sum of word contained by document thus.The IDF value of word can use formulaIt obtains, i.e. article total number after the article number comprising the word divided by taking logarithm to obtain.It calculates The product tf of TF and IDF_i*idf_i, the TF-IDF weight of a certain word needed for us can be obtained.The high word of a certain specific file The low document-frequency of speech frequency rate and the word in entire file set, can produce out the TF-IDF value of high weight.

Therefore, TF-IDF algorithm can filter out the everyday words frequently occurred, retain important representational word.

Word2vec algorithm is the algorithm that word is converted to vector form.Traditional term vector method is with one long Degree is N, component only one 1, other indicate a word, also referred to as sparse vector for 0 vectors.Sparse vector is a wide range of Dimension disaster can be generated in text application and can not embody the association between word.And word2vec algorithm can pass through training Word is mapped to the vector of a regular length, all these vectors can be made up of the cumulative method averaged document to Amount.Each vector can be considered a point spatially, and vector length can be unrelated with article scale with unrestricted choice at this time, and The association between word can be embodied.

Specifically, the Word2vec model training includes CBOW model and Skip-gram model, CBOW model is logical N-1 word is crossed around input to predict the word itself；Skip-gram model predicts context according to word itself.

Word2vec is based on skip-gram or CBOW, and single word and set of context can join together to consider.

Further, the CBOW model and Skip-gram model are three-layer neural network, i.e. input layer, middle layer And output layer,

The input layer of CBOW model is the term vector of current term upper and lower level window, and middle layer adds up the vector of upper and lower word (or averaging) obtains intermediate vector, and output layer is a Huffman tree, and leaf node represents word all in corpus, for Each leaf node can have a global coding, such as " 01001 ", each node of following middle layer can be with Hough Graceful tree nonleaf node is related, each nonleaf node is the Softmax of one two classification in fact, by each of intermediate vector Node is assigned in the subtree of Huffman tree, and the parameter of nonleaf node is updated by continuous iteration, the vector of each word can be obtained It indicates；

Skip-gram mode input layer is current term, and hidden layer is due to being no longer multiple vectors, then tired without doing vector Add, output layer is equally a Huffman tree, based on context the coding of the coding of word and current word, according to gradient descent algorithm The parameter of nonleaf node and current term vector in more new route are until convergence, the vector that each word can be obtained indicate.

Association between term vector can embody the association between word.

Further, the middle layer is hidden layer.

A kind of integrated use method of text feature of the invention compared with prior art, has the advantages that

The fractionation of one section of word is independent word and is shown with by vector table and helps machine and better understands and learning text Feature, the effect of document classification can also increase accordingly.TFIDF it both considered the frequency of word in a document, it is also considered that word is entire Frequency in corpus, its performance preferably and is stablized in most of tasks；And word2vec is based on skip-gram or CBOW, Single word and set of context can be joined together to consider.The feature synthesis that integrated use TFIDF and word2vec are generated For new feature, this feature simultaneously considers that the particularity of single word and context are contacted with it.Feature after merging is available In the subsequent model training of optimization.This method can be applied to medical field, and big data is applied to the classification of medical industry data And analysis, improve the classification effectiveness of machine learning.

This method combines the advantage and characteristic progress complementation of TFIDF and word2vec respectively, more complete to establish with this Relevance between the conspicuousness in a document and context of face and accurate term vector to describe a word, and then after being promoted The accuracy of continuous train classification models.The process for carrying out vectorization to corpus using single features engineering method at present is improved, By the synthesis of two feature vectors, more comprehensive description is provided to corpus, is helped to improve subsequent to medical text point The effect of class model training, optimizes hospital work process, reduces the workload that doctor arrange to data classification.

Intelligent medical treatment is one of the field of current artificial intelligence application, and carrying out classification to medical text is in intelligent medical treatment Important link, this method are facilitated excellent using the programming language of general at present open source natural language processing kit and mainstream Change hospital work process, reduces the workload that doctor arrange to data classification, improve classification effectiveness etc..

Detailed description of the invention

Fig. 1 is the flow chart of the integrated use method of text feature of the invention；

Fig. 2 is the building flow chart of comprehensive characteristics engineering model of the invention.

Specific embodiment

The present invention is further explained in the light of specific embodiments.

Wherein, TFIDF is for calculating word frequency, including original word frequency algorithm and inverse document frequency value；Word2vec is used for On the basis of TFIDF, the relevance of word within a context is solved.

In one embodiment of the invention, the Text Pretreatment method includes segmenting and removing stop words.

TFIDF and Word2vec is respectively trained, obtains the same corpus that two different vector matrixs indicate, then incite somebody to action To two vector matrixs be simply spliced into the higher vector matrix of dimension, operate as follows:

In one embodiment of the invention, the specific implementation steps are as follows for the integrated use method of this article eigen:

4), the vector of synthesis is used for the training of the disaggregated model of text, including Linear SVM Classifier, Logistic regression orBayes Classifier etc..

The IFIDF model training,

Calculate the product tf of TF and IDF_i*idf_i, obtain the TFIDF weight of required word.

The Word2vec model training includes CBOW model and Skip-gram model, and CBOW model passes through input Surrounding n-1 word predicts the word itself；Skip-gram model predicts context according to this itself.Word2vec is based on skip- Single word and set of context can be joined together to consider by gram or CBOW.

The CBOW model and Skip-gram model are three-layer neural network, i.e. input layer, middle layer (hidden layer) and defeated Layer out,

Association between term vector can embody the association between word.

Finally, the vector of synthesis to be used for the training of the disaggregated model of text, optimize subsequent model training.

The technical personnel in the technical field can readily realize the present invention with the above specific embodiments,.But it answers Work as understanding, the present invention is not limited to above-mentioned specific embodiments.On the basis of the disclosed embodiments, the technical field Technical staff can arbitrarily combine different technical features, to realize different technical solutions.

Except for the technical features described in the specification, it all is technically known to those skilled in the art.

Claims

1. a kind of integrated use method of text feature, it is characterised in that handle corpus with completely consistent Text Pretreatment method Then TFIDF Feature Engineering model and Word2vec Feature Engineering model is respectively trained in library, obtain two different vector matrixs The same corpus indicated；

Then obtain two vector matrixs are simply spliced into the higher vector matrix of dimension, use the vector matrix Training classification task model.

2. a kind of integrated use method of text feature according to claim 1, it is characterised in that the text is located in advance Reason method includes segmenting and removing stop words.

3. a kind of integrated use method of text feature according to claim 1 or 2, it is characterised in that obtained using TFIDF Obtain document vector dimension: N*K；Document vector dimension: N*L is obtained using Word2vec；Two vector matrixs are spliced into a dimension Spend higher vector matrix: N* (K+L).

4. a kind of integrated use method of text feature according to claim 3, it is characterised in that the specific reality of this method It is existing that steps are as follows:

1), Text Pretreatment, including segment and remove stop words；

3) two feature vectors, are spliced into a new feature vector by the direction for degree of being extended to, which includes Description of the TFIDF and Word2Vec to the same corpus；

5. a kind of integrated use method of text feature according to claim 4, it is characterised in that the disaggregated model packet Include Linear SVM Classifier, logistic regression orBayes Classifier。

6. a kind of integrated use method of text feature according to claim 4, it is characterised in that the IFIDF model instruction Practice,

The TF value for first calculating word, uses formulaWherein, n_iFor the number that a certain word occurs in a certain document, N is The sum of word contained by this document；

The IDF value for calculating word again, uses formulaI.e. article total number is divided by the text comprising the word Logarithm is taken after chapter number；

7. a kind of integrated use method of text feature according to claim 4, it is characterised in that the Word2vec mould Type training includes CBOW model and Skip-gram model, and CBOW model predicts the word sheet by input n-1 word of surrounding Body；Skip-gram model predicts context according to word itself.

8. a kind of integrated use method of text feature according to claim 7, it is characterised in that the CBOW model and Skip-gram model is three-layer neural network, i.e. input layer, middle layer and output layer,

The input layer of CBOW model is the term vector of current term upper and lower level window, and middle layer adds up the vector of upper and lower word To intermediate vector, output layer is a Huffman tree, and leaf node represents word all in corpus, each leaf node has one A global coding, each node of middle layer are related with Huffman tree nonleaf node, each by intermediate vector saves Point is assigned in the subtree of Huffman tree, and the parameter of nonleaf node is updated by continuous iteration, the vector table of each word can be obtained Show；

Skip-gram mode input layer is current term, and output layer is a Huffman tree, based on context the coding of word and is worked as The coding of preceding word, according to the parameter of nonleaf node in gradient descent algorithm more new route and current term vector until convergence The vector for obtaining each word indicates.

9. a kind of integrated use method of text feature according to claim 8, it is characterised in that the middle layer is hidden Layer.