CN109684637A - A kind of integrated use method of text feature - Google Patents

A kind of integrated use method of text feature Download PDF

Info

Publication number
CN109684637A
CN109684637A CN201811571221.9A CN201811571221A CN109684637A CN 109684637 A CN109684637 A CN 109684637A CN 201811571221 A CN201811571221 A CN 201811571221A CN 109684637 A CN109684637 A CN 109684637A
Authority
CN
China
Prior art keywords
word
vector
model
text
word2vec
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811571221.9A
Other languages
Chinese (zh)
Inventor
段强
李锐
高明
于治楼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan Inspur Hi Tech Investment and Development Co Ltd
Original Assignee
Jinan Inspur Hi Tech Investment and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan Inspur Hi Tech Investment and Development Co Ltd filed Critical Jinan Inspur Hi Tech Investment and Development Co Ltd
Priority to CN201811571221.9A priority Critical patent/CN109684637A/en
Publication of CN109684637A publication Critical patent/CN109684637A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of integrated use methods of text feature, belong to field of artificial intelligence, this method handles corpus with completely consistent Text Pretreatment method, then TFIDF Feature Engineering model and Word2vec Feature Engineering model is respectively trained, obtains the same corpus that two different vector matrixs indicate;Then obtain two vector matrixs are simply spliced into the higher vector matrix of dimension, use vector matrix training classification task model.Complementary relevance between one word of description that can be more comprehensive and accurate conspicuousness in a document and context is carried out present invention incorporates the respective advantage of TFIDF and word2vec and characteristic, promotes the accuracy of subsequent train classification models.

Description

A kind of integrated use method of text feature
Technical field
The present invention relates to field of artificial intelligence, the integrated use method of specifically a kind of text feature.
Background technique
In the practice of intelligent medical treatment, all generating a large amount of data all the time, as the health status of patient, prescription, Medical doctor's advice, progress note, consultation note etc..Make it what current intelligent medical treatment flourished, by collecting, storing these numbers Have profound significance according to classification later, is carried out to it, not only help preferably management data, in case later analysis and calling, The distribution and inherent rule of classification discovery data can more be passed through.In face of the data of Rapid Accumulation, although manual sort can guarantee Higher accuracy rate, but compared to the method with machine learning, the efficiency of manual sort is then more low, therefore, in big data It is imperative that sorted generalization is carried out by data of the machine learning to medical industry increasingly developed today.
Summary of the invention
Technical assignment of the invention be against the above deficiency place, a kind of integrated use method of text feature, energy are provided Relevance between enough more comprehensive and accurate one words of description conspicuousness in a document and context, promotes subsequent training The accuracy of disaggregated model.
The technical solution adopted by the present invention to solve the technical problems is:
A kind of integrated use method of text feature is handled with completely consistent Text Pretreatment method in the method Then TFIDF Feature Engineering model and Word2vec Feature Engineering model is respectively trained in corpus, obtain two different vectors The same corpus that matrix indicates, but the two different vector matrixs are each has different focus by oneself, such as the conspicuousness of vocabulary Or the correlation of context;
Then obtain two vector matrixs are simply spliced into the higher vector matrix of dimension, word is increased with this The comprehensive and accuracy of vector description is able to ascend subsequent supervised training using vector matrix training classification task model Learning effect.
Wherein, TFIDF is for calculating word frequency, including original word frequency algorithm and inverse document frequency value;
Word2vec is used on the basis of TFIDF, solves the relevance of word within a context.
TFIDF had both considered the frequency of word in a document, it is also considered that frequency of the word in entire corpus, therefore most of Task shows preferably again and stablizes.TF-IDF algorithm is simple and quick, is as a result more conform with actual conditions, but is come with " word frequency " merely The importance for measuring a word is obviously not comprehensive enough, and the number that sometimes important word occurs is simultaneously few, and this algorithm can not Embody the location information of word.And by combining Word2Vec algorithm that can effectively solve the relevance of word within a context.
In the text that medical field generates, not only there is significant indicative medical vocabulary, there are also stronger causes and effects to close System, i.e. context, which exist, to be closely connected.Therefore, the feature that integrated use TFIDF and word2vec is generated synthesizes new feature, This feature considers that the particularity of single word and context are contacted with it simultaneously.Feature after merging can be used for optimizing subsequent Model training.
Preferably, the Text Pretreatment method includes segmenting and removing stop words.
Preferably, TFIDF and Word2vec is respectively trained, obtains the same corpus that two different vector matrixs indicate Library, then obtain two vector matrixs are simply spliced into the higher vector matrix of dimension, it operates as follows:
Document vector dimension: N*K is obtained using TFIDF;Document vector dimension: N*L is obtained using Word2vec;Two to Moment matrix is spliced into higher vector matrix of dimension: N* (K+L).
Specifically, the specific implementation steps are as follows for this method:
1), Text Pretreatment, including segment and remove stop words, there are many tool that can be used, such as jieba, Thulac etc., the deactivated dictionary that deactivated dictionary oneself can be constructed or be increased income using each mechanism, the two all can be according to actual needs Manually expanded in conjunction with the characteristics of medical text;
2) two Feature Engineering models, are trained using TFIDF and Word2Vec respectively, save as vector matrix mode;
3) two feature vectors, are spliced into a new feature by the direction (i.e. extension columns) for degree of being extended to Vector, the vector include description of the TFIDF and Word2Vec to the same corpus;
4), the vector of synthesis is used for the training of the disaggregated model of text.
Specifically, the disaggregated model include Linear SVM Classifier, logistic regression orBayes Classifier etc..
Specifically, the IFIDF model training,
The TF value for first calculating word, uses formulaWherein, niTime occurred in a certain document for a certain word Number, the N sum of word contained by document thus;
The IDF value for calculating word again, uses formulaThat is article total number is divided by including the word Article number after take logarithm;
The product for calculating TF and IDF, obtains the TFIDF weight of required word.
TFIDF (term frequency-inverse document frequency) algorithm is a kind of common weighting technique for information retrieval and data mining, It plays a significant role assessing the significance level of a word in a document.Instinctively, the high frequency words in article can represent this The feature of a document, that is, original word frequency (TF) algorithm.But certain words are such as " we ", the words such as " everybody " are in most of texts The frequency occurred in chapter is very high, can not represent the feature of a certain particular document, so we are also contemplated that the inverse document frequency of vocabulary Rate value (IDF).The two is combined as TF-IDF algorithm.
In this, we first calculate the TF value an of word, can use formulaIt obtains, wherein niExist for a certain word The number occurred in a certain document, the N sum of word contained by document thus.The IDF value of word can use formulaIt obtains, i.e. article total number after the article number comprising the word divided by taking logarithm to obtain.It calculates The product tf of TF and IDFi*idfi, the TF-IDF weight of a certain word needed for us can be obtained.The high word of a certain specific file The low document-frequency of speech frequency rate and the word in entire file set, can produce out the TF-IDF value of high weight.
Therefore, TF-IDF algorithm can filter out the everyday words frequently occurred, retain important representational word.
Word2vec algorithm is the algorithm that word is converted to vector form.Traditional term vector method is with one long Degree is N, component only one 1, other indicate a word, also referred to as sparse vector for 0 vectors.Sparse vector is a wide range of Dimension disaster can be generated in text application and can not embody the association between word.And word2vec algorithm can pass through training Word is mapped to the vector of a regular length, all these vectors can be made up of the cumulative method averaged document to Amount.Each vector can be considered a point spatially, and vector length can be unrelated with article scale with unrestricted choice at this time, and The association between word can be embodied.
Specifically, the Word2vec model training includes CBOW model and Skip-gram model, CBOW model is logical N-1 word is crossed around input to predict the word itself;Skip-gram model predicts context according to word itself.
Word2vec is based on skip-gram or CBOW, and single word and set of context can join together to consider.
Further, the CBOW model and Skip-gram model are three-layer neural network, i.e. input layer, middle layer And output layer,
The input layer of CBOW model is the term vector of current term upper and lower level window, and middle layer adds up the vector of upper and lower word (or averaging) obtains intermediate vector, and output layer is a Huffman tree, and leaf node represents word all in corpus, for Each leaf node can have a global coding, such as " 01001 ", each node of following middle layer can be with Hough Graceful tree nonleaf node is related, each nonleaf node is the Softmax of one two classification in fact, by each of intermediate vector Node is assigned in the subtree of Huffman tree, and the parameter of nonleaf node is updated by continuous iteration, the vector of each word can be obtained It indicates;
Skip-gram mode input layer is current term, and hidden layer is due to being no longer multiple vectors, then tired without doing vector Add, output layer is equally a Huffman tree, based on context the coding of the coding of word and current word, according to gradient descent algorithm The parameter of nonleaf node and current term vector in more new route are until convergence, the vector that each word can be obtained indicate.
Association between term vector can embody the association between word.
Further, the middle layer is hidden layer.
A kind of integrated use method of text feature of the invention compared with prior art, has the advantages that
The fractionation of one section of word is independent word and is shown with by vector table and helps machine and better understands and learning text Feature, the effect of document classification can also increase accordingly.TFIDF it both considered the frequency of word in a document, it is also considered that word is entire Frequency in corpus, its performance preferably and is stablized in most of tasks;And word2vec is based on skip-gram or CBOW, Single word and set of context can be joined together to consider.The feature synthesis that integrated use TFIDF and word2vec are generated For new feature, this feature simultaneously considers that the particularity of single word and context are contacted with it.Feature after merging is available In the subsequent model training of optimization.This method can be applied to medical field, and big data is applied to the classification of medical industry data And analysis, improve the classification effectiveness of machine learning.
This method combines the advantage and characteristic progress complementation of TFIDF and word2vec respectively, more complete to establish with this Relevance between the conspicuousness in a document and context of face and accurate term vector to describe a word, and then after being promoted The accuracy of continuous train classification models.The process for carrying out vectorization to corpus using single features engineering method at present is improved, By the synthesis of two feature vectors, more comprehensive description is provided to corpus, is helped to improve subsequent to medical text point The effect of class model training, optimizes hospital work process, reduces the workload that doctor arrange to data classification.
Intelligent medical treatment is one of the field of current artificial intelligence application, and carrying out classification to medical text is in intelligent medical treatment Important link, this method are facilitated excellent using the programming language of general at present open source natural language processing kit and mainstream Change hospital work process, reduces the workload that doctor arrange to data classification, improve classification effectiveness etc..
Detailed description of the invention
Fig. 1 is the flow chart of the integrated use method of text feature of the invention;
Fig. 2 is the building flow chart of comprehensive characteristics engineering model of the invention.
Specific embodiment
The present invention is further explained in the light of specific embodiments.
A kind of integrated use method of text feature is handled with completely consistent Text Pretreatment method in the method Then TFIDF Feature Engineering model and Word2vec Feature Engineering model is respectively trained in corpus, obtain two different vectors The same corpus that matrix indicates, but the two different vector matrixs are each has different focus by oneself, such as the conspicuousness of vocabulary Or the correlation of context;
Wherein, TFIDF is for calculating word frequency, including original word frequency algorithm and inverse document frequency value;Word2vec is used for On the basis of TFIDF, the relevance of word within a context is solved.
In one embodiment of the invention, the Text Pretreatment method includes segmenting and removing stop words.
Then obtain two vector matrixs are simply spliced into the higher vector matrix of dimension, word is increased with this The comprehensive and accuracy of vector description is able to ascend subsequent supervised training using vector matrix training classification task model Learning effect.
TFIDF and Word2vec is respectively trained, obtains the same corpus that two different vector matrixs indicate, then incite somebody to action To two vector matrixs be simply spliced into the higher vector matrix of dimension, operate as follows:
Document vector dimension: N*K is obtained using TFIDF;Document vector dimension: N*L is obtained using Word2vec;Two to Moment matrix is spliced into higher vector matrix of dimension: N* (K+L).
TFIDF had both considered the frequency of word in a document, it is also considered that frequency of the word in entire corpus, therefore most of Task shows preferably again and stablizes.TF-IDF algorithm is simple and quick, is as a result more conform with actual conditions, but is come with " word frequency " merely The importance for measuring a word is obviously not comprehensive enough, and the number that sometimes important word occurs is simultaneously few, and this algorithm can not Embody the location information of word.And by combining Word2Vec algorithm that can effectively solve the relevance of word within a context.
In the text that medical field generates, not only there is significant indicative medical vocabulary, there are also stronger causes and effects to close System, i.e. context, which exist, to be closely connected.Therefore, the feature that integrated use TFIDF and word2vec is generated synthesizes new feature, This feature considers that the particularity of single word and context are contacted with it simultaneously.Feature after merging can be used for optimizing subsequent Model training.
In one embodiment of the invention, the specific implementation steps are as follows for the integrated use method of this article eigen:
1), Text Pretreatment, including segment and remove stop words, there are many tool that can be used, such as jieba, Thulac etc., the deactivated dictionary that deactivated dictionary oneself can be constructed or be increased income using each mechanism, the two all can be according to actual needs Manually expanded in conjunction with the characteristics of medical text;
2) two Feature Engineering models, are trained using TFIDF and Word2Vec respectively, save as vector matrix mode;
3) two feature vectors, are spliced into a new feature by the direction (i.e. extension columns) for degree of being extended to Vector, the vector include description of the TFIDF and Word2Vec to the same corpus;
4), the vector of synthesis is used for the training of the disaggregated model of text, including Linear SVM Classifier, Logistic regression orBayes Classifier etc..
The IFIDF model training,
The TF value for first calculating word, uses formulaWherein, niTime occurred in a certain document for a certain word Number, the N sum of word contained by document thus;
The IDF value for calculating word again, uses formulaThat is article total number is divided by including the word Article number after take logarithm;
Calculate the product tf of TF and IDFi*idfi, obtain the TFIDF weight of required word.
The Word2vec model training includes CBOW model and Skip-gram model, and CBOW model passes through input Surrounding n-1 word predicts the word itself;Skip-gram model predicts context according to this itself.Word2vec is based on skip- Single word and set of context can be joined together to consider by gram or CBOW.
The CBOW model and Skip-gram model are three-layer neural network, i.e. input layer, middle layer (hidden layer) and defeated Layer out,
The input layer of CBOW model is the term vector of current term upper and lower level window, and middle layer adds up the vector of upper and lower word (or averaging) obtains intermediate vector, and output layer is a Huffman tree, and leaf node represents word all in corpus, for Each leaf node can have a global coding, such as " 01001 ", each node of following middle layer can be with Hough Graceful tree nonleaf node is related, each nonleaf node is the Softmax of one two classification in fact, by each of intermediate vector Node is assigned in the subtree of Huffman tree, and the parameter of nonleaf node is updated by continuous iteration, the vector of each word can be obtained It indicates;
Skip-gram mode input layer is current term, and hidden layer is due to being no longer multiple vectors, then tired without doing vector Add, output layer is equally a Huffman tree, based on context the coding of the coding of word and current word, according to gradient descent algorithm The parameter of nonleaf node and current term vector in more new route are until convergence, the vector that each word can be obtained indicate.
Association between term vector can embody the association between word.
Finally, the vector of synthesis to be used for the training of the disaggregated model of text, optimize subsequent model training.
The technical personnel in the technical field can readily realize the present invention with the above specific embodiments,.But it answers Work as understanding, the present invention is not limited to above-mentioned specific embodiments.On the basis of the disclosed embodiments, the technical field Technical staff can arbitrarily combine different technical features, to realize different technical solutions.
Except for the technical features described in the specification, it all is technically known to those skilled in the art.

Claims (9)

1. a kind of integrated use method of text feature, it is characterised in that handle corpus with completely consistent Text Pretreatment method Then TFIDF Feature Engineering model and Word2vec Feature Engineering model is respectively trained in library, obtain two different vector matrixs The same corpus indicated;
Then obtain two vector matrixs are simply spliced into the higher vector matrix of dimension, use the vector matrix Training classification task model.
2. a kind of integrated use method of text feature according to claim 1, it is characterised in that the text is located in advance Reason method includes segmenting and removing stop words.
3. a kind of integrated use method of text feature according to claim 1 or 2, it is characterised in that obtained using TFIDF Obtain document vector dimension: N*K;Document vector dimension: N*L is obtained using Word2vec;Two vector matrixs are spliced into a dimension Spend higher vector matrix: N* (K+L).
4. a kind of integrated use method of text feature according to claim 3, it is characterised in that the specific reality of this method It is existing that steps are as follows:
1), Text Pretreatment, including segment and remove stop words;
2) two Feature Engineering models, are trained using TFIDF and Word2Vec respectively, save as vector matrix mode;
3) two feature vectors, are spliced into a new feature vector by the direction for degree of being extended to, which includes Description of the TFIDF and Word2Vec to the same corpus;
4), the vector of synthesis is used for the training of the disaggregated model of text.
5. a kind of integrated use method of text feature according to claim 4, it is characterised in that the disaggregated model packet Include Linear SVM Classifier, logistic regression orBayes Classifier。
6. a kind of integrated use method of text feature according to claim 4, it is characterised in that the IFIDF model instruction Practice,
The TF value for first calculating word, uses formulaWherein, niFor the number that a certain word occurs in a certain document, N is The sum of word contained by this document;
The IDF value for calculating word again, uses formulaI.e. article total number is divided by the text comprising the word Logarithm is taken after chapter number;
The product for calculating TF and IDF, obtains the TFIDF weight of required word.
7. a kind of integrated use method of text feature according to claim 4, it is characterised in that the Word2vec mould Type training includes CBOW model and Skip-gram model, and CBOW model predicts the word sheet by input n-1 word of surrounding Body;Skip-gram model predicts context according to word itself.
8. a kind of integrated use method of text feature according to claim 7, it is characterised in that the CBOW model and Skip-gram model is three-layer neural network, i.e. input layer, middle layer and output layer,
The input layer of CBOW model is the term vector of current term upper and lower level window, and middle layer adds up the vector of upper and lower word To intermediate vector, output layer is a Huffman tree, and leaf node represents word all in corpus, each leaf node has one A global coding, each node of middle layer are related with Huffman tree nonleaf node, each by intermediate vector saves Point is assigned in the subtree of Huffman tree, and the parameter of nonleaf node is updated by continuous iteration, the vector table of each word can be obtained Show;
Skip-gram mode input layer is current term, and output layer is a Huffman tree, based on context the coding of word and is worked as The coding of preceding word, according to the parameter of nonleaf node in gradient descent algorithm more new route and current term vector until convergence The vector for obtaining each word indicates.
9. a kind of integrated use method of text feature according to claim 8, it is characterised in that the middle layer is hidden Layer.
CN201811571221.9A 2018-12-21 2018-12-21 A kind of integrated use method of text feature Pending CN109684637A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811571221.9A CN109684637A (en) 2018-12-21 2018-12-21 A kind of integrated use method of text feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811571221.9A CN109684637A (en) 2018-12-21 2018-12-21 A kind of integrated use method of text feature

Publications (1)

Publication Number Publication Date
CN109684637A true CN109684637A (en) 2019-04-26

Family

ID=66188180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811571221.9A Pending CN109684637A (en) 2018-12-21 2018-12-21 A kind of integrated use method of text feature

Country Status (1)

Country Link
CN (1) CN109684637A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930318A (en) * 2016-04-11 2016-09-07 深圳大学 Word vector training method and system
US9984682B1 (en) * 2016-03-30 2018-05-29 Educational Testing Service Computer-implemented systems and methods for automatically generating an assessment of oral recitations of assessment items
CN108763477A (en) * 2018-05-29 2018-11-06 厦门快商通信息技术有限公司 A kind of short text classification method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9984682B1 (en) * 2016-03-30 2018-05-29 Educational Testing Service Computer-implemented systems and methods for automatically generating an assessment of oral recitations of assessment items
CN105930318A (en) * 2016-04-11 2016-09-07 深圳大学 Word vector training method and system
CN108763477A (en) * 2018-05-29 2018-11-06 厦门快商通信息技术有限公司 A kind of short text classification method and system

Similar Documents

Publication Publication Date Title
Singh et al. Vectorization of text documents for identifying unifiable news articles
CN109933789B (en) Neural network-based judicial domain relation extraction method and system
US20210034813A1 (en) Neural network model with evidence extraction
CN112115700B (en) Aspect-level emotion analysis method based on dependency syntax tree and deep learning
CN110019770A (en) The method and apparatus of train classification models
CN105930368B (en) A kind of sensibility classification method and system
CN107967255A (en) A kind of method and system for judging text similarity
Cui et al. Sliding selector network with dynamic memory for extractive summarization of long documents
CN113239186A (en) Graph convolution network relation extraction method based on multi-dependency relation representation mechanism
JP2022088319A (en) Analysis of natural language text in document
Ettaouil et al. Architecture optimization model for the multilayer perceptron and clustering.
CN109271516A (en) Entity type classification method and system in a kind of knowledge mapping
Elayidom et al. A generalized data mining framework for placement chance prediction problems
CN112463989A (en) Knowledge graph-based information acquisition method and system
CN110705279A (en) Vocabulary selection method and device and computer readable storage medium
CN110888944B (en) Attention convolutional neural network entity relation extraction method based on multi-convolutional window size
Wu et al. TW-TGNN: Two windows graph-based model for text classification
CN108122613A (en) Health forecast method and apparatus based on health forecast model
CN110020015A (en) A kind of conversational system answers generation method and system
CN109684637A (en) A kind of integrated use method of text feature
CN112686306B (en) ICD operation classification automatic matching method and system based on graph neural network
US20170011309A1 (en) System and method for layered, vector cluster pattern with trim
Gao et al. Compressing lstm networks by matrix product operators
Keegan Using first-order stochastic based optimizers in solving regression models
Mandayam et al. Intelligent conversational model for mental health wellness

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190426