CN101377769A - Method for representing multiple graininess of text message - Google Patents

Method for representing multiple graininess of text message Download PDF

Info

Publication number
CN101377769A
CN101377769A CNA2007101210789A CN200710121078A CN101377769A CN 101377769 A CN101377769 A CN 101377769A CN A2007101210789 A CNA2007101210789 A CN A2007101210789A CN 200710121078 A CN200710121078 A CN 200710121078A CN 101377769 A CN101377769 A CN 101377769A
Authority
CN
China
Prior art keywords
text
feature
model
representation
integrated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2007101210789A
Other languages
Chinese (zh)
Inventor
戴汝为
朱远平
王春恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CNA2007101210789A priority Critical patent/CN101377769A/en
Publication of CN101377769A publication Critical patent/CN101377769A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a text information representation method based on multi-granularity text features. In the method, multi-granularity text representation model training is used for generating a multi-granularity text model and integration, thus forming text information multi-granularity integration representation. The method solves the problem of the integration of multi-granularity text features based on overall weights and text local features. The text information representation method is very robust and stable in respect of corpus scale and sparse data. The semantic structures implied in the text can be accurately and fully represented through acquiring the multi-granularity semantic space mapping of the text. The advantages of fine granularity text representation and coarse granularity text representation can be used comprehensively based on the relativity between the multi-granularity text features. Under the condition of training corpora with different scales, the text representation performance of the text information representation method is better than the representation performance of the single-granularity text representation method. Although the multi-granularity text representation method is in a multilayer structure, and the relations between the layers are clear, and the complex operations of a plurality of text representation methods in parameter adjustment are eliminated.

Description

A kind of method for representing multiple graininess of text message
Technical field
The present invention relates to Intelligent Information Processing, information retrieval and natural language processing technique field, particularly relate to a kind of text message method for expressing, be used for the expression originally of information retrieval or other text-processing applicating Chinese, make text be handled and to analyze use, in the purposes of carrying out information retrieval, text classification and some other text-processing, show superior performance by computing machine.
Background technology
Text-processing is an important techniques in the information processing technology, has core status in fields such as information retrieval, text message analysis, natural language processings.The first step of text-processing is that text message is expressed as the analyzable form of computer program, and the good and bad direct effect and the efficient to text-processing of text message method for expressing exerts an influence.Especially in text-processings such as text retrieval, text classification cluster and text content analysis were used, the performance of text representation often played conclusive effect.
Classical text representation method based on speech bag (BOW---Bag of Words) model and vector space model (VSM) method with the feature of speech different in the text as text representation, according to the statistical information (as frequency etc. occur) of these speech in text, utilize methods such as TFIDF, IG to carry out processing such as feature selecting and weight calculation again, constitute the representation of text vector as text.Be used to generate vectorial speech and be called as characteristic item (term).This is that the text representation granularity of method of feature is minimum based on single speech, is subjected to the influence of speech ambiguity easily.Some researchers propose to adopt the more text representation method of coarseness, improve the accuracy of feature speech semantic expressiveness, as based on method of phrase and n-gram etc.Research shows that also reasonable use coarseness feature helps to improve the performance of text representation.Also have many researchers to use the LSI model, the speech feature space is mapped to latent semantic space to extract potential semantic structure, improve the performance of text representation, this has come down to take the process of a kind of feature selecting and conversion.
But above method all is text representation methods of a kind of single granularity.Varigrained text representation correspondence be the mapping of text at the different grain size semantic space, varigrained semantic space all has Special Significance separately for text content analysis.Simultaneously, there is the nonuniformity of optimum granularity in text on local semantic expressiveness.Therefore, the text representation of the single granularity optimum that can not meet text is represented.
Summary of the invention
For the text representation that solves the single granularity of prior art can not meet the problem that the optimum of text is represented, the objective of the invention is: study many granularities text representation method, improve the performance that text message is represented, thereby the Intelligent treatment to text messages such as information retrieval, text classification, text cluster, text content analysis produces impetus, for this reason, this bright text message method for expressing that provides a kind of based on many granularities text feature.
In order to reach described purpose, the method for representing multiple graininess of text message provided by the invention comprises that step is as follows:
Step S1: utilize the study of many granularities text representation model, training generates the text model of many granularities;
Step S2: integrated based on many granularities text feature model, form many granularities of text message integrated representation.
Described many granularities text representation model is made of a plurality of varigrained text models, forms the model of sandwich construction, and is interrelated between each layer granularity text model feature unit; This model is used for the representing multiple graininess of text message.
The learning process of described many granularities text representation model may further comprise the steps:
Step S11: varigrained text model is learnt respectively;
Step S12: analyze the relation between the different grain size text model feature unit, the different grain size text model is combined constitutes the text representation model of sandwich construction.
Described many granularities text feature is integrated, be integrated into varigrained text model based on many granularities text feature of overall weight and give separately weight, their text feature space is weighted to merge obtains new text feature space, thereby merged many granularities text feature at new feature space.
Described many granularities text feature is integrated, based on integrated the steps include: of many granularities text feature of text local feature
Step S21: the text feature that obtains a plurality of granularity text models is represented;
Step S22: the weights that calculate the local different grain size feature of text;
Step S23: integrated many grain size characteristics are represented text.
Described integrated many grain size characteristics are represented text, it is the relation of inclusion of utilizing between the local different grain size text feature of text, probability distribution parameters according to varigrained text feature unit, the weights of each granularity text feature of analytical calculation text part are represented many grain size characteristics of text integrated on this basis.
Described many granularities text feature is integrated to be taked based on many granularities text feature of overall weight integratedly, or takes based on many granularities text feature of text local feature integrated.
Described text feature, employed text feature is provided by a plurality of varigrained text models, and the study of text model realizes under the support of corpus; Wherein, varigrained text representation model is the text representation of same type, or dissimilar text representations.
If it uses integrated based on many granularities text feature of text local feature in the described integrated many granularities text feature of step S2, then in the learning process of each granularity text model of step S1, need the probability distribution parameters of study different grain size text feature unit.
The many grain size characteristics of described text based on overall weight are integrated, be in text representation, by adjusting the overall separately weight that varigrained text model is endowed, if wherein have only the overall weight of a text model non-vanishing, then the method for representing multiple graininess of text is degenerated and is become the simple grain kilsyth basalt and show.
The many grain size characteristics of described text based on the text local feature are integrated, be in text representation, by adjusting the overall separately weight that varigrained text model is endowed, if wherein have only the overall weight of a text model non-vanishing, then the representing multiple graininess of text is degenerated and is become the simple grain kilsyth basalt and show.
Good effect of the present invention: superiority of the present invention is superior performance, and corpus scale and sparse data problem are had higher robustness and stability.By obtaining the mapping of many granularities of text semantic space, the text method for representing multiple graininess can be more accurately and is portrayed the semantic structure that contains in the text fully.And, can fully utilize the advantage of fine granularity text representation and coarseness text representation based on the correlativity between many granularities text feature.Under the training corpus situation of different scales, the text representation performance all is better than simple grain degree text representation method.Secondly, method realizes simple relatively.Though many granularities text representation model has the structure of multilayer, the relation between each layer structure is clear, and method realizes simple, has avoided the complex operations of many text representation methods in parameter adjustment.Therefore, the present invention is particularly suitable for carrying out the text information processing field that extensive text calculates, as information retrieval, text classification, text cluster and text content analysis etc.
Principle of the present invention is: varigrained text feature has represented to represent the mapping of text at the different grain size semantic space, has meaning separately in text representation.But represent from the integral body of text, have the approximation on the semantic expressiveness.Simultaneously, varigrained text representation is being different to the robustness of the dependence of corpus and sparse data.The fine granularity text representation can be portrayed trickleer semantic difference, robustness to sparse data problem is higher, but the while also exists in the semanteme portrayal to text and lacks accuracy, is subjected to the weakness of ambiguity influence easily, is suitable for the application on the small-scale training corpus.The coarseness text representation is the semantic abstraction of higher level often, has higher accuracy on semantic expressiveness, and it is littler influenced by ambiguity, can portray text semantic more exactly, but to the sparse data problem sensitivity, needs large-scale corpus support.Though can improve robustness to sparse data to a certain extent by smoothing technique, parameter regulation complexity and effect may not be desirable.Property difference based between the different grain size combines them, and brings into play their advantage, can obtain more performance.By the fusion of different grain size feature space, varigrained text feature addition is got up to realize the integrated of many granularities text representation, but this is not optimum method.The coarseness feature has superiority on the determinacy of semantic expressiveness, but observes semantic expressiveness from the angle of text part, and the granularity that neither one is single is suitable, and different local optimum text granularities are to change.The distribution probability of different grain size feature in corpus shown the reliability of this feature, and high more this feature that shows of probability is used to represent that the reliability of text is high more, should give higher weights in text representation.If low this grain size characteristic that then shows of this grain size characteristic probability represents that the reliability of text is lower, text representation should more depend on the more grain size characteristic of low order.Therefore, utilize the character that has relation of inclusion between the different grain size text feature,, be the different weight of different grain size characteristic allocation by analyzing the relativity of the probability parameter between local each granularity text feature unit of text.In other words, it is integrated to be with the reliability of text feature that the basis is that different grain size characteristic allocation weight is carried out feature, comprehensively coarseness and the advantage of fine granularity feature in text representation, thereby when the more excellent text semantic of acquisition is represented, raising obtains better text representation combination property to the robustness of learning sample.
Description of drawings
Fig. 1 is the text-processing FB(flow block) that adopts many granularities text representation method
Fig. 2 is the process flow diagram of the text model learning process in many granularities text representation method
Fig. 3 adopts many granularities text model to carry out the processing flow chart of text representation
Fig. 4 is a text representing multiple graininess structure of models synoptic diagram
Fig. 5 is many granularities text feature incidence relation synoptic diagram
Embodiment
Below introduce the preferred embodiments of the present invention, this part only is to illustrate of the present invention, but not to the restriction of the present invention and application or purposes.Other embodiment that draws according to the present invention belongs to technological innovation scope of the present invention too.There is the setting of related parameter also not show to have only example value to use in the scheme.
Shown in the text-processing FB(flow block) that Fig. 1 adopts many granularities text representation method, the present invention is a kind of text message method for expressing based on many granularities text feature.At first, train varigrained text model, concern between the analysis different grain size text model feature unit, each text model is combined into many granularities text representation model by corpus; Then, use these text models that target text is generated varigrained text feature and represent, utilize many granularities text feature integrated approach, the many grain size characteristics that form text message are represented.For this reason, the invention allows for a kind of many granularities text feature integrated approach and a kind of many granularities text feature integrated approach, solved the integration problem of many granularities text feature based on the text local feature based on overall weight.
The method for representing multiple graininess of text message is described below:
The feature speech can be an also phrase of single word in the text representation, and the granularity of text representation is meant the length of institute's use characteristic speech in text representation.Therefore be that the text representation of characteristic item is the text representation of granularity minimum based on word.Text representation method based on many granularities then integrates varigrained text representation, forms the unified expression of text, obtains better text representation performance.
Method mainly is divided into two steps, at first is the study and the training of many granularities text model, by the study of text corpus, makes up varigrained text model and is combined into many granularities text model.Be to utilize these models and parameter thereof to carry out many grain size characteristics integrated representation of text then.
Step S1: the study of many granularities text model:
With the corpus is object, learns varigrained text representation model, then varigrained text representation feature is carried out feature selecting and weight calculation.The learning process of each granularity text model and the difference of common text model learning process are little, but require the distribution probability parameter of learning text feature unit (or claiming characteristic item) in corpus in this course, use for many granularities integrated representation process.The feature unit probability parameter is meant in training text, the probability of occurrence information of text feature unit and the generating probability between the different grain size characteristic item.Shown in the process flow diagram of the text model learning process in many granularities of Fig. 2 text representation method, its treatment step is:
Step S11: text pre-service.As required, text is carried out pre-service, comprise processing such as stop words removal, stemization, participle.
Step S12: many granularities text model study.The varigrained text model of training on corpus comprises the work such as distribution parameter study, characteristic item weight calculation, feature selecting of characteristic item in the model.Text model with a plurality of granularities is combined into many granularities text representation model then.
Many granularities text representation structure of models as shown in Figure 4.The different grain size model forms a bigger model with the structure of multilayer, associates by the mutual relation of inclusion in different grain size text feature unit (adopting the arrow ways of connecting to represent in Fig. 4) between each layer.Fig. 5 is the synoptic diagram of incidence relation between many granularities text feature, shows that the text feature of coarseness has comprised fine-grained text feature for concrete text sequence (the speech string that is composed of words), and this relation is used to describe the association between the different grain size feature.
Step S2: many granularities integrated representation of text:
Many granularities integrated representation of text is a core of the present invention, by analyze the restriction relation that the different grain size text feature exists between the text part, generates many granularities integrated representation of target text.Adopt many granularities text model to carry out shown in the processing flow chart of text representation as Fig. 3, its basic step is:
Step S21: text pre-service.Text is carried out pre-service, comprise processing such as stop words removal, stemization, participle, processing mode is with step S11 in the training process of text model.
Step S22: the text representation that calculates each different grain size text model of target text.According to text model text representation implementation method separately, the different grain size representation of text is calculated.
Step S23: integrated many granularities text representation.Calculate many granularities integrated representation of text by the integrated approach of many granularities text feature.
The integrated of many granularities text feature is by certain mode different grain size text feature space to be merged in essence, thereby many granularities text feature is integrated the expression text.Can take dual mode or method, a kind of integrated approach that is based on overall weight; Another kind is based on the integrated approach of text local feature.
Realization based on the integrated approach of overall weight is simple relatively, gives weight separately for the different grain size text model, and then the text feature unit for each granularity text model inside will obtain identical overall weight.In text representation, will be based on this overall situation weight, the text representation with the different grain size feature is weighted gets the fusion that union realizes many granularities text feature space, to obtain final text representation.
Integrated approach based on the text local feature is considered as one section text flow or word sequence with target text, calculate the weights of each grain size characteristic unit by the incidence relation between the different grain size feature unit of analyzing the text part, and based on this varigrained text representation is integrated.Incidence relation between the feature unit of different grain size text model mainly utilizes the character that structurally has relation of inclusion between coarseness text feature unit and the fine granularity text feature unit, and portrays therebetween correlativity with the probabilistic relation that exists between the different grain size text feature unit.Usually, the probability height of text feature shows that this grain size characteristic is used for representing that the reliability of text is higher, the weight that worth acquisition is higher shows that then the parametric reliability of this grain size characteristic is lower if this grain size characteristic probability is low, and text representation should depend on the more grain size characteristic of low order.By the probability relativity between the different grain size feature unit, local for calculating and distribute different weights in each grain size characteristic unit at text, realize different grain size feature integrated in the text part with this weight.Generally speaking, in this process, should follow the coarseness priority rule, on weight calculation, tilt to coarseness, as far as possible to obtain semantic expressiveness determinacy preferably.
Have the use of overall weight in based on the integrated approach of text local feature equally, method is identical with integrated approach based on overall weight.Therefore, in many granularities integrating process of text, by the adjusting of overall weights, we can select the center of gravity of text representation is tilted to the coarseness text representation, thereby obtain the semantic text representation result who determines.Also can tilt, thereby keep fine granularity in trickle semanteme of portrayal and the advantage on robustness thereof to fine granularity text representation model.Simultaneously, no matter the integrated approach that is based on overall weight also is based on the integrated approach of text local feature, if wherein only have the overall weight of a text model non-vanishing, then the method for representing multiple graininess of text is degenerated and is become the simple grain kilsyth basalt and show method.Therefore, method of the present invention has realized the simple grain kilsyth basalt is shown the pardon of method in fact.
Example:
Text representation during this example is used with text classification is an example, adopts the n-gram method to implement the representing multiple graininess of text, and the difference of n value is represented different grain size.At many granularities text learning phase, need learn so, then different n-gram text models be integrated, form multi-layer structure model different n-gram models.Present embodiment is with by Unigram, and it is that example describes implementation process that three layers of n-gram model that Bigram and Trigram text model constitute carry out many granularities text representation, and the text feature of use is each rank language unit (gram).Specify as follows:
1. the study of many granularities text model
It is 1 to 3 text representation model that Unigram, Bigram, Trigram represent granularity respectively.It is carried out the study of n-gram respectively, and study is at separately language metamember's probability data.But be the probability of occurrence of n unit speech string here at the needed probability of n-gram text model.
1.1 the pre-service of text
The pre-service of text need be distinguished Chinese and English language and handle respectively.Usually need carry out the processing of participle for Chinese, then need to carry out the processing of stemization for English.The processing of removing stop words all is suitable for for Chinese and English, rejects semantic irrelevant word such as some tone auxiliary words or function word from text.Adopted the n-gram model in the present embodiment, can select not carry out word segmentation processing for the pre-service of Chinese.
1.2 Unigram text model study
Use w iThe word that occurs in the expression corpus, total words is M, the appearance frequency of each word is designated as c (w in the statistics corpus i), word w iComputation process based on the distribution probability of maximal possibility estimation is shown below:
P ( w i ) = c ( w i ) Σ i = 1 M c ( w i ) - - - ( 1 )
1.3 Bigram text model study
Binary speech string (w in the statistics Bigram text model I1w I2) appearance frequency in corpus is designated as c (w I1w I2), its distribution probability P (w I1w I2) the computation process following expression shown in:
P ( w i 1 w i 2 ) = c ( w i 1 w i 2 ) c ( w i 1 ) P ( w i 1 ) - - - ( 2 )
1.4 Trigram text model study
Ternary speech string (w in the statistics Trigram text model I1w I2w I3) appearance frequency in corpus is designated as c (w I1w I2w I3), its distribution probability P (w I1w I2w I3) computation process be shown below:
P ( w i 1 w i 2 w i 3 ) = c ( w i 1 w i 2 w i 3 ) c ( w i 1 w i 2 ) P ( w i 1 w i 2 ) - - - ( 3 )
1.5 feature selecting and characteristic item weights calculate
Use characteristic is selected, and reduces the number of text feature item on the one hand, improves operational efficiency; On the other hand, select to help the characteristic item of text classification, filtering does not help even disadvantageous characteristic item text classification.In text classification, crossing high frequency or crossing the characteristics of low-frequency item not only increases calculated amount, and classification performance is had negative effect.Can take the word frequency method to filter these characteristic items, for example will be lower than 3 characteristic item in the highest 10 the characteristic item of word frequency in the training corpus study and word frequency here and give filtering, carry out feature selecting.
In the text vector that is generated by text model, component of a vector is corresponding with the weights of each characteristic item.Present embodiment is that each characteristic item calculates weights in the document with the TF-IDF method, represents with following formula at the component of a vector of TF-IDF method Chinese version:
a jd=f jd·log(N/n j) (4)
F wherein JdRepresent the word frequency of j characteristic item in text d, the sum of N representative training storehouse Chinese version, n jRepresent that text number at least once appears in j feature.
The weights of each characteristic item of j are the IDF value, are designated as:
tw(j)=log(N/n j) (5)
2. many granularities text representation is integrated
At first, text is carried out pre-service, processing procedure is with the learning phase of many granularities template.Then, utilize study gained different grain size model representation text, obtain the character representation vector of text.The feature space Φ of the text model of granularity k kExpression, L kSize for the space.To certain text sequence S=(w 1w 2..., w T), its text vector V on granularity k kExpression.Then the text feature of granularity k vector is represented with formula (4):
V k={V k1,V k2,…,V kLk} (6)
V K1... wait each component of expression text vector.
The integrated of many granularities text representation in fact just carries out the text vector of different grain size feature integrated.Its key is many granularities text representation feature space merged and character representation process integrated.Mainly can adopt two kinds of methods: based on the integrated approach of overall weight, based on the integrated approach of text local feature.Exemplifying embodiment is as follows separately:
1) based on the integrated approach of overall weight
For each granularity text model is given separately overall weight r k, varigrained text feature space merged obtaining new feature space, its process can be expressed as with relational expression (7):
Φ = ∪ k = 1 N r k Φ k - - - ( 7 )
In the enforcement for concrete text representation, can be implemented as the process that the text vector of different grain size feature is weighted fusion, each granularity text feature is endowed the overall weight r of place text model k, the processing procedure of the representing multiple graininess of text sequence can be represented with relational expression (8):
V = r 1 V 1 ⊕ r 2 V 2 ⊕ · · · ⊕ r N V N
= { r 1 V 11 , r 1 V 12 , · · · , r 1 V 1 L 1 , r 2 V 21 , r 2 V 22 , · · · , r 2 V 2 L 2 , · · · , r N V N 1 , r N V N 2 , · · · , r N V NLN }
(8)
2) based on the integrated approach of text local feature
Integrated approach based on the text local feature is complicated more, but can obtain better effect.
At first, the language of each layer unit unit is expressed as respectively in the N layer n-gram representation model of text sequence S:
L k={(w i-k+1…w i)|i=1,…T},k=1,…,N; (9)
The distribution probability of the unit i of language unit of layer k is represented with following formula:
p k(i)=P(w i-k+1…w i) (10)
Carry out weight calculation and setting by the relativity of its distribution probability parameter between the language unit characteristic item of each rank n-gram model, the weight that the grain size characteristic that probability is high more obtains is big more.For tilting to the coarseness feature, the weight of first feature of respectively speaking in the granularity model of high-order is set as 1.0, the weight of the language of other granularity models unit feature then can be made as the resulting surplus value of probability that itself distribution probability deducts the high-order grain size characteristic, is distributed in the weight q of text part for each grain size characteristic by the process of a recurrence k(i).This process can be represented with following two steps:
At first, the weight of characteristic item all is made as 1.0 in the granularity model of high-order, is shown below:
q k(i)=1.0;k=N (11)
Secondly, k asks for by the recursive fashion of k descending less than the weight of the characteristic item in each layer granularity model of N, and its process is represented with relational expression (12):
q k ( i ) = ( p k ( i ) - ( p k + 1 ( i ) + p k + 1 ( i - 1 ) - p k + 2 ( i - 1 ) ) ) / p k ( i ) ; k = 1 , &CenterDot; &CenterDot; &CenterDot; , N - 1 ; if k = N - 1 then p k + 2 ( i ) = 0 ; q k ( i ) = 0 ; if q k ( i ) < 0 - - - ( 12 )
Make that text sequence S is tn in the corresponding k-gram characteristic item sequence number of k layer model i language unit kAnd the attribute weights of k layer model j characteristic item are tw (i), k(j), then the k-gram characteristic item weights sequence of text sequence can be represented with following form:
S′=(q k(1)·tw k(tn k(1)),q k(2)·tw k(tn k(2)),…,q k(T)·tw k(tn k(T)))
(13)
The text vector that obtains each text model correspondence is on this basis used Expression.Be similar to integrated approach, for each granularity text model is given different overall weight r based on overall weight kThe representing multiple graininess gained text vector of text sequence can be expressed as with relational expression (14):
V &prime; = r 1 V 1 &prime; &CirclePlus; r 2 V 2 &prime; &CirclePlus; &CenterDot; &CenterDot; &CenterDot; &CirclePlus; r N V N &prime;
= { r 1 V 11 &prime; , r 1 V 12 &prime; , &CenterDot; &CenterDot; &CenterDot; , r 1 V 1 L 1 &prime; , r 2 V 21 &prime; , r 2 V 22 &prime; , &CenterDot; &CenterDot; &CenterDot; , r 2 V 2 L 2 &prime; , &CenterDot; &CenterDot; &CenterDot; , r N V N 1 &prime; , r N V N 2 &prime; , &CenterDot; &CenterDot; &CenterDot; , r N V NLN &prime; }
(14)
By distributing different overall weight r k, we can give the importance of different grain size text model in many granularities text representation.Defaultly can all be made as 1.0.As if the weight of only giving 1 to certain one deck, it is 0 that other layer is all composed, and then multilayer n-gram model deteriorates to common single n-gram model, and in fact we can unite multilayer n-gram model and common n-gram model like this.For example, r m=1 and r k=0 (during k ≠ m), multilayer n-gram model is degenerated becomes common m rank n-gram.This has realized a Unified frame of many granularities text representation.
After having obtained the vectorization representation of text, we can utilize the text to represent to carry out the related application of text information processing, in text classification, can utilize text vector to train all kinds of sorters and use sorter to the target text processing of classifying.Test shows in text classification research on the widely used Reuters-21578 data set, under the situation of using identical feature extracting method and sorter, the inventive method is compared simple grain degree method can improve 2~3 percentage points of classification accuracies.

Claims (11)

1. the method for representing multiple graininess of a text message is characterized in that, may further comprise the steps:
Step S1: utilize the study of many granularities text representation model, training generates the text model of many granularities;
Step S2: integrated based on many granularities text feature model, form many granularities of text message integrated representation.
2. according to the method for representing multiple graininess of the described text message of claim 1, it is characterized in that: described many granularities text representation model is made of a plurality of varigrained text models, form the model of sandwich construction, interrelated between each layer granularity text model feature unit; This model is used for the representing multiple graininess of text message.
3. the method for representing multiple graininess of text message as claimed in claim 1, it is characterized in that: the learning process of many granularities text representation model may further comprise the steps:
Step S11: varigrained text model is learnt respectively;
Step S12: analyze the relation between the different grain size text model feature unit, the different grain size text model is combined constitutes the text representation model of sandwich construction.
4. according to the method for representing multiple graininess of the described text message of claim 1, it is characterized in that: described many granularities text feature is integrated, be integrated into varigrained text model based on many granularities text feature of overall weight and give separately weight, their text feature space is weighted to merge obtains new text feature space, thereby merged many granularities text feature at new feature space.
5. according to the method for representing multiple graininess of the described text message of claim 1, it is characterized in that: described many granularities text feature is integrated, based on integrated the steps include: of many granularities text feature of text local feature
Step S21: the text feature that obtains a plurality of granularity text models is represented;
Step S22: the weights that calculate the local different grain size feature of text;
Step S23: integrated many grain size characteristics are represented text.
6. according to the method for representing multiple graininess of the described text message of claim 5, it is characterized in that, described integrated many grain size characteristics are represented text, it is the relation of inclusion of utilizing between the local different grain size text feature of text, probability distribution parameters according to varigrained text feature unit, the weights of each granularity text feature of analytical calculation text part are represented many grain size characteristics of text integrated on this basis.
7. as the method for representing multiple graininess of claim 4 and 5 described text messages, it is characterized in that: described many granularities text feature is integrated to be taked based on many granularities text feature of overall weight integratedly, or takes based on many granularities text feature of text local feature integrated.
8. the method for representing multiple graininess of text message as claimed in claim 1 is characterized in that: described text feature, and employed text feature is provided by a plurality of varigrained text models, and the study of text model realizes under the support of corpus; Wherein, varigrained text representation model is the text representation of same type, or dissimilar text representations.
9. as the method for representing multiple graininess of claim 1 and 5 described text messages, it is characterized in that: if it uses integrated based on many granularities text feature of text local feature in integrated many granularities text feature of step S2, then in the learning process of each granularity text model of step S1, need the probability distribution parameters of study different grain size text feature unit.
10. the method for representing multiple graininess of text message as claimed in claim 4, it is characterized in that: the many grain size characteristics of described text based on overall weight are integrated, be in text representation, by adjusting the overall separately weight that varigrained text model is endowed, if wherein have only the overall weight of a text model non-vanishing, then the method for representing multiple graininess of text is degenerated and is become the simple grain kilsyth basalt and show.
11. the method for representing multiple graininess of text message as claimed in claim 5, it is characterized in that: the many grain size characteristics of described text based on the text local feature are integrated, be in text representation, by adjusting the overall separately weight that varigrained text model is endowed, if wherein have only the overall weight of a text model non-vanishing, then the representing multiple graininess of text is degenerated and is become the simple grain kilsyth basalt and show.
CNA2007101210789A 2007-08-29 2007-08-29 Method for representing multiple graininess of text message Pending CN101377769A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2007101210789A CN101377769A (en) 2007-08-29 2007-08-29 Method for representing multiple graininess of text message

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2007101210789A CN101377769A (en) 2007-08-29 2007-08-29 Method for representing multiple graininess of text message

Publications (1)

Publication Number Publication Date
CN101377769A true CN101377769A (en) 2009-03-04

Family

ID=40421317

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2007101210789A Pending CN101377769A (en) 2007-08-29 2007-08-29 Method for representing multiple graininess of text message

Country Status (1)

Country Link
CN (1) CN101377769A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103518187A (en) * 2011-03-10 2014-01-15 特克斯特怀茨有限责任公司 Method and system for information modeling and applications thereof
CN104408153A (en) * 2014-12-03 2015-03-11 中国科学院自动化研究所 Short text hash learning method based on multi-granularity topic models
CN104462408A (en) * 2014-12-12 2015-03-25 浙江大学 Topic modeling based multi-granularity sentiment analysis method
CN107169086A (en) * 2017-05-12 2017-09-15 北京化工大学 A kind of file classification method
CN107797985A (en) * 2017-09-27 2018-03-13 百度在线网络技术(北京)有限公司 Establish synonymous discriminating model and differentiate the method, apparatus of synonymous text
CN108763510A (en) * 2018-05-30 2018-11-06 北京五八信息技术有限公司 Intension recognizing method, device, equipment and storage medium
WO2019080863A1 (en) * 2017-10-26 2019-05-02 福建亿榕信息技术有限公司 Text sentiment classification method, storage medium and computer
CN110276640A (en) * 2019-06-10 2019-09-24 北京云莱坞文化传媒有限公司 More granularities of copyright are split and its method for digging of commercial value
CN111046179A (en) * 2019-12-03 2020-04-21 哈尔滨工程大学 Text classification method for open network question in specific field
CN112163404A (en) * 2020-08-25 2021-01-01 北京邮电大学 Text generation method and device, electronic equipment and storage medium
CN113011176A (en) * 2021-03-10 2021-06-22 云从科技集团股份有限公司 Language model training and language reasoning method, device and computer storage medium thereof
CN114254158A (en) * 2022-02-25 2022-03-29 北京百度网讯科技有限公司 Video generation method and device, and neural network training method and device
US11373041B2 (en) 2020-09-18 2022-06-28 International Business Machines Corporation Text classification using models with complementary granularity and accuracy

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103518187B (en) * 2011-03-10 2015-07-01 特克斯特怀茨有限责任公司 Method and system for information modeling and applications thereof
CN103518187A (en) * 2011-03-10 2014-01-15 特克斯特怀茨有限责任公司 Method and system for information modeling and applications thereof
CN104408153A (en) * 2014-12-03 2015-03-11 中国科学院自动化研究所 Short text hash learning method based on multi-granularity topic models
CN104408153B (en) * 2014-12-03 2018-07-31 中国科学院自动化研究所 A kind of short text Hash learning method based on more granularity topic models
CN104462408A (en) * 2014-12-12 2015-03-25 浙江大学 Topic modeling based multi-granularity sentiment analysis method
CN104462408B (en) * 2014-12-12 2017-09-01 浙江大学 A kind of many granularity sentiment analysis methods modeled based on theme
CN107169086B (en) * 2017-05-12 2020-10-27 北京化工大学 Text classification method
CN107169086A (en) * 2017-05-12 2017-09-15 北京化工大学 A kind of file classification method
CN107797985B (en) * 2017-09-27 2022-02-25 百度在线网络技术(北京)有限公司 Method and device for establishing synonymous identification model and identifying synonymous text
CN107797985A (en) * 2017-09-27 2018-03-13 百度在线网络技术(北京)有限公司 Establish synonymous discriminating model and differentiate the method, apparatus of synonymous text
WO2019080863A1 (en) * 2017-10-26 2019-05-02 福建亿榕信息技术有限公司 Text sentiment classification method, storage medium and computer
CN108763510A (en) * 2018-05-30 2018-11-06 北京五八信息技术有限公司 Intension recognizing method, device, equipment and storage medium
CN110276640A (en) * 2019-06-10 2019-09-24 北京云莱坞文化传媒有限公司 More granularities of copyright are split and its method for digging of commercial value
CN111046179A (en) * 2019-12-03 2020-04-21 哈尔滨工程大学 Text classification method for open network question in specific field
CN111046179B (en) * 2019-12-03 2022-07-15 哈尔滨工程大学 Text classification method for open network question in specific field
CN112163404A (en) * 2020-08-25 2021-01-01 北京邮电大学 Text generation method and device, electronic equipment and storage medium
US11373041B2 (en) 2020-09-18 2022-06-28 International Business Machines Corporation Text classification using models with complementary granularity and accuracy
CN113011176A (en) * 2021-03-10 2021-06-22 云从科技集团股份有限公司 Language model training and language reasoning method, device and computer storage medium thereof
CN114254158A (en) * 2022-02-25 2022-03-29 北京百度网讯科技有限公司 Video generation method and device, and neural network training method and device
CN114254158B (en) * 2022-02-25 2022-06-10 北京百度网讯科技有限公司 Video generation method and device, and neural network training method and device

Similar Documents

Publication Publication Date Title
CN101377769A (en) Method for representing multiple graininess of text message
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN105824922B (en) A kind of sensibility classification method merging further feature and shallow-layer feature
Melo et al. Automated geocoding of textual documents: A survey of current approaches
CN102567464B (en) Based on the knowledge resource method for organizing of expansion thematic map
CN105955981B (en) A kind of personalized traveling bag recommended method based on demand classification and subject analysis
US8583646B2 (en) Information searching apparatus, information searching method, and computer product
CN107609052A (en) A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle
CN109241256B (en) Dialogue processing method and device, computer equipment and readable storage medium
CA3007723A1 (en) Systems and/or methods for automatically classifying and enriching data records imported from big data and/or other sources to help ensure data integrity and consistency
CN108090800A (en) A kind of game item method for pushing and device based on player&#39;s consumption potentiality
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
WO2022156328A1 (en) Restful-type web service clustering method fusing service cooperation relationships
CN108509982A (en) A method of the uneven medical data of two classification of processing
CN106844632A (en) Based on the product review sensibility classification method and device that improve SVMs
CN107704500B (en) News classification method based on semantic analysis and multiple cosine theorem
CN108710663A (en) A kind of data matching method and system based on ontology model
CN109871443A (en) A kind of short text classification method and device based on book keeping operation scene
CN105279264A (en) Semantic relevancy calculation method of document
CN111782797A (en) Automatic matching method for scientific and technological project review experts and storage medium
CN104967558A (en) Method and device for detecting junk mail
CN108090178A (en) A kind of text data analysis method, device, server and storage medium
CN106708926A (en) Realization method for analysis model supporting massive long text data classification
CN103514151A (en) Dependency grammar analysis method and device and auxiliary classifier training method
CN116501875B (en) Document processing method and system based on natural language and knowledge graph

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20090304