CN101377769A - Method for representing multiple graininess of text message - Google Patents
Method for representing multiple graininess of text message Download PDFInfo
- Publication number
- CN101377769A CN101377769A CNA2007101210789A CN200710121078A CN101377769A CN 101377769 A CN101377769 A CN 101377769A CN A2007101210789 A CNA2007101210789 A CN A2007101210789A CN 200710121078 A CN200710121078 A CN 200710121078A CN 101377769 A CN101377769 A CN 101377769A
- Authority
- CN
- China
- Prior art keywords
- text
- feature
- model
- representation
- integrated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 95
- 238000012549 training Methods 0.000 claims abstract description 12
- 235000019580 granularity Nutrition 0.000 claims description 117
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000010276 construction Methods 0.000 claims description 4
- 230000010354 integration Effects 0.000 abstract description 4
- 238000013507 mapping Methods 0.000 abstract description 4
- 238000012545 processing Methods 0.000 description 21
- 238000013459 approach Methods 0.000 description 18
- 238000004458 analytical method Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000010365 information processing Effects 0.000 description 4
- 230000004927 fusion Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- RZVAJINKPMORJF-UHFFFAOYSA-N Acetaminophen Chemical compound CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJF-UHFFFAOYSA-N 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention relates to a text information representation method based on multi-granularity text features. In the method, multi-granularity text representation model training is used for generating a multi-granularity text model and integration, thus forming text information multi-granularity integration representation. The method solves the problem of the integration of multi-granularity text features based on overall weights and text local features. The text information representation method is very robust and stable in respect of corpus scale and sparse data. The semantic structures implied in the text can be accurately and fully represented through acquiring the multi-granularity semantic space mapping of the text. The advantages of fine granularity text representation and coarse granularity text representation can be used comprehensively based on the relativity between the multi-granularity text features. Under the condition of training corpora with different scales, the text representation performance of the text information representation method is better than the representation performance of the single-granularity text representation method. Although the multi-granularity text representation method is in a multilayer structure, and the relations between the layers are clear, and the complex operations of a plurality of text representation methods in parameter adjustment are eliminated.
Description
Technical field
The present invention relates to Intelligent Information Processing, information retrieval and natural language processing technique field, particularly relate to a kind of text message method for expressing, be used for the expression originally of information retrieval or other text-processing applicating Chinese, make text be handled and to analyze use, in the purposes of carrying out information retrieval, text classification and some other text-processing, show superior performance by computing machine.
Background technology
Text-processing is an important techniques in the information processing technology, has core status in fields such as information retrieval, text message analysis, natural language processings.The first step of text-processing is that text message is expressed as the analyzable form of computer program, and the good and bad direct effect and the efficient to text-processing of text message method for expressing exerts an influence.Especially in text-processings such as text retrieval, text classification cluster and text content analysis were used, the performance of text representation often played conclusive effect.
Classical text representation method based on speech bag (BOW---Bag of Words) model and vector space model (VSM) method with the feature of speech different in the text as text representation, according to the statistical information (as frequency etc. occur) of these speech in text, utilize methods such as TFIDF, IG to carry out processing such as feature selecting and weight calculation again, constitute the representation of text vector as text.Be used to generate vectorial speech and be called as characteristic item (term).This is that the text representation granularity of method of feature is minimum based on single speech, is subjected to the influence of speech ambiguity easily.Some researchers propose to adopt the more text representation method of coarseness, improve the accuracy of feature speech semantic expressiveness, as based on method of phrase and n-gram etc.Research shows that also reasonable use coarseness feature helps to improve the performance of text representation.Also have many researchers to use the LSI model, the speech feature space is mapped to latent semantic space to extract potential semantic structure, improve the performance of text representation, this has come down to take the process of a kind of feature selecting and conversion.
But above method all is text representation methods of a kind of single granularity.Varigrained text representation correspondence be the mapping of text at the different grain size semantic space, varigrained semantic space all has Special Significance separately for text content analysis.Simultaneously, there is the nonuniformity of optimum granularity in text on local semantic expressiveness.Therefore, the text representation of the single granularity optimum that can not meet text is represented.
Summary of the invention
For the text representation that solves the single granularity of prior art can not meet the problem that the optimum of text is represented, the objective of the invention is: study many granularities text representation method, improve the performance that text message is represented, thereby the Intelligent treatment to text messages such as information retrieval, text classification, text cluster, text content analysis produces impetus, for this reason, this bright text message method for expressing that provides a kind of based on many granularities text feature.
In order to reach described purpose, the method for representing multiple graininess of text message provided by the invention comprises that step is as follows:
Step S1: utilize the study of many granularities text representation model, training generates the text model of many granularities;
Step S2: integrated based on many granularities text feature model, form many granularities of text message integrated representation.
Described many granularities text representation model is made of a plurality of varigrained text models, forms the model of sandwich construction, and is interrelated between each layer granularity text model feature unit; This model is used for the representing multiple graininess of text message.
The learning process of described many granularities text representation model may further comprise the steps:
Step S11: varigrained text model is learnt respectively;
Step S12: analyze the relation between the different grain size text model feature unit, the different grain size text model is combined constitutes the text representation model of sandwich construction.
Described many granularities text feature is integrated, be integrated into varigrained text model based on many granularities text feature of overall weight and give separately weight, their text feature space is weighted to merge obtains new text feature space, thereby merged many granularities text feature at new feature space.
Described many granularities text feature is integrated, based on integrated the steps include: of many granularities text feature of text local feature
Step S21: the text feature that obtains a plurality of granularity text models is represented;
Step S22: the weights that calculate the local different grain size feature of text;
Step S23: integrated many grain size characteristics are represented text.
Described integrated many grain size characteristics are represented text, it is the relation of inclusion of utilizing between the local different grain size text feature of text, probability distribution parameters according to varigrained text feature unit, the weights of each granularity text feature of analytical calculation text part are represented many grain size characteristics of text integrated on this basis.
Described many granularities text feature is integrated to be taked based on many granularities text feature of overall weight integratedly, or takes based on many granularities text feature of text local feature integrated.
Described text feature, employed text feature is provided by a plurality of varigrained text models, and the study of text model realizes under the support of corpus; Wherein, varigrained text representation model is the text representation of same type, or dissimilar text representations.
If it uses integrated based on many granularities text feature of text local feature in the described integrated many granularities text feature of step S2, then in the learning process of each granularity text model of step S1, need the probability distribution parameters of study different grain size text feature unit.
The many grain size characteristics of described text based on overall weight are integrated, be in text representation, by adjusting the overall separately weight that varigrained text model is endowed, if wherein have only the overall weight of a text model non-vanishing, then the method for representing multiple graininess of text is degenerated and is become the simple grain kilsyth basalt and show.
The many grain size characteristics of described text based on the text local feature are integrated, be in text representation, by adjusting the overall separately weight that varigrained text model is endowed, if wherein have only the overall weight of a text model non-vanishing, then the representing multiple graininess of text is degenerated and is become the simple grain kilsyth basalt and show.
Good effect of the present invention: superiority of the present invention is superior performance, and corpus scale and sparse data problem are had higher robustness and stability.By obtaining the mapping of many granularities of text semantic space, the text method for representing multiple graininess can be more accurately and is portrayed the semantic structure that contains in the text fully.And, can fully utilize the advantage of fine granularity text representation and coarseness text representation based on the correlativity between many granularities text feature.Under the training corpus situation of different scales, the text representation performance all is better than simple grain degree text representation method.Secondly, method realizes simple relatively.Though many granularities text representation model has the structure of multilayer, the relation between each layer structure is clear, and method realizes simple, has avoided the complex operations of many text representation methods in parameter adjustment.Therefore, the present invention is particularly suitable for carrying out the text information processing field that extensive text calculates, as information retrieval, text classification, text cluster and text content analysis etc.
Principle of the present invention is: varigrained text feature has represented to represent the mapping of text at the different grain size semantic space, has meaning separately in text representation.But represent from the integral body of text, have the approximation on the semantic expressiveness.Simultaneously, varigrained text representation is being different to the robustness of the dependence of corpus and sparse data.The fine granularity text representation can be portrayed trickleer semantic difference, robustness to sparse data problem is higher, but the while also exists in the semanteme portrayal to text and lacks accuracy, is subjected to the weakness of ambiguity influence easily, is suitable for the application on the small-scale training corpus.The coarseness text representation is the semantic abstraction of higher level often, has higher accuracy on semantic expressiveness, and it is littler influenced by ambiguity, can portray text semantic more exactly, but to the sparse data problem sensitivity, needs large-scale corpus support.Though can improve robustness to sparse data to a certain extent by smoothing technique, parameter regulation complexity and effect may not be desirable.Property difference based between the different grain size combines them, and brings into play their advantage, can obtain more performance.By the fusion of different grain size feature space, varigrained text feature addition is got up to realize the integrated of many granularities text representation, but this is not optimum method.The coarseness feature has superiority on the determinacy of semantic expressiveness, but observes semantic expressiveness from the angle of text part, and the granularity that neither one is single is suitable, and different local optimum text granularities are to change.The distribution probability of different grain size feature in corpus shown the reliability of this feature, and high more this feature that shows of probability is used to represent that the reliability of text is high more, should give higher weights in text representation.If low this grain size characteristic that then shows of this grain size characteristic probability represents that the reliability of text is lower, text representation should more depend on the more grain size characteristic of low order.Therefore, utilize the character that has relation of inclusion between the different grain size text feature,, be the different weight of different grain size characteristic allocation by analyzing the relativity of the probability parameter between local each granularity text feature unit of text.In other words, it is integrated to be with the reliability of text feature that the basis is that different grain size characteristic allocation weight is carried out feature, comprehensively coarseness and the advantage of fine granularity feature in text representation, thereby when the more excellent text semantic of acquisition is represented, raising obtains better text representation combination property to the robustness of learning sample.
Description of drawings
Fig. 1 is the text-processing FB(flow block) that adopts many granularities text representation method
Fig. 2 is the process flow diagram of the text model learning process in many granularities text representation method
Fig. 3 adopts many granularities text model to carry out the processing flow chart of text representation
Fig. 4 is a text representing multiple graininess structure of models synoptic diagram
Fig. 5 is many granularities text feature incidence relation synoptic diagram
Embodiment
Below introduce the preferred embodiments of the present invention, this part only is to illustrate of the present invention, but not to the restriction of the present invention and application or purposes.Other embodiment that draws according to the present invention belongs to technological innovation scope of the present invention too.There is the setting of related parameter also not show to have only example value to use in the scheme.
Shown in the text-processing FB(flow block) that Fig. 1 adopts many granularities text representation method, the present invention is a kind of text message method for expressing based on many granularities text feature.At first, train varigrained text model, concern between the analysis different grain size text model feature unit, each text model is combined into many granularities text representation model by corpus; Then, use these text models that target text is generated varigrained text feature and represent, utilize many granularities text feature integrated approach, the many grain size characteristics that form text message are represented.For this reason, the invention allows for a kind of many granularities text feature integrated approach and a kind of many granularities text feature integrated approach, solved the integration problem of many granularities text feature based on the text local feature based on overall weight.
The method for representing multiple graininess of text message is described below:
The feature speech can be an also phrase of single word in the text representation, and the granularity of text representation is meant the length of institute's use characteristic speech in text representation.Therefore be that the text representation of characteristic item is the text representation of granularity minimum based on word.Text representation method based on many granularities then integrates varigrained text representation, forms the unified expression of text, obtains better text representation performance.
Method mainly is divided into two steps, at first is the study and the training of many granularities text model, by the study of text corpus, makes up varigrained text model and is combined into many granularities text model.Be to utilize these models and parameter thereof to carry out many grain size characteristics integrated representation of text then.
Step S1: the study of many granularities text model:
With the corpus is object, learns varigrained text representation model, then varigrained text representation feature is carried out feature selecting and weight calculation.The learning process of each granularity text model and the difference of common text model learning process are little, but require the distribution probability parameter of learning text feature unit (or claiming characteristic item) in corpus in this course, use for many granularities integrated representation process.The feature unit probability parameter is meant in training text, the probability of occurrence information of text feature unit and the generating probability between the different grain size characteristic item.Shown in the process flow diagram of the text model learning process in many granularities of Fig. 2 text representation method, its treatment step is:
Step S11: text pre-service.As required, text is carried out pre-service, comprise processing such as stop words removal, stemization, participle.
Step S12: many granularities text model study.The varigrained text model of training on corpus comprises the work such as distribution parameter study, characteristic item weight calculation, feature selecting of characteristic item in the model.Text model with a plurality of granularities is combined into many granularities text representation model then.
Many granularities text representation structure of models as shown in Figure 4.The different grain size model forms a bigger model with the structure of multilayer, associates by the mutual relation of inclusion in different grain size text feature unit (adopting the arrow ways of connecting to represent in Fig. 4) between each layer.Fig. 5 is the synoptic diagram of incidence relation between many granularities text feature, shows that the text feature of coarseness has comprised fine-grained text feature for concrete text sequence (the speech string that is composed of words), and this relation is used to describe the association between the different grain size feature.
Step S2: many granularities integrated representation of text:
Many granularities integrated representation of text is a core of the present invention, by analyze the restriction relation that the different grain size text feature exists between the text part, generates many granularities integrated representation of target text.Adopt many granularities text model to carry out shown in the processing flow chart of text representation as Fig. 3, its basic step is:
Step S21: text pre-service.Text is carried out pre-service, comprise processing such as stop words removal, stemization, participle, processing mode is with step S11 in the training process of text model.
Step S22: the text representation that calculates each different grain size text model of target text.According to text model text representation implementation method separately, the different grain size representation of text is calculated.
Step S23: integrated many granularities text representation.Calculate many granularities integrated representation of text by the integrated approach of many granularities text feature.
The integrated of many granularities text feature is by certain mode different grain size text feature space to be merged in essence, thereby many granularities text feature is integrated the expression text.Can take dual mode or method, a kind of integrated approach that is based on overall weight; Another kind is based on the integrated approach of text local feature.
Realization based on the integrated approach of overall weight is simple relatively, gives weight separately for the different grain size text model, and then the text feature unit for each granularity text model inside will obtain identical overall weight.In text representation, will be based on this overall situation weight, the text representation with the different grain size feature is weighted gets the fusion that union realizes many granularities text feature space, to obtain final text representation.
Integrated approach based on the text local feature is considered as one section text flow or word sequence with target text, calculate the weights of each grain size characteristic unit by the incidence relation between the different grain size feature unit of analyzing the text part, and based on this varigrained text representation is integrated.Incidence relation between the feature unit of different grain size text model mainly utilizes the character that structurally has relation of inclusion between coarseness text feature unit and the fine granularity text feature unit, and portrays therebetween correlativity with the probabilistic relation that exists between the different grain size text feature unit.Usually, the probability height of text feature shows that this grain size characteristic is used for representing that the reliability of text is higher, the weight that worth acquisition is higher shows that then the parametric reliability of this grain size characteristic is lower if this grain size characteristic probability is low, and text representation should depend on the more grain size characteristic of low order.By the probability relativity between the different grain size feature unit, local for calculating and distribute different weights in each grain size characteristic unit at text, realize different grain size feature integrated in the text part with this weight.Generally speaking, in this process, should follow the coarseness priority rule, on weight calculation, tilt to coarseness, as far as possible to obtain semantic expressiveness determinacy preferably.
Have the use of overall weight in based on the integrated approach of text local feature equally, method is identical with integrated approach based on overall weight.Therefore, in many granularities integrating process of text, by the adjusting of overall weights, we can select the center of gravity of text representation is tilted to the coarseness text representation, thereby obtain the semantic text representation result who determines.Also can tilt, thereby keep fine granularity in trickle semanteme of portrayal and the advantage on robustness thereof to fine granularity text representation model.Simultaneously, no matter the integrated approach that is based on overall weight also is based on the integrated approach of text local feature, if wherein only have the overall weight of a text model non-vanishing, then the method for representing multiple graininess of text is degenerated and is become the simple grain kilsyth basalt and show method.Therefore, method of the present invention has realized the simple grain kilsyth basalt is shown the pardon of method in fact.
Example:
Text representation during this example is used with text classification is an example, adopts the n-gram method to implement the representing multiple graininess of text, and the difference of n value is represented different grain size.At many granularities text learning phase, need learn so, then different n-gram text models be integrated, form multi-layer structure model different n-gram models.Present embodiment is with by Unigram, and it is that example describes implementation process that three layers of n-gram model that Bigram and Trigram text model constitute carry out many granularities text representation, and the text feature of use is each rank language unit (gram).Specify as follows:
1. the study of many granularities text model
It is 1 to 3 text representation model that Unigram, Bigram, Trigram represent granularity respectively.It is carried out the study of n-gram respectively, and study is at separately language metamember's probability data.But be the probability of occurrence of n unit speech string here at the needed probability of n-gram text model.
1.1 the pre-service of text
The pre-service of text need be distinguished Chinese and English language and handle respectively.Usually need carry out the processing of participle for Chinese, then need to carry out the processing of stemization for English.The processing of removing stop words all is suitable for for Chinese and English, rejects semantic irrelevant word such as some tone auxiliary words or function word from text.Adopted the n-gram model in the present embodiment, can select not carry out word segmentation processing for the pre-service of Chinese.
1.2 Unigram text model study
Use w
iThe word that occurs in the expression corpus, total words is M, the appearance frequency of each word is designated as c (w in the statistics corpus
i), word w
iComputation process based on the distribution probability of maximal possibility estimation is shown below:
1.3 Bigram text model study
Binary speech string (w in the statistics Bigram text model
I1w
I2) appearance frequency in corpus is designated as c (w
I1w
I2), its distribution probability P (w
I1w
I2) the computation process following expression shown in:
1.4 Trigram text model study
Ternary speech string (w in the statistics Trigram text model
I1w
I2w
I3) appearance frequency in corpus is designated as c (w
I1w
I2w
I3), its distribution probability P (w
I1w
I2w
I3) computation process be shown below:
1.5 feature selecting and characteristic item weights calculate
Use characteristic is selected, and reduces the number of text feature item on the one hand, improves operational efficiency; On the other hand, select to help the characteristic item of text classification, filtering does not help even disadvantageous characteristic item text classification.In text classification, crossing high frequency or crossing the characteristics of low-frequency item not only increases calculated amount, and classification performance is had negative effect.Can take the word frequency method to filter these characteristic items, for example will be lower than 3 characteristic item in the highest 10 the characteristic item of word frequency in the training corpus study and word frequency here and give filtering, carry out feature selecting.
In the text vector that is generated by text model, component of a vector is corresponding with the weights of each characteristic item.Present embodiment is that each characteristic item calculates weights in the document with the TF-IDF method, represents with following formula at the component of a vector of TF-IDF method Chinese version:
a
jd=f
jd·log(N/n
j) (4)
F wherein
JdRepresent the word frequency of j characteristic item in text d, the sum of N representative training storehouse Chinese version, n
jRepresent that text number at least once appears in j feature.
The weights of each characteristic item of j are the IDF value, are designated as:
tw(j)=log(N/n
j) (5)
2. many granularities text representation is integrated
At first, text is carried out pre-service, processing procedure is with the learning phase of many granularities template.Then, utilize study gained different grain size model representation text, obtain the character representation vector of text.The feature space Φ of the text model of granularity k
kExpression, L
kSize for the space.To certain text sequence S=(w
1w
2..., w
T), its text vector V on granularity k
kExpression.Then the text feature of granularity k vector is represented with formula (4):
V
k={V
k1,V
k2,…,V
kLk} (6)
V
K1... wait each component of expression text vector.
The integrated of many granularities text representation in fact just carries out the text vector of different grain size feature integrated.Its key is many granularities text representation feature space merged and character representation process integrated.Mainly can adopt two kinds of methods: based on the integrated approach of overall weight, based on the integrated approach of text local feature.Exemplifying embodiment is as follows separately:
1) based on the integrated approach of overall weight
For each granularity text model is given separately overall weight r
k, varigrained text feature space merged obtaining new feature space, its process can be expressed as with relational expression (7):
In the enforcement for concrete text representation, can be implemented as the process that the text vector of different grain size feature is weighted fusion, each granularity text feature is endowed the overall weight r of place text model
k, the processing procedure of the representing multiple graininess of text sequence can be represented with relational expression (8):
(8)
2) based on the integrated approach of text local feature
Integrated approach based on the text local feature is complicated more, but can obtain better effect.
At first, the language of each layer unit unit is expressed as respectively in the N layer n-gram representation model of text sequence S:
L
k={(w
i-k+1…w
i)|i=1,…T},k=1,…,N; (9)
The distribution probability of the unit i of language unit of layer k is represented with following formula:
p
k(i)=P(w
i-k+1…w
i) (10)
Carry out weight calculation and setting by the relativity of its distribution probability parameter between the language unit characteristic item of each rank n-gram model, the weight that the grain size characteristic that probability is high more obtains is big more.For tilting to the coarseness feature, the weight of first feature of respectively speaking in the granularity model of high-order is set as 1.0, the weight of the language of other granularity models unit feature then can be made as the resulting surplus value of probability that itself distribution probability deducts the high-order grain size characteristic, is distributed in the weight q of text part for each grain size characteristic by the process of a recurrence
k(i).This process can be represented with following two steps:
At first, the weight of characteristic item all is made as 1.0 in the granularity model of high-order, is shown below:
q
k(i)=1.0;k=N (11)
Secondly, k asks for by the recursive fashion of k descending less than the weight of the characteristic item in each layer granularity model of N, and its process is represented with relational expression (12):
Make that text sequence S is tn in the corresponding k-gram characteristic item sequence number of k layer model i language unit
kAnd the attribute weights of k layer model j characteristic item are tw (i),
k(j), then the k-gram characteristic item weights sequence of text sequence can be represented with following form:
S′=(q
k(1)·tw
k(tn
k(1)),q
k(2)·tw
k(tn
k(2)),…,q
k(T)·tw
k(tn
k(T)))
(13)
The text vector that obtains each text model correspondence is on this basis used
Expression.Be similar to integrated approach, for each granularity text model is given different overall weight r based on overall weight
kThe representing multiple graininess gained text vector of text sequence can be expressed as with relational expression (14):
(14)
By distributing different overall weight r
k, we can give the importance of different grain size text model in many granularities text representation.Defaultly can all be made as 1.0.As if the weight of only giving 1 to certain one deck, it is 0 that other layer is all composed, and then multilayer n-gram model deteriorates to common single n-gram model, and in fact we can unite multilayer n-gram model and common n-gram model like this.For example, r
m=1 and r
k=0 (during k ≠ m), multilayer n-gram model is degenerated becomes common m rank n-gram.This has realized a Unified frame of many granularities text representation.
After having obtained the vectorization representation of text, we can utilize the text to represent to carry out the related application of text information processing, in text classification, can utilize text vector to train all kinds of sorters and use sorter to the target text processing of classifying.Test shows in text classification research on the widely used Reuters-21578 data set, under the situation of using identical feature extracting method and sorter, the inventive method is compared simple grain degree method can improve 2~3 percentage points of classification accuracies.
Claims (11)
1. the method for representing multiple graininess of a text message is characterized in that, may further comprise the steps:
Step S1: utilize the study of many granularities text representation model, training generates the text model of many granularities;
Step S2: integrated based on many granularities text feature model, form many granularities of text message integrated representation.
2. according to the method for representing multiple graininess of the described text message of claim 1, it is characterized in that: described many granularities text representation model is made of a plurality of varigrained text models, form the model of sandwich construction, interrelated between each layer granularity text model feature unit; This model is used for the representing multiple graininess of text message.
3. the method for representing multiple graininess of text message as claimed in claim 1, it is characterized in that: the learning process of many granularities text representation model may further comprise the steps:
Step S11: varigrained text model is learnt respectively;
Step S12: analyze the relation between the different grain size text model feature unit, the different grain size text model is combined constitutes the text representation model of sandwich construction.
4. according to the method for representing multiple graininess of the described text message of claim 1, it is characterized in that: described many granularities text feature is integrated, be integrated into varigrained text model based on many granularities text feature of overall weight and give separately weight, their text feature space is weighted to merge obtains new text feature space, thereby merged many granularities text feature at new feature space.
5. according to the method for representing multiple graininess of the described text message of claim 1, it is characterized in that: described many granularities text feature is integrated, based on integrated the steps include: of many granularities text feature of text local feature
Step S21: the text feature that obtains a plurality of granularity text models is represented;
Step S22: the weights that calculate the local different grain size feature of text;
Step S23: integrated many grain size characteristics are represented text.
6. according to the method for representing multiple graininess of the described text message of claim 5, it is characterized in that, described integrated many grain size characteristics are represented text, it is the relation of inclusion of utilizing between the local different grain size text feature of text, probability distribution parameters according to varigrained text feature unit, the weights of each granularity text feature of analytical calculation text part are represented many grain size characteristics of text integrated on this basis.
7. as the method for representing multiple graininess of claim 4 and 5 described text messages, it is characterized in that: described many granularities text feature is integrated to be taked based on many granularities text feature of overall weight integratedly, or takes based on many granularities text feature of text local feature integrated.
8. the method for representing multiple graininess of text message as claimed in claim 1 is characterized in that: described text feature, and employed text feature is provided by a plurality of varigrained text models, and the study of text model realizes under the support of corpus; Wherein, varigrained text representation model is the text representation of same type, or dissimilar text representations.
9. as the method for representing multiple graininess of claim 1 and 5 described text messages, it is characterized in that: if it uses integrated based on many granularities text feature of text local feature in integrated many granularities text feature of step S2, then in the learning process of each granularity text model of step S1, need the probability distribution parameters of study different grain size text feature unit.
10. the method for representing multiple graininess of text message as claimed in claim 4, it is characterized in that: the many grain size characteristics of described text based on overall weight are integrated, be in text representation, by adjusting the overall separately weight that varigrained text model is endowed, if wherein have only the overall weight of a text model non-vanishing, then the method for representing multiple graininess of text is degenerated and is become the simple grain kilsyth basalt and show.
11. the method for representing multiple graininess of text message as claimed in claim 5, it is characterized in that: the many grain size characteristics of described text based on the text local feature are integrated, be in text representation, by adjusting the overall separately weight that varigrained text model is endowed, if wherein have only the overall weight of a text model non-vanishing, then the representing multiple graininess of text is degenerated and is become the simple grain kilsyth basalt and show.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2007101210789A CN101377769A (en) | 2007-08-29 | 2007-08-29 | Method for representing multiple graininess of text message |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2007101210789A CN101377769A (en) | 2007-08-29 | 2007-08-29 | Method for representing multiple graininess of text message |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101377769A true CN101377769A (en) | 2009-03-04 |
Family
ID=40421317
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2007101210789A Pending CN101377769A (en) | 2007-08-29 | 2007-08-29 | Method for representing multiple graininess of text message |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101377769A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103518187A (en) * | 2011-03-10 | 2014-01-15 | 特克斯特怀茨有限责任公司 | Method and system for information modeling and applications thereof |
CN104408153A (en) * | 2014-12-03 | 2015-03-11 | 中国科学院自动化研究所 | Short text hash learning method based on multi-granularity topic models |
CN104462408A (en) * | 2014-12-12 | 2015-03-25 | 浙江大学 | Topic modeling based multi-granularity sentiment analysis method |
CN107169086A (en) * | 2017-05-12 | 2017-09-15 | 北京化工大学 | A kind of file classification method |
CN107797985A (en) * | 2017-09-27 | 2018-03-13 | 百度在线网络技术(北京)有限公司 | Establish synonymous discriminating model and differentiate the method, apparatus of synonymous text |
CN108763510A (en) * | 2018-05-30 | 2018-11-06 | 北京五八信息技术有限公司 | Intension recognizing method, device, equipment and storage medium |
WO2019080863A1 (en) * | 2017-10-26 | 2019-05-02 | 福建亿榕信息技术有限公司 | Text sentiment classification method, storage medium and computer |
CN110276640A (en) * | 2019-06-10 | 2019-09-24 | 北京云莱坞文化传媒有限公司 | More granularities of copyright are split and its method for digging of commercial value |
CN111046179A (en) * | 2019-12-03 | 2020-04-21 | 哈尔滨工程大学 | Text classification method for open network question in specific field |
CN112163404A (en) * | 2020-08-25 | 2021-01-01 | 北京邮电大学 | Text generation method and device, electronic equipment and storage medium |
CN113011176A (en) * | 2021-03-10 | 2021-06-22 | 云从科技集团股份有限公司 | Language model training and language reasoning method, device and computer storage medium thereof |
CN114254158A (en) * | 2022-02-25 | 2022-03-29 | 北京百度网讯科技有限公司 | Video generation method and device, and neural network training method and device |
US11373041B2 (en) | 2020-09-18 | 2022-06-28 | International Business Machines Corporation | Text classification using models with complementary granularity and accuracy |
-
2007
- 2007-08-29 CN CNA2007101210789A patent/CN101377769A/en active Pending
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103518187B (en) * | 2011-03-10 | 2015-07-01 | 特克斯特怀茨有限责任公司 | Method and system for information modeling and applications thereof |
CN103518187A (en) * | 2011-03-10 | 2014-01-15 | 特克斯特怀茨有限责任公司 | Method and system for information modeling and applications thereof |
CN104408153A (en) * | 2014-12-03 | 2015-03-11 | 中国科学院自动化研究所 | Short text hash learning method based on multi-granularity topic models |
CN104408153B (en) * | 2014-12-03 | 2018-07-31 | 中国科学院自动化研究所 | A kind of short text Hash learning method based on more granularity topic models |
CN104462408A (en) * | 2014-12-12 | 2015-03-25 | 浙江大学 | Topic modeling based multi-granularity sentiment analysis method |
CN104462408B (en) * | 2014-12-12 | 2017-09-01 | 浙江大学 | A kind of many granularity sentiment analysis methods modeled based on theme |
CN107169086B (en) * | 2017-05-12 | 2020-10-27 | 北京化工大学 | Text classification method |
CN107169086A (en) * | 2017-05-12 | 2017-09-15 | 北京化工大学 | A kind of file classification method |
CN107797985B (en) * | 2017-09-27 | 2022-02-25 | 百度在线网络技术(北京)有限公司 | Method and device for establishing synonymous identification model and identifying synonymous text |
CN107797985A (en) * | 2017-09-27 | 2018-03-13 | 百度在线网络技术(北京)有限公司 | Establish synonymous discriminating model and differentiate the method, apparatus of synonymous text |
WO2019080863A1 (en) * | 2017-10-26 | 2019-05-02 | 福建亿榕信息技术有限公司 | Text sentiment classification method, storage medium and computer |
CN108763510A (en) * | 2018-05-30 | 2018-11-06 | 北京五八信息技术有限公司 | Intension recognizing method, device, equipment and storage medium |
CN110276640A (en) * | 2019-06-10 | 2019-09-24 | 北京云莱坞文化传媒有限公司 | More granularities of copyright are split and its method for digging of commercial value |
CN111046179A (en) * | 2019-12-03 | 2020-04-21 | 哈尔滨工程大学 | Text classification method for open network question in specific field |
CN111046179B (en) * | 2019-12-03 | 2022-07-15 | 哈尔滨工程大学 | Text classification method for open network question in specific field |
CN112163404A (en) * | 2020-08-25 | 2021-01-01 | 北京邮电大学 | Text generation method and device, electronic equipment and storage medium |
US11373041B2 (en) | 2020-09-18 | 2022-06-28 | International Business Machines Corporation | Text classification using models with complementary granularity and accuracy |
CN113011176A (en) * | 2021-03-10 | 2021-06-22 | 云从科技集团股份有限公司 | Language model training and language reasoning method, device and computer storage medium thereof |
CN114254158A (en) * | 2022-02-25 | 2022-03-29 | 北京百度网讯科技有限公司 | Video generation method and device, and neural network training method and device |
CN114254158B (en) * | 2022-02-25 | 2022-06-10 | 北京百度网讯科技有限公司 | Video generation method and device, and neural network training method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101377769A (en) | Method for representing multiple graininess of text message | |
WO2019080863A1 (en) | Text sentiment classification method, storage medium and computer | |
CN105824922B (en) | A kind of sensibility classification method merging further feature and shallow-layer feature | |
Melo et al. | Automated geocoding of textual documents: A survey of current approaches | |
CN102567464B (en) | Based on the knowledge resource method for organizing of expansion thematic map | |
CN105955981B (en) | A kind of personalized traveling bag recommended method based on demand classification and subject analysis | |
US8583646B2 (en) | Information searching apparatus, information searching method, and computer product | |
CN107609052A (en) | A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle | |
CN109241256B (en) | Dialogue processing method and device, computer equipment and readable storage medium | |
CA3007723A1 (en) | Systems and/or methods for automatically classifying and enriching data records imported from big data and/or other sources to help ensure data integrity and consistency | |
CN108090800A (en) | A kind of game item method for pushing and device based on player's consumption potentiality | |
CN104391942A (en) | Short text characteristic expanding method based on semantic atlas | |
WO2022156328A1 (en) | Restful-type web service clustering method fusing service cooperation relationships | |
CN108509982A (en) | A method of the uneven medical data of two classification of processing | |
CN106844632A (en) | Based on the product review sensibility classification method and device that improve SVMs | |
CN107704500B (en) | News classification method based on semantic analysis and multiple cosine theorem | |
CN108710663A (en) | A kind of data matching method and system based on ontology model | |
CN109871443A (en) | A kind of short text classification method and device based on book keeping operation scene | |
CN105279264A (en) | Semantic relevancy calculation method of document | |
CN111782797A (en) | Automatic matching method for scientific and technological project review experts and storage medium | |
CN104967558A (en) | Method and device for detecting junk mail | |
CN108090178A (en) | A kind of text data analysis method, device, server and storage medium | |
CN106708926A (en) | Realization method for analysis model supporting massive long text data classification | |
CN103514151A (en) | Dependency grammar analysis method and device and auxiliary classifier training method | |
CN116501875B (en) | Document processing method and system based on natural language and knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20090304 |