CN102867048B

CN102867048B - Document storing method based on semantic compression

Info

Publication number: CN102867048B
Application number: CN201210329421.XA
Authority: CN
Inventors: 曾嘉; 曹小琴; 严建峰; 刘晓升
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2012-09-08
Filing date: 2012-09-08
Publication date: 2015-02-18
Anticipated expiration: 2032-09-08
Also published as: CN102867048A

Abstract

The invention discloses a document storing method based on semantic compression. The method comprises the following steps of: constructing a document information matrix, a distribution matrix of which the themes are in a work list and a distribution matrix of which documents are in a theme, initializing matrixes, redistributing a theme to each word of each document, sorting the all documents and all themes in a descending form according to the change degree of semantic compression, and updating the theme distribution of words in front N documents in a document set which is sorted in a descending way, wherein the updated themes are front M themes in the theme set which is sorted in a descending way; N=td*DD; M=tk*TT; DD refers to the total quantity of documents; TT refers to the total quantity of themes; and td and tk refer to preset values between 0.01-0.5; repeating descending sort and updating till iteration ending conditions are met; and outputting the distribution of the themes in the word list, and the distribution of the documents in the themes. Due to the adoption of the document storing method, the storing space can be reduced, the retrieval speed is increased, and the accuracy of semantic compression is ensured while the semantic compression speed is increased.

Description

A kind of document storing method based on semantic compression

Technical field

The present invention relates to a kind of computer document storage means, be specifically related to a kind of document storing method based on semantic compression.

Background technology

Along with the development of computer technology and the widespread use of network technology, sharply increased by the number of documents of computer disposal, the storage of information, retrieval and transmission propose more and more higher requirement to the capacity of memory device, the internal memory of disposal system, the bandwidth etc. of transmission network.If effectively can compress the file stored, for minimizing storage demand, accelerating data retrieval will play important effect.

In prior art, for the research that the compression of document stores, be substantially all confined to the lossless compressiong of document, by compressing computer document in a binary fashion, reduce the redundant space taken in storing, thus reduce document to the requirement of storage space.This mode can reduce document storage space, but owing to needing to retain whole document information, storage space is difficult to reduce further.

The semantic compression of document is described by document theme, reflects document, determine the classification of document by theme.Subject information is far smaller than document information, but subject information can fully demonstrate document information.Thus, just can portray document information with a small amount of subject information, thus realize the semantic compression of document.Semantic compression belongs to lossy compression method, and its ratio of compression is correlated with the theme number chosen.Theme number is larger, and compression accuracy is higher, but ratio of compression low (ratio of compression is defined as the front shared storage space of compression divided by storage space shared after compression).Otherwise theme number is less, and compression accuracy is low, but ratio of compression is high.Usually, theme number sets according to the actual requirements.

The principle that realizes of the semantic compression of document is: be the set of word document representation, and subject heading list is shown as the distribution of word, and certain theme some words that the frequency of occurrences in this theme is the highest represent; According to document and the relation of word and the relation of theme and word, document is become the multinomial distribution of several themes, namely certain several theme constitutes a document, and a several theme of document describes.

Expression W × the D matrix of the information of document represents, W is the word list of corpus, and D is the set of document, and W × D matrix have recorded the number of times that the word in word list occurs in a document, as shown in table 1, W ₀d ₀corresponding element value is 3, represents that in word list, call number is the word W of 0 ₀at D ₀occur 3 times in section document.

Table 1 W × D matrix

。

By the iterative processing to document information, obtain theme at the distribution p hi matrix of word list and the document distribution theta matrix at theme, as shown in table 2 and table 3.When number of documents is larger, as several ten million sections of documents, W × D matrix is an extremely huge matrix, W × D matrix is decomposed into phi matrix and theta matrix, just can realize the compression of semantic information, convenient storage, for the operations such as follow-up document analysis, data mining, information retrieval are provided convenience.

Table 2 theme is at the distribution p hi matrix of word list

。

Table 3 document is at the distribution theta matrix of theme

。

At present, conventional Semantic compression has gibbs sampler (Gibbs Sampling or GS) and variation Bayes (Variational Bayes or VB) two kinds of methods.

The word unit of Gibbs sampling method to every section of document is scanned.Word unit is the example that in document, word index repeats.Such as, " family " this word occurs 10 times in a document, so just has 10 " family " word units.Often scan a word unit, Gibbs sampling method just infers the semanteme distribution of word unit on theme, and then from this distribution, this word unit given in stochastic sampling theme.If there is the word unit of many repetitions in document, such as " family " repeats 1000 times, and the sweep time of Gibbs sampling method will increase greatly.Meanwhile, the partial information in distribution can be lost in theme of sampling from the theme distribution of word unit, also makes the precision of semantic compression not high.By Multiple-Scan (being usually greater than 500 times) whole collection of document, Gibbs sampling method reasoning obtains the theme distribution parameter of every section of document, and the distribution parameter of each theme on word list, achieves the target of semantic compression.With 10000 sections of documents, it is example that every section of document contains 100 word units, suppose scanning word unit and compress its semantic needs 0.00001 second, as theme number J=10, gibbs method needed for 0.00001 second × 10 document × 100, theme × 10000 word unit × 500=50000 second time of circulation completed semantic compression in theory.

Variational Bayesian method only scans the word index of every section of document, and such as, in document, " family " repeats 1000 times, and variational Bayesian method only needs run-down " family " index in word list.Like this, variational Bayesian method is more efficient than Gibbs sampling method in the whole text collection of scanning.But variational Bayesian method introduces complicated digamma operation when the semantic information of each word index of reasoning, in fact this operation consumes 4-6 doubly to the time of normal operations.Meanwhile, digamma operation also brings error to semantic compression.If scan 10000 sections of documents, although every section of document is containing 100 word units, but word index number is 50, variational Bayesian method needs 0.00001 second × 10 word index × 500, document × 50, theme × 10000 circulation × 5 times of digamma operations=125000 seconds so in theory.

Visible, all there is to semantic compression the shortcoming that speed is slow, precision is not high in gibbs sampler and variational Bayesian method.

Summary of the invention

Goal of the invention of the present invention is to provide a kind of document storing method based on semantic compression, improves speed and the precision of semantic compression, thus effectively improves semantic compression efficiency when magnanimity document stores.

To achieve the above object of the invention, the technical solution used in the present invention is: a kind of document storing method based on semantic compression, for the storage to collection of document D, comprises the following steps:

(1) with calculating machine-readable enter collection of document D, build according to collection of document D and represent the W × D matrix of document information, wherein, W is the set of letters occurred in document, and matrix element is the number of times that word occurs in a document;

(2) theme is built at the phi matrix of the distribution of word list, document at the theta matrix of theme distribution, wherein, phi matrix is the two-dimensional matrix that set of letters W and theme set T are formed, matrix element is the weights of word on theme, theta matrix is the two-dimensional matrix that collection of document D and theme set T are formed, matrix element is the weights of document on theme, and the initial value of phi matrix and each matrix element of theta matrix is 0;

(3) initialization of phi matrix and theta matrix, is followed successively by each the word Random assignment theme T in document _i, the document is at this theme T _ion weights increase C _j, in word list, this word is at theme T _ion weights increase C _j; Wherein, i is the sequence number that the theme of this Random assignment is corresponding, C _jfor the number of times that this word occurs in the document;

(4) matrix obtained according to step (3) carries out initialization to document word at the distribution mu matrix of theme, and mu matrix is the theme and gathers T and word W _dithe two-dimensional matrix formed, W _direpresent i-th word of d section article;

(5) first time iteration, redistribute theme to each word of each section of document, method is,

A. cancel the current existing theme of word that processing to distribute, eliminate the existing impact being distributed in word list and affiliated document, amendment phi matrix and theta matrix,

Wi is the index value of word in word list; Xi is word number of times in a document; J is the theme sum; J is the theme when pre-treatment, value 0 ~ J-1; I is the index value of word in document word; Di is document code index value, then revise phi matrix and theta matrix method as follows:

This word in phi matrix is deducted in the distribution of each theme;

phi[wi×J+j]= phi[wi×J+j]-xi×mu[i×J+j]

The document in theta matrix is deducted in the distribution of each theme;

theta[di×J+j]= theta[di×J+j]-xi×mu[i×J+j]

The distribution aggregate-value of each theme in word list is deducted,

phitot[j]= phitot[j]-xi×mu[i×J+j]

B. according to the subject information of this word on the subject information of current document and word list, upgrade the theme distribution information of this word in document, more new formula is as follows:

In formula, munew is the mu value after upgrading, and BETA, WBETA, ALPHA are parameter presets, are constant, and wherein, BETA and ALPHA value is between 0 ~ 0.5, and WBETA is that the word number in word list is multiplied by BETA value;

C. upgrade phi matrix and theta matrix by new mu value, the distribution of each theme adds up,

phi[wi×J+j]= phi[wi×J+j]+xi×mu[i×J+j]

phitot[j]= phitot[j]+xi×mu[i×J+j]

theta[di×J+j]= theta[di×J+j]+xi×mu[i×J+j]

(6) to all documents and theme according to the intensity of variation descending sort of semantic compression, sort method is:

Scan the intensity of variation of the weights of each word on each theme,

Changing value=| the theme weights after the theme weights-renewal before renewal | the number of times that × word occurs at this theme

The changing value of all words related to by a theme adds up, and obtains the changing value of theme;

The changing value of one section of all word of document is added up, obtains the changing value of document;

Obtain lists of documents and the topic list of pressing changing value descending sort respectively;

Such as: document D ₀there are 4 word W ₀~ W ₃, respectively occur: 5 times, 6 times, 2 times, 1 time, after initialization, the distribution of word on theme is in table 5; After redistributing theme, the distribution of word on theme is in table 6.

The distribution of word on theme after table 5 initialization

。

The distribution of word on theme after theme redistributed by table 6

。

Then word at the changing value of three themes in table 7.

The changing value of table 7 word on theme

。

So D ₀after section document first time iteration upgrades, theme T ₀on changing value 2.8, theme T ₁on changing value 3.5, theme T ₂on changing value 3.3, the changing value of document: 2.8+3.5+3.3=9.6.

(7) successive iterations is carried out:

To in the collection of document of descending sort, the theme upgrading word in top n document distributes, and the theme of renewal is front M theme in the theme set of descending sort, and update method is undertaken by the method in step (5), wherein: N=td × DD; M=tk × TT, DD are total number of documents, and TT is theme sum, td and tk is the preset value between 0.01 ~ 0.5;

(8) step (6) and step (7) is repeated, until meet iteration termination condition;

(9) export the distribution of theme in word list, document is in the distribution of theme, and thus as the storage data of collection of document D, the document realized based on semantic compression stores.

In technique scheme, described iteration termination condition is, iterations is preset in setting, and when iterative operation reaches default iterations, be and meet iteration termination condition, described default iterations is the integer of 100 ~ 1000.

Or in technique scheme, described iteration termination condition is, setting degree of aliasing threshold value, described degree of aliasing is , in formula, P (W _d) be the probability of certain section of all word of document, its computing method are the quantity cumulative sum of the relevant ranks dot product of theta matrix and phi matrix being multiplied by again word, ∑ _dlog (P (W _d)) be log (P (W to all documents _d)) cumulative, ∑ _dn _dit is the total quantity of all words in collection of document; Calculate current degree of aliasing after each iteration, if the absolute value of the difference of current degree of aliasing and front once circulation degree of aliasing is less than set degree of aliasing threshold value, then for meeting iteration termination condition, described degree of aliasing threshold value is 1.

The calculating of degree of aliasing can be expressed as:

perplexity = exp(sum(-log(sum(theta(:,di).*phi(:,wi),1)).*ci')/sum(ci));

Ci is word number vector.

In technique scheme, the preferred value of BETA and ALPHA is 0.01.

Technique scheme is the storage means for a new collection of document, similar with said method, when the compression storage file of an existing collection of document, for the storage of a new document, the distribution of theme in word list of the document set that technique scheme can be adopted to obtain, document in the distribution of theme as the initial value of homography, use the same method to pending document and process, the compression storage file that can obtain the document stores.Because existing initial value is different, the compression of new document can be completed with less iterations, thus under the prerequisite ensureing degree of accuracy speed up processing further.

Because technique scheme is used, the present invention compared with prior art has following advantages:

1. the method for semantic compression is used for the storage of document by the present invention, greatly can reduce storage space; When retrieving for internet data, condensed document can be stored in this locality and retrieve, greatly accelerating retrieval rate.

2. the present invention is by improving the scan mode to document word index, and avoids complicated digamma operation, thus greatly accelerates the speed of semantic compression.Adopt prior art, 10M document sets is when number of topics 100, and 500 circulation GS semantic compression times are 1580 seconds, and VB needs 10670 seconds, and the present invention only needs 175 seconds.

3. the present invention is while greatly improving semantic compression speed, ensure that the precision of semantic compression.Such as, when 10M document sets theme number is 100, ratio of precision GS of the present invention improves 8.9%, improves 22.1% than VB.

Accompanying drawing explanation

Fig. 1 is compression time comparison diagram when adopting NIPS data set in embodiment;

Fig. 2 is compression accuracy comparison diagram when adopting NIPS data set in embodiment;

Fig. 3 is compression time comparison diagram when adopting NYTIMES data set in embodiment;

Fig. 4 is compression accuracy comparison diagram when adopting NYTIMES data set in embodiment;

Fig. 5 is relative degree of aliasing change schematic diagram when changing td in embodiment;

Fig. 6 is relative degree of aliasing change schematic diagram when changing tk in embodiment.

Embodiment

Below in conjunction with drawings and Examples, the invention will be further described:

Embodiment one: a kind of document storing method based on semantic compression, first obtain collection of document D and build W × D matrix, concrete storage means comprises:

(1) W × D matrix information is read, setting iterations iter=100, scale factor td=0.2, tk=0.2

(2) initialization

A theme topic is produced at random to each word;

topic

The number of times that phi [wi*J+topic]=phi [wi*J+topic]+xi //xi word occurs in certain section of document

phitot[topic]= phitot[topic]+xi

theta[di*J+topic]= theta[di*J+topic]+xi

1.0→mu[i*J + topic]

Wherein, mu matrix is the distribution of document word at theme, and its form is as shown in table 4, W _direpresent i-th word of d section article.

Table 4 document word is at the distribution mu matrix of theme

。

(2) first time iteration

1. to each section of document process

Theme is redistributed to each word.

Cancel word in the distribution of all themes, affect phitot, phi, theta, produce the munew [j] of all themes.

phi[wi*J+j]= phi[wi*J+j]-xi*mu[i*J+j]

phitot[j]= phitot[j]-xi*mu[i*J+j]

theta[di*J+j]= theta[di*J+j]-xi*mu[i*J+j]

mutot=mutot+munew[j]

Produce new munew [j]/mutot → mu [i*J+j]

Phitot, phi, theta is upgraded with new mu.

phi[wi*J+j]= phi[wi*J+j]+xi*mu[i*J+j]

phitot[j]= phitot[j]+xi*mu[i*J+j]

theta[di*J+j]= theta[di*J+j]+xi*mu[i*J+j]

Produce the data for sorting.

rk[di*J+j]+xi*fabs(munew[j] - mu[i*J + j])→ rk[di*J+j]

rd[di] +xi*fabs(munew[j] - mu[i*J + j])→ rd[di]

Sort according to rk.

dsort(J, rk + di*J, -1, ind_rk + di*J)

2. sort according to rd

dsort(D, rd, -1, ind_rd);

(3) successive iterations

1. successive iterations scope is carried out on td*D section document, a tk*J theme.

To td*D section document process, cycle index td*D time, td is parameter preset.

Number of documents stores in ind_rd [ii], and ii is cycle index.

For the preliminary work of sequence, in ind_rk [jj], store the theme (theme after sequence) needing priority processing.

Ind_rk [di*J+j] → k/specific document of/j is to the circulation of tk*J theme

Rd [di]-rk [di*J+k] → rd [di] //di number of documents

0→rk[di*J + k]

Theme is redistributed to each word.

Cancel word to the distribution of J*tk theme, affect phitot, phi, theta, produce the munew [j] of J*tk theme.

phi[wi*J+j]= phi[wi*J+j]-xi*mu[i*J+j]

phitot[j]= phitot[j]-xi*mu[i*J+j]

theta[di*J+j]= theta[di*J+j]-xi*mu[i*J+j]

mutot=mutot+munew[j]

Produce new munew [j]/mutot → mu [i*J+j]

Phitot, phi, theta is upgraded with new mu.

phi[wi*J+j]= phi[wi*J+j]+xi*mu[i*J+j]

phitot[j]= phitot[j]+xi*mu[i*J+j]

theta[di*J+j]= theta[di*J+j]+xi*mu[i*J+j]

Produce the data for sorting.

rk[di*J+j]+xi*fabs(munew[j] - mu[i*J + j])→ rk[di*J+j]

rd[di] +xi*fabs(munew[j] - mu[i*J + j])→ rd[di]

Sort according to rk.

insertionsort(J, rk + di*J, ind_rk + di*J)

2. sort according to rd

insertionsort(D, rd, ind_rd)

(4) 10 iterative computation perplexity

0.0→ mutot

To all themes:

mutot +prob[j]→ mutot

perp -= (log(mutot)*xi)。

The method of the present embodiment is called for short BP, Gibbs sampling method is called for short GS, variational Bayesian method is called for short VB, from the speed of semantic compression and precision, three kinds of compression methods are compared below.

(1) compression speed

With 10000 sections of documents, every section of document contains 100 word units, and 50 word index are example.GS method is the slowest to semantic compression, needs 50000 seconds.Secondly VB method, needs 25000 seconds.In fact BP method needs the document of 500 scan round parts (such as td=0.1) and theme space (such as tk=0.5) to carry out semantic compression, and therefore the time is 0.00001 second × tk × 10 theme × word index × 500, document × 50, td × 10000 circulation=1250 seconds.Therefore, BP method is the fastest.In reality under identical compression cycle number of times, the usual required time of BP semantic compression is 1/50 of GS, is 1/200 of VB.

(2) compression accuracy

The each semantic compression of GS method all passes through sampling operation, distribute a theme determined to each word, and in fact a word often belongs to multiple theme, the probability just in different themes is different, this hard distribution of GS method can make information dropout, reduces semantic compression precision.BP method adopts word in the probability distribution of different themes, uses soft distribution method, greatly improves semantic compression precision.

The target problem that VB Method Modeling obtains cannot solve, and in order to solve, has done approximate resoning for target problem, and introduce complicated digamma operation, this approximate resoning and complicated calculating can cause the loss of semantic information, reduce precision.Contrastingly BP method is not sampled and digamma operation, and therefore the precision of semantic compression is the highest.

Further, BP method and GS and VB method are compared on different pieces of information collection, result is as follows:

(1) NIPS data set

This data set source is the abstract of a thesis in international conference Neural Information Processing Systems (NIPS).NIPS comprises 1500 sections of documents, and word list contains 12419 words.Data set occupies about 5M space at hard disk.

Contrast parameter used: theme number J={ 100,300,500,700,900}; ALPHA=2/J; BETA=0.01; Cycle index N=500;

tk = td = 0.2;

Compression time as shown in Figure 1;

Compression accuracy as shown in Figure 2.

(2) NYTIMES data set

This data set source is the article of the New York Times (New York Times).NYTIMES comprises 15000 sections of documents, and word list contains 84258 words.Data set accounts for about 10M space at hard disk.

Contrast parameter used: J={ 100,300,500,700,900}; ALPHA=2/J; BETA=0.01; Cycle index N=500; Tk=td=0.2;

Compression time as shown in Figure 3;

Compression accuracy as shown in Figure 4.

Further, the impact of BP method scale factor td and tk value is studied:

Usual td <=0.5, tk <=0.5.This value is less, and the compression speed of BP is faster.NIPS data set is tested the impact of value on compression accuracy of td and tk respectively.

Fixing tk=1, then changes td={0.1,0.2,0.3,0.4,0.5}.J={ the relative degree of aliasing under 100,300,500,700,900} as shown in Figure 5.

Can see, along with the increase of parametric t d, relative degree of aliasing declines, and means that compression accuracy improves, but compression speed can decline simultaneously.Relative degree of aliasing variation range when considering td=0.1 with td=0.5, about 20, belongs to relatively little loss of significance, therefore the less td=0.1 of recommendation is to obtain compression speed faster.

Fixing td=1, then changes tk={0.1,0.2,0.3,0.4,0.5}.J={ the relative degree of aliasing under 100,300,500,700,900} as shown in Figure 6.

Can see, the change of tk is very faint on the impact of relative degree of aliasing, and the variation range of usual degree of aliasing, within 10, belongs to very little loss of significance in reality.Therefore, the tk=0.1 that recommendation is less is to obtain compression speed faster.

Claims

1., based on a document storing method for semantic compression, for the storage to collection of document D, it is characterized in that, comprise the following steps:

This word in phi matrix is deducted in the distribution of each theme;

phi[wi×J+j]= phi[wi×J+j]-xi×mu[i×J+j]

The document in theta matrix is deducted in the distribution of each theme;

theta[di×J+j]= theta[di×J+j]-xi×mu[i×J+j]

The distribution aggregate-value of each theme in word list is deducted,

phitot[j]= phitot[j]-xi×mu[i×J+j]

phi[wi×J+j]= phi[wi×J+j]+xi×mu[i×J+j]

phitot[j]= phitot[j]+xi×mu[i×J+j]

theta[di×J+j]= theta[di×J+j]+xi×mu[i×J+j]；

Scan the intensity of variation of the weights of each word on each theme,

(7) successive iterations is carried out:

Described iteration termination condition is selected from one of following two kinds:

1. iterations is preset in setting, and when iterative operation reaches default iterations, be and meet iteration termination condition, described default iterations is the integer of 100 ~ 1000;

2. set degree of aliasing threshold value, described degree of aliasing is , in formula, P (w _d) be the probability of certain section of all word of document, its computing method are the quantity cumulative sum of the relevant ranks dot product of theta matrix and phi matrix being multiplied by again word, ∑ _dlog (P (w _d)) be log (P (w to all documents _d)) cumulative, ∑ _dn _dit is the total quantity of all words in collection of document; Calculate current degree of aliasing after each iteration, if the absolute value of the difference of current degree of aliasing and front once circulation degree of aliasing is less than set degree of aliasing threshold value, then for meeting iteration termination condition, described degree of aliasing threshold value is 1;

2. the document storing method based on semantic compression according to claim 1, is characterized in that: the value of BETA and ALPHA is 0.01.

3. the document storing method based on semantic compression, for the storage to document, it is characterized in that: the distribution of the theme obtained using claim 1 in word list as the initial value of phi matrix, document in the distribution of theme as the initial value of theta matrix, adopt the method for claim 1 to process to pending document, the compression storage file obtaining the document stores.