CN102867048B - Document storing method based on semantic compression - Google Patents

Document storing method based on semantic compression Download PDF

Info

Publication number
CN102867048B
CN102867048B CN201210329421.XA CN201210329421A CN102867048B CN 102867048 B CN102867048 B CN 102867048B CN 201210329421 A CN201210329421 A CN 201210329421A CN 102867048 B CN102867048 B CN 102867048B
Authority
CN
China
Prior art keywords
document
theme
word
matrix
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210329421.XA
Other languages
Chinese (zh)
Other versions
CN102867048A (en
Inventor
曾嘉
曹小琴
严建峰
刘晓升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201210329421.XA priority Critical patent/CN102867048B/en
Publication of CN102867048A publication Critical patent/CN102867048A/en
Application granted granted Critical
Publication of CN102867048B publication Critical patent/CN102867048B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a document storing method based on semantic compression. The method comprises the following steps of: constructing a document information matrix, a distribution matrix of which the themes are in a work list and a distribution matrix of which documents are in a theme, initializing matrixes, redistributing a theme to each word of each document, sorting the all documents and all themes in a descending form according to the change degree of semantic compression, and updating the theme distribution of words in front N documents in a document set which is sorted in a descending way, wherein the updated themes are front M themes in the theme set which is sorted in a descending way; N=td*DD; M=tk*TT; DD refers to the total quantity of documents; TT refers to the total quantity of themes; and td and tk refer to preset values between 0.01-0.5; repeating descending sort and updating till iteration ending conditions are met; and outputting the distribution of the themes in the word list, and the distribution of the documents in the themes. Due to the adoption of the document storing method, the storing space can be reduced, the retrieval speed is increased, and the accuracy of semantic compression is ensured while the semantic compression speed is increased.

Description

A kind of document storing method based on semantic compression
Technical field
The present invention relates to a kind of computer document storage means, be specifically related to a kind of document storing method based on semantic compression.
Background technology
Along with the development of computer technology and the widespread use of network technology, sharply increased by the number of documents of computer disposal, the storage of information, retrieval and transmission propose more and more higher requirement to the capacity of memory device, the internal memory of disposal system, the bandwidth etc. of transmission network.If effectively can compress the file stored, for minimizing storage demand, accelerating data retrieval will play important effect.
In prior art, for the research that the compression of document stores, be substantially all confined to the lossless compressiong of document, by compressing computer document in a binary fashion, reduce the redundant space taken in storing, thus reduce document to the requirement of storage space.This mode can reduce document storage space, but owing to needing to retain whole document information, storage space is difficult to reduce further.
The semantic compression of document is described by document theme, reflects document, determine the classification of document by theme.Subject information is far smaller than document information, but subject information can fully demonstrate document information.Thus, just can portray document information with a small amount of subject information, thus realize the semantic compression of document.Semantic compression belongs to lossy compression method, and its ratio of compression is correlated with the theme number chosen.Theme number is larger, and compression accuracy is higher, but ratio of compression low (ratio of compression is defined as the front shared storage space of compression divided by storage space shared after compression).Otherwise theme number is less, and compression accuracy is low, but ratio of compression is high.Usually, theme number sets according to the actual requirements.
The principle that realizes of the semantic compression of document is: be the set of word document representation, and subject heading list is shown as the distribution of word, and certain theme some words that the frequency of occurrences in this theme is the highest represent; According to document and the relation of word and the relation of theme and word, document is become the multinomial distribution of several themes, namely certain several theme constitutes a document, and a several theme of document describes.
Expression W × the D matrix of the information of document represents, W is the word list of corpus, and D is the set of document, and W × D matrix have recorded the number of times that the word in word list occurs in a document, as shown in table 1, W 0d 0corresponding element value is 3, represents that in word list, call number is the word W of 0 0at D 0occur 3 times in section document.
Table 1 W × D matrix
By the iterative processing to document information, obtain theme at the distribution p hi matrix of word list and the document distribution theta matrix at theme, as shown in table 2 and table 3.When number of documents is larger, as several ten million sections of documents, W × D matrix is an extremely huge matrix, W × D matrix is decomposed into phi matrix and theta matrix, just can realize the compression of semantic information, convenient storage, for the operations such as follow-up document analysis, data mining, information retrieval are provided convenience.
Table 2 theme is at the distribution p hi matrix of word list
Table 3 document is at the distribution theta matrix of theme
At present, conventional Semantic compression has gibbs sampler (Gibbs Sampling or GS) and variation Bayes (Variational Bayes or VB) two kinds of methods.
The word unit of Gibbs sampling method to every section of document is scanned.Word unit is the example that in document, word index repeats.Such as, " family " this word occurs 10 times in a document, so just has 10 " family " word units.Often scan a word unit, Gibbs sampling method just infers the semanteme distribution of word unit on theme, and then from this distribution, this word unit given in stochastic sampling theme.If there is the word unit of many repetitions in document, such as " family " repeats 1000 times, and the sweep time of Gibbs sampling method will increase greatly.Meanwhile, the partial information in distribution can be lost in theme of sampling from the theme distribution of word unit, also makes the precision of semantic compression not high.By Multiple-Scan (being usually greater than 500 times) whole collection of document, Gibbs sampling method reasoning obtains the theme distribution parameter of every section of document, and the distribution parameter of each theme on word list, achieves the target of semantic compression.With 10000 sections of documents, it is example that every section of document contains 100 word units, suppose scanning word unit and compress its semantic needs 0.00001 second, as theme number J=10, gibbs method needed for 0.00001 second × 10 document × 100, theme × 10000 word unit × 500=50000 second time of circulation completed semantic compression in theory.
Variational Bayesian method only scans the word index of every section of document, and such as, in document, " family " repeats 1000 times, and variational Bayesian method only needs run-down " family " index in word list.Like this, variational Bayesian method is more efficient than Gibbs sampling method in the whole text collection of scanning.But variational Bayesian method introduces complicated digamma operation when the semantic information of each word index of reasoning, in fact this operation consumes 4-6 doubly to the time of normal operations.Meanwhile, digamma operation also brings error to semantic compression.If scan 10000 sections of documents, although every section of document is containing 100 word units, but word index number is 50, variational Bayesian method needs 0.00001 second × 10 word index × 500, document × 50, theme × 10000 circulation × 5 times of digamma operations=125000 seconds so in theory.
Visible, all there is to semantic compression the shortcoming that speed is slow, precision is not high in gibbs sampler and variational Bayesian method.
Summary of the invention
Goal of the invention of the present invention is to provide a kind of document storing method based on semantic compression, improves speed and the precision of semantic compression, thus effectively improves semantic compression efficiency when magnanimity document stores.
To achieve the above object of the invention, the technical solution used in the present invention is: a kind of document storing method based on semantic compression, for the storage to collection of document D, comprises the following steps:
(1) with calculating machine-readable enter collection of document D, build according to collection of document D and represent the W × D matrix of document information, wherein, W is the set of letters occurred in document, and matrix element is the number of times that word occurs in a document;
(2) theme is built at the phi matrix of the distribution of word list, document at the theta matrix of theme distribution, wherein, phi matrix is the two-dimensional matrix that set of letters W and theme set T are formed, matrix element is the weights of word on theme, theta matrix is the two-dimensional matrix that collection of document D and theme set T are formed, matrix element is the weights of document on theme, and the initial value of phi matrix and each matrix element of theta matrix is 0;
(3) initialization of phi matrix and theta matrix, is followed successively by each the word Random assignment theme T in document i, the document is at this theme T ion weights increase C j, in word list, this word is at theme T ion weights increase C j; Wherein, i is the sequence number that the theme of this Random assignment is corresponding, C jfor the number of times that this word occurs in the document;
(4) matrix obtained according to step (3) carries out initialization to document word at the distribution mu matrix of theme, and mu matrix is the theme and gathers T and word W dithe two-dimensional matrix formed, W direpresent i-th word of d section article;
(5) first time iteration, redistribute theme to each word of each section of document, method is,
A. cancel the current existing theme of word that processing to distribute, eliminate the existing impact being distributed in word list and affiliated document, amendment phi matrix and theta matrix,
Wi is the index value of word in word list; Xi is word number of times in a document; J is the theme sum; J is the theme when pre-treatment, value 0 ~ J-1; I is the index value of word in document word; Di is document code index value, then revise phi matrix and theta matrix method as follows:
This word in phi matrix is deducted in the distribution of each theme;
phi[wi×J+j]= phi[wi×J+j]-xi×mu[i×J+j]
The document in theta matrix is deducted in the distribution of each theme;
theta[di×J+j]= theta[di×J+j]-xi×mu[i×J+j]
The distribution aggregate-value of each theme in word list is deducted,
phitot[j]= phitot[j]-xi×mu[i×J+j]
B. according to the subject information of this word on the subject information of current document and word list, upgrade the theme distribution information of this word in document, more new formula is as follows:
In formula, munew is the mu value after upgrading, and BETA, WBETA, ALPHA are parameter presets, are constant, and wherein, BETA and ALPHA value is between 0 ~ 0.5, and WBETA is that the word number in word list is multiplied by BETA value;
C. upgrade phi matrix and theta matrix by new mu value, the distribution of each theme adds up,
phi[wi×J+j]= phi[wi×J+j]+xi×mu[i×J+j]
phitot[j]= phitot[j]+xi×mu[i×J+j]
theta[di×J+j]= theta[di×J+j]+xi×mu[i×J+j]
(6) to all documents and theme according to the intensity of variation descending sort of semantic compression, sort method is:
Scan the intensity of variation of the weights of each word on each theme,
Changing value=| the theme weights after the theme weights-renewal before renewal | the number of times that × word occurs at this theme
The changing value of all words related to by a theme adds up, and obtains the changing value of theme;
The changing value of one section of all word of document is added up, obtains the changing value of document;
Obtain lists of documents and the topic list of pressing changing value descending sort respectively;
Such as: document D 0there are 4 word W 0~ W 3, respectively occur: 5 times, 6 times, 2 times, 1 time, after initialization, the distribution of word on theme is in table 5; After redistributing theme, the distribution of word on theme is in table 6.
The distribution of word on theme after table 5 initialization
The distribution of word on theme after theme redistributed by table 6
Then word at the changing value of three themes in table 7.
The changing value of table 7 word on theme
So D 0after section document first time iteration upgrades, theme T 0on changing value 2.8, theme T 1on changing value 3.5, theme T 2on changing value 3.3, the changing value of document: 2.8+3.5+3.3=9.6.
(7) successive iterations is carried out:
To in the collection of document of descending sort, the theme upgrading word in top n document distributes, and the theme of renewal is front M theme in the theme set of descending sort, and update method is undertaken by the method in step (5), wherein: N=td × DD; M=tk × TT, DD are total number of documents, and TT is theme sum, td and tk is the preset value between 0.01 ~ 0.5;
(8) step (6) and step (7) is repeated, until meet iteration termination condition;
(9) export the distribution of theme in word list, document is in the distribution of theme, and thus as the storage data of collection of document D, the document realized based on semantic compression stores.
In technique scheme, described iteration termination condition is, iterations is preset in setting, and when iterative operation reaches default iterations, be and meet iteration termination condition, described default iterations is the integer of 100 ~ 1000.
Or in technique scheme, described iteration termination condition is, setting degree of aliasing threshold value, described degree of aliasing is , in formula, P (W d) be the probability of certain section of all word of document, its computing method are the quantity cumulative sum of the relevant ranks dot product of theta matrix and phi matrix being multiplied by again word, ∑ dlog (P (W d)) be log (P (W to all documents d)) cumulative, ∑ dn dit is the total quantity of all words in collection of document; Calculate current degree of aliasing after each iteration, if the absolute value of the difference of current degree of aliasing and front once circulation degree of aliasing is less than set degree of aliasing threshold value, then for meeting iteration termination condition, described degree of aliasing threshold value is 1.
The calculating of degree of aliasing can be expressed as:
perplexity = exp(sum(-log(sum(theta(:,di).*phi(:,wi),1)).*ci')/sum(ci));
Ci is word number vector.
In technique scheme, the preferred value of BETA and ALPHA is 0.01.
Technique scheme is the storage means for a new collection of document, similar with said method, when the compression storage file of an existing collection of document, for the storage of a new document, the distribution of theme in word list of the document set that technique scheme can be adopted to obtain, document in the distribution of theme as the initial value of homography, use the same method to pending document and process, the compression storage file that can obtain the document stores.Because existing initial value is different, the compression of new document can be completed with less iterations, thus under the prerequisite ensureing degree of accuracy speed up processing further.
Because technique scheme is used, the present invention compared with prior art has following advantages:
1. the method for semantic compression is used for the storage of document by the present invention, greatly can reduce storage space; When retrieving for internet data, condensed document can be stored in this locality and retrieve, greatly accelerating retrieval rate.
2. the present invention is by improving the scan mode to document word index, and avoids complicated digamma operation, thus greatly accelerates the speed of semantic compression.Adopt prior art, 10M document sets is when number of topics 100, and 500 circulation GS semantic compression times are 1580 seconds, and VB needs 10670 seconds, and the present invention only needs 175 seconds.
3. the present invention is while greatly improving semantic compression speed, ensure that the precision of semantic compression.Such as, when 10M document sets theme number is 100, ratio of precision GS of the present invention improves 8.9%, improves 22.1% than VB.
Accompanying drawing explanation
Fig. 1 is compression time comparison diagram when adopting NIPS data set in embodiment;
Fig. 2 is compression accuracy comparison diagram when adopting NIPS data set in embodiment;
Fig. 3 is compression time comparison diagram when adopting NYTIMES data set in embodiment;
Fig. 4 is compression accuracy comparison diagram when adopting NYTIMES data set in embodiment;
Fig. 5 is relative degree of aliasing change schematic diagram when changing td in embodiment;
Fig. 6 is relative degree of aliasing change schematic diagram when changing tk in embodiment.
Embodiment
Below in conjunction with drawings and Examples, the invention will be further described:
Embodiment one: a kind of document storing method based on semantic compression, first obtain collection of document D and build W × D matrix, concrete storage means comprises:
(1) W × D matrix information is read, setting iterations iter=100, scale factor td=0.2, tk=0.2
(2) initialization
A theme topic is produced at random to each word;
topic
The number of times that phi [wi*J+topic]=phi [wi*J+topic]+xi //xi word occurs in certain section of document
phitot[topic]= phitot[topic]+xi
theta[di*J+topic]= theta[di*J+topic]+xi
1.0→mu[i*J + topic]
Wherein, mu matrix is the distribution of document word at theme, and its form is as shown in table 4, W direpresent i-th word of d section article.
Table 4 document word is at the distribution mu matrix of theme
(2) first time iteration
1. to each section of document process
Theme is redistributed to each word.
Cancel word in the distribution of all themes, affect phitot, phi, theta, produce the munew [j] of all themes.
phi[wi*J+j]= phi[wi*J+j]-xi*mu[i*J+j]
phitot[j]= phitot[j]-xi*mu[i*J+j]
theta[di*J+j]= theta[di*J+j]-xi*mu[i*J+j]
mutot=mutot+munew[j]
Produce new munew [j]/mutot → mu [i*J+j]
Phitot, phi, theta is upgraded with new mu.
phi[wi*J+j]= phi[wi*J+j]+xi*mu[i*J+j]
phitot[j]= phitot[j]+xi*mu[i*J+j]
theta[di*J+j]= theta[di*J+j]+xi*mu[i*J+j]
Produce the data for sorting.
rk[di*J+j]+xi*fabs(munew[j] - mu[i*J + j])→ rk[di*J+j]
rd[di] +xi*fabs(munew[j] - mu[i*J + j])→ rd[di]
Sort according to rk.
dsort(J, rk + di*J, -1, ind_rk + di*J)
2. sort according to rd
dsort(D, rd, -1, ind_rd);
(3) successive iterations
1. successive iterations scope is carried out on td*D section document, a tk*J theme.
To td*D section document process, cycle index td*D time, td is parameter preset.
Number of documents stores in ind_rd [ii], and ii is cycle index.
For the preliminary work of sequence, in ind_rk [jj], store the theme (theme after sequence) needing priority processing.
Ind_rk [di*J+j] → k/specific document of/j is to the circulation of tk*J theme
Rd [di]-rk [di*J+k] → rd [di] //di number of documents
0→rk[di*J + k]
Theme is redistributed to each word.
Cancel word to the distribution of J*tk theme, affect phitot, phi, theta, produce the munew [j] of J*tk theme.
phi[wi*J+j]= phi[wi*J+j]-xi*mu[i*J+j]
phitot[j]= phitot[j]-xi*mu[i*J+j]
theta[di*J+j]= theta[di*J+j]-xi*mu[i*J+j]
mutot=mutot+munew[j]
Produce new munew [j]/mutot → mu [i*J+j]
Phitot, phi, theta is upgraded with new mu.
phi[wi*J+j]= phi[wi*J+j]+xi*mu[i*J+j]
phitot[j]= phitot[j]+xi*mu[i*J+j]
theta[di*J+j]= theta[di*J+j]+xi*mu[i*J+j]
Produce the data for sorting.
rk[di*J+j]+xi*fabs(munew[j] - mu[i*J + j])→ rk[di*J+j]
rd[di] +xi*fabs(munew[j] - mu[i*J + j])→ rd[di]
Sort according to rk.
insertionsort(J, rk + di*J, ind_rk + di*J)
2. sort according to rd
insertionsort(D, rd, ind_rd)
(4) 10 iterative computation perplexity
0.0→ mutot
To all themes:
mutot +prob[j]→ mutot
perp -= (log(mutot)*xi)。
The method of the present embodiment is called for short BP, Gibbs sampling method is called for short GS, variational Bayesian method is called for short VB, from the speed of semantic compression and precision, three kinds of compression methods are compared below.
(1) compression speed
With 10000 sections of documents, every section of document contains 100 word units, and 50 word index are example.GS method is the slowest to semantic compression, needs 50000 seconds.Secondly VB method, needs 25000 seconds.In fact BP method needs the document of 500 scan round parts (such as td=0.1) and theme space (such as tk=0.5) to carry out semantic compression, and therefore the time is 0.00001 second × tk × 10 theme × word index × 500, document × 50, td × 10000 circulation=1250 seconds.Therefore, BP method is the fastest.In reality under identical compression cycle number of times, the usual required time of BP semantic compression is 1/50 of GS, is 1/200 of VB.
(2) compression accuracy
The each semantic compression of GS method all passes through sampling operation, distribute a theme determined to each word, and in fact a word often belongs to multiple theme, the probability just in different themes is different, this hard distribution of GS method can make information dropout, reduces semantic compression precision.BP method adopts word in the probability distribution of different themes, uses soft distribution method, greatly improves semantic compression precision.
The target problem that VB Method Modeling obtains cannot solve, and in order to solve, has done approximate resoning for target problem, and introduce complicated digamma operation, this approximate resoning and complicated calculating can cause the loss of semantic information, reduce precision.Contrastingly BP method is not sampled and digamma operation, and therefore the precision of semantic compression is the highest.
Further, BP method and GS and VB method are compared on different pieces of information collection, result is as follows:
(1) NIPS data set
This data set source is the abstract of a thesis in international conference Neural Information Processing Systems (NIPS).NIPS comprises 1500 sections of documents, and word list contains 12419 words.Data set occupies about 5M space at hard disk.
Contrast parameter used: theme number J={ 100,300,500,700,900}; ALPHA=2/J; BETA=0.01; Cycle index N=500;
tk = td = 0.2;
Compression time as shown in Figure 1;
Compression accuracy as shown in Figure 2.
(2) NYTIMES data set
This data set source is the article of the New York Times (New York Times).NYTIMES comprises 15000 sections of documents, and word list contains 84258 words.Data set accounts for about 10M space at hard disk.
Contrast parameter used: J={ 100,300,500,700,900}; ALPHA=2/J; BETA=0.01; Cycle index N=500; Tk=td=0.2;
Compression time as shown in Figure 3;
Compression accuracy as shown in Figure 4.
Further, the impact of BP method scale factor td and tk value is studied:
Usual td <=0.5, tk <=0.5.This value is less, and the compression speed of BP is faster.NIPS data set is tested the impact of value on compression accuracy of td and tk respectively.
Fixing tk=1, then changes td={0.1,0.2,0.3,0.4,0.5}.J={ the relative degree of aliasing under 100,300,500,700,900} as shown in Figure 5.
Can see, along with the increase of parametric t d, relative degree of aliasing declines, and means that compression accuracy improves, but compression speed can decline simultaneously.Relative degree of aliasing variation range when considering td=0.1 with td=0.5, about 20, belongs to relatively little loss of significance, therefore the less td=0.1 of recommendation is to obtain compression speed faster.
Fixing td=1, then changes tk={0.1,0.2,0.3,0.4,0.5}.J={ the relative degree of aliasing under 100,300,500,700,900} as shown in Figure 6.
Can see, the change of tk is very faint on the impact of relative degree of aliasing, and the variation range of usual degree of aliasing, within 10, belongs to very little loss of significance in reality.Therefore, the tk=0.1 that recommendation is less is to obtain compression speed faster.

Claims (3)

1., based on a document storing method for semantic compression, for the storage to collection of document D, it is characterized in that, comprise the following steps:
(1) with calculating machine-readable enter collection of document D, build according to collection of document D and represent the W × D matrix of document information, wherein, W is the set of letters occurred in document, and matrix element is the number of times that word occurs in a document;
(2) theme is built at the phi matrix of the distribution of word list, document at the theta matrix of theme distribution, wherein, phi matrix is the two-dimensional matrix that set of letters W and theme set T are formed, matrix element is the weights of word on theme, theta matrix is the two-dimensional matrix that collection of document D and theme set T are formed, matrix element is the weights of document on theme, and the initial value of phi matrix and each matrix element of theta matrix is 0;
(3) initialization of phi matrix and theta matrix, is followed successively by each the word Random assignment theme T in document i, the document is at this theme T ion weights increase C j, in word list, this word is at theme T ion weights increase C j; Wherein, i is the sequence number that the theme of this Random assignment is corresponding, C jfor the number of times that this word occurs in the document;
(4) matrix obtained according to step (3) carries out initialization to document word at the distribution mu matrix of theme, and mu matrix is the theme and gathers T and word W dithe two-dimensional matrix formed, W direpresent i-th word of d section article;
(5) first time iteration, redistribute theme to each word of each section of document, method is,
A. cancel the current existing theme of word that processing to distribute, eliminate the existing impact being distributed in word list and affiliated document, amendment phi matrix and theta matrix,
Wi is the index value of word in word list; Xi is word number of times in a document; J is the theme sum; J is the theme when pre-treatment, value 0 ~ J-1; I is the index value of word in document word; Di is document code index value, then revise phi matrix and theta matrix method as follows:
This word in phi matrix is deducted in the distribution of each theme;
phi[wi×J+j]= phi[wi×J+j]-xi×mu[i×J+j]
The document in theta matrix is deducted in the distribution of each theme;
theta[di×J+j]= theta[di×J+j]-xi×mu[i×J+j]
The distribution aggregate-value of each theme in word list is deducted,
phitot[j]= phitot[j]-xi×mu[i×J+j]
B. according to the subject information of this word on the subject information of current document and word list, upgrade the theme distribution information of this word in document, more new formula is as follows:
In formula, munew is the mu value after upgrading, and BETA, WBETA, ALPHA are parameter presets, are constant, and wherein, BETA and ALPHA value is between 0 ~ 0.5, and WBETA is that the word number in word list is multiplied by BETA value;
C. upgrade phi matrix and theta matrix by new mu value, the distribution of each theme adds up,
phi[wi×J+j]= phi[wi×J+j]+xi×mu[i×J+j]
phitot[j]= phitot[j]+xi×mu[i×J+j]
theta[di×J+j]= theta[di×J+j]+xi×mu[i×J+j];
(6) to all documents and theme according to the intensity of variation descending sort of semantic compression, sort method is:
Scan the intensity of variation of the weights of each word on each theme,
Changing value=| the theme weights after the theme weights-renewal before renewal | the number of times that × word occurs at this theme
The changing value of all words related to by a theme adds up, and obtains the changing value of theme;
The changing value of one section of all word of document is added up, obtains the changing value of document;
Obtain lists of documents and the topic list of pressing changing value descending sort respectively;
(7) successive iterations is carried out:
To in the collection of document of descending sort, the theme upgrading word in top n document distributes, and the theme of renewal is front M theme in the theme set of descending sort, and update method is undertaken by the method in step (5), wherein: N=td × DD; M=tk × TT, DD are total number of documents, and TT is theme sum, td and tk is the preset value between 0.01 ~ 0.5;
(8) step (6) and step (7) is repeated, until meet iteration termination condition;
Described iteration termination condition is selected from one of following two kinds:
1. iterations is preset in setting, and when iterative operation reaches default iterations, be and meet iteration termination condition, described default iterations is the integer of 100 ~ 1000;
2. set degree of aliasing threshold value, described degree of aliasing is , in formula, P (w d) be the probability of certain section of all word of document, its computing method are the quantity cumulative sum of the relevant ranks dot product of theta matrix and phi matrix being multiplied by again word, ∑ dlog (P (w d)) be log (P (w to all documents d)) cumulative, ∑ dn dit is the total quantity of all words in collection of document; Calculate current degree of aliasing after each iteration, if the absolute value of the difference of current degree of aliasing and front once circulation degree of aliasing is less than set degree of aliasing threshold value, then for meeting iteration termination condition, described degree of aliasing threshold value is 1;
(9) export the distribution of theme in word list, document is in the distribution of theme, and thus as the storage data of collection of document D, the document realized based on semantic compression stores.
2. the document storing method based on semantic compression according to claim 1, is characterized in that: the value of BETA and ALPHA is 0.01.
3. the document storing method based on semantic compression, for the storage to document, it is characterized in that: the distribution of the theme obtained using claim 1 in word list as the initial value of phi matrix, document in the distribution of theme as the initial value of theta matrix, adopt the method for claim 1 to process to pending document, the compression storage file obtaining the document stores.
CN201210329421.XA 2012-09-08 2012-09-08 Document storing method based on semantic compression Expired - Fee Related CN102867048B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210329421.XA CN102867048B (en) 2012-09-08 2012-09-08 Document storing method based on semantic compression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210329421.XA CN102867048B (en) 2012-09-08 2012-09-08 Document storing method based on semantic compression

Publications (2)

Publication Number Publication Date
CN102867048A CN102867048A (en) 2013-01-09
CN102867048B true CN102867048B (en) 2015-02-18

Family

ID=47445917

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210329421.XA Expired - Fee Related CN102867048B (en) 2012-09-08 2012-09-08 Document storing method based on semantic compression

Country Status (1)

Country Link
CN (1) CN102867048B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079025A (en) * 2006-06-19 2007-11-28 腾讯科技(深圳)有限公司 File correlation computing system and method
CN101079024A (en) * 2006-06-19 2007-11-28 腾讯科技(深圳)有限公司 Special word list dynamic generation system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9798732B2 (en) * 2011-01-06 2017-10-24 Micro Focus Software Inc. Semantic associations in data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079025A (en) * 2006-06-19 2007-11-28 腾讯科技(深圳)有限公司 File correlation computing system and method
CN101079024A (en) * 2006-06-19 2007-11-28 腾讯科技(深圳)有限公司 Special word list dynamic generation system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Coauthor Network Topic Models with Application to Expert Finding;Jia Zeng等;《Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on 》;20100903;第1卷;第366~373页 *

Also Published As

Publication number Publication date
CN102867048A (en) 2013-01-09

Similar Documents

Publication Publication Date Title
Lakshminarasimhan et al. Compressing the incompressible with ISABELA: In-situ reduction of spatio-temporal data
Chiarot et al. Time series compression survey
US11610124B2 (en) Learning compressible features
US20100179855A1 (en) Large-Scale Behavioral Targeting for Advertising over a Network
Bartoldson et al. Compute-efficient deep learning: Algorithmic trends and opportunities
CN104283567A (en) Method for compressing or decompressing name data, and equipment thereof
CN104040542A (en) Techniques for maintaining column vectors of relational data within volatile memory
WO2022261570A1 (en) Cross-attention system and method for fast video-text retrieval task with image clip
US10540355B1 (en) ACID database
CN109933682B (en) Image hash retrieval method and system based on combination of semantics and content information
CN112994701A (en) Data compression method and device, electronic equipment and computer readable medium
CN111353303A (en) Word vector construction method and device, electronic equipment and storage medium
Jiang et al. xLightFM: Extremely memory-efficient factorization machine
CN113821635A (en) Text abstract generation method and system for financial field
Yang et al. Deep reinforcement hashing with redundancy elimination for effective image retrieval
Matusevych et al. Hokusai-sketching streams in real time
CN114417058A (en) Video material screening method and device, computer equipment and storage medium
CN112580805A (en) Method and device for quantizing neural network model
CN114528944B (en) Medical text coding method, device, equipment and readable storage medium
US20200074277A1 (en) Fuzzy input for autoencoders
Mao et al. Accelerating general-purpose lossless compression via simple and scalable parameterization
CN102867048B (en) Document storing method based on semantic compression
US20210157485A1 (en) Pattern-based cache block compression
Gendron et al. Natural language processing: a model to predict a sequence of words
CN114997190A (en) Machine translation method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150218

Termination date: 20170908