CN101246501A

CN101246501A - Method and system for polymerizing the same subject network document files

Info

Publication number: CN101246501A
Application number: CNA2008100880557A
Authority: CN
Inventors: 唐年鹏; 王志平
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date: 2008-03-27
Filing date: 2008-03-27
Publication date: 2008-08-20
Anticipated expiration: 2028-03-27
Also published as: CN101246501B

Abstract

The invention relates to a method for gathering web documents with the same theme, comprising: obtaining weight value of each word in a current web document, selecting two or more words with greater weight value orderly to compose a term, searching web documents with the same subject through the composite term until quantity of web documents with the same subject searched by a certain term exceeds a preset value, gathering the current web document and the web documents with the same theme. The invention also discloses a system for gathering web documents with the same theme. The invention solves the problem that the data quantity to be processed for gathering web documents with the same subject causes low web updating speed in current technique, which influence on experiencing of user. The invention is capable of improving web updating speed and improving experiencing of user.

Description

A kind of method and system of polymerization same subject network documentation

Technical field

The present invention relates to the network documentation polymerization field, particularly relate to a kind of method and system of polymerization same subject network documentation.

Background technology

On network, the network documentation of same subject is condensed together, offer the user, be convenient to the user related content of this theme is carried out comprehensive, careful understanding, be an important content of network service.In the prior art, editor is mainly passed through in many websites, the network documentation that the manual sorting theme is identical, but manpower is limited after all, and in the face of the huge Internet resources of data volume, the human-edited obviously can't put the network documentation of same subject comprehensively, in time in order.At present, the part large-scale website adopts traditional classification and clustering method, the network documentation of polymerization same subject.

Consult Fig. 1, the method for existing polymerization same subject network documentation is shown, specifically may further comprise the steps.

Step S101, network documentation is pressed category classification under the theme, for the disparate networks document is provided with crucial dictionary respectively.The characteristic of such network documentation embodied a concentrated reflection of in keyword in the crucial dictionary.For example, with the network documentation that certain star is the theme, its crucial dictionary comprises words such as this star's name, main song title, protagonist movie name.

Step S102, the network documentation to newly finding are extracted whole words in this network documentation, form crucial dictionary.

Step S103, with the crucial dictionary of new Network Search document, mate with the crucial dictionary of disparate networks document, select a classification of word match degree maximum, the network documentation of newly searching is identical with such network documentation theme.For example, the network documentation of newly searching is the report of relevant " 911 " incident, and crucial dictionary comprises words such as " September 11 ", " terrorist ", " aircraft ", " World Trade Organization's mansion ".And the crucial dictionary of " 911 " event class network documentation also can comprise above-mentioned each word, and therefore, the word match degree of these two crucial dictionaries will be higher relatively.

Step S104, the network documentation that will newly search are aggregated to such network documentation.

Though said method can be aggregated to the network documentation of newly searching the network documentation of same subject preferably, but need be to each piece network documentation that retrieves, all be organized into crucial dictionary, crucial dictionary with the disparate networks document mates again, network documentation generally need be subdivided into a plurality of classification, need the data volume of processing excessive like this, cause the network renewal speed slow, influence sense of experience of users.

Said method is when judging, and is main according to the keyword in the crucial dictionary, and it is improper to select as keyword, or the keyword major part is identical in the crucial dictionary of the close network documentation of theme, be easy to cause erroneous judgement, accurately the identical network documentation of polymerization theme reduces sense of experience of users.

Summary of the invention

Technical matters to be solved by this invention provides a kind of method of polymerization theme identical network document, to solve polymerization theme identical network document in the prior art, need the data volume of processing excessive, cause the network renewal speed slow, influence the problem of sense of experience of users.This method can improve the network renewal speed, improves sense of experience of users.

Another object of the present invention provides a kind of system of polymerization theme identical network document, and this system can improve the network renewal speed, improves sense of experience of users.

The method of a kind of polymerization same subject of the present invention network documentation, comprise: the weighted value that obtains each word in the current network document, choose the higher word of two or more weighted values successively and form term, utilize the term retrieval same subject network documentation of forming, same subject network documentation quantity until certain term retrieval surpasses default value, above-mentioned current network document of polymerization and same subject network documentation.

Preferably, before above-mentioned current network document of polymerization and the same subject network documentation, also comprise: use Hash table to represent the vector value of each word in current network document and the same subject network documentation, the vector value of described each word of foundation is calculated the relevance degree of described same subject network documentation and current network document, removes the same subject network documentation that relevance degree is lower than default value.

Preferably, the relevance degree that calculates described same subject network documentation and current network document according to the vector value of described each word is specially, press each word in frequency of occurrence ascending order arrangement current network document and the same subject network documentation, vector value with each word in the same subject network documentation, multiply each other respectively with the vector value of each word of corresponding current network document, the long-pending addition that obtains, as first data, with the vector value of each word in the same subject network documentation respectively square after, addition; With the vector value of each word in the current network document respectively square after, addition; With aforementioned calculation and multiply each other, evolution again, as second data, described first data are divided by the merchant of described second data, as the relevance degree of same subject network documentation and current network document.

Preferably, choose the higher word composition term of two or more weighted values successively and be specially: the weighted value descending sort pressed in above-mentioned each word,, will go up a word successively and form term with next word that this word faces mutually from first word.

Preferably, the weighted value that obtains each word in the current network document is specially, the frequency of occurrence of each word of statistics in the current network document, obtain index file quantity and general index number of documents that each word hits, the index quantity that general index quantity is hit divided by this word, take the logarithm, the numerical value that obtains multiply by above-mentioned frequency of occurrence again, obtains the weighted value of this word.

Preferably, the frequency of occurrence of each word of statistics is specially in the current network document, obtain the position that this word occurs in the current network document, reach occurrence number in this position, the occurrence number of word in this position be multiply by this position coefficient of correspondence, after the product addition as the frequency of occurrence of this word.

Preferably, the frequency of occurrence of each word of statistics is specially in the current network document, statistics word occurrence number in the current network document, judge whether this word occurs in network documentation theme position, in this way, on the total occurrence number of this word, add the fixed number value, as the frequency of occurrence of this word.

The system of a kind of polymerization same subject of the present invention network documentation, comprise that weighted value computing module, term form module, network documentation retrieval module, and polymerization module: described weighted value computing module is used for obtaining the weighted value of each word of current network document; Module formed in described term, is used for choosing successively the higher word of two or more weighted values and forms term; Described network documentation retrieval module is used to utilize the term of composition to retrieve the same subject network documentation, and the same subject network documentation quantity of retrieving until certain term surpasses default value; Described polymerization module is used for above-mentioned current network document of polymerization and same subject network documentation.

Preferably, described term is formed module and comprised word arrangement submodule and form submodule: described word arrangement submodule is used for above-mentioned each word by the weighted value descending sort; Described composition submodule is used for from first word, will go up a word successively and form term with next word that this word faces mutually.

Preferably, also comprise the vector value module, relatedness computation module, removal module: described vector value module is used for using Hash table to represent the vector value of current network document and each word of same subject network documentation; Described relatedness computation module, the vector value that is used for described each word of foundation is calculated the relevance degree of described same subject network documentation and current network document; Described removal module is used to remove the same subject network documentation that relevance degree is lower than default value.

Compared with prior art, the present invention has the following advantages:

The present invention makes up the higher word of weighted value in the current network document as term, and retrieval same subject network documentation because of the high word of weighted value, has very strong representativeness, can be good at reacting the characteristic of current network document.The network documentation that the term of being made up of the higher word of two or more weighted values retrieves is very big with the possibility of theme with the current network document.The present invention is in choosing the process of same subject network document, only need choose suitable word and form the term retrieval, relatively with prior art shown in Figure 1, the present invention does not need the diverse network document that will search and the network documentation of all kinds of themes to contrast one by one, need the data volume of processing less, in application process, the network renewal speed is fast, helps improving sense of experience of users.

Description of drawings

Fig. 1 is the method flow diagram of existing polymerization same subject network documentation;

Fig. 2 is the method first embodiment process flow diagram of polymerization same subject document of the present invention;

Fig. 3 calculates the method flow diagram of the weighted value of each word in the current network document for the present invention;

Fig. 4 is the method second embodiment process flow diagram of converging network relevant documentation of the present invention;

Fig. 5 is system's first embodiment synoptic diagram of polymerization same subject network documentation of the present invention;

Fig. 6 forms the modular structure synoptic diagram for term of the present invention;

Fig. 7 illustrates system's second embodiment synoptic diagram of polymerization same subject network documentation of the present invention.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.

Term formed in the present invention's word that weighted value in the current network document is higher, utilizes the term retrieval network documentation identical with the current network document subject matter, with network documentation and the polymerization of current network document that retrieves.The method of polymerization same subject network documentation of the present invention can be applicable to a plurality of association areas, and make things convenient for the user to concentrate and read, as thematic news polymerization field, thematic event aggregation field or the like.

Consult Fig. 2, method first embodiment of polymerization same subject document of the present invention is shown, concrete steps are as described below.

Step S201, obtain the weighted value of each word in the current network document.Each word in the current network document is spaced apart, remove the function word that preposition, modal particle, interjection etc. do not have essential meaning, extract the word that noun, verb etc. have essential meaning, calculate the weighted value that extracts word successively.Weighted value is represented the degree of correlation of this word with current network document subject matter content, and degree of correlation is high more, and weighted value is also corresponding high more.

For example, the current network document is one piece of patent file, and is just relative higher with the be closely related weighted value of word of patent in the document, as, " patent ", " application ", engineering noise, " examination ", " reexamination " or the like.

Step S202, choose the higher word of two or more weighted values successively and form term, utilize the term retrieval same subject network documentation of forming, surpass default value until the same subject network documentation quantity of certain term retrieval.Default value can span can be greater than 10.

With the term retrieval same subject network documentation of at first choosing, judge whether the same subject network documentation that retrieves surpasses default value, in this way, stops to form term, extracts the network documentation that retrieves; As not, continue to form term and retrieve again, the same subject network documentation quantity of retrieving until certain term surpasses default value.

For example, in above-mentioned patent file, term " patented claim " formed in two words of " patent " that the weight selection value is higher and " application ", use term " patented claim " retrieval same subject network documentation, judge that whether the network documentation quantity that retrieves is above 10, in this way, stop retrieval; As denying, term formed in the higher word of continuation weight selection value in above-mentioned patent file, as choose " patent " and two words compositions of engineering noise term " patent is invalid ", retrieval retrieves the quantity of network documentation above 10 up to certain term again.

The present invention can take multiple mode to choose the higher word composition term of two or more weighted values, its objective is to make term react the subject content characteristic of current network document as much as possible.

For example, weighted value is surpassed the word of setting numerical value form the word storehouse, term formed in the two or more words of picked at random in this word storehouse.

Again for example, the weighted value descending sort pressed in word, with first word successively with second, third, the 4th word be combined into term.Be exemplified as, word is arranged as A, B, C, D..., and the term of Zu Chenging is AB, AC, AD... successively.

Again for example, the weighted value descending sort pressed in word,, will go up a word successively and form term with next word that this word faces mutually from first word.Be exemplified as, word is arranged as A, B, C, D..., and the term of Zu Chenging is AB, BC, CD... successively.

Step S203, the above-mentioned current network document of polymerization and same subject network documentation.

The present invention makes up the higher word of weighted value in the current network document as term, and retrieval same subject network documentation because of the high word of weighted value, has very strong representativeness, can be good at reacting the characteristic of current network document.The network documentation that the term of being made up of the higher word of two or more weighted values retrieves is very big with the possibility of theme with the current network document.

The present invention is in choosing the process of same subject network document, only need choose suitable word and form the term retrieval, do not need the diverse network document that to search and the network documentation of all kinds of themes to contrast one by one, need the data volume of processing less, in application process, the network renewal speed is fast, helps improving sense of experience of users.

In above-mentioned steps S201 of the present invention, the present invention can adopt multiple mode to calculate the weighted value of each word in the current network document, its juche idea is to utilize the frequency of occurrence of word in the current network document, reach this word general degree in each network documentation, calculate the weighted value of this word by relevant formula.

Consult Fig. 3, the method that the present invention calculates the weighted value of each word in the current network document is shown, concrete steps are as described below.

Step S301, in the current network document statistics each word frequency of occurrence.Word occurrence number in the current network document is many more, and it is important more the position to occur, and the frequency of occurrence of this word is also just big more.The present invention can adopt multiple mode to add up the frequency of occurrence of word, introduces preferred two kinds of statisticals at this.

For example, obtain the position that word occurs in the current network document, reach the occurrence number in this position, the occurrence number of word in this position be multiply by this position coefficient of correspondence, the numerical value that the product addition obtains is as the frequency of occurrence of this word.As, word occurs 1 time at the caption position of current network document, occurs 15 times in current network document text, and the coefficient of caption position is 8, and the coefficient of text position is 1, and then the frequency of occurrence of this word is 1 * 8+15 * 1=23.

Again for example, statistics word occurrence number in the current network document judges that whether this word occurs in network documentation theme position, in this way, adds the fixed number value, as the frequency of occurrence of this word on the total occurrence number of this word.As, adding up certain word total occurrence number in the current network document is 12, and this word occurs in network documentation theme position, and setting quantity is 10, and the frequency of occurrence of this word is 12+10=22.

Step S302, obtain index file quantity and general index number of documents that each word hits.The webserver obtains the general index number of documents of diverse network document by the traversal mode, utilizes this word to retrieve in the general index document again, adds up the index file quantity that this word hits.

Step S303, calculate the weighted value of this word, weighted value calculates formula and is:

Term weighing value=TF * 1g (N/n);

Wherein, TF is the frequency of occurrence of this word, and N represents the quantity of general index document, and n represents the index file quantity that this word hits.

Certainly, the present invention also can adopt other multiple weighted value formula to calculate, for example

Term weighing value=TF * K (N/n), wherein, K is a coefficient.

Again for example

Term weighing value=TF * (N/n)+Z, wherein, Z is a constant.

The present invention is by the frequency of occurrence of word in the current network document, and this word general degree in each network documentation, calculate the weighted value of this word with respect to the current network document, this weighted value can react the representative degree of this word for current network document characteristic preferably.

For the same subject network document that further guarantees retrieval with current network document degree of correlation height, the present invention can adopt multiple mode that the network documentation that retrieves is further screened, and chooses and the high network documentation of current network document degree of correlation.

Consult Fig. 4, method second embodiment of converging network relevant documentation of the present invention is shown, concrete steps are as described below.

Step S401, obtain the weighted value of each word in the current network document.

Step S402, choose the higher word of two or more weighted values successively and form term, utilize the term retrieval same subject network documentation of forming, surpass default value until the same subject network documentation quantity of certain term retrieval.

Step S403, use Hash table represent the current network document and the network documentation that retrieves in the vector value of each word.

Step S404, arrange the current network document and retrieve each word in the network documentation by the frequency of occurrence ascending order.

Step S405, calculate the network documentation that retrieves and the relevance degree of current network document according to the vector value of each word.The calculating formula is:

Sim (d, q) = \frac{\underset{i}{Σ} (a_{i} \times b_{i})}{\sqrt{\underset{i}{Σ} {a_{i}}^{2} \times \underset{i}{Σ} {b_{i}}^{2}}};

Wherein, ai represents the vector value of each word in the current network document, the vector value of each word in the network documentation that bi represents to retrieve.

Step S406, removal relevance degree are lower than the same subject network documentation of default value.Default value can be adjusted according to the type of theme of current network document.

Step S407, polymerization current network document and same subject network documentation.

The present invention selects and the higher network documentation of current network document degree of correlation by word vector calculation current network document and the network documentation relevance degree that retrieves, and further improves the precision of polymerization same subject network document.

Based on the method for above-mentioned polymerization same subject network documentation, the present invention also provides a kind of system of polymerization same subject network documentation, and this system can improve the network renewal speed, improves sense of experience of users.

Consult Fig. 5, first embodiment of system of polymerization same subject network documentation of the present invention is shown, comprise weighted value computing module 51, term composition module 52, network documentation retrieval module 53, reach polymerization module 54.

Weighted value computing module 51 obtains the weighted value of each word in the current network document.Weighted value is represented the degree of correlation of this word with current network document subject matter content, and degree of correlation is high more, and weighted value is also corresponding high more.Weighted value computing module 51 sends to term with the weighted value that obtains and forms module 52.

Term is formed module 52 and is chosen the higher word composition term of two or more weighted values successively.Module 52 formed in term can surpass weighted value the word composition word storehouse of setting numerical value, and term formed in the two or more words of picked at random in this word storehouse; Module 52 formed in term also can press word the weighted value descending sort, with first word successively with second, third, the 4th word be combined into term; Module 52 formed in term also can press word the weighted value descending sort, from first word, will go up a word successively and form term with next word that this word faces mutually.Term is formed module 52 term of forming is sent to network documentation retrieval module 53.

Network documentation retrieval module 53 utilizes the term retrieval same subject network documentation of forming, and the same subject network documentation quantity of retrieving until certain term surpasses default value.The term retrieval same subject network documentation that network documentation retrieval module 53 will at first be chosen judges that whether the same subject network documentation that retrieves surpasses default value, in this way, extracts the network documentation that retrieves; As not, continue to obtain term and retrieve again, the same subject network documentation quantity of retrieving until certain term surpasses default value.Network documentation retrieval module 53 sends to polymerization module 54 with the network documentation of extracting.

The network documentation of above-mentioned current network document of polymerization module 54 polymerizations and retrieval.

Consult Fig. 6, term of the present invention is formed module 52 and is comprised word arrangement submodule 521 and form submodule 522.Word is arranged submodule 521 the weighted value descending sort is pressed in above-mentioned each word, sends to form submodule 522.Form submodule 522 from first word, will go up a word successively and form term with next word that this word faces mutually.

The present invention calculates the degree of correlation that retrieves between network documentation and the current network document by correlation module, removes the lower network documentation of the degree of correlation, further improves the quality of the network documentation of polymerization.

Consult Fig. 7, second embodiment of system of polymerization same subject network documentation of the present invention is shown, comprise that weighted value computing module 51, term form module 52, network documentation retrieval module 53, polymerization module 54, vector value module 55, relatedness computation module 56, and remove module 57.

Vector value module 55 uses Hash table to represent the vector value of each word in current network document and the same subject network documentation, and the vector value of each word is sent to relatedness computation module 56.

Relatedness computation module 56 is calculated the network documentation and the relevance degree of current network document that retrieves according to the vector value of each word, the calculating formula is:

Sim (d, q) = \frac{\underset{i}{Σ} (a_{i} \times b_{i})}{\sqrt{\underset{i}{Σ} {a_{i}}^{2} \times \underset{i}{Σ} {b_{i}}^{2}}};

Wherein, ai represents the vector value of each word in the current network document, the vector value of each word in the network documentation that bi represents to retrieve.Relevance degree between network documentation that relatedness computation module 56 will respectively retrieve and the current network document sends to removes module 57.

Remove module 57 and remove the network documentation that relevance degree is lower than default value, all the other network documentations are sent to polymerization module 54.The above-mentioned network documentation of polymerization module 54 polymerizations.

Weighted value computing module 51, term form module 52, and network documentation retrieval module 53 function in the present embodiment and effect with embodiment illustrated in fig. 5 identical, repeat no more.

More than to the method and system of a kind of polymerization same subject network documentation provided by the present invention, be described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1, a kind of method of polymerization same subject network documentation is characterized in that, comprising:

Obtain the weighted value of each word in the current network document;

Choose the higher word of two or more weighted values successively and form term, utilize the term retrieval same subject network documentation of forming, the same subject network documentation quantity of retrieving until certain term surpasses default value;

Above-mentioned current network document of polymerization and same subject network documentation.

2, the method for claim 1 is characterized in that, before above-mentioned current network document of polymerization and the same subject network documentation, also comprises:

Use Hash table to represent the vector value of each word in current network document and the same subject network documentation;

Calculate the relevance degree of described same subject network documentation and current network document according to the vector value of described each word;

Remove the same subject network documentation that relevance degree is lower than default value.

3, method as claimed in claim 2 is characterized in that, the relevance degree that calculates described same subject network documentation and current network document according to the vector value of described each word is specially;

Press each word in frequency of occurrence ascending order arrangement current network document and the same subject network documentation;

With the vector value of each word in the same subject network documentation, to multiply each other respectively with the vector value of each word of corresponding current network document, the long-pending addition of acquisition is as first data;

With the vector value of each word in the same subject network documentation respectively square after, addition; With the vector value of each word in the current network document respectively square after, addition; With aforementioned calculation and multiply each other, evolution again is as second data;

Described first data are divided by the merchant of described second data, as the relevance degree of same subject network documentation and current network document.

4, the method for claim 1 is characterized in that, chooses the higher word composition term of two or more weighted values successively and is specially:

The weighted value descending sort pressed in above-mentioned each word;

From first word, will go up a word successively and form term with next word that this word faces mutually.

5, as each described method of claim 1 to 4, it is characterized in that the weighted value that obtains each word in the current network document is specially:

The frequency of occurrence of each word of statistics obtains index file quantity and general index number of documents that each word hits in the current network document;

General index quantity divided by the index quantity that this word hits, is taken the logarithm again, and the numerical value that obtains multiply by above-mentioned frequency of occurrence, obtains the weighted value of this word.

6, method as claimed in claim 5 is characterized in that, the frequency of occurrence of each word of statistics is specially in the current network document:

Obtain the position that this word occurs in the current network document, reach occurrence number in this position;

The occurrence number of word in this position be multiply by this position coefficient of correspondence, after the product addition as the frequency of occurrence of this word.

7, method as claimed in claim 5 is characterized in that, the frequency of occurrence of each word of statistics is specially in the current network document:

Statistics word occurrence number in the current network document;

Judge that whether this word occurs in network documentation theme position, in this way, adds the fixed number value, as the frequency of occurrence of this word on the total occurrence number of this word.

8, a kind of system of polymerization same subject network documentation is characterized in that, comprises weighted value computing module, term composition module, network documentation retrieval module, reaches the polymerization module:

Described weighted value computing module is used for obtaining the weighted value of each word of current network document;

Module formed in described term, is used for choosing successively the higher word of two or more weighted values and forms term;

Described network documentation retrieval module is used to utilize the term of composition to retrieve the same subject network documentation, and the same subject network documentation quantity of retrieving until certain term surpasses default value;

Described polymerization module is used for above-mentioned current network document of polymerization and same subject network documentation.

9, system as claimed in claim 8 is characterized in that, described term is formed module and comprised word arrangement submodule and form submodule:

Submodule arranged in described word, is used for above-mentioned each word by the weighted value descending sort;

Described composition submodule is used for from first word, will go up a word successively and form term with next word that this word faces mutually.

10, system as claimed in claim 8 or 9 is characterized in that, also comprises the vector value module, the relatedness computation module, removes module:

Described vector value module is used for using Hash table to represent the vector value of current network document and each word of same subject network documentation;

Described relatedness computation module, the vector value that is used for described each word of foundation is calculated the relevance degree of described same subject network documentation and current network document;

Described removal module is used to remove the same subject network documentation that relevance degree is lower than default value.