CN101079025B

CN101079025B - File correlation computing system and method

Info

Publication number: CN101079025B
Application number: CN2006100360943A
Authority: CN
Inventors: 丁江伟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date: 2006-06-19
Filing date: 2006-06-19
Publication date: 2010-06-16
Anticipated expiration: 2026-06-19
Also published as: CN101079025A

Abstract

The invention discloses a document related degree calculating system, which is characterized by the following: comprising sequence of document pretreating module and dividing vocabulary module; setting the output of the document pretreating module as at least one pre-analyzing document; setting the output of the dividing vocabulary module as relative first vocabulary meter; also comprising aryumentation element processing module and document related degree calculating module; converting the vocabulary of the first vocabulary mater to aryumentation element; calculating the weight of the aryumentation element; getting at least one theme semantic vector with relative to at least one document; connecting the document relative degree calculating module to the theme semantic vector calculating module; using to calculate related degree of at least two theme semantic vector. This invention also discloses a document related degree calculating method. This invention can remove vocabulary rarefaction and ambiguous vocabulary phenomenon to improve the calculating accuracy of document related degree.

Description

A kind of file correlation computing system and method

Technical field

The present invention relates to the network communications technology, more particularly, relate to a kind of file correlation computing system and method.

Background technology

File correlation is the decimal between 0 to 1, has characterized degree of correlation semantically between two pieces of documents.For example, the degree of correlation of two pieces of identical document is 1, and the degree of correlation of one piece of document that relates to programming technique and one piece of document that relates to political society much smaller than 1, approach 0.Calculate file correlation and can be applied in a lot of aspects, such as the taxonomic clustering of document, retrieval related article information etc.

The calculating of file correlation at present all is based on theme vocabulary extractive technique: at first extract the theme vocabulary of document to be compared by calculating, draw the degree of correlation of document to be compared again by the degree of correlation between the calculating theme vocabulary.

Existing theme extracting method mainly contains two kinds.A kind of theme extractive technique that is based on title.Its method is: adopt document resolver, parse documents is found out the title of document, then with the value of title in the document theme as document.But this computing method are obviously too simple, can't be applied in to calculate in the file correlation.

Another is based on the theme extractive technique of word frequency.Along with the development of statistics natural language processing technique, the method that indicates document subject matter according to high word frequency keyword has obtained using widely, and is particularly more commonly used in the extraction of Web page subject.Specific practice is earlier the webpage source file to be removed the tag mark, then article content is carried out participle statistics word frequency, by word frequency keyword is sorted at last, provides the high word frequency of top n as the article theme.But because the table semantic language develops very fully, one adopted many speech, polysemy are universal phenomena, the utilization of adding the rhetoric method makes the sparse phenomenon outwardness of vocabulary, particularly for the short essay chapter of web page class, this algorithm whole structure is not very desirable, thereby causes the calculating of file correlation undesirable.

Summary of the invention

The objective of the invention is to the defective at prior art, a kind of file correlation computing system and method are provided, it can eliminate the negative influence to the degree of correlation of polysemant and the sparse phenomenon of vocabulary simultaneously based on justice unit collection semantic analysis technology.

Technical scheme of the present invention is: a kind of file correlation computing system, comprise document pretreatment module and word-dividing mode that order links to each other, described document pretreatment module be input as at least one piece of writing document to be analyzed, described word-dividing mode is output as first vocabulary of corresponding described at least one piece of document; Described word-dividing mode also has the function of the vocabulary after the cutting being carried out part-of-speech tagging; Also comprise: the participle post-processing module of between described word-dividing mode and adopted first processing module, also connecting, described participle post-processing module is used for the part of speech according to the first vocabulary vocabulary, rejects wherein stop words, function word, obtains second vocabulary; The first processing module of justice, be used for the vocabulary of described second vocabulary is carried out justice unit mark, form the 3rd vocabulary, determine the weights of the pairing a plurality of justice of polysemant in described the 3rd vocabulary unit or determine unique justice unit for polysemant and show to obtain the first justice unit, weight is calculated by all justice units in the described first justice unit table, obtain theme semantic vector by the weight ordering; The file correlation computing module, it links to each other with the first processing module of described justice, is used at least two theme semantic vectors are carried out relatedness computation.

Wherein, described document pretreatment module is used for document input, different-format is converted to standard format, and extracts the document text; Word-dividing mode is used for participle is carried out in the output of described document pretreatment module, obtains described first vocabulary.The first processing module of described justice comprises: adopted first labeling module, and it is used for using adopted first dictionary that the vocabulary of described second vocabulary is carried out justice unit mark, forms the 3rd vocabulary; The word sense disambiguation module, it is used for determining the first weight of the pairing a plurality of justice of described the 3rd vocabulary polysemant, or determines that for polysemant unique justice is first, obtains the first justice unit and shows; Theme semantic vector computing module, it is used for weight is calculated by all justice units of the described first justice unit table, obtains the theme semantic vector by the weight ordering.

As an improvement of the present invention, also comprise theme semantic vector storehouse, its input end links to each other with the first processing module of described justice, and output terminal links to each other with described file correlation computing module, is used to store the theme semantic vector of the first processing module output of described justice.Described file correlation computing module is used at least two theme semantic vectors are carried out relatedness computation; Described theme semantic vector obtains from the first processing module of described justice, or obtains from described theme semantic vector storehouse, or obtains from first processing module of described justice and theme semantic vector storehouse respectively.

The present invention also provides a kind of file correlation computing method, may further comprise the steps: (a), be converted to standard format by document pretreatment module document that will import, different-format, and extract the document body matter; (b), the output of described document pretreatment module is carried out participle and the vocabulary after the cutting is carried out part-of-speech tagging, obtain first vocabulary by word-dividing mode; Stop words, the function word processing of the vocabulary of described first vocabulary being rejected wherein by the participle post-processing module obtain second vocabulary; (c), by the first processing module of justice the vocabulary in described second vocabulary is carried out justice unit mark, form the 3rd vocabulary, and the vocabulary in described the 3rd vocabulary handled, determine the weights of the pairing a plurality of justice of polysemant wherein unit or determine unique justice unit for polysemant and show to obtain the first justice unit, weight is calculated by all justice units in the described first justice unit table, obtain theme semantic vector by the weight ordering; (d), calculate, obtain the degree of correlation of described at least two pieces of documents by the theme semantic vector of file correlation computing module at least two pieces of documents to be analyzed.

Wherein, in the step (d), the theme semantic vector of described at least two pieces of documents obtains from the first processing module of described justice, or from theme semantic vector storehouse that described file correlation computing module is connected obtain, or from first processing module of described justice and theme semantic vector storehouse, obtain respectively.

Further, step (a) further comprises: described document pretreatment module is obtained corresponding document classification information and heading message.

In the step (c), the method that obtains the theme semantic vector is: (c1), use adopted first dictionary that the vocabulary in described second vocabulary is carried out justice unit mark by the first labeling module of justice, form the 3rd vocabulary; (c2), handled marking the first vocabulary of justice in described the 3rd vocabulary, determined the wherein first weights of the pairing a plurality of justice of polysemant, or determined for polysemant that unique justice was first, and obtained the first justice unit and show by the word sense disambiguation module; (c3), weight is calculated by all the justice units in the described first justice unit table, obtain theme semantic vector by the weight ordering by theme semantic vector computing module.

Beneficial effect of the present invention is: 1. adopt the semantic analysis technology based on justice unit collection, avoided the sparse difficult problem of vocabulary, make that the analytical effect of the degree of correlation is good between the document that relates to the short essay chapter, promoted the precision that file correlation calculates.2. use the word sense disambiguation technology, eliminated polysemant, improved the relatedness computation precision calculating the negative influence of the degree of correlation.3. take into full account the presorting of document, heading message and display properties, can extract the theme of document exactly, thereby promoted the file correlation computational accuracy.

Description of drawings

Fig. 1 is the structural drawing of a kind of file correlation computing system of the present invention.

Fig. 2 is the process flow diagram of a kind of file correlation computing method of the present invention.

Embodiment

The present invention is further elaborated with preferred embodiment with reference to the accompanying drawings below.

As shown in Figure 1, a kind of file correlation computing system of the present invention comprises document pretreatment module 1, word-dividing mode 2, participle post-processing module 3, adopted first processing module and the file correlation computing module 8 that links to each other in turn.The first processing module of justice comprises adopted first labeling module 4, word sense disambiguation module 5 and the theme semantic vector computing module 6 that links to each other in turn.As required, can also comprise theme semantic vector storehouse 7, its input end links to each other with theme semantic vector computing module 6, and its output terminal links to each other with file correlation computing module 8.

Wherein, document pretreatment module 1 is used for document input, different-format is converted to standard format, and extracts the document text.Wherein, the document of different-format can comprise documents such as webpage, word document, text document, pdf.Standard format can be a text document.In specific implementation, if can from the standard format after the conversion, extract Document Title and classified information, then the document pretreatment module can also have the ability of extracting conversion back standard document title and classified information, with the accuracy of raising extraction document subject matter, thereby promote the file correlation computational accuracy.All documents as system handles all are webpage formats, and then standard format is defined as webpage format, and the document pretreatment module just need possess the ability of extracting web page title and classified information.It links to each other with word-dividing mode 2.

Word-dividing mode 2 is used for participle is carried out in the output of described document pretreatment module 1.In the present embodiment, word-dividing mode 2 is responsible for will changing afterwards according to dictionary, and text and title, the classification of webpage are cut into vocabulary.As " I am a student " carried out participle, be divided into " I ", "Yes", " one ", " student " four speech.Word algorithm can be divided into three major types in existing minute: based on the segmenting method of string matching, based on the segmenting method of understanding with based on the segmenting method of adding up.Adopt segmenting method in the present embodiment based on string matching.This method is called mechanical segmentation method again, and it is according to certain strategy the entry in Chinese character string to be analyzed and one " fully big " machine dictionary to be mated, if find certain character string in dictionary, then the match is successful (identifying a speech).

Among the present invention, word-dividing mode 2 also has the function of the vocabulary after the cutting being carried out part-of-speech tagging, stop words in the vocabulary and function word etc. is removed according to part of speech to make things convenient for participle post-processing module 3.

The function of participle post-processing module 3 includes but not limited to the output of word-dividing mode 2 is removed stop words, gone function such as function word, rejects the irrelevant information of theme.

Vocabulary after the first labeling module 4 of justice is used to use adopted first dictionary to participle is to justice unit mark.It links to each other with participle post-processing module 3, word sense disambiguation module 5 and adopted first dictionary.

Because it is first that adopted first dictionary has provided a plurality of justice to polysemant, at this moment just need word sense disambiguation module 5 to determine possible weight of each justice unit of this polysemant correspondence according to upper and lower civilian information.Can certainly adopt simpler method: based on context determine the semanteme of a specific justice unit as this polysemant.Present embodiment adopts second method.Can adopt methods such as bayesian algorithm, decision tree, computing information entropy to calculate.

The present invention in the leaching process of document subject matter be not with vocabulary as computing unit, and be to use adopted first dictionary to convert vocabulary the expression of to justice unit, be a kind of semantic analysis technology based on justice unit collection.So-called justice unit (semantic primitives) is exactly to organize element the most basic in the semantic language.It can be construed to: the symbol of one group of meaning, in addition, other all vocabulary can both define with them.The great difficult problem that natural language processing faced is that vocabulary is sparse, so converting keyword to adopted first vocabulary shows and can avoid the sparse phenomenon of vocabulary to a great extent, justice unit collection is a vocabulary or an adopted first sequence number set on a small scale, natural all notions have been characterized, plain unique, the notion of unduplicated expression of the first element of set of justice.

Theme semantic vector computing module 6 is used to utilize the Statistical Linguistics principle that all adopted units of word sense disambiguation module 5 outputs are calculated, and result calculated is to have given different weights to different adopted units, obtains the theme semantic vector by the weight ordering.If document pretreatment module 1 has obtained the title and the classified information of document, then theme semantic vector computing module 6 classified information, heading message and text message for document in calculation process gives different degneracies respectively.

In the present embodiment, adopt the Tf-Idf algorithm that weight is calculated by all justice units.Can certainly adopt the cross entropy scheduling algorithm that weight is calculated by justice unit.The Tf-Idf algorithm adopts the inverted index technology, and it is mainly used in full-text search.This algorithm can guarantee to compose with high weight into the justice unit of intermediate frequency, and gets rid of noise vocabulary.

Theme semantic vector storehouse 7 is used to store the theme semantic vector of theme semantic vector computing module 6 outputs.

File correlation computing module 8 is used for the theme semantic vector of at least two pieces of documents to be analyzed is calculated, and obtains the degree of correlation of described at least two pieces of documents.Above-mentioned theme semantic vector can all obtain from the first processing module of justice, and promptly aforementioned each module is handled at least two pieces of documents respectively simultaneously.Above-mentioned theme semantic vector also can all obtain from described theme semantic vector storehouse 7, and it searches out the theme semantic vector corresponding with document to be analyzed according to setting from theme semantic vector storehouse 7, calculate then.Above-mentioned theme semantic vector also can one piece obtains from the first processing module of justice, and another piece of writing obtains from theme semantic vector storehouse 7.For example administration module is found in two pieces of documents to be analyzed one piece by analysis, and its theme semantic vector is stored in the theme semantic vector storehouse 7, then this analyzes one piece of document, and the theme semantic vector of another piece document directly obtains from theme semantic vector storehouse 7.

Can be by calculating the degree of correlation that two included angle cosines between the theme semantic vector obtain relevant documentation.

As shown in Figure 2, a kind of file correlation computing method of the present invention comprise step:

S1, document pretreatment module 1 document that will import, different-format is converted to standard format documentation, extracts its body matter, if can reentry its title and classification information.

S2, the body matter of 2 pairs of documents of word-dividing mode (may also comprise classification and title) carries out participle, and the vocabulary after the above-mentioned cutting is carried out part-of-speech tagging, forms first vocabulary.

S3, participle post-processing module 3 forms second vocabulary with rejectings such as the stop words in first vocabulary, function words.

S4, adopted first labeling module 4 marks the vocabulary in second vocabulary according to the corresponding relation of dictionary and adopted first dictionary with adopted unit, form the 3rd vocabulary.

S5, the polysemant in 5 pairs the 3rd vocabularies of word sense disambiguation module is handled, and based on contextual information is that polysemant determines that corresponding unique justice is first, obtains the first justice unit table.

S6, theme semantic vector computing module 6 Tf-Idf according to the quantity space model (characteristic item tax weight factor) scheduling algorithm calculate weight for all the justice units in the first justice unit table, obtain the theme semantic vector by the weight ordering.

S7,8 pairs of file correlation computing modules calculate with the corresponding theme semantic vector of document to be analyzed, obtain the degree of correlation between the document to be analyzed, and it is normalized to numerical value between the 0-1.

The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within the claim scope of the present invention.

Claims

1. file correlation computing system, comprise document pretreatment module and word-dividing mode that order links to each other, described document pretreatment module be input as at least one piece of writing document to be analyzed, described word-dividing mode is output as first vocabulary of corresponding described at least one piece of document; Described word-dividing mode also has the function of the vocabulary after the cutting being carried out part-of-speech tagging; It is characterized in that, also comprise:

The participle post-processing module of between described word-dividing mode and adopted first processing module, also connecting, described participle post-processing module is used for the part of speech according to the first vocabulary vocabulary, rejects wherein stop words, function word, obtains second vocabulary;

The first processing module of justice, be used for the vocabulary of described second vocabulary is carried out justice unit mark, form the 3rd vocabulary, determine the weights of the pairing a plurality of justice of polysemant in described the 3rd vocabulary unit or determine unique justice unit for polysemant and show to obtain the first justice unit, weight is calculated by all justice units in the described first justice unit table, obtain theme semantic vector by the weight ordering;

The file correlation computing module, it links to each other with the first processing module of described justice, is used at least two theme semantic vectors are carried out relatedness computation.

2. a kind of file correlation computing system according to claim 1, it is characterized in that: also comprise theme semantic vector storehouse, its input end links to each other with the first processing module of described justice, output terminal links to each other with described file correlation computing module, is used to store the theme semantic vector of the first processing module output of described justice;

Described file correlation computing module is used at least two theme semantic vectors are carried out relatedness computation; Described theme semantic vector obtains from the first processing module of described justice, or obtains from described theme semantic vector storehouse, or obtains from first processing module of described justice and theme semantic vector storehouse respectively.

3. a kind of file correlation computing system according to claim 1 is characterized in that:

Described document pretreatment module is used for document input, different-format is converted to standard format, and extracts the document text;

Word-dividing mode is used for participle is carried out in the output of described document pretreatment module, obtains described first vocabulary.

4. a kind of file correlation computing system according to claim 3 is characterized in that, the first processing module of described justice comprises:

The first labeling module of justice, it is used for using adopted first dictionary that the vocabulary of described second vocabulary is carried out justice unit mark, forms the 3rd vocabulary;

The word sense disambiguation module, it is used for determining the first weight of the pairing a plurality of justice of described the 3rd vocabulary polysemant, or determines that for polysemant unique justice is first, obtains the first justice unit and shows;

Theme semantic vector computing module, it is used for weight is calculated by all justice units of the described first justice unit table, obtains the theme semantic vector by the weight ordering.

5. file correlation computing method is characterized in that, may further comprise the steps:

(a), be converted to standard format, and extract the document body matter by document pretreatment module document that will import, different-format;

(b), the output of described document pretreatment module is carried out participle and the vocabulary after the cutting is carried out part-of-speech tagging, obtain first vocabulary by word-dividing mode; Stop words, the function word processing of the vocabulary of described first vocabulary being rejected wherein by the participle post-processing module obtain second vocabulary;

(c), by the first processing module of justice the vocabulary in described second vocabulary is carried out justice unit mark, form the 3rd vocabulary, and the vocabulary in described the 3rd vocabulary handled, determine the weights of the pairing a plurality of justice of polysemant wherein unit or determine unique justice unit for polysemant and show to obtain the first justice unit, weight is calculated by all justice units in the described first justice unit table, obtain theme semantic vector by the weight ordering;

(d), calculate, obtain the degree of correlation of described at least two pieces of documents by the theme semantic vector of file correlation computing module at least two pieces of documents to be analyzed.

6. a kind of file correlation computing method according to claim 5, it is characterized in that: in the step (d), the theme semantic vector of described at least two pieces of documents obtains from the first processing module of described justice, or from theme semantic vector storehouse that described file correlation computing module is connected obtain, or from first processing module of described justice and theme semantic vector storehouse, obtain respectively.

7. a kind of file correlation computing method according to claim 5 is characterized in that, step (a) further comprises: described document pretreatment module is obtained corresponding document classification information and heading message.

8. a kind of file correlation computing method according to claim 5 is characterized in that, in the step (c), the method that obtains the theme semantic vector is:

(c1), use adopted first dictionary that the vocabulary in described second vocabulary is carried out justice unit mark, form the 3rd vocabulary by the first labeling module of justice;

(c2), handled marking the first vocabulary of justice in described the 3rd vocabulary, determined the wherein first weights of the pairing a plurality of justice of polysemant, or determined for polysemant that unique justice was first, and obtained the first justice unit and show by the word sense disambiguation module;

(c3), weight is calculated by all the justice units in the described first justice unit table, obtain theme semantic vector by the weight ordering by theme semantic vector computing module.