CN103049568B

CN103049568B - The method of the document classification to magnanimity document library

Info

Publication number: CN103049568B
Application number: CN201210593096.8A
Authority: CN
Inventors: 江潮
Original assignee: WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Current assignee: Language network (Wuhan) Information Technology Co., Ltd.
Priority date: 2012-12-31
Filing date: 2012-12-31
Publication date: 2016-05-18
Anticipated expiration: 2032-12-31
Also published as: CN103049568A

Abstract

A kind of method that the invention provides document classification to magnanimity document library, comprising: determine each keyword of all documents in document library and the corresponding relation of each keyword and its each document belonging to; Described each keyword is mated one by one in terminology bank, by the category of employment attribute of the term of each keyword coupling, the category of employment attribute belonging at each document of its correspondence as this keyword; According to described corresponding relation, determine that each document package is containing identical maximum category of employment attribute; Classification using category of employment attributes maximum ownership as each document. The present invention takes a kind of thinking of negative relational matching to carry out the document of reference library to carry out term retrieval, because term corpus is a set that possesses character sequence index structure, adopting dichotomy to carry out therein string matching at most only needs 1+log2n coupling to calculate, reduce greatly matching times, simplify matching process, improved the efficiency to document classification.

Description

The method of the document classification to magnanimity document library

Technical field

The present invention relates to computer realm, in particular to a kind of to magnanimity document libraryThe method of document classification.

Background technology

Translation bibliography storehouse (hereinafter to be referred as reference library), being one has the auxiliary of magnanimity documentThe document library of translated resources, with the method for general similarity retrieval to its by certain industry,Classify in subject, field, and need to carry out very huge text similarity coupling and calculate,The time of expending and space are all that system is difficult to bear.

By large-scale term corpus, the document in reference library is carried out the calculating of term quantity,Can carry out to document the Preliminary division of the attributes such as industry, subject, field, the word spendingSymbol string pattern coupling is calculated to be greatly less than and is carried out the amount of calculation that text similarity coupling is calculated.

Large-scale term corpus is one and comprises term marking information, possesses multiple index structureThe big collection of term language material, its scale is generally in 1,000,000 to ten million ranks, large can arriveHundred million grades. The markup information that this method need be used has: the industry of term, subject, realm information,The index structure that need use is character sequence index.

Conventionally to be undertaken by the term quantity in industry, subject, field with reference to the document in storehouseThe method of classification, adopting is that keyword carries out character string in document with the term in terminology bankCoupling, obtains every profession and trade, the subject of each document, the term quantity in field.

Because the document in reference library is a kind of unsorted text at random space, use this sideFormula is classified, need to be with 1,000,000, ten million so the term of more than one hundred million meters be keyword, in seaIn the reference library document of amount, carry out order and mate, also unusual huge (establishing of the time of expending like thisThe term number of term corpus is n, the number of files in reference documents storehouse is m, wherein documentAverage word number is k, and its time complexity is o(m × n × k). ), and wholeJoin process and will will repeatedly carry out character string to the identical word of the different document in reference libraryJoin, matching process repeats very much.

Summary of the invention

The present invention aims to provide a kind of method of the document classification to magnanimity document library, to solveDocument classification complexity, the consuming time longer problem of the mode of employing term coupling to reference library.

In an embodiment of the present invention, provide a kind of document classification to magnanimity document libraryMethod, comprising:

Determine each keyword and each keyword and its institute of all documents in document libraryThe corresponding relation of each document of ownership;

Described each keyword is mated one by one in terminology bank, each keyword is matedThe category of employment attribute of term, belongs at each document of its correspondence as this keywordCategory of employment attribute;

According to described corresponding relation, determine that each document package is containing identical maximum category of employmentAttribute;

Classification using category of employment attributes maximum ownership as each document.

The present invention takes a kind of thinking of negative relational matching to carry out the document of reference library to carry out termRetrieval, in reference library all words in (being document library) as keyword, in artIn language corpus, mate, because term corpus is the individual character sequence index structure that possessesSet, adopt dichotomy to carry out therein string matching and at most only need 1+log2n timeCoupling is calculated (n is the term number of term corpus), even in the term corpus of hundred million gradesMate, word matching times in term corpus is also no more than 30 times. The utmost pointLarge minimizing coupling number of times, simplified matching process, improved the effect to document classificationRate, has realized the fast automatic classification of magnanimity document.

Brief description of the drawings

Accompanying drawing described herein is used to provide a further understanding of the present invention, forms this ShenA part please, schematic description and description of the present invention is used for explaining the present invention, andDo not form inappropriate limitation of the present invention. In the accompanying drawings:

Fig. 1 shows the flow chart of embodiment;

Fig. 2 shows the flow chart of another embodiment.

Detailed description of the invention

Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail. Referring to Fig. 1,The step of embodiment comprises:

S11: each keyword and the each keyword of determining all documents in document libraryCorresponding relation with its each document belonging to;

S12: described each keyword is mated one by one in terminology bank, by each keywordThe category of employment attribute of term of coupling, as this keyword in each document of its correspondenceThe category of employment attribute belonging to;

S13: according to described corresponding relation, determine the identical multirow that each document package containsIndustry category attribute;

S14: maximum category of employment attributes is as the classification of each document.

The present invention takes a kind of thinking of negative relational matching to carry out the document of reference library to carry out termRetrieval, in reference library all words in (being document library) as keyword, in artIn language corpus, mate, because term corpus is the individual character sequence index structure that possessesSet, adopt dichotomy to carry out therein string matching and at most only need 1+log2n timeCoupling is calculated (n is the term number of term corpus), even in the term corpus of hundred million gradesMate, word matching times in term corpus is also no more than 30 times. The utmost pointLarge minimizing matching times, simplified matching process, improved the efficiency to document classification,Realize the fast automatic classification of magnanimity document.

Preferably, in an embodiment, each described document is carried out to word segmentation processing, removal stopsThe word of word, the concrete meaning of nothing, obtains described each keyword.

Preferably, also comprise: determine that each keyword occurs at its corresponding each documentMultiple positional informations; Wherein, the quantity of described positional information equals this keyword in its instituteThe word frequency of corresponding each document.

By this positional information, the position can recorded key word occurring in each document, whenWhen the long L of word of term exceedes keyword, can, according to the keyword behind this position, enter with termRow coupling, with the category of employment attribute of determining that this keyword is belonged in current document.

Preferably, illustrate the step of above-described embodiment below by embodiment: comprising:

S21: all documents to reference library carry out document code, are designated as docID.

S22: all documents in reference library are carried out to word segmentation processing, remove wherein stop usingWord, obtains all set of words of reference library, and each word is numbered, and is designated asWordID. Each word is keyword.

S23: calculate the number of times that each word occurs in different document, i.e. word frequency tf.

S24: calculate the positional information that each word occurs in each document, i.e. this wordIt is which word in document.

Just obtain a word lists structure as shown in table 1 below for each word like this:

Table 1

In table 1, set up the corresponding relation of multiple documents that word is with it corresponding, andThe positional information and the word frequency that occur at each document.

For example: following table 2 represents that " database " this word occurs in document doc0010Twice, it occurs that position is for respectively at the 10th and the 100th character place; At document doc0020Occur 3 times, it occurs that position is respectively at the 20th, the 200th and the 300th character place.

Table 2

Just set up like this information of all words of a reference library by said methodTable.

S25: by the order of reference library word information table, taking word as pattern string, at termIn corpus, carry out pattern match.

Because term corpus by character sequence index, can carry out with simple dichotomyCoupling, it is the term in term corpus that required matching times is not more than 1+log2n(nNumber). Concrete matching process is as follows:

If with first word match success of certain term, the word that calculates this term is long, establishesFor L, if L=1 this word be term, the match is successful, return this term industry,Subject, domain attribute information are given the document under this word; If corresponding multiple documents,Return the industry, subject, domain attribute information of this term to the multiple documents under this word.

If with first word match success of certain term, the word that calculates this term is long; AsThe long L > 1 of word that matches term described in fruit, travels through described current keyword corresponding one by onePositional information in each document;

For example: current keyword is " database ", the term matching is for " database is softPart "; The match is successful to match first word " database " of term. " database is soft for termPart " the long L=2 > 1 of word, travel through the positional information in the document doc0010 of keyword place10,100。

Traversing after each positional information of current document, in the document, extracting successively everyL-1 after an individual positional information keyword;

By L-1 the keyword extracting at every turn, be greater than 1 with the described word matching long LTerm mates.

After position 10, find next keyword " software ". By keyword " software "Mate with second word " software " in term " database software ".

If L-1 the keyword extracting, the art that is greater than 1 with the long L of the described word matchingLanguage carries out that the match is successful, using the category of employment attribute of this term as described current keywordThe category of employment attribute belonging at the described current document of its correspondence.

After the match is successful, using the category of employment information of term " database software " as keyThe category of employment information of word " database " in document doc0010.

S26: order has been mated all keywords in reference library word information table.

S27: calculate every profession and trade, the subject of each document, the term number in field, according to literary compositionThe industry of shelves, the term quantity of ambit, determine that identical, the highest category of employment belongs toProperty, according to this category attribute, the document is included into certain industry, subject, field.

Preferably, the word frequency of record, is used in and determines that comprising of each document is identicalIn the process of many category of employment attributes, the word frequency of the keyword by the document is done product fortuneCalculate, for example, the term that the keyword B of A document mates belongs to C industry; CrucialThe word frequency of word B in A document is 5, and the C category of employment attribute that A document comprises is 5.

Obviously, it is apparent to those skilled in the art that above-mentioned of the present invention each module orEach step can realize with general calculation element, and they can concentrate on single calculatingOn device, or be distributed on the network that multiple calculation elements form, alternatively, theyCan realize with the executable program code of calculation element, thereby, they can be storedIn storage device, carried out by calculation element, or they are made into respectively to each is integratedCircuit module, or the multiple modules in them or step are made into single integrated circuit mouldPiece is realized. Like this, the present invention is not restricted to any specific hardware and software combination.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention,For a person skilled in the art, the present invention can have various modifications and variations. AllWithin the spirit and principles in the present invention, any amendment of doing, be equal to replacement, improvement etc.,Within all should being included in protection scope of the present invention.

Claims

1. a method for the document classification to magnanimity document library, is characterized in that, comprising:

Determine that each keywords of all documents in document library and each keyword and its returnThe corresponding relation of each document belonging to;

Described each keyword is mated one by one in terminology bank, by the art of each keyword couplingThe category of employment attribute of language, the industry belonging at each document of its correspondence as this keywordCategory attribute;

According to described corresponding relation, determine that each document package belongs to containing identical maximum category of employmentProperty;

Classification using category of employment attributes maximum ownership as each document;

The method also comprises:

Determine multiple positional informations that each keyword occurs at its corresponding each document; ItsIn, the quantity of described positional information equals the word of this keyword at its corresponding each documentFrequently;

Described matching process comprises:

If described in match the long L=1 of word of term, determine that the match is successful;

If described in match the long L > 1 of word of term, travel through one by one current keyword correspondenceEach document in positional information;

Traversing after each positional information of current document, in the document, extract successively eachL-1 after a positional information keyword;

By L-1 the keyword extracting at every turn, the art that is greater than 1 with the long L of the described word matchingRear L-1 word of language carries out Corresponding matching;

Determine the category of employment attribute that each keyword belongs in each document of its correspondence;

If L-1 the keyword extracting, the term that is greater than 1 with the long L of the described word matchingCarry out that the match is successful, using the category of employment attribute of this term as described current keyword at itThe category of employment attribute belonging in corresponding described current document.

2. method according to claim 1, is characterized in that, each described document is enteredRow word segmentation processing, removes the specifically word of meaning of stop words, nothing, obtains described each keyword.

3. method according to claim 1, is characterized in that, adopts dichotomy, oughtFront keyword is searched in described terminology bank.