CN103049568B - The method of the document classification to magnanimity document library - Google Patents

The method of the document classification to magnanimity document library Download PDF

Info

Publication number
CN103049568B
CN103049568B CN201210593096.8A CN201210593096A CN103049568B CN 103049568 B CN103049568 B CN 103049568B CN 201210593096 A CN201210593096 A CN 201210593096A CN 103049568 B CN103049568 B CN 103049568B
Authority
CN
China
Prior art keywords
document
keyword
word
category
term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210593096.8A
Other languages
Chinese (zh)
Other versions
CN103049568A (en
Inventor
江潮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Language network (Wuhan) Information Technology Co., Ltd.
Original Assignee
WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd filed Critical WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority to CN201210593096.8A priority Critical patent/CN103049568B/en
Publication of CN103049568A publication Critical patent/CN103049568A/en
Application granted granted Critical
Publication of CN103049568B publication Critical patent/CN103049568B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A kind of method that the invention provides document classification to magnanimity document library, comprising: determine each keyword of all documents in document library and the corresponding relation of each keyword and its each document belonging to; Described each keyword is mated one by one in terminology bank, by the category of employment attribute of the term of each keyword coupling, the category of employment attribute belonging at each document of its correspondence as this keyword; According to described corresponding relation, determine that each document package is containing identical maximum category of employment attribute; Classification using category of employment attributes maximum ownership as each document. The present invention takes a kind of thinking of negative relational matching to carry out the document of reference library to carry out term retrieval, because term corpus is a set that possesses character sequence index structure, adopting dichotomy to carry out therein string matching at most only needs 1+log2n coupling to calculate, reduce greatly matching times, simplify matching process, improved the efficiency to document classification.

Description

The method of the document classification to magnanimity document library
Technical field
The present invention relates to computer realm, in particular to a kind of to magnanimity document libraryThe method of document classification.
Background technology
Translation bibliography storehouse (hereinafter to be referred as reference library), being one has the auxiliary of magnanimity documentThe document library of translated resources, with the method for general similarity retrieval to its by certain industry,Classify in subject, field, and need to carry out very huge text similarity coupling and calculate,The time of expending and space are all that system is difficult to bear.
By large-scale term corpus, the document in reference library is carried out the calculating of term quantity,Can carry out to document the Preliminary division of the attributes such as industry, subject, field, the word spendingSymbol string pattern coupling is calculated to be greatly less than and is carried out the amount of calculation that text similarity coupling is calculated.
Large-scale term corpus is one and comprises term marking information, possesses multiple index structureThe big collection of term language material, its scale is generally in 1,000,000 to ten million ranks, large can arriveHundred million grades. The markup information that this method need be used has: the industry of term, subject, realm information,The index structure that need use is character sequence index.
Conventionally to be undertaken by the term quantity in industry, subject, field with reference to the document in storehouseThe method of classification, adopting is that keyword carries out character string in document with the term in terminology bankCoupling, obtains every profession and trade, the subject of each document, the term quantity in field.
Because the document in reference library is a kind of unsorted text at random space, use this sideFormula is classified, need to be with 1,000,000, ten million so the term of more than one hundred million meters be keyword, in seaIn the reference library document of amount, carry out order and mate, also unusual huge (establishing of the time of expending like thisThe term number of term corpus is n, the number of files in reference documents storehouse is m, wherein documentAverage word number is k, and its time complexity is o(m × n × k). ), and wholeJoin process and will will repeatedly carry out character string to the identical word of the different document in reference libraryJoin, matching process repeats very much.
Summary of the invention
The present invention aims to provide a kind of method of the document classification to magnanimity document library, to solveDocument classification complexity, the consuming time longer problem of the mode of employing term coupling to reference library.
In an embodiment of the present invention, provide a kind of document classification to magnanimity document libraryMethod, comprising:
Determine each keyword and each keyword and its institute of all documents in document libraryThe corresponding relation of each document of ownership;
Described each keyword is mated one by one in terminology bank, each keyword is matedThe category of employment attribute of term, belongs at each document of its correspondence as this keywordCategory of employment attribute;
According to described corresponding relation, determine that each document package is containing identical maximum category of employmentAttribute;
Classification using category of employment attributes maximum ownership as each document.
The present invention takes a kind of thinking of negative relational matching to carry out the document of reference library to carry out termRetrieval, in reference library all words in (being document library) as keyword, in artIn language corpus, mate, because term corpus is the individual character sequence index structure that possessesSet, adopt dichotomy to carry out therein string matching and at most only need 1+log2n timeCoupling is calculated (n is the term number of term corpus), even in the term corpus of hundred million gradesMate, word matching times in term corpus is also no more than 30 times. The utmost pointLarge minimizing coupling number of times, simplified matching process, improved the effect to document classificationRate, has realized the fast automatic classification of magnanimity document.
Brief description of the drawings
Accompanying drawing described herein is used to provide a further understanding of the present invention, forms this ShenA part please, schematic description and description of the present invention is used for explaining the present invention, andDo not form inappropriate limitation of the present invention. In the accompanying drawings:
Fig. 1 shows the flow chart of embodiment;
Fig. 2 shows the flow chart of another embodiment.
Detailed description of the invention
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail. Referring to Fig. 1,The step of embodiment comprises:
S11: each keyword and the each keyword of determining all documents in document libraryCorresponding relation with its each document belonging to;
S12: described each keyword is mated one by one in terminology bank, by each keywordThe category of employment attribute of term of coupling, as this keyword in each document of its correspondenceThe category of employment attribute belonging to;
S13: according to described corresponding relation, determine the identical multirow that each document package containsIndustry category attribute;
S14: maximum category of employment attributes is as the classification of each document.
The present invention takes a kind of thinking of negative relational matching to carry out the document of reference library to carry out termRetrieval, in reference library all words in (being document library) as keyword, in artIn language corpus, mate, because term corpus is the individual character sequence index structure that possessesSet, adopt dichotomy to carry out therein string matching and at most only need 1+log2n timeCoupling is calculated (n is the term number of term corpus), even in the term corpus of hundred million gradesMate, word matching times in term corpus is also no more than 30 times. The utmost pointLarge minimizing matching times, simplified matching process, improved the efficiency to document classification,Realize the fast automatic classification of magnanimity document.
Preferably, in an embodiment, each described document is carried out to word segmentation processing, removal stopsThe word of word, the concrete meaning of nothing, obtains described each keyword.
Preferably, also comprise: determine that each keyword occurs at its corresponding each documentMultiple positional informations; Wherein, the quantity of described positional information equals this keyword in its instituteThe word frequency of corresponding each document.
By this positional information, the position can recorded key word occurring in each document, whenWhen the long L of word of term exceedes keyword, can, according to the keyword behind this position, enter with termRow coupling, with the category of employment attribute of determining that this keyword is belonged in current document.
Preferably, illustrate the step of above-described embodiment below by embodiment: comprising:
S21: all documents to reference library carry out document code, are designated as docID.
S22: all documents in reference library are carried out to word segmentation processing, remove wherein stop usingWord, obtains all set of words of reference library, and each word is numbered, and is designated asWordID. Each word is keyword.
S23: calculate the number of times that each word occurs in different document, i.e. word frequency tf.
S24: calculate the positional information that each word occurs in each document, i.e. this wordIt is which word in document.
Just obtain a word lists structure as shown in table 1 below for each word like this:
Table 1
In table 1, set up the corresponding relation of multiple documents that word is with it corresponding, andThe positional information and the word frequency that occur at each document.
For example: following table 2 represents that " database " this word occurs in document doc0010Twice, it occurs that position is for respectively at the 10th and the 100th character place; At document doc0020Occur 3 times, it occurs that position is respectively at the 20th, the 200th and the 300th character place.
Table 2
Just set up like this information of all words of a reference library by said methodTable.
S25: by the order of reference library word information table, taking word as pattern string, at termIn corpus, carry out pattern match.
Because term corpus by character sequence index, can carry out with simple dichotomyCoupling, it is the term in term corpus that required matching times is not more than 1+log2n(nNumber). Concrete matching process is as follows:
If with first word match success of certain term, the word that calculates this term is long, establishesFor L, if L=1 this word be term, the match is successful, return this term industry,Subject, domain attribute information are given the document under this word; If corresponding multiple documents,Return the industry, subject, domain attribute information of this term to the multiple documents under this word.
If with first word match success of certain term, the word that calculates this term is long; AsThe long L > 1 of word that matches term described in fruit, travels through described current keyword corresponding one by onePositional information in each document;
For example: current keyword is " database ", the term matching is for " database is softPart "; The match is successful to match first word " database " of term. " database is soft for termPart " the long L=2 > 1 of word, travel through the positional information in the document doc0010 of keyword place10,100。
Traversing after each positional information of current document, in the document, extracting successively everyL-1 after an individual positional information keyword;
By L-1 the keyword extracting at every turn, be greater than 1 with the described word matching long LTerm mates.
After position 10, find next keyword " software ". By keyword " software "Mate with second word " software " in term " database software ".
If L-1 the keyword extracting, the art that is greater than 1 with the long L of the described word matchingLanguage carries out that the match is successful, using the category of employment attribute of this term as described current keywordThe category of employment attribute belonging at the described current document of its correspondence.
After the match is successful, using the category of employment information of term " database software " as keyThe category of employment information of word " database " in document doc0010.
S26: order has been mated all keywords in reference library word information table.
S27: calculate every profession and trade, the subject of each document, the term number in field, according to literary compositionThe industry of shelves, the term quantity of ambit, determine that identical, the highest category of employment belongs toProperty, according to this category attribute, the document is included into certain industry, subject, field.
Preferably, the word frequency of record, is used in and determines that comprising of each document is identicalIn the process of many category of employment attributes, the word frequency of the keyword by the document is done product fortuneCalculate, for example, the term that the keyword B of A document mates belongs to C industry; CrucialThe word frequency of word B in A document is 5, and the C category of employment attribute that A document comprises is 5.
Obviously, it is apparent to those skilled in the art that above-mentioned of the present invention each module orEach step can realize with general calculation element, and they can concentrate on single calculatingOn device, or be distributed on the network that multiple calculation elements form, alternatively, theyCan realize with the executable program code of calculation element, thereby, they can be storedIn storage device, carried out by calculation element, or they are made into respectively to each is integratedCircuit module, or the multiple modules in them or step are made into single integrated circuit mouldPiece is realized. Like this, the present invention is not restricted to any specific hardware and software combination.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention,For a person skilled in the art, the present invention can have various modifications and variations. AllWithin the spirit and principles in the present invention, any amendment of doing, be equal to replacement, improvement etc.,Within all should being included in protection scope of the present invention.

Claims (3)

1. a method for the document classification to magnanimity document library, is characterized in that, comprising:
Determine that each keywords of all documents in document library and each keyword and its returnThe corresponding relation of each document belonging to;
Described each keyword is mated one by one in terminology bank, by the art of each keyword couplingThe category of employment attribute of language, the industry belonging at each document of its correspondence as this keywordCategory attribute;
According to described corresponding relation, determine that each document package belongs to containing identical maximum category of employmentProperty;
Classification using category of employment attributes maximum ownership as each document;
The method also comprises:
Determine multiple positional informations that each keyword occurs at its corresponding each document; ItsIn, the quantity of described positional information equals the word of this keyword at its corresponding each documentFrequently;
Described matching process comprises:
If described in match the long L=1 of word of term, determine that the match is successful;
If described in match the long L > 1 of word of term, travel through one by one current keyword correspondenceEach document in positional information;
Traversing after each positional information of current document, in the document, extract successively eachL-1 after a positional information keyword;
By L-1 the keyword extracting at every turn, the art that is greater than 1 with the long L of the described word matchingRear L-1 word of language carries out Corresponding matching;
Determine the category of employment attribute that each keyword belongs in each document of its correspondence;
If L-1 the keyword extracting, the term that is greater than 1 with the long L of the described word matchingCarry out that the match is successful, using the category of employment attribute of this term as described current keyword at itThe category of employment attribute belonging in corresponding described current document.
2. method according to claim 1, is characterized in that, each described document is enteredRow word segmentation processing, removes the specifically word of meaning of stop words, nothing, obtains described each keyword.
3. method according to claim 1, is characterized in that, adopts dichotomy, oughtFront keyword is searched in described terminology bank.
CN201210593096.8A 2012-12-31 2012-12-31 The method of the document classification to magnanimity document library Active CN103049568B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210593096.8A CN103049568B (en) 2012-12-31 2012-12-31 The method of the document classification to magnanimity document library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210593096.8A CN103049568B (en) 2012-12-31 2012-12-31 The method of the document classification to magnanimity document library

Publications (2)

Publication Number Publication Date
CN103049568A CN103049568A (en) 2013-04-17
CN103049568B true CN103049568B (en) 2016-05-18

Family

ID=48062208

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210593096.8A Active CN103049568B (en) 2012-12-31 2012-12-31 The method of the document classification to magnanimity document library

Country Status (1)

Country Link
CN (1) CN103049568B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679733B (en) * 2013-11-26 2018-02-23 中国移动通信集团公司 A kind of voice dialogue interpretation method, apparatus and system
CN103729344B (en) * 2013-12-30 2016-08-31 传神联合(北京)信息技术有限公司 A kind of method of statement mark in document manuscript
CN103729350B (en) * 2013-12-30 2017-01-04 语联网(武汉)信息技术有限公司 The preprocess method of various dimensions waiting for translating shelves
CN103714051B (en) * 2013-12-30 2016-05-18 传神联合(北京)信息技术有限公司 A kind of preprocess method of waiting for translating shelves
CN103955449B (en) * 2014-04-21 2018-03-06 安一恒通(北京)科技有限公司 The method and apparatus for positioning target sample
CN104615772B (en) * 2015-02-16 2017-11-03 重庆大学 A kind of professional degree analyzing method of text evaluating data for ecommerce
CN104778371A (en) * 2015-04-21 2015-07-15 天脉聚源(北京)传媒科技有限公司 Method and device for evaluating document content speciality
WO2017117781A1 (en) * 2016-01-07 2017-07-13 马岩 Network information classification method and system
CN106484788A (en) * 2016-09-19 2017-03-08 合肥清浊信息科技有限公司 Patent search system based on industry keyword
CN107798074A (en) * 2017-09-29 2018-03-13 汤东澜 Information processing method and server
CN108182182B (en) * 2017-12-27 2021-09-10 传神语联网网络科技股份有限公司 Method and device for matching documents in translation database and computer readable storage medium
CN107992633B (en) * 2018-01-09 2021-07-27 国网福建省电力有限公司 Automatic electronic document classification method and system based on keyword features
CN108572942A (en) * 2018-04-20 2018-09-25 北京深度智耀科技有限公司 A kind of method and apparatus creating hyperlink
CN109543023B (en) * 2018-09-29 2020-09-08 中国石油化工股份有限公司石油勘探开发研究院 Document classification method and system based on trie and LCS algorithm
US11144579B2 (en) * 2019-02-11 2021-10-12 International Business Machines Corporation Use of machine learning to characterize reference relationship applied over a citation graph
CN109871433B (en) * 2019-02-21 2021-07-23 北京奇艺世纪科技有限公司 Method, device, equipment and medium for calculating relevance between document and topic
CN111274798B (en) * 2020-01-06 2023-08-18 北京大米科技有限公司 Text subject term determining method and device, storage medium and terminal
CN111782601A (en) * 2020-06-08 2020-10-16 北京海泰方圆科技股份有限公司 Electronic file processing method and device, electronic equipment and machine readable medium
CN112015884A (en) * 2020-08-28 2020-12-01 欧冶云商股份有限公司 Method and device for extracting keywords of user visiting data and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593200A (en) * 2009-06-19 2009-12-02 淮海工学院 Chinese Web page classification method based on the keyword frequency analysis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7047236B2 (en) * 2002-12-31 2006-05-16 International Business Machines Corporation Method for automatic deduction of rules for matching content to categories

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593200A (en) * 2009-06-19 2009-12-02 淮海工学院 Chinese Web page classification method based on the keyword frequency analysis

Also Published As

Publication number Publication date
CN103049568A (en) 2013-04-17

Similar Documents

Publication Publication Date Title
CN103049568B (en) The method of the document classification to magnanimity document library
US20170161375A1 (en) Clustering documents based on textual content
US11176124B2 (en) Managing a search
CN104679778B (en) A kind of generation method and device of search result
CN109885773B (en) Personalized article recommendation method, system, medium and equipment
US10019515B2 (en) Attribute-based contexts for sentiment-topic pairs
US10579661B2 (en) System and method for machine learning and classifying data
CN103593418B (en) A kind of distributed motif discovery method and system towards big data
CN103823838B (en) A kind of method of multi-format document typing and comparison
CN103258000A (en) Method and device for clustering high-frequency keywords in webpages
CN105022827A (en) Field subject-oriented Web news dynamic aggregation method
CN104199833A (en) Network search term clustering method and device
CN103106245A (en) Method which is used for classifying translation manuscript in automatic fragmentation mode and based on large-scale term corpus
CN110888981B (en) Title-based document clustering method and device, terminal equipment and medium
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN109885641A (en) A kind of method and system of database Chinese Full Text Retrieval
JP6047365B2 (en) SEARCH DEVICE, SEARCH PROGRAM, AND SEARCH METHOD
US9256669B2 (en) Stochastic document clustering using rare features
Amato et al. YFCC100M hybridnet fc6 deep features for content-based image retrieval
CN107657067B (en) Cosine distance-based leading-edge scientific and technological information rapid pushing method and system
CN111090743B (en) Thesis recommendation method and device based on word embedding and multi-value form concept analysis
CN103793466A (en) Image retrieval method and image retrieval device
CN110008407B (en) Information retrieval method and device
CN113486148A (en) PDF file conversion method and device, electronic equipment and computer readable medium
CN111639099A (en) Full-text indexing method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Jiang Chao

Inventor after: Zhang Pi

Inventor before: Jiang Chao

COR Change of bibliographic data
C56 Change in the name or address of the patentee
CP03 Change of name, title or address

Address after: 430070 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Patentee after: Language network (Wuhan) Information Technology Co., Ltd.

Address before: 430073 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Patentee before: Wuhan Transn Information Technology Co., Ltd.

PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method for classifying documents in mass document library

Effective date of registration: 20181115

Granted publication date: 20160518

Pledgee: Bank of Communications Co., Ltd. Wuhan Branch of Hubei Free Trade Experimental Zone

Pledgor: Language network (Wuhan) Information Technology Co., Ltd.

Registration number: 2018420000061

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20200617

Granted publication date: 20160518

Pledgee: Bank of Communications Co.,Ltd. Wuhan Branch of Hubei Free Trade Experimental Zone

Pledgor: IOL (WUHAN) INFORMATION TECHNOLOGY Co.,Ltd.

Registration number: 2018420000061

PC01 Cancellation of the registration of the contract for pledge of patent right