The generation of dictionary and the Document Classification Method and device using dictionary
Technical field
The application is related to field of computer technology, more particularly to the generation of dictionary and the document of utilization dictionary divide
Class method and device.
Background technology
With the sustainable development of Internet technology, increasing network interaction platform is occurred in that, these networks
Interaction platform provides a great convenience for daily life.For example, network electric business platform makes
User can easily buy commodity on the net;The appearance of network game platform allows user to enter by network
Row electronic game;E-learning Platform allows user easily to learn by network.
However, the appearance of a large amount of network interaction platforms, while the life given people offers convenience, also for
Resource consolidation between homogeneous network interaction platform adds difficulty.For example, for two network electric business platforms
A and B, if necessary to which the commodity of B platforms are incorporated into A platforms, due to right on A platforms and B platforms
It is typically different in the standard of commodity classification, or even series may also be different, are at this time generally difficult to B
The commodity of platform are incorporated into A platforms.Therefore, when being implement resource integration between homogeneous network interaction platform,
Due to the criteria for classification even difference of series, it is generally difficult to implement resource integration.Therefore, homogeneous network
How to be implement resource integration between interaction platform is current industry urgent problem to be solved.
The content of the invention
The embodiment of the present application provides the generation of dictionary and the Document Classification Method and device using dictionary, is used for
Solve in the prior art, when being implement resource integration between homogeneous network interaction platform, due to criteria for classification even
The difference of series, the problem of being generally difficult to implement resource integration.
The embodiment of the present application provides a kind of generation method of dictionary, and the dictionary, which is used to interact objective network, puts down
Document classification in platform is to specified network interaction platform, and this method includes:
Obtain the heading message for specifying each document in network interaction platform under each series of given level;
By the heading message of each document, determine each series distinguish corresponding each keyword with
And weight point of each keyword under correspondence series, the weight point is used to assessing by keyword
Discrimination when place series makes a distinction with other series outside the series, it is described
Keyword refers under the corresponding series of the keyword according to the specified word of pre-defined rule;
Corresponding each keyword and each institute are distinguished according to each series, each series
The weight for stating keyword is mitogenetic into dictionary.
It is preferred that, weight point of the keyword under correspondence series is determined by the following method:
Determine word frequency of the keyword in corresponding series;
By in the title of each document under each series comprising the keyword number of files and
The sum of each series Documents, determines the anti-document frequency of the keyword;
Pass through word frequency of the keyword in corresponding series and the anti-text of the keyword
Shelves frequency determines weight point of the keyword under correspondence series.
It is preferred that, the heading message by each document determines that each series is corresponded to respectively
Each keyword specifically include:
The title of each document is normalized;
Cutting word processing and entity word identification are carried out to the title of each document after normalized, obtained
Entity word for describing entity things;
Determine that each series distinguishes the word frequency of corresponding each entity word, the word frequency of entity word is used to describe
The times or frequency that the entity word occurs in the corresponding series;
Using each entity word as current entity word, judge whether the word frequency of the current entity word is big
In predetermined threshold value, if so, the current entity word to be then defined as to the keyword of corresponding series.
It is preferred that, methods described also includes, when the word frequency of the corresponding each entity word of current class classification is respectively less than
During predetermined threshold value, the maximum entity word of word frequency in the current class classification is defined as the current class class
Keyword in mesh, the current class classification includes any one series in each series.
It is preferred that, it is described according to each series, each series difference is corresponding each described closes
The weight of keyword and each keyword is mitogenetic to be specifically included into dictionary:
Using each keyword as current key word, according to the current key word, the current pass
The weight of the corresponding series of keyword and the current key word under correspondence series is mitogenetic into pass
Joint group;
According to each associated group generation dictionary.
It is preferred that, each series of the given level specifically includes each one-level classification and/or leaf classification.
The embodiment of the present application also provides a kind of Document Classification Method, and this method includes:
The heading message of document is specified in acquisition objective network interaction platform and determines the mark of the specified document
Corresponding each entity word is inscribed, the entity word is used to describe entity things;
By inquiring about dictionary, the weight point under each series of each entity word in the dictionary is determined,
The dictionary that the dictionary generates for the generation method of the dictionary according to above-mentioned any one;
By each entity word weight under each series point, determine the specified document each
Respective total score under the series;
By the specified document, respective total score is entered to the specified document under each series
Row classification.
It is preferred that, it is described by the specified document under each series respective total score to described
Specify document to carry out classification to specifically include:
By the total score being sequentially generated from high to low of each series according to the specified document
Queue, the series of predetermined quantity, institute are chosen according to the order in queue from the original position of the queue
Stating the series of predetermined quantity at least includes two series;
Obtain initial classification of the specified document in objective network interaction platform;
Each series and the phase of initial classes purpose first of selected predetermined quantity are determined respectively
Like spending, first similarity is used to describe the initial classification and the similarity degree of each series;
Each series of predetermined quantity selected by determining respectively and the title of the specified document
Second similarity, second similarity is used for the title for describing the specified document and each series
Similarity degree;
, will be described pre- using each series of selected predetermined quantity as pre-selection series
The total score corresponding to series, first similarity and second similarity is selected to be added
Power summation, classifies according to the result of weighted sum to the specified document.
It is preferred that, it is described by the specified document under each series respective total score to described
Specify document to carry out classification to specifically include:
By the specified document under each series respective total score, determine the specified document
Maximum total score corresponding to maximum total score series and by the specified document classification to it is described most
Big total score series.
It is preferred that, the weight by each entity word under each series point, it is determined that described
Document respective total score under each series is specified to specifically include:
Using each series as current class classification, by the way that each entity word is worked as described
Weight under preceding series point summed, and determines the specified document in the current class class now
Total score.
It is preferred that, the corresponding each entity word of title for determining the specified document is specifically included:
The title of the specified document is normalized;
Cutting word processing and entity word identification are carried out to the title of the specified document after normalized, really
The corresponding each entity word of title of the fixed specified document.
The embodiment of the present application also provides a kind of generating means of dictionary, and the dictionary is used to interact objective network
Document classification in platform is to specified network interaction platform, and the device includes:
Acquiring unit, determining unit and generation unit, wherein:
Acquiring unit, for obtaining each text in specified network interaction platform under each series of given level
The heading message of shelves;
Determining unit, for the heading message by each document, determines that each series is right respectively
Weight point of each keyword and each keyword answered under correspondence series, the weight point is used for
The series where by keyword is assessed to make a distinction with other series outside the series
When discrimination, the keyword refer under the corresponding series of the keyword according to pre-defined rule specify
Word;
Generation unit, for according to each series, each series difference it is corresponding it is each described in
The weight of keyword and each keyword is mitogenetic into dictionary.
The embodiment of the present application also provides a kind of document sorting apparatus, and the device includes:
Acquiring unit, query unit, determining unit and taxon, wherein:
Acquiring unit, the heading message of document is specified for obtaining and is determined described in objective network interaction platform
The corresponding each entity word of title of document is specified, the entity word is used to describe entity things;
Query unit, for by inquiring about dictionary, determining each classification of each entity word in the dictionary
Class weight now point, the dictionary is the dictionary generated according to a kind of generating means of above-mentioned dictionary;
Determining unit, for the weight point by each entity word under each series, determines institute
State specified document respective total score under each series;
Taxon, for by the specified document under each series respective total score to institute
State specified document and carry out a kind of device of classification, including:.
At least one above-mentioned technical scheme that the embodiment of the present application is used can reach following beneficial effect:
The heading message of document is specified by obtaining and entity word is determined by heading message, then passes through inquiry
Dictionary determines entity word corresponding weight point under each series, determines that this specifies document by weight point
Total score under each series, by total score to specifying document to classify.It is existing so as to solve
There is the difference due to criteria for classification, the problem of being difficult to resource consolidation between homogeneous network interaction platform.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application,
The schematic description and description of the application is used to explain the application, does not constitute the improper limit to the application
It is fixed.In the accompanying drawings:
Fig. 1 implements schematic flow sheet for a kind of generation method of dictionary of the offer of the embodiment of the present application 1;
Fig. 2 is a kind of tool of the generation method for dictionary that the embodiment of the present application 1 is provided in practical application scene
Body implementation process schematic diagram;
Fig. 3 implements schematic flow sheet for a kind of Document Classification Method of the offer of the embodiment of the present application 2;
By total score to specifying document in a kind of Document Classification Method that Fig. 4 provides for the embodiment of the present application 2
That is classified implements schematic flow sheet;
Fig. 5 is a kind of concrete structure schematic diagram of the generating means for dictionary that the embodiment of the present application 3 is provided;
Fig. 6 is a kind of concrete structure schematic diagram for document sorting apparatus that the embodiment of the present application 4 is provided.
Embodiment
It is specifically real below in conjunction with the application to make the purpose, technical scheme and advantage of the application clearer
Apply example and technical scheme is clearly and completely described corresponding accompanying drawing.Obviously, it is described
Embodiment is only some embodiments of the present application, rather than whole embodiments.Based on the implementation in the application
Example, the every other implementation that those of ordinary skill in the art are obtained under the premise of creative work is not made
Example, belongs to the scope of the application protection.
Below in conjunction with accompanying drawing, the technical scheme that each embodiment of the application is provided is described in detail.
Embodiment 1
Embodiment 1 provides a kind of generation method of dictionary, and the dictionary of this method generation is used for objective network
Document classification in interaction platform is to specified network interaction platform, so as to be the money between homogeneous network interaction platform
Integrate and lay a good foundation in source.The idiographic flow schematic diagram of this method is as shown in figure 1, comprise the steps:
Step S11:Each document under each series of given level in network interaction platform is specified in acquisition
Heading message.
Specified network interaction platform mentioned here be it is relative with for objective network interaction platform, objective network
Interaction platform refers to the platform that resource is integrated, and specified network interaction platform refers to receive the platform reallocated resources.
For example, when by the resource consolidation of first network game platform to second network game platform, first network game platform
For objective network interaction platform, while second network game platform is to specify network interaction platform.In addition, for
The type of network interaction platform and objective network interaction platform is specified not limit herein, that is to say, that it
Can be network electric business platform, network game platform etc..
In network interaction platform, for convenience user find need resource, it usually needs by resource according to
Certain standard is divided into multiple series, and these series can also include multiple different grades.Example
Such as, in network electric business platform, first-level class classification includes books, phonotapes and videotapes, electronic apparatus and household building materials etc.;
Wherein the books, phonotapes and videotapes of first-level class classification, includes secondary classification classification children's book, musical instrument etc. again.Specify
The series of grade can be one-level classification or two grades of classifications or leaf classification,
It can be the series for the other grades specified.The series of the given level is excellent in actual applications
The scheme of choosing can be one-level classification and/or leaf classification, and wherein leaf classification refers to flat in specified network interaction
The subclassification classification of series least significant end in platform, that is to say, that do not have in this specifies network interaction platform
The leaf classification is continued into downward subdivision.
In network interaction platform, various resources are generally embodied in server end in the form of document.For example,
On network electric business platform, for each commodity in the generally all respective corresponding document of service end, a commodity
Corresponding document includes Document Title, and the description information of the classification of the commodity classification, and the document are existed
Address on network.
It can be obtained by this step in specified network interaction platform, under each series of given level
Each document heading message.
Step S12:By the heading message of each document, determine that each series difference is corresponding each
The weight point of keyword and each keyword under correspondence series.
Herein, the weight point be used to assess where by keyword series and the series it
Discrimination when outer other series make a distinction, the keyword refers to the corresponding classification of the keyword
Class is now according to the specified word of pre-defined rule.
This step can be handled by the heading message of each document by the title to these documents,
Determine that each series distinguishes the weight of corresponding keyword and these keywords under correspondence series
Point.The title of document is handled in actual applications, determines that each series distinguishes corresponding keyword
Method can include it is a variety of, a kind of preferred scheme is as shown in step 121.
Step 121:The title of each document is normalized;To each institute after normalized
The title for stating document carries out cutting word processing and entity word identification, obtains the entity word for describing entity things;
Determine that each series distinguishes the word frequency of corresponding each entity word, the word frequency of entity word is used to describe described
The times or frequency that entity word occurs in the corresponding series;Using each entity word as
Current entity word, judges whether the word frequency of the current entity word is more than predetermined threshold value, if so, then will be described
Current entity word is defined as the keyword of corresponding series.
Normalized can include in actual applications letter it is big first write conversion, traditional font and simplified conversion and
Tense conversion etc..Title progress cutting word processing to document can generally use tokenizer, conventional cutting word
Instrument has word2vec.Entity word identification can generally be completed by entity dictionary, and entity word refers to description
The word of entity things, such as trade name, trade mark, title, game name.After entity word identification,
The word frequency of each entity word under each series is counted respectively;Then by comparing the word frequency of these entity words
With the size of predetermined threshold value, word frequency is selected more than predetermined threshold value entity word as the corresponding classification of the entity word
The keyword of class now.For example, 3 entity words under electronic apparatus series be respectively " computer ",
" mobile phone " and " charger ", wherein the word frequency of " computer " is 300, the word frequency of " mobile phone " is 350, " is filled
The word frequency of electrical equipment " is 100, and predetermined threshold value can be set according to actual needs, if predetermined threshold value is set to 200,
Then " computer " and " mobile phone " will be confirmed as keyword under the electronic apparatus series.
It should be noted that in order to avoid under some series in step 121, word frequency highest entity
The word frequency of word is again smaller than predetermined threshold value, so that the series is excluded.Therefore when some classification of appearance
In classification, the word frequency of word frequency highest entity word again smaller than predetermined threshold value situation when, can be by suitably reducing
The mode of predetermined threshold value to solve.Certainly, it is a kind of preferred when there is such case in actual applications
Scheme is, when the word frequency of the corresponding each entity word of current class classification is respectively less than predetermined threshold value, described will work as
The maximum entity word of word frequency is defined as the keyword in the current class classification in preceding series, described to work as
Preceding series includes any one series in each series.For example, musical instrument series
Lower word frequency highest entity word is " guitar ", and the word frequency of " guitar " is 86 under the series, if in advance
If threshold value is 100, in order to prevent musical instrument series to be excluded, " guitar " can be defined as to musical instrument point
The keyword of class class now.
In fact, it should be noted that first can also be cut to the title of each document in step 121
Word processing, then the result after cutting word is normalized and entity word identification, so as to obtain being used to retouch
State the entity word of entity things.
Weight point of each keyword under correspondence series can also be determined in step 12, can be direct
Using word frequency of each keyword under correspondence series as weight point, it can also determine by other means
Weight point.In actual applications, a kind of preferred scheme for determining keyword weight point under correspondence series
As shown at 122.
Step 122:Determine word frequency of the keyword in corresponding series;Pass through each classification
The number of files and each series of the keyword are included in the title of each document of class now hereafter
The sum of shelves, determines the anti-document frequency of the keyword;By the keyword in corresponding series
In word frequency and the keyword the anti-document frequency determine the keyword correspondence series
Under weight point.
In step 122, in order to determine some keyword weight under corresponding series point, first really
Fixed word frequency (Term Frequency, TF) of the keyword under corresponding series;It is then determined that each
In the title of each document under individual series under number of files n and each series comprising the keyword
Document total N, and determine by n and N anti-document frequency (the Inverse Document of the keyword
Frequency, IDF);Finally by word frequency under correspondence series of the keyword and the keyword
Anti- document frequency determines weight point of the keyword under correspondence series.
In actual applications, the anti-document frequency of the keyword can be determined by following formula:
IDF=log (N/n)
Wherein, N is the sum of the document under each series, and n is each document under each series
Title in include the number of files of the keyword.Log (N/n) refers to take the logarithm to N/n value, the logarithm
Bottom can be set according to actual needs, in actual applications log (N/n) typically natural logrithms
ln(N/n)。
In actual applications, weight point of the keyword under correspondence series can pass through, by the keyword
Word frequency under correspondence series is multiplied with the anti-document frequency of the keyword to be obtained.
Step S13:Corresponding each keyword is distinguished according to each series, each series
And the weight of each keyword is mitogenetic into dictionary.
According to distinguishing each corresponding keyword under each series, each series and each is crucial
Correspondence series under weight it is mitogenetic into dictionary.Keyword, classification class can be included in the dictionary
Three attribute items of mesh and weight point.
It is a kind of generate dictionary preferred scheme be, using each keyword as current key word, according to
The current key word, the corresponding series of the current key word and the current key word exist
Weight under correspondence series is mitogenetic into associated group;According to each associated group generation dictionary.
The word stock generating method provided using embodiment 1, is obtained and specifies given level in network interaction platform
Each series under each document heading message, by the heading message of each document, determine each classification class
Mesh distinguishes the weight point of corresponding each keyword and each keyword under correspondence series, then according to each
Series, each series distinguish corresponding each keyword and the weight of each keyword is mitogenetic into word
Storehouse.This method is used for by the dictionary of each document structure tree under each series in specified network interaction platform
By the document classification in objective network interaction platform to specified network interaction platform, so as to be homogeneous network interaction
Resource consolidation between platform is laid a good foundation.
In order to make it easy to understand, the embodiment of the present application is additionally provided under a kind of practical application scene, implement described
The schematic diagram of word stock generating method, as shown in Figure 2.Under the application scenarios, specified network interaction platform is
Network electric business platform A, objective network interaction platform is network electric business platform B, the series of given level
For leaf classification.Methods described specifically may comprise steps of:
Step S21:Obtain the heading message of each document of each leaf class now in network electric business platform A.
Step S22:The title of each document is normalized.
Step S23:Cutting word processing and entity word are carried out to the title of each document after normalized
Identification, obtains the entity word for describing entity things.
Step S24:Determine that each leaf classification distinguishes the word frequency of corresponding each entity word.
Step S25:Using each entity word as current entity word, the current entity word is judged
Whether word frequency is more than predetermined threshold value, if so, then performing step S26.
Step S26:The current entity word is defined as to the keyword of corresponding leaf classification.
Step S27:The text of the keyword is included in title by each document of each leaf class now
The sum of gear number and each leaf classification Documents, determines the anti-document frequency of the keyword.
Step S28:Pass through word frequency of the keyword in corresponding leaf classification and the keyword
The anti-document frequency determines the weight point of the keyword in correspondence leaf class now.
Step S29:Corresponding each keyword is distinguished according to each leaf classification, each leaf classification
And the weight of each keyword is mitogenetic into dictionary.
In actual applications, a kind of dictionary generated according to this method is as shown in table 1.
A kind of dictionary in the practical application of table 1..
Keyword |
Leaf classification |
Weight point |
Notebook |
Computer |
7 |
Apple |
Mobile phone |
4 |
Notebook |
Stationery |
5 |
It should be noted that the executive agent that embodiment 1 provides each step of method may each be same and set
It is standby, or, each step of this method can also be used as executive agent by distinct device.Such as, step 11
Executive agent with step 12 can be equipment 1;Again such as, the executive agent of step 11 can be equipment
1, the executive agent of step 12 sum can be equipment 2;Etc..
Embodiment 2
Embodiment 2 provides a kind of Document Classification Method, and this method utilizes the dictionary that embodiment 1 is generated, will
Document classification in objective network interaction platform is to specified network interaction platform, so as to solve existing due to dividing
The difference of class standard, the problem of being difficult to resource consolidation between homogeneous network interaction platform.This method
Idiographic flow schematic diagram is as shown in figure 3, comprise the steps:
Step S31:Obtain the heading message that document is specified in objective network interaction platform and determine described specify
The corresponding each entity word of title of document, the entity word is used to describe entity things.
Objective network interaction platform mentioned here refers to the platform that resource is integrated.For example, by first network
When the resource consolidation of gaming platform is to second network game platform, first network game platform is that objective network interaction is flat
Platform.In addition, the type for objective network interaction platform is not limited herein, that is to say, that they can
To be network electric business platform, network game platform etc..
In network interaction platform, various resources are generally embodied in server end in the form of document.For example,
On network electric business platform, for each commodity in the generally all respective corresponding document of service end, a commodity
Corresponding document includes Document Title, and the description information of the classification of the commodity classification, and the document are existed
Address on network.Specified document refers in objective network interaction platform, it is necessary to be integrated into specified network
The corresponding document of resource in interaction platform.
By specifying the heading message of document, determine that this specifies each corresponding entity word of title of document, institute
Stating entity word is used to describe entity things, such as trade name, trade mark, title, game name.
In actual applications, by specify document title, determine this specify document title it is corresponding each
The method of entity word has a lot, and a kind of preferred scheme is as shown in step S311.
Step S311:The title of the specified document is normalized;To the institute after normalized
The title for stating specified document carries out cutting word processing and entity word identification, determines the title pair of the specified document
Each entity word answered.
Normalized can include in actual applications letter it is big first write conversion, traditional font and simplified conversion and
Tense conversion etc..Cutting word processing is carried out to the title for specifying document can generally use tokenizer, conventional
Tokenizer has word2vec.Entity word identification can generally be completed by entity dictionary.Know in entity word
After not, the word frequency of each entity word under each series is counted respectively;Then by comparing these entity words
Word frequency and predetermined threshold value size, select word frequency more than predetermined threshold value entity word as the entity word correspondence
Series under keyword.
The entity word determined by step S131, due to be normalized and entity word identification, because
This more accurate and standardization.
Step S32:By inquiring about dictionary, each series of each entity word in the dictionary is determined
Under weight point.
The dictionary is the dictionary of the word stock generating method generation provided according to embodiment 1.Lead in the dictionary
Often include keyword, series and weight point three attribute items, wherein, weight point is used to assessing will
Discrimination when series makes a distinction with other series outside the series where keyword,
Keyword refers under the corresponding series of keyword according to the specified word of pre-defined rule.
By inquiring about the dictionary, determine that each entity word is corresponding respectively under each series of the dictionary
Weight point.In actual applications, it will usually which appearing in does not have the entity word to be searched under some series
Situation, at this time weight of the entity word under these series can be divided into predetermined constant, lead to
The normal predetermined constant can be 0.For example, entity word is " rice ", in the series " computer " of dictionary
Do not occur the entity word below, therefore the weight of " rice " under the series can be set up separately and be set to
Preset constant.
Step S33:By each entity word weight under each series point, the finger is determined
Determine document respective total score under each series.
This step is used to determine that this specifies document respective total score under each series of dictionary, with this
Illustrate the determination process of total score exemplified by specified total score of the document under one of series of dictionary,
The series is called current class classification herein.Generally each reality in Document Title can be specified by this
The weight of pronouns, general term for nouns, numerals and measure words respectively in current class classification point, determines that this specifies document in current class class now total
Score.
In actual applications, each entity word is specified in Document Title respectively in current class classification by this
Weight point, determine that this specifies the mode of total score of the document in current class class now to have a variety of.For example,
The weight of each entity word point can be weighted summation, determine the specified document in current class class now
Total score, the wherein weights of entity word can determine according to the entity word own type, entity word itself class
Type has title, place name, brand word etc.;Can also using each series as current class classification,
Summed by the weight point by each entity word in the current class class now, determine described specify
Total score of the document in the current class class now.
Step S34:By the specified document, respective total score refers to described under each series
Determine document to be classified.
It is determined that this specifies document under each series after respective total score, by each classification class
Now respective total score is to specifying document to classify.
It is this under each series respective total score to specify document classify by way of have
It is a variety of.For example, can be arranged according to the size order of total score, this is specified into document classification to must
Under point maximum series, namely by the specified document is respective under each series must
Point, determine the maximum total score series corresponding to the maximum total score of the specified document and refer to described
Document classification is determined to the maximum total score series.
Further, the problem of would generally encountering various in actual applications, therefore passing through total score
Classification is may result in when being classified not accurate enough.For example, Document Title is " apple iphone4s data
Line ", the entity word after the document title is split includes " apple ", " iphone4s " and " data wire ", by
It is higher in " apple " and " iphone4s " the two entity words weight under series " mobile phone " point,
Normally result in and be categorized into the document under " mobile phone this " series.Therefore, such as Fig. 4, Yi Zhongtong
The respective total score under each series is crossed, to the preferred scheme such as step for specifying document to be classified
Shown in S341 to step S345.
Step S341:By each series according to the specified document the total score from high to low
Queue is sequentially generated, the classification of predetermined quantity is chosen according to the order in queue from the original position of the queue
Classification.
The series of the predetermined quantity at least includes two series, in actual applications generally can be with
Multiple series are chosen as needed, for example, choose total score 100 series of highest successively.
Step S342:Obtain initial classification of the specified document in objective network interaction platform.
Due to this in objective network interaction platform specify document belonging to initial classification in, generally also can comprising pair
This specifies the useful information of document classification, therefore can obtain the specified document in objective network interaction platform
Initial classes purpose information.
In addition, the document classification of each in network interaction platform typically there are multiple grades, this is caused to specify document
Initial classification may be by multiple, and one kind, which chooses the initial classes purpose preferred scheme, is, selected is described initial
Classification is identical with the series classification grade.For example, when the series is one-level classification, choosing
The initial classification taken is also one-level classification;When the series is leaf classification, selected initial classes
Mesh is also leaf classification.
Step S343:Each series and the initial classification of selected predetermined quantity are determined respectively
The first similarity, first similarity be used for the initial classification and the phase of each series are described
Like degree.
Determining the mode of series and the similarity of initial classes purpose first has a variety of, for example, can pass through word
The mode of matching determines their the first similarity, and their similarity can also be determined by other means.
A kind of preferred scheme be determine respectively each series of selected predetermined quantity with it is described initial
The Jacard of classification is like degree.
Using each word in series as set A, using each word in initial classification as set B,
It is as follows to set A and B, Jaccard Similarity Measure:
Jaccard (A, B)=| A intersect B |/| A union B |
Wherein | A intersect B | represent A and B common factor number of words, | A union B | represent A and B's and
The number of words of collection.Generally, Jaccard (A, B) numerical value is between [0,1], as A==B namely A and B
When identical, Jaccard (A, B)=1.
Step S344:Each series and the specified document of selected predetermined quantity are determined respectively
Title the second similarity, second similarity be used to describing the title of the specified document with it is each described
The similarity degree of series.
The mode of series and the second similarity of the title of specified document is determined, can be with determining the first phase
Mode like degree is identical, can also be different with the mode for determining the first similarity.In practical application, it is determined that point
The title of class classification and specified document the second similarity it is preferable that, determine series with specifying text
The Jacard similarities of the title of shelves.
Step S345:Using each series of selected predetermined quantity as pre-selection series,
By the total score corresponding to the pre-selection series, first similarity and described second similar
Degree is weighted summation, and the specified document is classified according to the result of weighted sum.
By each using each series of selected predetermined quantity as pre-selection series,
By the way that the total score of the pre-selection series, the first similarity and the second similarity are weighted into summation,
By the result of the weighted sum to specifying document to classify.Wherein, for pre-selection series must
Divide, the first similarity and the respective corresponding weights of the second similarity can be determined according to actual conditions, lead to
In the case of often, the corresponding weights of each of which can also be disposed as 1.
In actual applications, generally for the accurate fixed of inspection-classification result, classification results will can also be entered
Line sampling is examined, for example, predict hit rate according to first-level class classification sampling verification.
A kind of Document Classification Method provided using embodiment 2, this method utilizes the dictionary that embodiment 1 is generated,
By the document classification in objective network interaction platform to specified network interaction platform, thus solve it is existing due to
The difference of criteria for classification, the problem of being difficult to resource consolidation between homogeneous network interaction platform
Embodiment 3
Based on inventive concept same as Example 1, embodiment 3 provides a kind of generating means of dictionary,
The dictionary of device generation is used to put down the document classification in objective network interaction platform to specified network interaction
Platform, so as to be laid a good foundation for the resource consolidation between homogeneous network interaction platform.As shown in figure 5, the device
40 include:
Acquiring unit 401, determining unit 402 and generation unit 403, wherein:
Acquiring unit 401, for obtaining in specified network interaction platform under each series of given level
The heading message of each document;
Determining unit 403, for the heading message by each document, determines each series difference
The weight point of corresponding each keyword and each keyword under correspondence series, the weight point is used
In assessing the series where by keyword area is carried out with other series outside the series
The discrimination of timesharing, the keyword refers under the corresponding series of the keyword according to the finger of pre-defined rule
Determine word;
Generation unit 403, for corresponding each according to each series, each series difference
The weight of the keyword and each keyword is mitogenetic into dictionary.
The word stock generation device provided using embodiment 3, acquiring unit, which is obtained, specifies in network interaction platform
The heading message of each document under each series of given level, determining unit is believed by the title of each document
Breath, determines that each series distinguishes the power of corresponding each keyword and each keyword under correspondence series
Divide again, generation unit distinguishes corresponding each keyword and each key according to each series, each series
The weight of word is mitogenetic into dictionary.This method is each under each series in network interaction platform by specifying
The dictionary of document structure tree, for by the document classification in objective network interaction platform to specified network interaction platform,
So as to be laid a good foundation for the resource consolidation between homogeneous network interaction platform.
Embodiment 4
Based on inventive concept same as Example 2, embodiment 4 provides a kind of document sorting apparatus, should
Device utilizes the dictionary that embodiment 3 is generated, by the document classification in objective network interaction platform to specified network
Interaction platform, so as to solve existing due to the difference of criteria for classification, the hardly possible between homogeneous network interaction platform
The problem of to implement resource integration.As shown in fig. 6, the device 50 includes:
Acquiring unit 501, query unit 502, determining unit 503 and taxon 504, wherein:
Acquiring unit 501, heading message and the determination of document are specified for obtaining in objective network interaction platform
The corresponding each entity word of title of the specified document, the entity word is used to describe entity things;
Query unit 502, for by inquiring about dictionary, determining that each entity word is each in the dictionary
Weight point under series, the dictionary generates for the word stock generation device according to claim 12
Dictionary;
Determining unit 503, for being divided by weight of each entity word under each series, really
The fixed specified document respective total score under each series;
Taxon 504, for by the specified document under each series respective total score
The specified document is classified
A kind of document sorting apparatus provided using embodiment 4, the mark for specifying document is obtained by acquiring unit
Topic information simultaneously determines entity word by heading message, and then query unit is by inquiring about the word that embodiment 3 is generated
Storehouse determines entity word corresponding weight point under each series, and determining unit determines that this refers to by weight point
Determine total score of the document under each series, taxon is classified by total score.So as to solve
It is existing due to the difference of criteria for classification, asking for resource consolidation is difficult between homogeneous network interaction platform
Topic.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or meter
Calculation machine program product.Therefore, the application can be using complete hardware embodiment, complete software embodiment or knot
The form of embodiment in terms of conjunction software and hardware.Wherein wrapped one or more moreover, the application can be used
Containing computer usable program code computer-usable storage medium (include but is not limited to magnetic disk storage,
CD-ROM, optical memory etc.) on the form of computer program product implemented.
The application is produced with reference to according to the method, equipment (system) and computer program of the embodiment of the present application
The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions implementation process figure and
/ or each flow and/or square frame in block diagram and the flow in flow chart and/or block diagram and/
Or the combination of square frame.These computer program instructions can be provided to all-purpose computer, special-purpose computer, insertion
Formula processor or the processor of other programmable data processing devices are to produce a machine so that pass through and calculate
The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one
The device for the function of being specified in individual flow or multiple flows and/or one square frame of block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or the processing of other programmable datas to set
In the standby computer-readable memory worked in a specific way so that be stored in the computer-readable memory
Instruction produce include the manufacture of command device, the command device realization in one flow or multiple of flow chart
The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices, made
Obtain and perform series of operation steps on computer or other programmable devices to produce computer implemented place
Reason, so that the instruction performed on computer or other programmable devices is provided for realizing in flow chart one
The step of function of being specified in flow or multiple flows and/or one square frame of block diagram or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/defeated
Outgoing interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory
And/or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory (RAM).
Internal memory is the example of computer-readable medium.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by appointing
What method or technique realizes that information is stored.Information can be computer-readable instruction, data structure, program
Module or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory
(PRAM), static RAM (SRAM), dynamic random access memory (DRAM), its
Random access memory (RAM), read-only storage (ROM), the electrically erasable of his type are read-only
Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage
(CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic cassette tape, tape magnetic
Disk storage or other magnetic storage apparatus or any other non-transmission medium, can be calculated available for storage
The information that equipment is accessed.Defined according to herein, computer-readable medium does not include temporary computer-readable matchmaker
The data-signal and carrier wave of body (transitory media), such as modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant be intended to it is non-
It is exclusive to include, so that process, method, commodity or equipment including a series of key elements are not only wrapped
Include those key elements, but also other key elements including being not expressly set out, or also include for this process,
Method, commodity or the intrinsic key element of equipment.In the absence of more restrictions, by sentence " including
One ... " limit key element, it is not excluded that in the process including key element, method, commodity or equipment
Also there is other identical element.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer journey
Sequence product.Therefore, the application can using complete hardware embodiment, complete software embodiment or combine software and
The form of the embodiment of hardware aspect.Moreover, the application can be used wherein includes calculating one or more
Machine usable program code computer-usable storage medium (include but is not limited to magnetic disk storage, CD-ROM,
Optical memory etc.) on the form of computer program product implemented.
Embodiments herein is these are only, the application is not limited to.For people in the art
For member, the application can have various modifications and variations.It is all to be made within spirit herein and principle
Any modification, equivalent substitution and improvements etc., should be included within the scope of claims hereof.