CN103309857B - A kind of taxonomy determines method and apparatus - Google Patents

A kind of taxonomy determines method and apparatus Download PDF

Info

Publication number
CN103309857B
CN103309857B CN201210056669.3A CN201210056669A CN103309857B CN 103309857 B CN103309857 B CN 103309857B CN 201210056669 A CN201210056669 A CN 201210056669A CN 103309857 B CN103309857 B CN 103309857B
Authority
CN
China
Prior art keywords
entry
taxonomy
characteristic
weights
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210056669.3A
Other languages
Chinese (zh)
Other versions
CN103309857A (en
Inventor
贺翔
亓超
毛少林
翟俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Shiji Guangsu Information Technology Co Ltd filed Critical Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority to CN201210056669.3A priority Critical patent/CN103309857B/en
Publication of CN103309857A publication Critical patent/CN103309857A/en
Application granted granted Critical
Publication of CN103309857B publication Critical patent/CN103309857B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of taxonomies to determine that method and apparatus, this method include:The input sample that preset quantity is obtained from database, forms input sample collection;Wherein, the input sample includes the entry name, classification information and related entry information of entry;It is concentrated from the input sample according to preset seed words and obtains feature samples, composition characteristic sample set;Characteristic of division word is determined according to the feature samples collection;Taxonomy and its classification are determined according to the characteristic of division word and text to be selected.In the present invention, the efficiency and accuracy rate of taxonomy acquisition are improved.

Description

A kind of taxonomy determines method and apparatus
Technical field
The present invention relates to Internet technology application fields more particularly to a kind of taxonomy to determine method and apparatus.
Background technology
Text automatic classification refer to computer program to text set (or other data) according to certain taxonomic hierarchies or Standard carries out automatic classification marker.
In order to enable computer program to carry out automatic classification marker to text set, need using a large amount of taxonomies to its into Row training;Wherein, which refers to a large amount of text collections with classification markup information, and above computer program is (as divided Class device) pass through language material study (training) mark rule.
In the prior art, the approach for obtaining taxonomy includes mainly following two modes:
(1) artificial mark, i.e., manually carry out classification annotation to a large amount of text;
(2) orientation crawl, i.e., captured by modes such as automatic reptiles from the data for having divided class on internet;Such as, When needing video display class taxonomy, captured in video display class site databases that can be on the internet.
Inventor in the implementation of the present invention, it is found that the prior art at least has the following defects:
The mode manually marked needs to spend a large amount of manpower and time, less efficient;Orientation crawl can not then ensure point The accuracy rate of class language material can not ensure that the text set got from video display class site databases is the language material of video display class.
Invention content
The present invention provides a kind of confirmation method and equipment of taxonomy, to improve the efficiency of taxonomy acquisition and accurate Rate.
In order to achieve the above object, a kind of taxonomy of offer of the embodiment of the present invention determines method, including:
The input sample that preset quantity is obtained from database, forms input sample collection;Wherein, the input sample includes Entry name, classification information and the related entry information of entry;
It is concentrated from the input sample according to preset seed words and obtains feature samples, composition characteristic sample set;
Characteristic of division word is determined according to the feature samples collection;
Taxonomy and its classification are determined according to the characteristic of division word and text to be selected.
The embodiment of the present invention also provides a kind of taxonomy and determines equipment, including:
First acquisition module, the input sample for obtaining preset quantity from database form input sample collection;Its In, the input sample includes the entry name, classification information and related entry information of entry;
Second acquisition module obtains feature samples, composition for being concentrated from the input sample according to preset seed words Feature samples collection;
First determining module, for determining characteristic of division word according to the feature samples collection;
Second determining module, for determining taxonomy and its classification according to the characteristic of division word and text to be selected.
Compared with prior art, the embodiment of the present invention has the following advantages:
By choosing the seed words of a certain number of known class in advance, and a certain number of inputs are obtained from database Sample forms input sample collection;It is concentrated from input sample according to preset seed words and obtains feature samples composition characteristic sample set, And characteristic of division word is determined according to the feature samples collection got;It is determined according to the characteristic of division word got and text to be selected Taxonomy and its classification improve the efficiency and accuracy rate of taxonomy acquisition.
Description of the drawings
Fig. 1 is the flow diagram that a kind of taxonomy provided in an embodiment of the present invention determines method;
Fig. 2 is the flow diagram that feature samples are obtained in technical solution provided in an embodiment of the present invention;
Fig. 3 is that the taxonomy under a kind of concrete application scene provided in an embodiment of the present invention determines that the flow of method is illustrated Figure;
Fig. 4 is the structural schematic diagram that a kind of taxonomy provided in an embodiment of the present invention determines equipment.
Specific implementation mode
The defects of for the above-mentioned prior art, an embodiment of the present invention provides the technical sides that a kind of taxonomy determines Case.In the technical scheme, it by choosing the seed words of a certain number of known class in advance, and is obtained centainly from database The input sample of quantity forms input sample collection;It is concentrated from input sample according to preset seed words and obtains feature samples composition spy Sample set is levied, and characteristic of division word is determined according to the feature samples collection got;According to the characteristic of division word got and wait for This determination of selection taxonomy and its classification improve the efficiency and accuracy rate of taxonomy acquisition.
Wherein, in technical solution provided in an embodiment of the present invention, it can be Baidu to obtain the database that input sample integrates Encyclopaedia, wikipedia, WordNet etc..The input sample collection got from database can include the entry name of entry, classification Information and related entry information, format can be as shown in table 1:
Table 1
Entry Classification Related entry
Rules and forms poem Literature poem Prose poem regulated verse free verse
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear Chu is fully described by, it is clear that the embodiments described below are only a part of the embodiment of the present invention, rather than whole realities Apply example.Based on the embodiments of the present invention, those of ordinary skill in the art are obtained without making creative work Every other embodiment, belong to the embodiment of the present invention protection range.
As shown in Figure 1, being the flow diagram that a kind of taxonomy provided in an embodiment of the present invention determines method, can wrap Include following steps:
Step 101, the input sample that preset quantity is obtained from database, form input sample collection.
Specifically, by for excavating taxonomy in Baidupedia.In technical solution provided in an embodiment of the present invention, The input sample of preset quantity (such as 1000) can be obtained from Baidupedia, format can be as shown in table 1.
Step 102 concentrates acquisition feature samples, composition characteristic sample set according to preset seed words from input sample.
Specifically, in technical solution provided in an embodiment of the present invention, when needing to obtain taxonomy, can select in advance Take a quantity of seeds word.For example, when needing to obtain sport category taxonomy, the kind of 10 sport categories can be chosen in advance Sub- word, such as sport, football, sportsman, track and field, world cup, the Olympic Games.It, can after obtaining input sample and choosing seed words Feature samples, composition characteristic sample set are obtained to be concentrated from input sample according to seed words.
Wherein, in technical solution provided in an embodiment of the present invention, obtain feature samples flow can with as shown in Fig. 2, It may comprise steps of:
Step 102A, the feature samples for obtaining and including current seed words are concentrated from input sample.
For example, the seed words chosen in advance are football, basketball, sportsman, then concentrated from input sample according to the seed words Obtain the feature samples for including current seed words.Wherein, including it is football, basketball that the feature samples of the seed words, which can be entry, Or sportsman, can also be in related entry include respective seed word.
Step 102B, whether the quantity of judging characteristic sample more than first threshold terminates the flow if being judged as YES; Otherwise, step 102C is gone to.
Wherein, feature samples amount threshold can be determined according to actual demand, such as 10000.
Step 102C, the entry in feature samples and related entry are obtained, and the entry got and related entry are added Enter seed words, updates current seed words;Go to step 102A.
Specifically, when the feature samples quantity got is less than predetermined threshold value, it can will be in the feature samples that got Entry and related entry be added in seed words, and concentrated from input sample according to updated seed words obtain it is more Feature samples.
Sufficient amount of feature samples can be got by the above flow.
Step 103 determines characteristic of division word according to the feature samples collection got.
Specifically, in embodiments of the present invention, after getting feature samples, may further determine that and wrapped in each feature samples The weights of the entry contained, and determine characteristic of division word according to the weights of each entry.
By taking the weights of entry are the discrimination of the entry as an example, in embodiments of the present invention, using input sample collection as complete Collection, and two set are further determined that according to feature samples collection:
Set 1:Including all entries that feature samples are concentrated;
Set 2:Including all related entries that feature samples are concentrated.
To some word W in set 2, defining its discrimination is:
QwThe number that number/W that=W occurs in set 2 occurs in complete or collected works
For some word x in set 1, its discrimination is defined as the mean value of its all related entry discrimination:
Wherein, n is the number of related entry in the feature samples that entry is x, QWiFor the discrimination of i-th of related entry.
Can be more than the entry determination of threshold value (such as K) by discrimination after determining the discrimination that feature samples concentrate each entry For characteristic of division word.
Step 104 determines taxonomy and its classification according to characteristic of division word and text to be selected.
Specifically, after determining characteristic of division word, can an optional text to be selected, and cutting word is carried out to the text to be selected, obtained The characteristic of division word for including in the text to be selected is taken, and determines the weights of text to be selected according to the characteristic of division word got;When When the weights of text to be selected are more than threshold value, determine that the text to be selected is taxonomy, and by the classification belonging to corresponding seed words Classification as the taxonomy.
Wherein, the weights of text to be selected are determined according to characteristic of division word and the Feature Words got, it can be especially by Following formula is realized:
Wherein, tf is the word frequency of the characteristic of division word that occurs in the text to be selected in the text to be selected;The n is point The number of category feature word;The QiFor the weights of i-th of characteristic of division word;The N is the number of words of the text to be selected.
In order to further increase the accuracy rate of the taxonomy got, in technical solution provided in an embodiment of the present invention In, after taxonomy is determined, identified taxonomy can also be divided into more parts;Language is carried out according to each part taxonomy Expect cross validation, and determines final taxonomy and its classification.
Wherein, language material cross validation is carried out according to each part taxonomy, can be realized especially by following below scheme:
Step A1, select from each part taxonomy a non-selected taxonomy as test data;
Step B1, respectively using remaining each part taxonomy the classification of the test data is verified;
Step C1, the correct number of statistical testing of business cycles, and when it is more than five threshold values, determine point that test data is final Class language material;
Step D1, judge whether that there is also non-selected taxonomies;If being judged as YES, step A is gone to1;Otherwise, Terminate the flow.
For example, can determining taxonomy be divided into 10 parts, in turn will wherein 9 parts be used as training data, 1 part as surveying Data are tried, the classification of test data is verified, is i.e. every part of test data has carried out the test of 9 subseries;By in test data, Classification verifies correct number and is determined as final taxonomy more than threshold value.
It should be noted that the method for the determination discrimination provided in above-mentioned flow is only provided in an embodiment of the present invention A kind of specific implementation mode of entry weights is determined in technical solution, and in technical solution provided in an embodiment of the present invention, it determines The mode of entry weights is not limited to a kind of this specific implementation mode.For example, in technical solution provided in an embodiment of the present invention, Tax power can also be carried out or using hits algorithms common in link analysis come to spy to entry according to the parameter preset of each entry Sign word carries out tax power, and when the weights of entry are more than threshold value, determines that the entry is characteristic of division word.Wherein, the parameter preset Including at least following one or arbitrary combination:Click volume, favorable comment number and the editor's number of entry.
Technical solution provided in an embodiment of the present invention is carried out more with reference to specific attached drawing and specific application scenarios Add detailed description.
The taxonomy being illustrated in figure 3 under a kind of concrete application scene provided in an embodiment of the present invention determines the stream of method Journey schematic diagram needs the taxonomy for obtaining 5000 sport categories in this embodiment;The seed words of pre-selection include:Sport, foot Ball, sportsman, track and field, world cup, the Olympic Games;Corpora mining database is Baidupedia;This method may include:
Step 301 obtains 10000 input samples composition input sample collection from Baidupedia.
Wherein, the format of input sample collection can be as shown in table 1.
Step 302 concentrates 1000 feature samples of acquisition, composition characteristic sample according to preset seed words from input sample Collection.
Wherein, this feature sample can be as shown in table 2:
Table 2
Entry Classification Related entry
Football Sport Basketball billiard ball world cup
When the feature samples number got according to seed words such as sport, football, sportsman, track and field, world cup, the Olympic Games not When foot 1000, more feature samples can be obtained according to the related entry for including in feature samples.
Step 303 determines characteristic of division word according to feature samples collection.
Specifically, in this embodiment it is possible to by way of determining discrimination, the power of each entry in feature samples is determined Value, and the entry using weights more than 0.05 is as characteristic of division word.
By taking feature samples shown in table 2 as an example.Assuming that the discrimination of basketball is 0.08, the discrimination of billiard ball is 0.03, generation The discrimination of boundary's cup is 0.07, then the discrimination of football is 0.06, i.e. football belongs to characteristic of division word.
Step 304 determines taxonomy according to characteristic of division word and text to be selected.
Specifically, 50000 texts to be selected can be obtained from internet, and to each text to be selected carry out respectively cutting word and Weight computing, and determine that weights are more than that the text to be selected of certain threshold value obtains 5000 classification in this step for taxonomy Language material.
Step 305 carries out 5000 determining taxonomies language material cross validation, and determines 1000 final classification Language material.
Specifically, in this step, can 5000 taxonomies that determined in step 304 be divided into 5 parts, and successively It selects a copy of it for test data, class verification is carried out to the test data with remaining 4 parts respectively, and choose and be proved to be successful Rate sorts preceding 1000 taxonomies as final taxonomy from high to low.Wherein, it is proved to be successful the identical classification language of rate It is randomly ordered between material.
It is certain by choosing in advance by above description as can be seen that in technical solution provided in an embodiment of the present invention The seed words of the known class of quantity, and a certain number of input sample composition input sample collection are obtained from database;According to Preset seed words are concentrated from input sample and obtain feature samples composition characteristic sample set, and according to the feature samples collection got Determine characteristic of division word;Taxonomy and its classification are determined according to the characteristic of division word and text to be selected that get, are improved The efficiency and accuracy rate that taxonomy obtains.
Determine that the identical inventive concept of method, the embodiment of the present invention additionally provide a kind of classification language based on above-mentioned taxonomy Material determines equipment, can be applied in above method flow.
As shown in figure 4, determining the structural schematic diagram of equipment for taxonomy provided in an embodiment of the present invention, may include:
First acquisition module 41, the input sample for obtaining preset quantity from database form input sample collection;Its In, the input sample includes the entry name, classification information and related entry information of entry;
Second acquisition module 42 obtains feature samples, group for being concentrated from the input sample according to preset seed words At feature samples collection;
First determining module 43, for determining characteristic of division word according to the feature samples collection;
Second determining module 44, for determining taxonomy and its class according to the characteristic of division word and text to be selected Not.
Wherein, second acquisition module 42 is concentrated from the input sample according to preset seed words and obtains feature sample This, realizes especially by following below scheme:
Step A, the feature samples for obtaining and including current seed words are concentrated from the input sample;
Step B, whether the quantity of judging characteristic sample is more than first threshold;If being judged as YES, terminate the flow;It is no Then, step C is gone to;
Step C, the entry in the feature samples and related entry are obtained, and by the entry and related term got Seed words are added in item, update current seed words;Go to step A.
Wherein, first determining module 43 is specifically used for, and obtains the entry in this feature sample set;It determines in the entry The weights of each entry;Characteristic of division word is determined according to the weights of each entry.
Wherein, the weights of the entry are the discrimination of the entry;
First determining module 43 is specifically used for, and obtains the related entry that the feature samples are concentrated;Determine the correlation The discrimination of each correlation entry in entry;The discrimination of each entry in the entry is determined according to the discrimination of the related entry; Characteristic of division word is determined according to the discrimination of the entry.
Wherein, in the described correlation entry it is each correlation entry discrimination specifically, it is described correlation entry in each related term Item concentrates the number entry related to this occurred in related entry information to concentrate the number occurred in input sample in feature samples Ratio;The related entry for including in feature samples where the discrimination of each entry in the described entry, the specially entry The mean value of discrimination;
First determining module 43 is specifically used for, and when the discrimination of the entry is more than second threshold, determines the word Item is characteristic of division word.
Wherein, first determining module 43 is specifically used for, and the weights of each entry are determined according to parameter preset, when institute's predicate When the weights of item are more than third threshold value, determine that the entry is characteristic of division word;Or, determining the power of each entry according to hits algorithms Value determines that the entry is characteristic of division word when the weights of the entry are more than third threshold value;
Wherein, the parameter preset includes following one or arbitrary combination:
Click volume, favorable comment number and the editor's number of entry.
Wherein, second determining module 44 is specifically used for, and carries out cutting word to the text to be selected, and obtain this and wait for selection The characteristic of division word for including in this;The weights of the text to be selected are determined according to the characteristic of division word got;When described to be selected When the weights of text are more than four threshold values, determine that the text to be selected is taxonomy, and will be belonging to the preset seed words Classification of the classification as the taxonomy.
Wherein, second determining module 44 is waited for according to described in the characteristic of division word and the Feature Words got determination The weights of selection sheet, are realized especially by following formula:
Wherein, tf is the word frequency of the characteristic of division word that occurs in the text to be selected in the text to be selected;The n is point The number of category feature word;The QiFor the weights of i-th of characteristic of division word;The N is the number of words of the text to be selected.
Wherein, second determining module 44 is additionally operable to, and the taxonomy of the determination is divided into more parts;According to described each Part taxonomy carries out language material cross validation, and determines final taxonomy and its classification.
Wherein, step A1, select from each part taxonomy a non-selected taxonomy as testing number According to;
Step B1, respectively using remaining each part taxonomy the classification of the test data is verified;
Step C1, the correct number of statistical testing of business cycles, and when its be more than five threshold values when, determine that the test data is final Taxonomy;
Step D1, judge whether that there is also non-selected taxonomies;If being judged as YES, step A is gone to1;Otherwise, Terminate the flow.
Through the above description of the embodiments, those skilled in the art can be understood that the present invention can be by Software adds the mode of required general hardware platform to realize, naturally it is also possible to which by hardware, but the former is more in many cases Good embodiment.Based on this understanding, technical scheme of the present invention substantially in other words contributes to the prior art Part can be expressed in the form of software products, which is stored in a storage medium, if including Dry instruction is used so that a computer equipment (can be personal computer, server or the network equipment etc.) executes this hair Method described in bright each embodiment.
It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the module in attached drawing or stream Journey is not necessarily implemented necessary to the present invention.
It will be appreciated by those skilled in the art that the module in device in embodiment can describe be divided according to embodiment It is distributed in the device of embodiment, respective change can also be carried out and be located in one or more devices different from the present embodiment.On The module for stating embodiment can be merged into a module, can also be further split into multiple submodule.
Disclosed above is only several specific embodiments of the present invention, and still, the present invention is not limited to this, any ability What the technical staff in domain can think variation should all fall into protection scope of the present invention.

Claims (20)

1. a kind of taxonomy determines method, which is characterized in that including:
The input sample that preset quantity is obtained from database, forms input sample collection;Wherein, the input sample includes entry Entry name, classification information and related entry information;
It is concentrated from the input sample according to preset seed words and obtains feature samples, composition characteristic sample set, the feature sample The feature samples of this concentration include the preset seed words;
Characteristic of division word is determined according to the feature samples collection;
Taxonomy and its classification are determined according to the characteristic of division word and text to be selected.
2. the method as described in claim 1, which is characterized in that described to be concentrated from the input sample according to preset seed words Feature samples are obtained, are realized especially by following below scheme:
Step A, the feature samples for obtaining and including current seed words are concentrated from the input sample;
Step B, whether the quantity of judging characteristic sample is more than first threshold;If being judged as YES, terminate the flow;Otherwise, turn To step C;
Step C, the entry in the feature samples and related entry are obtained, and the entry got and related entry are added Enter seed words, updates current seed words;Go to step A.
3. the method as described in claim 1, which is characterized in that it is described that characteristic of division word is determined according to the feature samples collection, Specially:
Obtain the entry in this feature sample set;
Determine the weights of each entry in the entry;
Characteristic of division word is determined according to the weights of each entry.
4. method as claimed in claim 3, which is characterized in that the weights of the entry are the discrimination of the entry;
The weights of each entry in the determination entry, specially:
Obtain the related entry that the feature samples are concentrated;
Determine the discrimination of each correlation entry in the correlation entry;
The discrimination of each entry in the entry is determined according to the discrimination of the related entry;
The weights according to each entry determine characteristic of division word, specially:
Characteristic of division word is determined according to the discrimination of each entry.
5. method as claimed in claim 4, which is characterized in that
The discrimination of each correlation entry in the described correlation entry, specially:
Each related entry concentrates the number and the related term occurred in related entry information in feature samples in the correlation entry Item concentrates the ratio of the number occurred in input sample;
The discrimination of each entry in the described entry, specially:
The mean value of the discrimination for the related entry for including in feature samples where the entry;
The discrimination according to each entry determines characteristic of division word, specially:
When the discrimination of the entry is more than second threshold, determine that the entry is characteristic of division word.
6. method as claimed in claim 3, which is characterized in that the weights of each entry in the determination entry, specially:
The weights of each entry are determined according to parameter preset;Or,
The weights of each entry are determined according to hits algorithms;
Wherein, the parameter preset includes following one or arbitrary combination:
Click volume, favorable comment number and the editor's number of entry;
The weights according to each entry determine characteristic of division word, specially:
When the weights of the entry are more than third threshold value, determine that the entry is characteristic of division word.
7. method as claimed in claim 3, which is characterized in that described to be determined according to the characteristic of division word and text to be selected Taxonomy and its classification, specially:
Cutting word is carried out to the text to be selected, and obtains the characteristic of division word for including in the text to be selected;
The weights of the text to be selected are determined according to the characteristic of division word got;
When the weights of the text to be selected are more than four threshold values, determine that the text to be selected is taxonomy, and will be described pre- If seed words belonging to classification of the classification as the taxonomy.
8. the method for claim 7, which is characterized in that described according to the characteristic of division word and the feature got Word determines the weights of the text to be selected, is realized especially by following formula:
Wherein, tf is the word frequency of the characteristic of division word that occurs in the text to be selected in the text to be selected;The n is that classification is special Levy the number of word;The QiFor the weights of i-th of characteristic of division word;The N is the number of words of the text to be selected.
9. the method for claim 7, which is characterized in that this method further includes:
The taxonomy of the determination is divided into more parts;
Language material cross validation is carried out according to each part taxonomy, and determines final taxonomy and its classification.
10. method as claimed in claim 9, which is characterized in that it is described that language material cross validation is carried out according to each part taxonomy, It is realized especially by following below scheme:
Step A1, select from each part taxonomy a non-selected taxonomy as test data;
Step B1, respectively using remaining each part taxonomy the classification of the test data is verified;
Step C1, the correct number of statistical testing of business cycles, and when it is more than five threshold values, determine point that the test data is final Class language material;
Step D1, judge whether that there is also non-selected taxonomies;If being judged as YES, step A is gone to1;Otherwise, terminate The flow.
11. a kind of taxonomy determines equipment, which is characterized in that including:
First acquisition module, the input sample for obtaining preset quantity from database form input sample collection;Wherein, institute State the entry name, classification information and related entry information that input sample includes entry;
Second acquisition module obtains feature samples, composition characteristic for being concentrated from the input sample according to preset seed words Sample set, the feature samples that the feature samples are concentrated include the preset seed words;
First determining module, for determining characteristic of division word according to the feature samples collection;
Second determining module, for determining taxonomy and its classification according to the characteristic of division word and text to be selected.
12. taxonomy as claimed in claim 11 determines equipment, which is characterized in that second acquisition module is according to default Seed words from the input sample concentrate obtain feature samples, realized especially by following below scheme:
Step A, the feature samples for obtaining and including current seed words are concentrated from the input sample;
Step B, whether the quantity of judging characteristic sample is more than first threshold;If being judged as YES, terminate the flow;Otherwise, turn To step C;
Step C, the entry in the feature samples and related entry are obtained, and the entry got and related entry are added Enter seed words, updates current seed words;Go to step A.
13. taxonomy as claimed in claim 11 determines equipment, which is characterized in that first determining module is specifically used In the entry in acquisition this feature sample set;Determine the weights of each entry in the entry;It is determined according to the weights of each entry Characteristic of division word.
14. taxonomy as claimed in claim 13 determines equipment, which is characterized in that the weights of the entry are the entry Discrimination;
First determining module is specifically used for, and obtains the related entry that the feature samples are concentrated;It determines in the correlation entry The discrimination of each correlation entry;The discrimination of each entry in the entry is determined according to the discrimination of the related entry;According to institute The discrimination of predicate item determines characteristic of division word.
15. taxonomy as claimed in claim 14 determines equipment, which is characterized in that each related term in the described correlation entry The discrimination of item is specifically, each related entry concentrates time occurred in related entry information in feature samples in the correlation entry Number entry related to this concentrates the ratio of the number occurred in input sample;The discrimination of each entry in the described entry, specifically The mean value of the discrimination for the related entry for including in the feature samples where the entry;
First determining module is specifically used for, and when the discrimination of the entry is more than second threshold, determines that the entry is point Category feature word.
16. taxonomy as claimed in claim 13 determines equipment, which is characterized in that first determining module is specifically used In, the weights of each entry are determined according to parameter preset, when the weights of the entry be more than third threshold value when, determine the entry be point Category feature word;Or, determining the weights of each entry according to hits algorithms, when the weights of the entry are more than third threshold value, determine The entry is characteristic of division word;
Wherein, the parameter preset includes following one or arbitrary combination:
Click volume, favorable comment number and the editor's number of entry.
17. taxonomy as claimed in claim 13 determines equipment, which is characterized in that second determining module is specifically used for, Cutting word is carried out to the text to be selected, and obtains the characteristic of division word for including in the text to be selected;It is special according to the classification got Sign word determines the weights of the text to be selected;When the weights of the text to be selected are more than four threshold values, selection is waited for described in determination This is taxonomy, and using the classification belonging to the preset seed words as the classification of the taxonomy.
18. taxonomy as claimed in claim 17 determines equipment, which is characterized in that second determining module is according to Characteristic of division word and the Feature Words got determine the weights of the text to be selected, are realized especially by following formula:
Wherein, tf is the word frequency of the characteristic of division word that occurs in the text to be selected in the text to be selected;The n is that classification is special Levy the number of word;The QiFor the weights of i-th of characteristic of division word;The N is the number of words of the text to be selected.
19. taxonomy as claimed in claim 17 determines equipment, which is characterized in that second determining module is additionally operable to, The taxonomy of the determination is divided into more parts;Language material cross validation is carried out according to each part taxonomy, and determines final point Class language material and its classification.
20. taxonomy as claimed in claim 19 determines equipment, which is characterized in that second determining module is according to each part Taxonomy carries out language material cross validation, is realized especially by following below scheme:
Step A1, select from each part taxonomy a non-selected taxonomy as test data;
Step B1, respectively using remaining each part taxonomy the classification of the test data is verified;
Step C1, the correct number of statistical testing of business cycles, and when it is more than five threshold values, determine point that the test data is final Class language material;
Step D1, judge whether that there is also non-selected taxonomies;If being judged as YES, step A is gone to1;Otherwise, terminate The flow.
CN201210056669.3A 2012-03-06 2012-03-06 A kind of taxonomy determines method and apparatus Active CN103309857B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210056669.3A CN103309857B (en) 2012-03-06 2012-03-06 A kind of taxonomy determines method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210056669.3A CN103309857B (en) 2012-03-06 2012-03-06 A kind of taxonomy determines method and apparatus

Publications (2)

Publication Number Publication Date
CN103309857A CN103309857A (en) 2013-09-18
CN103309857B true CN103309857B (en) 2018-11-09

Family

ID=49135096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210056669.3A Active CN103309857B (en) 2012-03-06 2012-03-06 A kind of taxonomy determines method and apparatus

Country Status (1)

Country Link
CN (1) CN103309857B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631874B (en) * 2013-11-07 2017-01-18 微梦创科网络科技(中国)有限公司 UGC label classification determining method and device for social platform
CN106528615B (en) * 2016-09-29 2019-08-06 北京金山安全软件有限公司 Classification method and device and server
CN106503254A (en) * 2016-11-11 2017-03-15 上海智臻智能网络科技股份有限公司 Language material sorting technique, device and terminal
CN107229731B (en) * 2017-06-08 2021-05-25 百度在线网络技术(北京)有限公司 Method and apparatus for classifying data
CN108304530B (en) * 2018-01-26 2022-03-18 腾讯科技(深圳)有限公司 Knowledge base entry classification method and device and model training method and device
CN109165326A (en) * 2018-08-16 2019-01-08 蜜小蜂智慧(北京)科技有限公司 A kind of character string matching method and device
CN109948142B (en) * 2019-01-25 2020-01-14 北京海天瑞声科技股份有限公司 Corpus selection processing method, apparatus, device and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976246A (en) * 2010-09-30 2011-02-16 互动在线(北京)科技有限公司 Classification retrieval method for encyclopedia entries
CN102073729A (en) * 2011-01-14 2011-05-25 百度在线网络技术(北京)有限公司 Relationship knowledge sharing platform and implementation method thereof
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
US8489523B2 (en) * 2010-03-31 2013-07-16 Alcatel Lucent Categorization automation based on category ontology
CN102207961B (en) * 2011-05-25 2013-10-23 盛乐信息技术(上海)有限公司 Automatic web page classification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976246A (en) * 2010-09-30 2011-02-16 互动在线(北京)科技有限公司 Classification retrieval method for encyclopedia entries
CN102073729A (en) * 2011-01-14 2011-05-25 百度在线网络技术(北京)有限公司 Relationship knowledge sharing platform and implementation method thereof
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"基于维基百科构建语义知识库及其在文本分类领域的应用研究";苏小康;《中国优秀硕士学位论文全文数据库 信息科技辑》;20101015;全文 *
"文本分类语料库自动创建系统的研究与实现";吴韦;《中国优秀硕士学位论文全文数据库 信息科技辑》;20090915;第25-32、53-64页 *

Also Published As

Publication number Publication date
CN103309857A (en) 2013-09-18

Similar Documents

Publication Publication Date Title
CN103309857B (en) A kind of taxonomy determines method and apparatus
CN105955962B (en) The calculation method and device of topic similarity
CN104166706B (en) Multi-tag grader construction method based on cost-sensitive Active Learning
Lee A scientometric study of the research performance of the Institute of Molecular and Cell Biology in Singapore
CN109670039A (en) Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering
CN101937436B (en) Text classification method and device
CN108304493B (en) Hypernym mining method and device based on knowledge graph
CN106156335A (en) A kind of discovery and arrangement method and system of teaching material knowledge point
CN103885933B (en) For evaluating emotion degree and the method and apparatus for evaluating entity of text
CN105893390A (en) Application program processing method and electronic equipment
CN108717459B (en) A kind of mobile application defect positioning method of user oriented comment information
CN107463711A (en) A kind of tag match method and device of data
CN105868372A (en) Label distribution method and device
CN106874322A (en) A kind of data table correlation method and device
CN114817575B (en) Large-scale electric power affair map processing method based on extended model
CN107526805A (en) A kind of ML kNN multi-tag Chinese Text Categorizations based on weight
CN108009248A (en) A kind of data classification method and system
CN113918806A (en) Method for automatically recommending training courses and related equipment
CN106886561A (en) Web Community's model influence sort method based on association in time interaction fusion
CN109871770A (en) Property ownership certificate recognition methods, device, equipment and storage medium
CN108960884A (en) Information processing method, model building method and device, medium and calculating equipment
CN110378389A (en) A kind of Adaboost classifier calculated machine creating device
CN104156458B (en) The extracting method and device of a kind of information
US20210042370A1 (en) Youth sports program cataloging and rating system
Alzetta et al. Prelearn@ evalita 2020: Overview of the prerequisite relation learning task for italian

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
ASS Succession or assignment of patent right

Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text: FORMER OWNER: TENGXUN SCI-TECH (SHENZHEN) CO., LTD.

Effective date: 20131017

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20131017

Address after: A Tencent Building in Shenzhen Nanshan District City, Guangdong streets in Guangdong province science and technology 518057 16

Applicant after: Shenzhen Shiji Guangsu Information Technology Co., Ltd.

Address before: Shenzhen Futian District City, Guangdong province 518057 Zhenxing Road, SEG Science Park 2 East Room 403

Applicant before: Tencent Technology (Shenzhen) Co., Ltd.

C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant