CN103309857A - Method and equipment for determining classified linguistic data - Google Patents

Method and equipment for determining classified linguistic data Download PDF

Info

Publication number
CN103309857A
CN103309857A CN2012100566693A CN201210056669A CN103309857A CN 103309857 A CN103309857 A CN 103309857A CN 2012100566693 A CN2012100566693 A CN 2012100566693A CN 201210056669 A CN201210056669 A CN 201210056669A CN 103309857 A CN103309857 A CN 103309857A
Authority
CN
China
Prior art keywords
entry
characteristic
language material
word
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012100566693A
Other languages
Chinese (zh)
Other versions
CN103309857B (en
Inventor
贺翔
亓超
毛少林
翟俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201210056669.3A priority Critical patent/CN103309857B/en
Publication of CN103309857A publication Critical patent/CN103309857A/en
Application granted granted Critical
Publication of CN103309857B publication Critical patent/CN103309857B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method and equipment for determining classified linguistic data. The method comprises the following steps: obtaining input samples with a preset amount from a database to form an input sample set, wherein the input samples comprise vocabulary entry names of vocabulary entries, classification information and related vocabulary entry information; obtaining characteristic samples from the input sample set according to preset seed words to form a characteristic sample set; determining classified characteristic words according to the characteristic sample set; and determining the classified linguistic data and the type of the classified linguistic data according to the classified characteristic words and texts to be selected. Due to the adoption of the method and the equipment for determining classified linguistic data, the efficiency and the accuracy rate of obtaining the classified linguistic data are improved.

Description

A kind of classification language material is determined method and apparatus
Technical field
The present invention relates to the Internet technology application, relate in particular to a kind of classification language material and determine method and apparatus.
Background technology
Text automatic classification refers to computer program text set (or other data) be carried out automatic key words sorting according to certain taxonomic hierarchies or standard.
In order to make computer program carry out automatic key words sorting to text set, need to use a large amount of classification language materials that it is trained; Wherein, this classification language material refers to have in a large number the text collection of classification markup information, and above-mentioned computer program (as sorter) is by language material study (training) mark rule.
In the prior art, the approach that obtains the classification language material mainly comprises following dual mode:
(1) artificial mark is namely manually to a large amount of texts mark of classifying;
(2) the directed extracting namely divided from the internet the data of class by modes such as automatic reptiles and grasped; As, when needs video display class classification language material, can grasp in the video display class site databases on the internet.
The inventor finds that there is following defective at least in prior art in realizing process of the present invention:
The mode of artificial mark need spend great amount of manpower and time, and efficient is lower; The directed accuracy rate that grasps the language material that then can't guarantee to classify can't guarantee that namely the text set that gets access to is the language material of video display class from video display class site databases.
Summary of the invention
The invention provides a kind of confirmation method and equipment of the language material of classifying, to improve efficient and the accuracy rate that the classification language material obtains.
In order to achieve the above object, the embodiment of the invention provides a kind of classification language material to determine method, comprising:
From database, obtain the input sample of predetermined number, form the input sample set; Wherein, described input sample comprises entry name, classified information and the relevant entry information of entry;
From described input sample set, obtain feature samples, the composition characteristic sample set according to default seed word;
Determine the characteristic of division word according to described feature samples collection;
Determine classification language material and classification thereof according to described characteristic of division word and text to be selected.
The embodiment of the invention also provides a kind of classification language material to determine equipment, comprising:
First acquisition module for the input sample that obtains predetermined number from database, is formed the input sample set; Wherein, described input sample comprises entry name, classified information and the relevant entry information of entry;
Second acquisition module is used for obtaining feature samples from described input sample set, the composition characteristic sample set according to default seed word;
First determination module is used for determining the characteristic of division word according to described feature samples collection;
Second determination module is used for determining classification language material and classification thereof according to described characteristic of division word and text to be selected.
Compared with prior art, the embodiment of the invention has the following advantages:
The seed word of the known class by choosing some in advance, and the input sample that obtains some from database is formed the input sample set; From the input sample set, obtain feature samples composition characteristic sample set according to default seed word, and determine the characteristic of division word according to the feature samples collection that gets access to; Determine classification language material and classification thereof according to the characteristic of division word that gets access to and text to be selected, improved efficient and accuracy rate that the classification language material obtains.
Description of drawings
A kind of language material of classifying that Fig. 1 provides for the embodiment of the invention is determined the schematic flow sheet of method;
Obtain the schematic flow sheet of feature samples in the technical scheme that Fig. 2 provides for the embodiment of the invention;
Classification language material under a kind of concrete application scenarios that Fig. 3 provides for the embodiment of the invention is determined the schematic flow sheet of method;
A kind of language material of classifying that Fig. 4 provides for the embodiment of the invention is determined the structural representation of equipment.
Embodiment
At above-mentioned defective of the prior art, the technical scheme that the embodiment of the invention provides a kind of language material of classifying to determine.In this technical scheme, the seed word of the known class by choosing some in advance, and the input sample that obtains some from database is formed the input sample set; From the input sample set, obtain feature samples composition characteristic sample set according to default seed word, and determine the characteristic of division word according to the feature samples collection that gets access to; Determine classification language material and classification thereof according to the characteristic of division word that gets access to and text to be selected, improved efficient and accuracy rate that the classification language material obtains.
Wherein, in the technical scheme that the embodiment of the invention provides, the database that obtains the input sample set can be Baidu's encyclopaedia, wikipedia, WordNet etc.The input sample set that gets access to from database can comprise entry name, classified information and the relevant entry information of entry, and its form can be as shown in table 1:
Table 1
Below in conjunction with the accompanying drawing in the embodiments of the invention, the technical scheme in the embodiments of the invention is clearly and completely described, obviously, the embodiments described below only are the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are not making the every other embodiment that obtains under the creative work prerequisite, all belong to the scope of embodiments of the invention protection.
As shown in Figure 1, a kind of language material of classifying that provides for the embodiment of the invention is determined the schematic flow sheet of method, can may further comprise the steps:
Step 101, from database, obtain the input sample of predetermined number, form the input sample set.
Concrete, be example from Baidu's encyclopaedia, to excavate the classification language material.In the technical scheme that the embodiment of the invention provides, can from Baidu's encyclopaedia, obtain the input sample of predetermined number (as 1000), its form can be as shown in table 1.
Step 102, the default seed word of basis obtain feature samples, the composition characteristic sample set from the input sample set.
Concrete, in the technical scheme that the embodiment of the invention provides, when needs obtain the classification language material, can choose a quantity of seeds word in advance.For example, when needs obtain sport category classification language material, can choose the seed word of 10 sport category in advance, as physical culture, football, sportsman, track and field, world cup, the Olympic Games etc.Obtain input sample and choosing after the seed word, can from the input sample set, obtain feature samples, the composition characteristic sample set according to the seed word.
Wherein, in the technical scheme that the embodiment of the invention provides, the flow process of obtaining feature samples can may further comprise the steps as shown in Figure 2:
Step 102A, obtain the feature samples that comprises current seed word from input the sample set.
For example, the seed word of choosing in advance is football, basketball, sportsman, then obtains the feature samples that comprises current seed word according to this seed word from the input sample set.Wherein, the feature samples that comprises this seed word can be that entry is football, basketball or sportsman, also can be to comprise the respective seed word in the relevant entry.
Whether the quantity of step 102B, judging characteristic sample surpasses first threshold, is then to finish this flow process if be judged as; Otherwise, go to step 102C.
Wherein, the feature samples amount threshold can be determined according to the actual requirements, as 10000.
Step 102C, obtain entry and relevant entry in the feature samples, and entry and the relevant entry that gets access to added the seed word, upgrade current seed word; Go to step 102A.
Concrete, when the feature samples quantity that gets access to is lower than predetermined threshold value, the entry in the feature samples that gets access to all can be joined in the seed word with relevant entry, and from the input sample set, obtain more feature samples according to the seed word after upgrading.
Can get access to the feature samples of sufficient amount by above flow process.
The feature samples collection that step 103, basis get access to is determined the characteristic of division word.
Concrete, in embodiments of the present invention, get access to feature samples after, can further determine the weights of the entry that comprises in each feature samples, and determine the characteristic of division word according to the weights of each entry.
Be example with the weights of entry for the discrimination of this entry, in embodiments of the present invention, will import sample set as complete or collected works, and according to further definite two set of feature samples collection:
Set 1: comprise all entries that feature samples is concentrated;
Set 2: comprise all relevant entries that feature samples is concentrated.
Certain word W in the pair set 2 defines its discrimination and is:
Q wThe number of times that number of times/W that=W occurs in set 2 occurs in complete or collected works
For certain the word x in the set 1, define its discrimination and be the average of its all relevant entry discriminations:
Wherein, n is that entry is the number of relevant entry in the feature samples of x, is the discrimination of i relevant entry.
After determining that feature samples is concentrated the discrimination of each entry, the entry that discrimination can be surpassed threshold value (as K) is defined as the characteristic of division word.
Step 104, determine classification language material and classification thereof according to characteristic of division word and text to be selected.
Concrete, determine the characteristic of division word after, can choose a text to be selected wantonly, and this text to be selected is cut word, obtain the characteristic of division word that comprises in this text to be selected, and determine the weights of text to be selected according to the characteristic of division word that gets access to; When the weights of text to be selected surpass threshold value, determine that this text to be selected is the classification language material, and with the classification of the classification under the corresponding seed word as this classification language material.
Wherein, determine the weights of text to be selected according to characteristic of division word and the feature word that gets access to, can specifically realize by following formula:
Figure 2012100566693100002DEST_PATH_IMAGE003
Wherein, tf is the word frequency of characteristic of division word in this text to be selected that occurs in the described text to be selected; Described i is the number of characteristic of division word; Described Q iBe the weights of i characteristic of division word; Described N is the number of words of described text to be selected.
In order further to improve the accuracy rate of the classification language material that gets access to, in the technical scheme that the embodiment of the invention provides, after having determined the classification language material, determined classification language material can also be divided into many parts; Carry out the language material cross validation according to each part classification language material, and determine final classification language material and classification thereof.
Wherein, carry out the language material cross validation according to described each part classification language material, can specifically realize by following flow process:
Steps A 1, from each part classification language material, select a non-selected classification language material as test data;
Step B 1, use all the other each part classification language materials that the classification of described test data is verified respectively;
Step C 1, number of times that statistical testing of business cycles is correct, and when it surpasses the 5th threshold value, determine that test data is final classification language material;
Step D 1, judge whether also to exist non-selected classification language material; If to be judged as be then to go to steps A 1Otherwise, finish this flow process.
For example, the classification language material of determining can be divided into 10 parts, in turn will be wherein 9 parts as training data, 1 part as test data, the classification of test data is verified namely every part of test data has been carried out the test of 9 subseries; In test data, the number of times that the classification checking is correct surpasses the final classification language material that is defined as of threshold value.
It should be noted that, the method of the definite discrimination that provides in the above-mentioned flow process only is that a kind of embodiment of entry weights is provided in the technical scheme that provides of the embodiment of the invention, and in the technical scheme that the embodiment of the invention provides, determine that the mode of entry weights is not limited to this a kind of embodiment.For example, in the technical scheme that the embodiment of the invention provides, can also compose power to entry or use that common hit algorithm comes power composed in the feature word in the link analysis according to the parameter preset of each entry, and when the weights of entry during above threshold value, determine that this entry is the characteristic of division word.Wherein, this parameter preset comprises one of following or combination in any at least: the click volume of entry, favorable comment number and editor's number of times.
Below in conjunction with concrete accompanying drawing and concrete application scenarios the technical scheme that the embodiment of the invention provides is carried out more detailed description.
Be illustrated in figure 3 as classification language material under a kind of concrete application scenarios that the embodiment of the invention provides and determine the schematic flow sheet of method, in this embodiment, need obtain the classification language material of 5000 sport category; The seed word of preliminary election comprises: physical culture, football, sportsman, track and field, world cup, the Olympic Games; Language material mining data storehouse is Baidu's encyclopaedia; This method can comprise:
Step 301, from Baidu's encyclopaedia, obtain 10000 input samples and form the input sample sets.
Wherein, the form of input sample set can be as shown in table 1.
Step 302, the default seed word of basis obtain 1000 feature samples, the composition characteristic sample set from the input sample set.
Wherein, this feature samples can be as shown in table 2:
Table 2
Figure 2012100566693100002DEST_PATH_IMAGE004
When the feature samples that gets access to according to seed words such as physical culture, football, sportsman, track and field, world cup, the Olympic Games is counted less than 1000, can obtain more feature samples according to the relevant entry that comprises in the feature samples.
Step 303, determine the characteristic of division word according to the feature samples collection.
Concrete, in this embodiment, can determine the weights of each entry in the feature samples by determining the mode of discrimination, and with weights greater than 0.05 entry as the characteristic of division word.
Be example with the feature samples shown in the table 2.The discrimination of supposing basketball is 0.08, and the discrimination of billiard ball is 0.03, and the discrimination of world cup is 0.07, and then the discrimination of football is 0.06, and namely football belongs to the characteristic of division word.
Step 304, determine the classification language material according to characteristic of division word and text to be selected.
Concrete, can from the internet, obtain 50000 texts to be selected, and each text to be selected is cut word and weights calculating respectively, and the text to be selected that definite weights surpass certain threshold value in this step, obtains 5000 classification language materials for the classification language material.
Step 305,5000 classification language materials determining are carried out the language material cross validation, and determine 1000 final classification language materials.
Concrete, in this step, 5000 classification language materials determining in the step 304 can be divided into 5 parts, and to select a copy of it successively be test data, with remaining 4 parts this test data carried out classification checking respectively, and to choose the classification language material that the rate of being proved to be successful sorts preceding 1000 from high to low be final classification language material.Wherein, be proved to be successful between the identical classification language material of rate randomly ordered.
By above description as can be seen, in the technical scheme that the embodiment of the invention provides, the seed word of the known class by choosing some in advance, and the input sample that obtains some from database is formed the input sample set; From the input sample set, obtain feature samples composition characteristic sample set according to default seed word, and determine the characteristic of division word according to the feature samples collection that gets access to; Determine classification language material and classification thereof according to the characteristic of division word that gets access to and text to be selected, improved efficient and accuracy rate that the classification language material obtains.
Determine the inventive concept that method is identical based on above-mentioned classification language material, the embodiment of the invention also provides a kind of classification language material to determine equipment, can be applied in the said method flow process.
As shown in Figure 4, the classification language material that provides for the embodiment of the invention is determined the structural representation of equipment, can comprise:
First acquisition module 41 for the input sample that obtains predetermined number from database, is formed the input sample set; Wherein, described input sample comprises entry name, classified information and the relevant entry information of entry;
Second acquisition module 42 is used for obtaining feature samples from described input sample set, the composition characteristic sample set according to default seed word;
First determination module 43 is used for determining the characteristic of division word according to described feature samples collection;
Second determination module 44 is used for determining classification language material and classification thereof according to described characteristic of division word and text to be selected.
Wherein, described second acquisition module 42 obtains feature samples according to default seed word from described input sample set, specifically realize by following flow process:
Steps A, from described input sample set, obtain the feature samples that comprises current seed word;
Whether the quantity of step B, judging characteristic sample surpasses first threshold; If to be judged as be then to finish this flow process; Otherwise, go to step C;
Step C, obtain entry and relevant entry in the described feature samples, and the described entry that gets access to and relevant entry are added the seed word, upgrade current seed word; Go to steps A.
Wherein, described first determination module 43 specifically is used for, and obtains the entry that this feature samples is concentrated; Determine the weights of each entry in this entry; Determine the characteristic of division word according to the weights of described each entry.
Wherein, the weights of described entry are the discrimination of described entry;
Described first determination module 43 specifically is used for, and obtains the relevant entry that described feature samples is concentrated; Determine the discrimination of each relevant entry in this relevant entry; Determine the discrimination of each entry in this entry according to the discrimination of described relevant entry; Determine the characteristic of division word according to the discrimination of described entry.
Wherein, the discrimination of each relevant entry is specially in the described entry of should being correlated with, and each relevant entry concentrates the relevant entry with this of number of times that occurs in the relevant entry information importing the ratio of the number of times that occurs in the sample set at feature samples in the described relevant entry; The discrimination of each entry in described this entry is specially the average of the discrimination of the relevant entry that comprises in this entry place feature samples;
Described first determination module 43 specifically is used for, and when the discrimination of described entry surpasses second threshold value, determines that this entry is the characteristic of division word.
Wherein, described first determination module 43 specifically is used for, and determines the weights of each entry according to parameter preset, when the weights of described entry surpass the 3rd threshold value, determines that this entry is the characteristic of division word; Or, determine the weights of each entry according to the hit algorithm, when the weights of described entry surpass the 3rd threshold value, determine that this entry is the characteristic of division word;
Wherein, described parameter preset comprises one of following or combination in any:
The click volume of entry, favorable comment number and editor's number of times.
Wherein, described second determination module 44 specifically is used for, and described text to be selected is cut word, and obtains the characteristic of division word that comprises in this text to be selected; Determine the weights of described text to be selected according to the characteristic of division word that gets access to; When the weights of described text to be selected surpass the 4th threshold value, determine that described text to be selected is the classification language material, and the affiliated classification of will be described default seed word is as the classification of described classification language material.
Wherein, described second determination module 44 is determined the weights of described text to be selected according to described characteristic of division word and the feature word that gets access to, specifically realizes by following formula:
Wherein, tf is the word frequency of characteristic of division word in this text to be selected that occurs in the described text to be selected; Described i is the number of characteristic of division word; Described Q iBe the weights of i characteristic of division word; Described N is the number of words of described text to be selected.
Wherein, described second determination module 44 also is used for, and described definite classification language material is divided into many parts; Carry out the language material cross validation according to described each part classification language material, and determine final classification language material and classification thereof.
Wherein, steps A 1, from described each part classification language material, select a non-selected classification language material as test data;
Step B 1, use all the other each part classification language materials that the classification of described test data is verified respectively;
Step C 1, number of times that statistical testing of business cycles is correct, and when it surpasses the 5th threshold value, determine that described test data is final classification language material;
Step D 1, judge whether also to exist non-selected classification language material; If to be judged as be then to go to steps A 1Otherwise, finish this flow process.
Through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential general hardware platform, can certainly pass through hardware, but the former is better embodiment under a lot of situation.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product is stored in the storage medium, comprise that some instructions are with so that a computer equipment (can be personal computer, server, the perhaps network equipment etc.) carry out the described method of each embodiment of the present invention.
It will be appreciated by those skilled in the art that accompanying drawing is the synoptic diagram of a preferred embodiment, the module in the accompanying drawing or flow process might not be that enforcement the present invention is necessary.
It will be appreciated by those skilled in the art that the module in the device among the embodiment can be distributed in the device of embodiment according to the embodiment description, also can carry out respective change and be arranged in the one or more devices that are different from present embodiment.The module of above-described embodiment can be merged into a module, also can further split into a plurality of submodules.
More than disclosed only be several specific embodiment of the present invention, still, the present invention is not limited thereto, any those skilled in the art can think variation all should fall into protection scope of the present invention.

Claims (20)

1. a classification language material is determined method, it is characterized in that, comprising:
From database, obtain the input sample of predetermined number, form the input sample set; Wherein, described input sample comprises entry name, classified information and the relevant entry information of entry;
From described input sample set, obtain feature samples, the composition characteristic sample set according to default seed word;
Determine the characteristic of division word according to described feature samples collection;
Determine classification language material and classification thereof according to described characteristic of division word and text to be selected.
2. the method for claim 1 is characterized in that, the default seed word of described basis obtains feature samples from described input sample set, specifically realize by following flow process:
Steps A, from described input sample set, obtain the feature samples that comprises current seed word;
Whether the quantity of step B, judging characteristic sample surpasses first threshold; If to be judged as be then to finish this flow process; Otherwise, go to step C;
Step C, obtain entry and relevant entry in the described feature samples, and the described entry that gets access to and relevant entry are added the seed word, upgrade current seed word; Go to steps A.
3. the method for claim 1 is characterized in that, describedly determines the characteristic of division word according to described feature samples collection, is specially:
Obtain the entry that this feature samples is concentrated;
Determine the weights of each entry in this entry;
Determine the characteristic of division word according to the weights of described each entry.
4. method as claimed in claim 3 is characterized in that, the weights of described entry are the discrimination of described entry;
The described weights of determining each entry in this entry are specially:
Obtain the relevant entry that described feature samples is concentrated;
Determine the discrimination of each relevant entry in this relevant entry;
Determine the discrimination of each entry in this entry according to the discrimination of described relevant entry;
Described weights according to described each entry are determined the characteristic of division word, are specially:
Determine the characteristic of division word according to the discrimination of described each entry.
5. method as claimed in claim 4 is characterized in that,
The discrimination of each relevant entry in the described entry of should being correlated with is specially:
Each relevant entry concentrates the relevant entry with this of number of times that occurs in the relevant entry information importing the ratio of the number of times that occurs in the sample set at feature samples in the described relevant entry;
The discrimination of each entry in described this entry is specially:
The average of the discrimination of the relevant entry that comprises in this entry place feature samples;
Described discrimination according to described each entry is determined the characteristic of division word, is specially:
When the discrimination of described entry surpasses second threshold value, determine that this entry is the characteristic of division word.
6. method as claimed in claim 3 is characterized in that, the described weights of determining each entry in this entry are specially:
Determine the weights of each entry according to parameter preset; Or,
Determine the weights of each entry according to the hit algorithm;
Wherein, described parameter preset comprises one of following or combination in any:
The click volume of entry, favorable comment number and editor's number of times;
Described weights according to described each entry are determined the characteristic of division word, are specially:
When the weights of described entry surpass the 3rd threshold value, determine that this entry is the characteristic of division word.
7. method as claimed in claim 3 is characterized in that, and is described according to described characteristic of division word and text to be selected definite classification language material and classification thereof, is specially:
Described text to be selected is cut word, and obtain the characteristic of division word that comprises in this text to be selected;
Determine the weights of described text to be selected according to the characteristic of division word that gets access to;
When the weights of described text to be selected surpass the 4th threshold value, determine that described text to be selected is the classification language material, and the affiliated classification of will be described default seed word is as the classification of described classification language material.
8. method as claimed in claim 7 is characterized in that, describedly determines the weights of described text to be selected according to described characteristic of division word and the feature word that gets access to, specifically realizes by following formula:
Wherein, tf is the word frequency of characteristic of division word in this text to be selected that occurs in the described text to be selected; Described i is the number of characteristic of division word; Described Qi is the weights of i characteristic of division word; Described N is the number of words of described text to be selected.
9. method as claimed in claim 7 is characterized in that, this method also comprises:
Described definite classification language material is divided into many parts;
Carry out the language material cross validation according to described each part classification language material, and determine final classification language material and classification thereof.
10. method as claimed in claim 9 is characterized in that, describedly carries out the language material cross validation according to described each part classification language material, specifically by following flow process realization:
Steps A 1, from described each part classification language material, select a non-selected classification language material as test data;
Step B1, use all the other each part classification language materials that the classification of described test data is verified respectively;
Step C1, the number of times that statistical testing of business cycles is correct, and when it surpasses the 5th threshold value, determine that described test data is final classification language material;
Step D1, judge whether also to exist non-selected classification language material; If to be judged as be then to go to steps A 1; Otherwise, finish this flow process.
11. a classification language material is determined equipment, it is characterized in that, comprising:
First acquisition module for the input sample that obtains predetermined number from database, is formed the input sample set; Wherein, described input sample comprises entry name, classified information and the relevant entry information of entry;
Second acquisition module is used for obtaining feature samples from described input sample set, the composition characteristic sample set according to default seed word;
First determination module is used for determining the characteristic of division word according to described feature samples collection;
Second determination module is used for determining classification language material and classification thereof according to described characteristic of division word and text to be selected.
12. classification language material as claimed in claim 11 is determined equipment, it is characterized in that, described second acquisition module obtains feature samples according to default seed word from described input sample set, specifically realizes by following flow process:
Steps A, from described input sample set, obtain the feature samples that comprises current seed word;
Whether the quantity of step B, judging characteristic sample surpasses first threshold; If to be judged as be then to finish this flow process; Otherwise, go to step C;
Step C, obtain entry and relevant entry in the described feature samples, and the described entry that gets access to and relevant entry are added the seed word, upgrade current seed word; Go to steps A.
13. classification language material as claimed in claim 11 is determined equipment, it is characterized in that, described first determination module specifically is used for, and obtains the entry that this feature samples is concentrated; Determine the weights of each entry in this entry; Determine the characteristic of division word according to the weights of described each entry.
14. classification language material as claimed in claim 13 is determined equipment, it is characterized in that, the weights of described entry are the discrimination of described entry;
Described first determination module specifically is used for, and obtains the relevant entry that described feature samples is concentrated; Determine the discrimination of each relevant entry in this relevant entry; Determine the discrimination of each entry in this entry according to the discrimination of described relevant entry; Determine the characteristic of division word according to the discrimination of described entry.
15. method as claimed in claim 14, it is characterized in that, the discrimination of each relevant entry is specially in the described entry of should being correlated with, and each relevant entry concentrates the relevant entry with this of number of times that occurs in the relevant entry information importing the ratio of the number of times that occurs in the sample set at feature samples in the described relevant entry; The discrimination of each entry in described this entry is specially the average of the discrimination of the relevant entry that comprises in this entry place feature samples;
Described first determination module specifically is used for, and when the discrimination of described entry surpasses second threshold value, determines that this entry is the characteristic of division word.
16. classification language material as claimed in claim 13 is determined equipment, it is characterized in that, described first determination module specifically is used for, and determines the weights of each entry according to parameter preset, when the weights of described entry surpass the 3rd threshold value, determines that this entry is the characteristic of division word; Or, determine the weights of each entry according to the hit algorithm, when the weights of described entry surpass the 3rd threshold value, determine that this entry is the characteristic of division word;
Wherein, described parameter preset comprises one of following or combination in any:
The click volume of entry, favorable comment number and editor's number of times.
17. determine equipment as classification language material as described in the claim 13, it is characterized in that described second determination module specifically is used for, described text to be selected is cut word, and obtain the characteristic of division word that comprises in this text to be selected; Determine the weights of described text to be selected according to the characteristic of division word that gets access to; When the weights of described text to be selected surpass the 4th threshold value, determine that described text to be selected is the classification language material, and the affiliated classification of will be described default seed word is as the classification of described classification language material.
18. classification language material as claimed in claim 17 is determined equipment, it is characterized in that, described second determination module is determined the weights of described text to be selected according to described characteristic of division word and the feature word that gets access to, specifically realizes by following formula:
Figure 2012100566693100001DEST_PATH_IMAGE002
Wherein, tf is the word frequency of characteristic of division word in this text to be selected that occurs in the described text to be selected; Described i is the number of characteristic of division word; Described Qi is the weights of i characteristic of division word; Described N is the number of words of described text to be selected.
19. classification language material as claimed in claim 17 is determined equipment, it is characterized in that, described second determination module also is used for, and described definite classification language material is divided into many parts; Carry out the language material cross validation according to described each part classification language material, and determine final classification language material and classification thereof.
20. classification language material as claimed in claim 19 is determined equipment, it is characterized in that, described second determination module carries out the language material cross validation according to described each part classification language material, specifically realizes by following flow process:
Steps A 1, from described each part classification language material, select a non-selected classification language material as test data;
Step B1, use all the other each part classification language materials that the classification of described test data is verified respectively;
Step C1, the number of times that statistical testing of business cycles is correct, and when it surpasses the 5th threshold value, determine that described test data is final classification language material;
Step D1, judge whether also to exist non-selected classification language material; If to be judged as be then to go to steps A 1; Otherwise, finish this flow process.
CN201210056669.3A 2012-03-06 2012-03-06 A kind of taxonomy determines method and apparatus Active CN103309857B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210056669.3A CN103309857B (en) 2012-03-06 2012-03-06 A kind of taxonomy determines method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210056669.3A CN103309857B (en) 2012-03-06 2012-03-06 A kind of taxonomy determines method and apparatus

Publications (2)

Publication Number Publication Date
CN103309857A true CN103309857A (en) 2013-09-18
CN103309857B CN103309857B (en) 2018-11-09

Family

ID=49135096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210056669.3A Active CN103309857B (en) 2012-03-06 2012-03-06 A kind of taxonomy determines method and apparatus

Country Status (1)

Country Link
CN (1) CN103309857B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631874A (en) * 2013-11-07 2014-03-12 微梦创科网络科技(中国)有限公司 UGC label classification determining method and device for social platform
CN106503254A (en) * 2016-11-11 2017-03-15 上海智臻智能网络科技股份有限公司 Language material sorting technique, device and terminal
CN106528615A (en) * 2016-09-29 2017-03-22 北京金山安全软件有限公司 Classification method and device and server
CN107229731A (en) * 2017-06-08 2017-10-03 百度在线网络技术(北京)有限公司 Method and apparatus for grouped data
CN108304530A (en) * 2018-01-26 2018-07-20 腾讯科技(深圳)有限公司 Knowledge base entry sorting technique and device, model training method and device
CN109165326A (en) * 2018-08-16 2019-01-08 蜜小蜂智慧(北京)科技有限公司 A kind of character string matching method and device
CN109948142A (en) * 2019-01-25 2019-06-28 北京海天瑞声科技股份有限公司 Corpus chooses processing method, device, equipment and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
CN101976246A (en) * 2010-09-30 2011-02-16 互动在线(北京)科技有限公司 Classification retrieval method for encyclopedia entries
CN102073729A (en) * 2011-01-14 2011-05-25 百度在线网络技术(北京)有限公司 Relationship knowledge sharing platform and implementation method thereof
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device
CN102207961A (en) * 2011-05-25 2011-10-05 盛乐信息技术(上海)有限公司 Automatic web page classification method and device
US20110258152A1 (en) * 2010-03-31 2011-10-20 Kindsight, Inc. Categorization automation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
US20110258152A1 (en) * 2010-03-31 2011-10-20 Kindsight, Inc. Categorization automation
CN101976246A (en) * 2010-09-30 2011-02-16 互动在线(北京)科技有限公司 Classification retrieval method for encyclopedia entries
CN102073729A (en) * 2011-01-14 2011-05-25 百度在线网络技术(北京)有限公司 Relationship knowledge sharing platform and implementation method thereof
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device
CN102207961A (en) * 2011-05-25 2011-10-05 盛乐信息技术(上海)有限公司 Automatic web page classification method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘峰: ""通用中英文专业搜索引擎技术的研究及应用"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
吴韦: ""文本分类语料库自动创建系统的研究与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
苏小康: ""基于维基百科构建语义知识库及其在文本分类领域的应用研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631874A (en) * 2013-11-07 2014-03-12 微梦创科网络科技(中国)有限公司 UGC label classification determining method and device for social platform
CN106528615A (en) * 2016-09-29 2017-03-22 北京金山安全软件有限公司 Classification method and device and server
CN106528615B (en) * 2016-09-29 2019-08-06 北京金山安全软件有限公司 Classification method and device and server
CN106503254A (en) * 2016-11-11 2017-03-15 上海智臻智能网络科技股份有限公司 Language material sorting technique, device and terminal
CN107229731A (en) * 2017-06-08 2017-10-03 百度在线网络技术(北京)有限公司 Method and apparatus for grouped data
CN108304530A (en) * 2018-01-26 2018-07-20 腾讯科技(深圳)有限公司 Knowledge base entry sorting technique and device, model training method and device
CN109165326A (en) * 2018-08-16 2019-01-08 蜜小蜂智慧(北京)科技有限公司 A kind of character string matching method and device
CN109948142A (en) * 2019-01-25 2019-06-28 北京海天瑞声科技股份有限公司 Corpus chooses processing method, device, equipment and computer readable storage medium
CN109948142B (en) * 2019-01-25 2020-01-14 北京海天瑞声科技股份有限公司 Corpus selection processing method, apparatus, device and computer readable storage medium

Also Published As

Publication number Publication date
CN103309857B (en) 2018-11-09

Similar Documents

Publication Publication Date Title
CN103309857A (en) Method and equipment for determining classified linguistic data
CN105138653B (en) It is a kind of that method and its recommendation apparatus are recommended based on typical degree and the topic of difficulty
CN101937436B (en) Text classification method and device
CN106156372B (en) A kind of classification method and device of internet site
CN103208039B (en) Method and device for evaluating software project risks
CN104516986A (en) Method and device for recognizing sentence
CN106489149A (en) A kind of data mask method based on data mining and mass-rent and system
CN106547871A (en) Method and apparatus is recalled based on the Search Results of neutral net
CN106570109A (en) Method for automatically generating knowledge points of question bank through text analysis
CN109299271A (en) Training sample generation, text data, public sentiment event category method and relevant device
CN106651057A (en) Mobile terminal user age prediction method based on installation package sequence table
CN103324758B (en) A kind of news category method and system
CN102426572A (en) Method and equipment for classifying business entries
CN103092966A (en) Vocabulary mining method and device
CN107463711A (en) A kind of tag match method and device of data
CN106326498A (en) Cheat video identification method and device
CN113918806A (en) Method for automatically recommending training courses and related equipment
CN104809104A (en) Method and system for identifying micro-blog textual emotion
CN109871770A (en) Property ownership certificate recognition methods, device, equipment and storage medium
CN112445897A (en) Method, system, device and storage medium for large-scale classification and labeling of text data
CN101788987A (en) Automatic judging method of network resource types
CN108960884A (en) Information processing method, model building method and device, medium and calculating equipment
CN108717459A (en) A kind of mobile application defect positioning method of user oriented comment information
CN105787004A (en) Text classification method and device
CN105868272A (en) Multimedia file classification method and apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
ASS Succession or assignment of patent right

Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text: FORMER OWNER: TENGXUN SCI-TECH (SHENZHEN) CO., LTD.

Effective date: 20131017

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20131017

Address after: A Tencent Building in Shenzhen Nanshan District City, Guangdong streets in Guangdong province science and technology 518057 16

Applicant after: Shenzhen Shiji Guangsu Information Technology Co., Ltd.

Address before: Shenzhen Futian District City, Guangdong province 518057 Zhenxing Road, SEG Science Park 2 East Room 403

Applicant before: Tencent Technology (Shenzhen) Co., Ltd.

C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant