CN103309857A

CN103309857A - Method and equipment for determining classified linguistic data

Info

Publication number: CN103309857A
Application number: CN2012100566693A
Authority: CN
Inventors: 贺翔; 亓超; 毛少林; 翟俊杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date: 2012-03-06
Filing date: 2012-03-06
Publication date: 2013-09-18
Anticipated expiration: 2032-03-06
Also published as: CN103309857B

Abstract

The invention discloses a method and equipment for determining classified linguistic data. The method comprises the following steps: obtaining input samples with a preset amount from a database to form an input sample set, wherein the input samples comprise vocabulary entry names of vocabulary entries, classification information and related vocabulary entry information; obtaining characteristic samples from the input sample set according to preset seed words to form a characteristic sample set; determining classified characteristic words according to the characteristic sample set; and determining the classified linguistic data and the type of the classified linguistic data according to the classified characteristic words and texts to be selected. Due to the adoption of the method and the equipment for determining classified linguistic data, the efficiency and the accuracy rate of obtaining the classified linguistic data are improved.

Description

A kind of classification language material is determined method and apparatus

Technical field

The present invention relates to the Internet technology application, relate in particular to a kind of classification language material and determine method and apparatus.

Background technology

Text automatic classification refers to computer program text set (or other data) be carried out automatic key words sorting according to certain taxonomic hierarchies or standard.

In order to make computer program carry out automatic key words sorting to text set, need to use a large amount of classification language materials that it is trained; Wherein, this classification language material refers to have in a large number the text collection of classification markup information, and above-mentioned computer program (as sorter) is by language material study (training) mark rule.

In the prior art, the approach that obtains the classification language material mainly comprises following dual mode:

(1) artificial mark is namely manually to a large amount of texts mark of classifying;

(2) the directed extracting namely divided from the internet the data of class by modes such as automatic reptiles and grasped; As, when needs video display class classification language material, can grasp in the video display class site databases on the internet.

The inventor finds that there is following defective at least in prior art in realizing process of the present invention:

The mode of artificial mark need spend great amount of manpower and time, and efficient is lower; The directed accuracy rate that grasps the language material that then can't guarantee to classify can't guarantee that namely the text set that gets access to is the language material of video display class from video display class site databases.

Summary of the invention

The invention provides a kind of confirmation method and equipment of the language material of classifying, to improve efficient and the accuracy rate that the classification language material obtains.

In order to achieve the above object, the embodiment of the invention provides a kind of classification language material to determine method, comprising:

From database, obtain the input sample of predetermined number, form the input sample set; Wherein, described input sample comprises entry name, classified information and the relevant entry information of entry;

From described input sample set, obtain feature samples, the composition characteristic sample set according to default seed word;

Determine the characteristic of division word according to described feature samples collection;

Determine classification language material and classification thereof according to described characteristic of division word and text to be selected.

The embodiment of the invention also provides a kind of classification language material to determine equipment, comprising:

First acquisition module for the input sample that obtains predetermined number from database, is formed the input sample set; Wherein, described input sample comprises entry name, classified information and the relevant entry information of entry;

Second acquisition module is used for obtaining feature samples from described input sample set, the composition characteristic sample set according to default seed word;

First determination module is used for determining the characteristic of division word according to described feature samples collection;

Second determination module is used for determining classification language material and classification thereof according to described characteristic of division word and text to be selected.

Compared with prior art, the embodiment of the invention has the following advantages:

The seed word of the known class by choosing some in advance, and the input sample that obtains some from database is formed the input sample set; From the input sample set, obtain feature samples composition characteristic sample set according to default seed word, and determine the characteristic of division word according to the feature samples collection that gets access to; Determine classification language material and classification thereof according to the characteristic of division word that gets access to and text to be selected, improved efficient and accuracy rate that the classification language material obtains.

Description of drawings

A kind of language material of classifying that Fig. 1 provides for the embodiment of the invention is determined the schematic flow sheet of method;

Obtain the schematic flow sheet of feature samples in the technical scheme that Fig. 2 provides for the embodiment of the invention;

Classification language material under a kind of concrete application scenarios that Fig. 3 provides for the embodiment of the invention is determined the schematic flow sheet of method;

A kind of language material of classifying that Fig. 4 provides for the embodiment of the invention is determined the structural representation of equipment.

Embodiment

At above-mentioned defective of the prior art, the technical scheme that the embodiment of the invention provides a kind of language material of classifying to determine.In this technical scheme, the seed word of the known class by choosing some in advance, and the input sample that obtains some from database is formed the input sample set; From the input sample set, obtain feature samples composition characteristic sample set according to default seed word, and determine the characteristic of division word according to the feature samples collection that gets access to; Determine classification language material and classification thereof according to the characteristic of division word that gets access to and text to be selected, improved efficient and accuracy rate that the classification language material obtains.

Wherein, in the technical scheme that the embodiment of the invention provides, the database that obtains the input sample set can be Baidu's encyclopaedia, wikipedia, WordNet etc.The input sample set that gets access to from database can comprise entry name, classified information and the relevant entry information of entry, and its form can be as shown in table 1:

Table 1

Below in conjunction with the accompanying drawing in the embodiments of the invention, the technical scheme in the embodiments of the invention is clearly and completely described, obviously, the embodiments described below only are the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are not making the every other embodiment that obtains under the creative work prerequisite, all belong to the scope of embodiments of the invention protection.

As shown in Figure 1, a kind of language material of classifying that provides for the embodiment of the invention is determined the schematic flow sheet of method, can may further comprise the steps:

Step 101, from database, obtain the input sample of predetermined number, form the input sample set.

Concrete, be example from Baidu's encyclopaedia, to excavate the classification language material.In the technical scheme that the embodiment of the invention provides, can from Baidu's encyclopaedia, obtain the input sample of predetermined number (as 1000), its form can be as shown in table 1.

Step 102, the default seed word of basis obtain feature samples, the composition characteristic sample set from the input sample set.

Concrete, in the technical scheme that the embodiment of the invention provides, when needs obtain the classification language material, can choose a quantity of seeds word in advance.For example, when needs obtain sport category classification language material, can choose the seed word of 10 sport category in advance, as physical culture, football, sportsman, track and field, world cup, the Olympic Games etc.Obtain input sample and choosing after the seed word, can from the input sample set, obtain feature samples, the composition characteristic sample set according to the seed word.

Wherein, in the technical scheme that the embodiment of the invention provides, the flow process of obtaining feature samples can may further comprise the steps as shown in Figure 2:

Step 102A, obtain the feature samples that comprises current seed word from input the sample set.

For example, the seed word of choosing in advance is football, basketball, sportsman, then obtains the feature samples that comprises current seed word according to this seed word from the input sample set.Wherein, the feature samples that comprises this seed word can be that entry is football, basketball or sportsman, also can be to comprise the respective seed word in the relevant entry.

Whether the quantity of step 102B, judging characteristic sample surpasses first threshold, is then to finish this flow process if be judged as; Otherwise, go to step 102C.

Wherein, the feature samples amount threshold can be determined according to the actual requirements, as 10000.

Step 102C, obtain entry and relevant entry in the feature samples, and entry and the relevant entry that gets access to added the seed word, upgrade current seed word; Go to step 102A.

Concrete, when the feature samples quantity that gets access to is lower than predetermined threshold value, the entry in the feature samples that gets access to all can be joined in the seed word with relevant entry, and from the input sample set, obtain more feature samples according to the seed word after upgrading.

Can get access to the feature samples of sufficient amount by above flow process.

The feature samples collection that step 103, basis get access to is determined the characteristic of division word.

Concrete, in embodiments of the present invention, get access to feature samples after, can further determine the weights of the entry that comprises in each feature samples, and determine the characteristic of division word according to the weights of each entry.

Be example with the weights of entry for the discrimination of this entry, in embodiments of the present invention, will import sample set as complete or collected works, and according to further definite two set of feature samples collection:

Set 1: comprise all entries that feature samples is concentrated;

Set 2: comprise all relevant entries that feature samples is concentrated.

Certain word W in the pair set 2 defines its discrimination and is:

Q _wThe number of times that number of times/W that=W occurs in set 2 occurs in complete or collected works

For certain the word x in the set 1, define its discrimination and be the average of its all relevant entry discriminations:

Wherein, n is that entry is the number of relevant entry in the feature samples of x, is the discrimination of i relevant entry.

After determining that feature samples is concentrated the discrimination of each entry, the entry that discrimination can be surpassed threshold value (as K) is defined as the characteristic of division word.

Step 104, determine classification language material and classification thereof according to characteristic of division word and text to be selected.

Concrete, determine the characteristic of division word after, can choose a text to be selected wantonly, and this text to be selected is cut word, obtain the characteristic of division word that comprises in this text to be selected, and determine the weights of text to be selected according to the characteristic of division word that gets access to; When the weights of text to be selected surpass threshold value, determine that this text to be selected is the classification language material, and with the classification of the classification under the corresponding seed word as this classification language material.

Wherein, determine the weights of text to be selected according to characteristic of division word and the feature word that gets access to, can specifically realize by following formula:

Figure 2012100566693100002DEST_PATH_IMAGE003

Wherein, tf is the word frequency of characteristic of division word in this text to be selected that occurs in the described text to be selected; Described i is the number of characteristic of division word; Described Q _iBe the weights of i characteristic of division word; Described N is the number of words of described text to be selected.

In order further to improve the accuracy rate of the classification language material that gets access to, in the technical scheme that the embodiment of the invention provides, after having determined the classification language material, determined classification language material can also be divided into many parts; Carry out the language material cross validation according to each part classification language material, and determine final classification language material and classification thereof.

Wherein, carry out the language material cross validation according to described each part classification language material, can specifically realize by following flow process:

Steps A ₁, from each part classification language material, select a non-selected classification language material as test data;

Step B ₁, use all the other each part classification language materials that the classification of described test data is verified respectively;

Step C ₁, number of times that statistical testing of business cycles is correct, and when it surpasses the 5th threshold value, determine that test data is final classification language material;

Step D ₁, judge whether also to exist non-selected classification language material; If to be judged as be then to go to steps A ₁Otherwise, finish this flow process.

For example, the classification language material of determining can be divided into 10 parts, in turn will be wherein 9 parts as training data, 1 part as test data, the classification of test data is verified namely every part of test data has been carried out the test of 9 subseries; In test data, the number of times that the classification checking is correct surpasses the final classification language material that is defined as of threshold value.

It should be noted that, the method of the definite discrimination that provides in the above-mentioned flow process only is that a kind of embodiment of entry weights is provided in the technical scheme that provides of the embodiment of the invention, and in the technical scheme that the embodiment of the invention provides, determine that the mode of entry weights is not limited to this a kind of embodiment.For example, in the technical scheme that the embodiment of the invention provides, can also compose power to entry or use that common hit algorithm comes power composed in the feature word in the link analysis according to the parameter preset of each entry, and when the weights of entry during above threshold value, determine that this entry is the characteristic of division word.Wherein, this parameter preset comprises one of following or combination in any at least: the click volume of entry, favorable comment number and editor's number of times.

Below in conjunction with concrete accompanying drawing and concrete application scenarios the technical scheme that the embodiment of the invention provides is carried out more detailed description.

Be illustrated in figure 3 as classification language material under a kind of concrete application scenarios that the embodiment of the invention provides and determine the schematic flow sheet of method, in this embodiment, need obtain the classification language material of 5000 sport category; The seed word of preliminary election comprises: physical culture, football, sportsman, track and field, world cup, the Olympic Games; Language material mining data storehouse is Baidu's encyclopaedia; This method can comprise:

Step 301, from Baidu's encyclopaedia, obtain 10000 input samples and form the input sample sets.

Wherein, the form of input sample set can be as shown in table 1.

Step 302, the default seed word of basis obtain 1000 feature samples, the composition characteristic sample set from the input sample set.

Wherein, this feature samples can be as shown in table 2:

Table 2

Figure 2012100566693100002DEST_PATH_IMAGE004

When the feature samples that gets access to according to seed words such as physical culture, football, sportsman, track and field, world cup, the Olympic Games is counted less than 1000, can obtain more feature samples according to the relevant entry that comprises in the feature samples.

Step 303, determine the characteristic of division word according to the feature samples collection.

Concrete, in this embodiment, can determine the weights of each entry in the feature samples by determining the mode of discrimination, and with weights greater than 0.05 entry as the characteristic of division word.

Be example with the feature samples shown in the table 2.The discrimination of supposing basketball is 0.08, and the discrimination of billiard ball is 0.03, and the discrimination of world cup is 0.07, and then the discrimination of football is 0.06, and namely football belongs to the characteristic of division word.

Step 304, determine the classification language material according to characteristic of division word and text to be selected.

Concrete, can from the internet, obtain 50000 texts to be selected, and each text to be selected is cut word and weights calculating respectively, and the text to be selected that definite weights surpass certain threshold value in this step, obtains 5000 classification language materials for the classification language material.

Step 305,5000 classification language materials determining are carried out the language material cross validation, and determine 1000 final classification language materials.

Concrete, in this step, 5000 classification language materials determining in the step 304 can be divided into 5 parts, and to select a copy of it successively be test data, with remaining 4 parts this test data carried out classification checking respectively, and to choose the classification language material that the rate of being proved to be successful sorts preceding 1000 from high to low be final classification language material.Wherein, be proved to be successful between the identical classification language material of rate randomly ordered.

By above description as can be seen, in the technical scheme that the embodiment of the invention provides, the seed word of the known class by choosing some in advance, and the input sample that obtains some from database is formed the input sample set; From the input sample set, obtain feature samples composition characteristic sample set according to default seed word, and determine the characteristic of division word according to the feature samples collection that gets access to; Determine classification language material and classification thereof according to the characteristic of division word that gets access to and text to be selected, improved efficient and accuracy rate that the classification language material obtains.

Determine the inventive concept that method is identical based on above-mentioned classification language material, the embodiment of the invention also provides a kind of classification language material to determine equipment, can be applied in the said method flow process.

As shown in Figure 4, the classification language material that provides for the embodiment of the invention is determined the structural representation of equipment, can comprise:

First acquisition module 41 for the input sample that obtains predetermined number from database, is formed the input sample set; Wherein, described input sample comprises entry name, classified information and the relevant entry information of entry;

Second acquisition module 42 is used for obtaining feature samples from described input sample set, the composition characteristic sample set according to default seed word;

First determination module 43 is used for determining the characteristic of division word according to described feature samples collection;

Second determination module 44 is used for determining classification language material and classification thereof according to described characteristic of division word and text to be selected.

Wherein, described second acquisition module 42 obtains feature samples according to default seed word from described input sample set, specifically realize by following flow process:

Steps A, from described input sample set, obtain the feature samples that comprises current seed word;

Whether the quantity of step B, judging characteristic sample surpasses first threshold; If to be judged as be then to finish this flow process; Otherwise, go to step C;

Step C, obtain entry and relevant entry in the described feature samples, and the described entry that gets access to and relevant entry are added the seed word, upgrade current seed word; Go to steps A.

Wherein, described first determination module 43 specifically is used for, and obtains the entry that this feature samples is concentrated; Determine the weights of each entry in this entry; Determine the characteristic of division word according to the weights of described each entry.

Wherein, the weights of described entry are the discrimination of described entry;

Described first determination module 43 specifically is used for, and obtains the relevant entry that described feature samples is concentrated; Determine the discrimination of each relevant entry in this relevant entry; Determine the discrimination of each entry in this entry according to the discrimination of described relevant entry; Determine the characteristic of division word according to the discrimination of described entry.

Wherein, the discrimination of each relevant entry is specially in the described entry of should being correlated with, and each relevant entry concentrates the relevant entry with this of number of times that occurs in the relevant entry information importing the ratio of the number of times that occurs in the sample set at feature samples in the described relevant entry; The discrimination of each entry in described this entry is specially the average of the discrimination of the relevant entry that comprises in this entry place feature samples;

Described first determination module 43 specifically is used for, and when the discrimination of described entry surpasses second threshold value, determines that this entry is the characteristic of division word.

Wherein, described first determination module 43 specifically is used for, and determines the weights of each entry according to parameter preset, when the weights of described entry surpass the 3rd threshold value, determines that this entry is the characteristic of division word; Or, determine the weights of each entry according to the hit algorithm, when the weights of described entry surpass the 3rd threshold value, determine that this entry is the characteristic of division word;

Wherein, described parameter preset comprises one of following or combination in any:

The click volume of entry, favorable comment number and editor's number of times.

Wherein, described second determination module 44 specifically is used for, and described text to be selected is cut word, and obtains the characteristic of division word that comprises in this text to be selected; Determine the weights of described text to be selected according to the characteristic of division word that gets access to; When the weights of described text to be selected surpass the 4th threshold value, determine that described text to be selected is the classification language material, and the affiliated classification of will be described default seed word is as the classification of described classification language material.

Wherein, described second determination module 44 is determined the weights of described text to be selected according to described characteristic of division word and the feature word that gets access to, specifically realizes by following formula:

Wherein, described second determination module 44 also is used for, and described definite classification language material is divided into many parts; Carry out the language material cross validation according to described each part classification language material, and determine final classification language material and classification thereof.

Wherein, steps A ₁, from described each part classification language material, select a non-selected classification language material as test data;

Step C ₁, number of times that statistical testing of business cycles is correct, and when it surpasses the 5th threshold value, determine that described test data is final classification language material;

Through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential general hardware platform, can certainly pass through hardware, but the former is better embodiment under a lot of situation.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product is stored in the storage medium, comprise that some instructions are with so that a computer equipment (can be personal computer, server, the perhaps network equipment etc.) carry out the described method of each embodiment of the present invention.

It will be appreciated by those skilled in the art that accompanying drawing is the synoptic diagram of a preferred embodiment, the module in the accompanying drawing or flow process might not be that enforcement the present invention is necessary.

It will be appreciated by those skilled in the art that the module in the device among the embodiment can be distributed in the device of embodiment according to the embodiment description, also can carry out respective change and be arranged in the one or more devices that are different from present embodiment.The module of above-described embodiment can be merged into a module, also can further split into a plurality of submodules.

More than disclosed only be several specific embodiment of the present invention, still, the present invention is not limited thereto, any those skilled in the art can think variation all should fall into protection scope of the present invention.

Claims

1. a classification language material is determined method, it is characterized in that, comprising:

2. the method for claim 1 is characterized in that, the default seed word of described basis obtains feature samples from described input sample set, specifically realize by following flow process:

3. the method for claim 1 is characterized in that, describedly determines the characteristic of division word according to described feature samples collection, is specially:

Obtain the entry that this feature samples is concentrated;

Determine the weights of each entry in this entry;

Determine the characteristic of division word according to the weights of described each entry.

4. method as claimed in claim 3 is characterized in that, the weights of described entry are the discrimination of described entry;

The described weights of determining each entry in this entry are specially:

Obtain the relevant entry that described feature samples is concentrated;

Determine the discrimination of each relevant entry in this relevant entry;

Determine the discrimination of each entry in this entry according to the discrimination of described relevant entry;

Described weights according to described each entry are determined the characteristic of division word, are specially:

Determine the characteristic of division word according to the discrimination of described each entry.

5. method as claimed in claim 4 is characterized in that,

The discrimination of each relevant entry in the described entry of should being correlated with is specially:

Each relevant entry concentrates the relevant entry with this of number of times that occurs in the relevant entry information importing the ratio of the number of times that occurs in the sample set at feature samples in the described relevant entry;

The discrimination of each entry in described this entry is specially:

The average of the discrimination of the relevant entry that comprises in this entry place feature samples;

Described discrimination according to described each entry is determined the characteristic of division word, is specially:

When the discrimination of described entry surpasses second threshold value, determine that this entry is the characteristic of division word.

6. method as claimed in claim 3 is characterized in that, the described weights of determining each entry in this entry are specially:

Determine the weights of each entry according to parameter preset; Or,

Determine the weights of each entry according to the hit algorithm;

The click volume of entry, favorable comment number and editor's number of times;

When the weights of described entry surpass the 3rd threshold value, determine that this entry is the characteristic of division word.

7. method as claimed in claim 3 is characterized in that, and is described according to described characteristic of division word and text to be selected definite classification language material and classification thereof, is specially:

Described text to be selected is cut word, and obtain the characteristic of division word that comprises in this text to be selected;

Determine the weights of described text to be selected according to the characteristic of division word that gets access to;

When the weights of described text to be selected surpass the 4th threshold value, determine that described text to be selected is the classification language material, and the affiliated classification of will be described default seed word is as the classification of described classification language material.

8. method as claimed in claim 7 is characterized in that, describedly determines the weights of described text to be selected according to described characteristic of division word and the feature word that gets access to, specifically realizes by following formula:

Wherein, tf is the word frequency of characteristic of division word in this text to be selected that occurs in the described text to be selected; Described i is the number of characteristic of division word; Described Qi is the weights of i characteristic of division word; Described N is the number of words of described text to be selected.

9. method as claimed in claim 7 is characterized in that, this method also comprises:

Described definite classification language material is divided into many parts;

Carry out the language material cross validation according to described each part classification language material, and determine final classification language material and classification thereof.

10. method as claimed in claim 9 is characterized in that, describedly carries out the language material cross validation according to described each part classification language material, specifically by following flow process realization:

Steps A 1, from described each part classification language material, select a non-selected classification language material as test data;

Step B1, use all the other each part classification language materials that the classification of described test data is verified respectively;

Step C1, the number of times that statistical testing of business cycles is correct, and when it surpasses the 5th threshold value, determine that described test data is final classification language material;

Step D1, judge whether also to exist non-selected classification language material; If to be judged as be then to go to steps A 1; Otherwise, finish this flow process.

11. a classification language material is determined equipment, it is characterized in that, comprising:

12. classification language material as claimed in claim 11 is determined equipment, it is characterized in that, described second acquisition module obtains feature samples according to default seed word from described input sample set, specifically realizes by following flow process:

13. classification language material as claimed in claim 11 is determined equipment, it is characterized in that, described first determination module specifically is used for, and obtains the entry that this feature samples is concentrated; Determine the weights of each entry in this entry; Determine the characteristic of division word according to the weights of described each entry.

14. classification language material as claimed in claim 13 is determined equipment, it is characterized in that, the weights of described entry are the discrimination of described entry;

Described first determination module specifically is used for, and obtains the relevant entry that described feature samples is concentrated; Determine the discrimination of each relevant entry in this relevant entry; Determine the discrimination of each entry in this entry according to the discrimination of described relevant entry; Determine the characteristic of division word according to the discrimination of described entry.

15. method as claimed in claim 14, it is characterized in that, the discrimination of each relevant entry is specially in the described entry of should being correlated with, and each relevant entry concentrates the relevant entry with this of number of times that occurs in the relevant entry information importing the ratio of the number of times that occurs in the sample set at feature samples in the described relevant entry; The discrimination of each entry in described this entry is specially the average of the discrimination of the relevant entry that comprises in this entry place feature samples;

Described first determination module specifically is used for, and when the discrimination of described entry surpasses second threshold value, determines that this entry is the characteristic of division word.

16. classification language material as claimed in claim 13 is determined equipment, it is characterized in that, described first determination module specifically is used for, and determines the weights of each entry according to parameter preset, when the weights of described entry surpass the 3rd threshold value, determines that this entry is the characteristic of division word; Or, determine the weights of each entry according to the hit algorithm, when the weights of described entry surpass the 3rd threshold value, determine that this entry is the characteristic of division word;

17. determine equipment as classification language material as described in the claim 13, it is characterized in that described second determination module specifically is used for, described text to be selected is cut word, and obtain the characteristic of division word that comprises in this text to be selected; Determine the weights of described text to be selected according to the characteristic of division word that gets access to; When the weights of described text to be selected surpass the 4th threshold value, determine that described text to be selected is the classification language material, and the affiliated classification of will be described default seed word is as the classification of described classification language material.

18. classification language material as claimed in claim 17 is determined equipment, it is characterized in that, described second determination module is determined the weights of described text to be selected according to described characteristic of division word and the feature word that gets access to, specifically realizes by following formula:

Figure 2012100566693100001DEST_PATH_IMAGE002

19. classification language material as claimed in claim 17 is determined equipment, it is characterized in that, described second determination module also is used for, and described definite classification language material is divided into many parts; Carry out the language material cross validation according to described each part classification language material, and determine final classification language material and classification thereof.

20. classification language material as claimed in claim 19 is determined equipment, it is characterized in that, described second determination module carries out the language material cross validation according to described each part classification language material, specifically realizes by following flow process: