Specific implementation mode
The defects of for the above-mentioned prior art, an embodiment of the present invention provides the technical sides that a kind of taxonomy determines
Case.In the technical scheme, it by choosing the seed words of a certain number of known class in advance, and is obtained centainly from database
The input sample of quantity forms input sample collection;It is concentrated from input sample according to preset seed words and obtains feature samples composition spy
Sample set is levied, and characteristic of division word is determined according to the feature samples collection got;According to the characteristic of division word got and wait for
This determination of selection taxonomy and its classification improve the efficiency and accuracy rate of taxonomy acquisition.
Wherein, in technical solution provided in an embodiment of the present invention, it can be Baidu to obtain the database that input sample integrates
Encyclopaedia, wikipedia, WordNet etc..The input sample collection got from database can include the entry name of entry, classification
Information and related entry information, format can be as shown in table 1:
Table 1
Entry |
Classification |
Related entry |
Rules and forms poem |
Literature poem |
Prose poem regulated verse free verse |
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear
Chu is fully described by, it is clear that the embodiments described below are only a part of the embodiment of the present invention, rather than whole realities
Apply example.Based on the embodiments of the present invention, those of ordinary skill in the art are obtained without making creative work
Every other embodiment, belong to the embodiment of the present invention protection range.
As shown in Figure 1, being the flow diagram that a kind of taxonomy provided in an embodiment of the present invention determines method, can wrap
Include following steps:
Step 101, the input sample that preset quantity is obtained from database, form input sample collection.
Specifically, by for excavating taxonomy in Baidupedia.In technical solution provided in an embodiment of the present invention,
The input sample of preset quantity (such as 1000) can be obtained from Baidupedia, format can be as shown in table 1.
Step 102 concentrates acquisition feature samples, composition characteristic sample set according to preset seed words from input sample.
Specifically, in technical solution provided in an embodiment of the present invention, when needing to obtain taxonomy, can select in advance
Take a quantity of seeds word.For example, when needing to obtain sport category taxonomy, the kind of 10 sport categories can be chosen in advance
Sub- word, such as sport, football, sportsman, track and field, world cup, the Olympic Games.It, can after obtaining input sample and choosing seed words
Feature samples, composition characteristic sample set are obtained to be concentrated from input sample according to seed words.
Wherein, in technical solution provided in an embodiment of the present invention, obtain feature samples flow can with as shown in Fig. 2,
It may comprise steps of:
Step 102A, the feature samples for obtaining and including current seed words are concentrated from input sample.
For example, the seed words chosen in advance are football, basketball, sportsman, then concentrated from input sample according to the seed words
Obtain the feature samples for including current seed words.Wherein, including it is football, basketball that the feature samples of the seed words, which can be entry,
Or sportsman, can also be in related entry include respective seed word.
Step 102B, whether the quantity of judging characteristic sample more than first threshold terminates the flow if being judged as YES;
Otherwise, step 102C is gone to.
Wherein, feature samples amount threshold can be determined according to actual demand, such as 10000.
Step 102C, the entry in feature samples and related entry are obtained, and the entry got and related entry are added
Enter seed words, updates current seed words;Go to step 102A.
Specifically, when the feature samples quantity got is less than predetermined threshold value, it can will be in the feature samples that got
Entry and related entry be added in seed words, and concentrated from input sample according to updated seed words obtain it is more
Feature samples.
Sufficient amount of feature samples can be got by the above flow.
Step 103 determines characteristic of division word according to the feature samples collection got.
Specifically, in embodiments of the present invention, after getting feature samples, may further determine that and wrapped in each feature samples
The weights of the entry contained, and determine characteristic of division word according to the weights of each entry.
By taking the weights of entry are the discrimination of the entry as an example, in embodiments of the present invention, using input sample collection as complete
Collection, and two set are further determined that according to feature samples collection:
Set 1:Including all entries that feature samples are concentrated;
Set 2:Including all related entries that feature samples are concentrated.
To some word W in set 2, defining its discrimination is:
QwThe number that number/W that=W occurs in set 2 occurs in complete or collected works
For some word x in set 1, its discrimination is defined as the mean value of its all related entry discrimination:
Wherein, n is the number of related entry in the feature samples that entry is x, QWiFor the discrimination of i-th of related entry.
Can be more than the entry determination of threshold value (such as K) by discrimination after determining the discrimination that feature samples concentrate each entry
For characteristic of division word.
Step 104 determines taxonomy and its classification according to characteristic of division word and text to be selected.
Specifically, after determining characteristic of division word, can an optional text to be selected, and cutting word is carried out to the text to be selected, obtained
The characteristic of division word for including in the text to be selected is taken, and determines the weights of text to be selected according to the characteristic of division word got;When
When the weights of text to be selected are more than threshold value, determine that the text to be selected is taxonomy, and by the classification belonging to corresponding seed words
Classification as the taxonomy.
Wherein, the weights of text to be selected are determined according to characteristic of division word and the Feature Words got, it can be especially by
Following formula is realized:
Wherein, tf is the word frequency of the characteristic of division word that occurs in the text to be selected in the text to be selected;The n is point
The number of category feature word;The QiFor the weights of i-th of characteristic of division word;The N is the number of words of the text to be selected.
In order to further increase the accuracy rate of the taxonomy got, in technical solution provided in an embodiment of the present invention
In, after taxonomy is determined, identified taxonomy can also be divided into more parts;Language is carried out according to each part taxonomy
Expect cross validation, and determines final taxonomy and its classification.
Wherein, language material cross validation is carried out according to each part taxonomy, can be realized especially by following below scheme:
Step A1, select from each part taxonomy a non-selected taxonomy as test data;
Step B1, respectively using remaining each part taxonomy the classification of the test data is verified;
Step C1, the correct number of statistical testing of business cycles, and when it is more than five threshold values, determine point that test data is final
Class language material;
Step D1, judge whether that there is also non-selected taxonomies;If being judged as YES, step A is gone to1;Otherwise,
Terminate the flow.
For example, can determining taxonomy be divided into 10 parts, in turn will wherein 9 parts be used as training data, 1 part as surveying
Data are tried, the classification of test data is verified, is i.e. every part of test data has carried out the test of 9 subseries;By in test data,
Classification verifies correct number and is determined as final taxonomy more than threshold value.
It should be noted that the method for the determination discrimination provided in above-mentioned flow is only provided in an embodiment of the present invention
A kind of specific implementation mode of entry weights is determined in technical solution, and in technical solution provided in an embodiment of the present invention, it determines
The mode of entry weights is not limited to a kind of this specific implementation mode.For example, in technical solution provided in an embodiment of the present invention,
Tax power can also be carried out or using hits algorithms common in link analysis come to spy to entry according to the parameter preset of each entry
Sign word carries out tax power, and when the weights of entry are more than threshold value, determines that the entry is characteristic of division word.Wherein, the parameter preset
Including at least following one or arbitrary combination:Click volume, favorable comment number and the editor's number of entry.
Technical solution provided in an embodiment of the present invention is carried out more with reference to specific attached drawing and specific application scenarios
Add detailed description.
The taxonomy being illustrated in figure 3 under a kind of concrete application scene provided in an embodiment of the present invention determines the stream of method
Journey schematic diagram needs the taxonomy for obtaining 5000 sport categories in this embodiment;The seed words of pre-selection include:Sport, foot
Ball, sportsman, track and field, world cup, the Olympic Games;Corpora mining database is Baidupedia;This method may include:
Step 301 obtains 10000 input samples composition input sample collection from Baidupedia.
Wherein, the format of input sample collection can be as shown in table 1.
Step 302 concentrates 1000 feature samples of acquisition, composition characteristic sample according to preset seed words from input sample
Collection.
Wherein, this feature sample can be as shown in table 2:
Table 2
Entry |
Classification |
Related entry |
Football |
Sport |
Basketball billiard ball world cup |
When the feature samples number got according to seed words such as sport, football, sportsman, track and field, world cup, the Olympic Games not
When foot 1000, more feature samples can be obtained according to the related entry for including in feature samples.
Step 303 determines characteristic of division word according to feature samples collection.
Specifically, in this embodiment it is possible to by way of determining discrimination, the power of each entry in feature samples is determined
Value, and the entry using weights more than 0.05 is as characteristic of division word.
By taking feature samples shown in table 2 as an example.Assuming that the discrimination of basketball is 0.08, the discrimination of billiard ball is 0.03, generation
The discrimination of boundary's cup is 0.07, then the discrimination of football is 0.06, i.e. football belongs to characteristic of division word.
Step 304 determines taxonomy according to characteristic of division word and text to be selected.
Specifically, 50000 texts to be selected can be obtained from internet, and to each text to be selected carry out respectively cutting word and
Weight computing, and determine that weights are more than that the text to be selected of certain threshold value obtains 5000 classification in this step for taxonomy
Language material.
Step 305 carries out 5000 determining taxonomies language material cross validation, and determines 1000 final classification
Language material.
Specifically, in this step, can 5000 taxonomies that determined in step 304 be divided into 5 parts, and successively
It selects a copy of it for test data, class verification is carried out to the test data with remaining 4 parts respectively, and choose and be proved to be successful
Rate sorts preceding 1000 taxonomies as final taxonomy from high to low.Wherein, it is proved to be successful the identical classification language of rate
It is randomly ordered between material.
It is certain by choosing in advance by above description as can be seen that in technical solution provided in an embodiment of the present invention
The seed words of the known class of quantity, and a certain number of input sample composition input sample collection are obtained from database;According to
Preset seed words are concentrated from input sample and obtain feature samples composition characteristic sample set, and according to the feature samples collection got
Determine characteristic of division word;Taxonomy and its classification are determined according to the characteristic of division word and text to be selected that get, are improved
The efficiency and accuracy rate that taxonomy obtains.
Determine that the identical inventive concept of method, the embodiment of the present invention additionally provide a kind of classification language based on above-mentioned taxonomy
Material determines equipment, can be applied in above method flow.
As shown in figure 4, determining the structural schematic diagram of equipment for taxonomy provided in an embodiment of the present invention, may include:
First acquisition module 41, the input sample for obtaining preset quantity from database form input sample collection;Its
In, the input sample includes the entry name, classification information and related entry information of entry;
Second acquisition module 42 obtains feature samples, group for being concentrated from the input sample according to preset seed words
At feature samples collection;
First determining module 43, for determining characteristic of division word according to the feature samples collection;
Second determining module 44, for determining taxonomy and its class according to the characteristic of division word and text to be selected
Not.
Wherein, second acquisition module 42 is concentrated from the input sample according to preset seed words and obtains feature sample
This, realizes especially by following below scheme:
Step A, the feature samples for obtaining and including current seed words are concentrated from the input sample;
Step B, whether the quantity of judging characteristic sample is more than first threshold;If being judged as YES, terminate the flow;It is no
Then, step C is gone to;
Step C, the entry in the feature samples and related entry are obtained, and by the entry and related term got
Seed words are added in item, update current seed words;Go to step A.
Wherein, first determining module 43 is specifically used for, and obtains the entry in this feature sample set;It determines in the entry
The weights of each entry;Characteristic of division word is determined according to the weights of each entry.
Wherein, the weights of the entry are the discrimination of the entry;
First determining module 43 is specifically used for, and obtains the related entry that the feature samples are concentrated;Determine the correlation
The discrimination of each correlation entry in entry;The discrimination of each entry in the entry is determined according to the discrimination of the related entry;
Characteristic of division word is determined according to the discrimination of the entry.
Wherein, in the described correlation entry it is each correlation entry discrimination specifically, it is described correlation entry in each related term
Item concentrates the number entry related to this occurred in related entry information to concentrate the number occurred in input sample in feature samples
Ratio;The related entry for including in feature samples where the discrimination of each entry in the described entry, the specially entry
The mean value of discrimination;
First determining module 43 is specifically used for, and when the discrimination of the entry is more than second threshold, determines the word
Item is characteristic of division word.
Wherein, first determining module 43 is specifically used for, and the weights of each entry are determined according to parameter preset, when institute's predicate
When the weights of item are more than third threshold value, determine that the entry is characteristic of division word;Or, determining the power of each entry according to hits algorithms
Value determines that the entry is characteristic of division word when the weights of the entry are more than third threshold value;
Wherein, the parameter preset includes following one or arbitrary combination:
Click volume, favorable comment number and the editor's number of entry.
Wherein, second determining module 44 is specifically used for, and carries out cutting word to the text to be selected, and obtain this and wait for selection
The characteristic of division word for including in this;The weights of the text to be selected are determined according to the characteristic of division word got;When described to be selected
When the weights of text are more than four threshold values, determine that the text to be selected is taxonomy, and will be belonging to the preset seed words
Classification of the classification as the taxonomy.
Wherein, second determining module 44 is waited for according to described in the characteristic of division word and the Feature Words got determination
The weights of selection sheet, are realized especially by following formula:
Wherein, tf is the word frequency of the characteristic of division word that occurs in the text to be selected in the text to be selected;The n is point
The number of category feature word;The QiFor the weights of i-th of characteristic of division word;The N is the number of words of the text to be selected.
Wherein, second determining module 44 is additionally operable to, and the taxonomy of the determination is divided into more parts;According to described each
Part taxonomy carries out language material cross validation, and determines final taxonomy and its classification.
Wherein, step A1, select from each part taxonomy a non-selected taxonomy as testing number
According to;
Step B1, respectively using remaining each part taxonomy the classification of the test data is verified;
Step C1, the correct number of statistical testing of business cycles, and when its be more than five threshold values when, determine that the test data is final
Taxonomy;
Step D1, judge whether that there is also non-selected taxonomies;If being judged as YES, step A is gone to1;Otherwise,
Terminate the flow.
Through the above description of the embodiments, those skilled in the art can be understood that the present invention can be by
Software adds the mode of required general hardware platform to realize, naturally it is also possible to which by hardware, but the former is more in many cases
Good embodiment.Based on this understanding, technical scheme of the present invention substantially in other words contributes to the prior art
Part can be expressed in the form of software products, which is stored in a storage medium, if including
Dry instruction is used so that a computer equipment (can be personal computer, server or the network equipment etc.) executes this hair
Method described in bright each embodiment.
It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the module in attached drawing or stream
Journey is not necessarily implemented necessary to the present invention.
It will be appreciated by those skilled in the art that the module in device in embodiment can describe be divided according to embodiment
It is distributed in the device of embodiment, respective change can also be carried out and be located in one or more devices different from the present embodiment.On
The module for stating embodiment can be merged into a module, can also be further split into multiple submodule.
Disclosed above is only several specific embodiments of the present invention, and still, the present invention is not limited to this, any ability
What the technical staff in domain can think variation should all fall into protection scope of the present invention.