CN102023986A

CN102023986A - Method and equipment for constructing text classifier by referencing external knowledge

Info

Publication number: CN102023986A
Application number: CN2009101719478A
Authority: CN
Inventors: 李建强; 赵彧; 刘博�
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd; Renesas Electronics China Co Ltd
Priority date: 2009-09-22
Filing date: 2009-09-22
Publication date: 2011-04-20
Anticipated expiration: 2029-09-22
Also published as: CN102023986B

Abstract

The invention provides a method and equipment for constructing a text classifier by referencing external knowledge. The method comprises the steps of: inputting a label text set; extracting internal characteristics of the label text set; constructing external characteristics of the label text set by referencing an external knowledge source (such as a dictionary); comprehensively considering the internal characteristics and the external characteristics of the label text set, and selecting training texts from the label text set; and learning the generation of the text classifier by using the selected training texts. According to the invention, sample distribution deviation generated by the label text set can be possibly regulated by the external characteristics automatically generated by the external knowledge source, and therefore, the finally trained classifier has better generalization capability and robustness.

Description

Make up the method and apparatus of text classifier with reference to external knowledge

Technical field

Relate generally to information retrieval of the present invention and text classification.More specifically, the present invention relates to make up the method and apparatus of text classifier with reference to external knowledge.

Background technology

Along with the develop rapidly of electronic office and the Internet, the quantity of information of e-text becomes blast trend growth, and large-scale automatic information processing has become necessary means and the challenge that people utilize this extensive information better.

Information retrieval is meant that information organizes by certain mode, and finds out the process and the technology of relevant information according to information user's needs.And text automatic classification is to realize one of main support technology of information retrieval, and its basic purpose is that text is divided in the predefined class, helps people to retrieve, inquire about, filter and utilize the effective means of information.Early stage text classification is adopted is based on knowledge engineering and method of expert system, and such method is very complicated and lack dirigibility.Along with the rise and the development of machine learning, in the text classification field that the sorter model of a lot of machine learning is introduced into, obtained good effect from different aspects, become the mainstream technology that realizes the autotext classification at present.

Text classification based on machine learning realizes by the text classifier of final structure, and its performance depends on employed training data (text) set to a great extent, so just causes wherein the key of being selected to of training data.

The selection of so-called training data is meant selects one of them subclass to be used for training the corresponding text sorter from the given text collection with class label (mark text set).Good training data system of selection can increase substantially the efficient that makes up sorter by the quantity that reduces training text on the one hand, improve generalization ability and the robustness that training obtains sorter by the quality that improves training text on the other hand, thereby guarantee the precision of classification.

There have been at present some relevant patent and investigative techniques of being used for the training text selection.

For example, in the U.S. Pat 7409404B2 that is entitled as " Creating taxonomies and training data for document categorization ", provide a kind of training text selection technology, it is mainly under the prerequisite of the interference of as far as possible eliminating other feature of overclass, by the refine quality of training text data of the statistical information of given mark text.

In addition, at Wang, J, Neskovic, P and Cooper have also described two kinds and have utilized the internal statistical feature of given mark text collection to carry out equipment and the method that text classifier was selected also and then learnt to generate to training text in the non-patent literature (In:LNCS vol.3610 2005) (calling non-patent literature 1 in the following text) that is entitled as " Training data selection for support vector machines " that L.N write.Its concrete structure block diagram and workflow are as depicted in figs. 1 and 2.

As shown in Figure 1, making up equipment 100 according to the text classifier of the prior art is put 102, is constituted based on the training text selecting arrangement 103 and the sorter learning device 104 of statistical method by input media 101, text vector makeup.Input media 101 is from one group of mark of mark text storage unit 105 inputs text.Each mark text that 102 pairs of inputs are put in the text vector makeup carries out vectorization, and the vector space model (VSM) corresponding to each mark text that will generate stores in vector space model (VSM) storage unit 106.Then, each mark text of vectorization is given a mark based on statistical method based on the training text selecting arrangement 103 of statistical method and therefrom select suitable training text.The training text of selecting is classified device learning device 104 subsequently and is used for study generation text classifier.

Fig. 2 shows the schematic workflow that text classifier shown in Figure 1 makes up equipment 100.In non-patent literature 1 disclosed technology contents, put down in writing two kinds of internal features and carried out the illustrative methods that text classifier was selected also and then learnt to generate to training text, i.e. exemplary method 1 and exemplary method 2 according to the mark text collection.In exemplary method 1, be that the number of the mark text that comprised in the border circular areas of the mark text that does not comprise other classifications of maximum at center is made N (xi) by note with each mark text xi; And select the mark text in the border circular areas that wherein has minimum N (xi), as the training text collection.In exemplary method 2, calculate each mark text xi to the convex set of the mark text of other classifications apart from d (xi); And select mark text, as training text with minimum d (xi).

No matter exemplary method 1 still is an exemplary method 2, all only derives from given mark text set self as the feature of compute statistics, has wherein only considered the sample distribution of given mark text set inside.Therefore, the text classifier that makes up must be subjected to the influence of given mark text fully at last, thereby causes the generalization ability and the poor robustness of this sorter.

Though also there are other training text systems of selection in the prior art, but, present training text system of selection mainly all is to utilize the inside knowledge in the given mark text set to realize, promptly, feature that is adopted and the weight DATA DISTRIBUTION of given mark text collection that places one's entire reliance upon, thus make the training text that chooses can have very strong skewed popularity.This skewed popularity can propagate into the classification orientation of the sorter of final structure, makes its generalization ability and robustness be subjected to very big influence, finally causes the sorter performance undesirable.

Summary of the invention

The present invention considers that just above-mentioned problems of the prior art develop.

According to thought of the present invention, derive from the external knowledge source (for example, meaning of a word dictionary) surface is introduced in the selection course of training text, and specifically, core of the present invention realizes being mainly reflected in following two aspects:

(1) structure of surface: it mainly utilizes in the outside meaning of a word dictionary about the definition of the meaning of a word or notion, is input with the notion that contains in class name or the accurate class name, makes up the surface that is independent of given mark text set; And

(2) select based on the training text of mixing method: be different from the tradition mark text marking sort method of the internal feature of only considering to derive from the mark text set, the internal feature (from bottom to top) that the method synthesis that the present invention provides is considered to derive from the surface (from top to bottom) in external knowledge source and derived from given mark text set realizes that the marking that marks text sorts.This process has not only been utilized the knowledge that given mark text set contained but also has been utilized the external knowledge source to carry out the selection of training text data about the semantic knowledge of class name or accurate class name.

According to first aspect present invention, a kind of method that is used to make up text classifier is provided, it comprises: input mark text set; Make up the surface of mark text set with reference to the external knowledge source; Take all factors into consideration the internal feature and the surface of mark text set and from the mark text set, select training text; And utilize selected training text to learn to generate text classifier.In an embodiment of the present invention, the vocabulary that vector space model comprised that internal feature can generate by the mark text is carried out vectorization constitutes, and surface can be by in the reference meaning of a word dictionary about the definition of the meaning of a word and the semantic relation between the vocabulary, representative characteristic vocabulary in a plurality of these classifications of coming out by class name (or accurate class name) expansion.

According to second aspect present invention, a kind of equipment that is used to make up text classifier is provided, it comprises: input media is used for input mark text set; The surface construction device is used for making up the surface that marks text set with reference to the external knowledge source; The training text selecting arrangement is used to take all factors into consideration internal feature and the surface that marks text set, selects training text from the mark text set; And the sorter learning device, be used to utilize selected training text to learn to generate text classifier.

According to the present invention, the surface that derives from the external knowledge source is introduced in the constructive process of the selection of training text and sorter.Obtained proofreading and correct control owing to derive from the data skewed popularity of given mark text set, therefore can improve the classification representativeness of training text greatly, and increased different classes of between the otherness of training text, thereby finally reach the generalization ability that improves the sorter that training obtains and the purpose of robustness.

Description of drawings

In conjunction with the accompanying drawings, from following detailed description to the embodiment of the invention, will understand the present invention better, similar label is indicated similar part in the accompanying drawing, wherein:

Fig. 1 is the structured flowchart that makes up equipment 100 according to the text classifier of prior art;

Fig. 2 is the indicative flowchart that the workflow of equipment 100 shown in Figure 1 is shown;

Fig. 3 is that text classifier makes up the structured flowchart of equipment 300 according to an embodiment of the invention;

Fig. 4 is the process flow diagram that the course of work of equipment 300 shown in Figure 3 is shown;

Fig. 5 is shown specifically the block diagram that makes up the inner structure of the surface construction device in the equipment 300 according to the text classifier of the embodiment of the invention;

Fig. 6 is the block diagram of inner structure that another example of surface construction device shown in Figure 5 is shown;

Fig. 7 is shown specifically according to the text classifier of the embodiment of the invention to make up the block diagram based on the inner structure of first example of the training text selecting arrangement of mixing method in the equipment 300;

Fig. 8 is the workflow diagram that first example of the training text selecting arrangement based on mixing method shown in Figure 7 is shown;

Fig. 9 illustrates according to the text classifier of the embodiment of the invention to make up the generalized block diagram based on the inner structure of second example of the training text selecting arrangement of mixing method in the equipment 300;

Figure 10 is the detailed diagram of inner structure that is shown in further detail second example of training text selecting arrangement based on mixing method shown in Figure 9;

Figure 11 is the workflow diagram that second example of the training text selecting arrangement based on mixing method shown in Figure 10 is shown; And

Figure 12 illustrates the indicative flowchart that makes up the workflow of equipment 300 according to the text classifier of the embodiment of the invention.

Embodiment

Here, for convenience of description, at first some technical terms that will use among the present invention are carried out brief description:

Fig. 3 is that text classifier makes up the structured flowchart of equipment 300 according to an embodiment of the invention.As shown in Figure 3, in an embodiment of the present invention, text classifier make up that equipment 300 comprises that input media 301, text vector makeup put 302, surface construction device 303, based on the training text selecting arrangement 304 and the sorter learning device 305 of mixing method.Make up equipment 100 with the text classifier according to prior art shown in Figure 1 and compare, text classifier make up input media 301, text vector makeup that equipment 300 comprised put 302 and sorter learning device 305 have and similar function of prior art and structure.Therefore, unique distinction of the present invention should be surface construction device 303 and the action and function of being realized based on the training text selecting arrangement 304 of mixing method.

In addition, in the system of Fig. 3, be different from prior art, also comprise class name thesaurus 309, external knowledge source 310 and surface storage unit 311.Their collaborative works are used for realizing training text selection course of the present invention quoting for external knowledge.Class name thesaurus 309 is used for storing the item name based on natural language of given classification task item appointment.The item name that provides in the classification task item generally can be described the vocabulary of the understandable natural language of behaving, and especially for real text classification system, generally all has good man-machine interface, to make things convenient for the user to browse or to inquire about.External knowledge source 310 for example can be the machine-readable meaning of a word dictionary of storage, has wherein defined the meaning of a word tabulation that each target vocabulary may have.A plurality of meaning of a word of each vocabulary have been defined in the meaning of a word dictionary, the hierarchical structure relation between the promptly different meaning of a word.Surface construction device 303 can be with classification input by name, utilizes in the meaning of a word dictionary about the definition of the meaning of a word and the semantic relation between the vocabulary, expands out the representative feature vocabulary of a plurality of these classifications by class name.Based on the training text selecting arrangement 304 of mixing method comprehensively the factor of internal feature and surface two aspects realize the selection of training text data.

Fig. 4 is the process flow diagram that the course of work of equipment 300 shown in Figure 3 is shown.Process shown in Figure 4 starts from step 401, wherein is similar to prior art, and input media 301 is from the set of one group of mark of mark text storage unit 306 inputs text.In step 402, text vector makeup is put 302 each mark text is carried out vectorization, to obtain the vector space model (VSM) corresponding to each mark text.The vocabulary that each VSM comprised can be used as the internal feature of this given mark text set.Here, it should be noted that text vectorization only is to extract a kind of exemplary instrumentation of mark text set internal feature, and should be as limitation of the present invention.Any means that are used to extract mark text set internal feature known in those skilled in the art can be applied to the present invention similarly.

Then, in step 403, surface construction device 303 makes up surface by reference class another name thesaurus 309 and external knowledge source 310.Constructed surface is stored in the surface storage unit 311 subsequently.About the building process of surface, will describe in detail subsequently.After obtaining marking the internal feature and surface of text set, in step 404, utilize the internal feature and the surface of mixing method, comprehensive reference mark text set to select training text based on the training text selecting arrangement 304 of mixing method.About training text selection course, also will in content subsequently, describe in detail based on mixing method.Can be stored in the training text storage unit 308 by set based on the training text selecting arrangement 304 selected training texts of mixing method.Then, in step 405, sorter learning device 305 generates text classifier by learning selected training text collection.So far, process 400 finishes.

To at first describe according to surface constructive process of the present invention below.Fig. 5 is shown specifically the block diagram that makes up the inner structure of the surface construction device 303 in the equipment 300 according to the text classifier of the embodiment of the invention.As shown in the figure, surface construction device 303 can comprise word segmentation processing unit 501, filter element 502 (optional), the meaning of a word marking unit 503, meaning of a word selected cell 504 and assembled unit 505.As previously mentioned, surface construction device 303 is with classification input by name, utilize external knowledge source 310 (for example meaning of a word dictionary) about the definition of the meaning of a word and the semantic relation between the vocabulary, expand out the representative feature vocabulary of a plurality of these classifications by class name, as surface.It has realizes comprising the steps: that (1) word segmentation processing unit 501 at first carries out pre-service to class name, it is decomposed into one group of vocabulary; (2) finish after the word segmentation processing of class name, alternatively, filter element 502 can filter the vocabulary that generates, to remove stop words (stopwords); (3) meaning of a word that subsequently each vocabulary defined in meaning of a word dictionary of meaning of a word marking unit 503 ordering of giving a mark; (4) meaning of a word selected cell 504 according to each meaning of a word the high pairing vocabulary of one or more meaning of a word of component selections score; (5) last, assembled unit 505 combines the surface that the composing training text selecting is required with meaning of a word selected cell 504 selected high meaning of a word vocabulary and the word segmentation processing unit 501 of all scores by the resulting one group of vocabulary of class name word segmentation processing.

In said external feature-modeling process, another Chinese patent application No.200910129454.8 that is entitled as " Word sense disambiguation method and system " that meaning of a word marking unit 503 can be submitted on March 20th, 2009 with reference to applicant of the present invention the marking sequencer procedure of each target vocabulary.Should be incorporated into this on the whole by reference at the full content of first to file, to be used for all purposes.Put down in writing in first to file as this, meaning of a word marking unit 503 according to each vocabulary in the conceptive semantic feature that is hereinafter contained of classification mode (as Yahoo Directory, the open directory of ODP etc. all is typical stratification classification mode) to the ordering of giving a mark of its meaning of a word that in meaning of a word dictionary, defines.

The so-called conceptive various semantic relations that hereinafter are meant between other notions that comprised in the notion that comprises target vocabulary and the classification mode thereof.The conceptive syntactical and semantical feature that is used for the semantic disambiguation of target vocabulary in a large number that hereinafter contained.

Grammar property: other the co-occurrence vocabulary that appears at simultaneously in the same notion with target vocabulary constitutes its context vocabulary, and in " semantic net ", " semanteme " and " net " constitutes the other side's context vocabulary mutually.

Semantic feature: lie in and every other notion that the notion that comprises target vocabulary links together with certain relation (as fraternal notion, sub-notion, father's notion etc.) in.As comprising " internet " this notion in the conceptional tree that has a hierarchical relationship for, it comprises " semantic net " and is its sub-notion, also have a distance " internet " to have the notion of semantic relation far away simultaneously as " clothes ", in that the vocabulary in " internet " this notion is carried out semantic disambiguation be like this, " semantic net " and " clothes " can regard its conceptive context information as, but distance according to semantic distance, in last meaning of a word marking, they will give different weights.The main basis of this different weight appointment is exactly that relation in the semantic feature is divided.

According to extraction obtain hereinafter conceptive, for example can design multiple utilization semantic feature wherein come to a plurality of meaning of a word that provide in the reference dictionary about target vocabulary sort marking method.For example, put down in writing two kinds of such methods at first to file No.200910129454.8:

First method: utilize conceptive various semantic relations hereinafter, can be to giving different weights with the context vocabulary of (the appearing in the different contiguous notions) of target vocabulary co-occurrence, then realize utilizing conceptive semantic feature hereinafter to carry out high-quality meaning of a word ordering marking (in traditional semantic disambiguation method, general co-occurrence vocabulary has identical weight).

Particularly, for the target vocabulary w that appears in the notion title, { cw appears in its conceptive hereinafter total n context vocabulary ₁, cw ₂..., cw _n, meaning of a word sort algorithm can obtain by following flow process:

(A) for each co-occurrence vocabulary cw _i, can obtain its respective weights W (cw in final semantic disambiguation process by the calculating of certain semantic path length _i);

(B), calculate the meaning of a word w of target vocabulary based on the sense definition that provides in the meaning of a word dictionary ^rWith each co-occurrence vocabulary cw _iThe degree of correlation: 1) calculate w ^rWith context vocabulary cw _iEach meaning of a word carry out degree of correlation Rs; 2) calculate meaning of a word w ^rWith each context vocabulary cw _iDegree of correlation RW (cw _i), i.e. context vocabulary cw _iAll meaning of a word Rs and;

(C) and then obtain the degree of correlation of each meaning of a word with respect to all context vocabulary.

Rank(w ^r)＝∑W(cw _i)RW(cw _i)。

Second method: by with conceptive hereinafter hierarchical structure/graph structure and semantic dictionary in the meaning of a word hierarchical structure that provides mate calculating, reach the purpose of meaning of a word ordering marking.On the one hand, the conceptive hereinafter subclass of body or stratification classification mode normally of target vocabulary, target vocabulary is positioned at the center of this subclass; On the other hand, provide the reference dictionary of sense definition generally also to comprise the hierarchical structure that one or more describes hierarchical relationship between the meaning of a word, the definition of a general meaning of a word is present in one or more such hierarchical structure.Comprehensive top two aspects, new figure matching algorithm can provide:

(A) the context vocabulary (appearing in the notion title) of consideration co-occurrence calculates the ordering score Rank (cd) that gives co-occurrence vocabulary similarity;

(B) consider with the target vocabulary to be the contextual topological structure of notion of core, calculate the similarity score value Rank (cs) of the conceptive hereinafter topological structure of representative;

(C) comprehensive scores Rank (w ^r)=θ Rank (cd)+(1-θ) Rank (cs) realizes that context lexical set and the hierarchical structure of the different meaning of a word by comprising topology information carry out figure and mate and realize that the meaning of a word gives a mark.

After ordering that each vocabulary that obtains by the class name word segmentation processing is given a mark, can select the surface of giving a mark high one or more meaning of a word vocabulary and selecting as training text by the vocabulary that the various semantic relation extensions of meaning of a word dictionary definition are come out.

Fig. 6 is the block diagram of inner structure that another example of surface construction device shown in Figure 5 is shown.Consider that in some cases class name is not to describe with the vocabulary of natural language, this situation can cause class name not have the semantic information of natural language aspect, thereby causes the surface building process normally not carry out.To this, another example shown in Figure 6 can be handled this situation.Different with Fig. 5, Fig. 6 externally feature construction device 303 has also increased an accurate class name construction unit 601, it comes to make up an accurate class name automatically to be used for the structure of surface for each classification by the processing to the text with class label of vectorization.For example, accurate class name construction unit 601 can utilize certain method (for example latent semantic analysis) based on statistics that the mark text that each classification comprised is carried out the relevant meaning of a word analysis of title, to obtain this type of other representative vocabulary as this type of other accurate class name.For example, utilizing under the situation of latent semantic analysis, can carry out cluster (cluster), therefrom selecting the vocabulary cluster relevant, as the accurate class name of respective classes with title to the mark text of each classification.In this process, expert or user's participation can be arranged, to improve the quality that accurate class name makes up.Certainly, the creation method of accurate class name is not limited thereto.Other accurate class name creation methods that those skilled in the art expect easily also can be applied to the present invention similarly.

Describe training text system of selection in detail below with reference to Fig. 7 to Figure 11 based on mixing method according to the embodiment of the invention.As previously mentioned, in an embodiment of the present invention, the factor of comprehensive internal feature and surface two aspects realizes the selection of training text data.On the one hand, the surface that derives from external knowledge source (for example meaning of a word dictionary) has provided the representativeness of respective classes and has described vocabulary, for the selection of training text data provides guiding knowledge from top to bottom.On the other hand, the internal feature that derives from given mark text then by a plurality of text datas with the mode of example from bottom to top reflect the statistical law that the respective classes text should have.According to principle of the present invention, the mixing method of taking all factors into consideration the factor of this two aspect can be embodied as following two kinds of exemplary methods.

＜exemplary method 1 〉

Fig. 7 and Fig. 8 show first example of selecting based on the training text of mixing method according to the embodiment of the invention, and wherein Fig. 7 is shown specifically text classifier according to the embodiment of the invention to make up the block diagram based on the inner structure of first example of the training text selecting arrangement of mixing method in the equipment 300; Fig. 8 is the workflow diagram that first example of the training text selecting arrangement based on mixing method shown in Figure 7 is shown.

In first example, utilize weighted strategy to come balanced surface and internal feature role in the mark text data is selected, thereby can directly use the selection that traditional statistical analysis technique carries out training text.Say that simply first exemplary method comprises the steps:

The first step: the weight of just coming to determine surface according to the quality of external knowledge source and class name.For example, can give higher weight for the representative vocabulary that appears at simultaneously in internal feature and the surface;

Second step: based on given weight strategy, surface is incorporated in traditional training text system of selection based on statistics (for example, shown in Figure 2 exemplary method of the prior art 1 and 2).

Specifically, as shown in Figure 7, in exemplary method 1, comprise metrics calculation unit 701, distance adjusting unit 702, weight generation unit 703 and based on the training text selected cell 704 of statistical method based on the training text selecting arrangement 304 of mixing method.With reference to figure 8, process 800 starts from step 801, and wherein metrics calculation unit 701 is for example calculated the distance that marks in twos between the text by the similarity between the VSM that calculates each mark text.In step 802, distance adjusting unit 702 can utilize the surface of mark text that the distance that calculates is adjusted then.Weight generation unit 703 can just come to determine the weight of surface according to the quality of external knowledge source and class name.For example, the vocabulary that appears at simultaneously in internal feature and the surface can be endowed high weight.The weight at surface (vocabulary that wherein comprises) that is generated by weight generation unit 703 can be used to adjust by what metrics calculation unit 701 calculated and mark distance between the text in twos, thereby influence in the training text selection course, according to the selection of the similarity between the text for training text.Then, in step 803, utilize statistical method, select the suitable training text set according to the distance between the adjusted mark text based on the training text selected cell 704 of statistical method.Here, for example, exemplary method of the prior art 1 shown in Figure 2 and 2 can be used to the selection of training text.

＜exemplary method 2 〉

Fig. 9 to Figure 11 shows second example of selecting according to the training text based on mixing method of the embodiment of the invention.Fig. 9 illustrates according to the text classifier of the embodiment of the invention to make up the generalized block diagram based on the inner structure of second example of the training text selecting arrangement of mixing method in the equipment 300; Figure 10 is the detailed diagram of inner structure that is shown in further detail second example of training text selecting arrangement based on mixing method shown in Figure 9; Figure 11 is the workflow diagram that second example of the training text selecting arrangement based on mixing method shown in Figure 10 is shown.In second exemplary method, utilize surface at first the mark text to be carried out initialization marking.Then, the marking result who utilizes (iteration) learning process to refine initial.

As shown in Figure 9, on the whole, the training text selecting arrangement 304 based on mixing method of second example comprises initialization marking unit 901 and the marking result unit 902 of refining, wherein initialization marking unit 901 is corresponding to the from top to bottom operation in stage, wherein utilize external knowledge to mark the initialization marking of text, the marking result refines unit 902 corresponding to the from bottom to top operation in stage, wherein utilizes the inner knowledge initial marking result that refines.

Figure 10 illustrates in greater detail the inner structure based on the training text selecting arrangement 304 of mixing method of second example.As shown in figure 10, in stage from top to bottom, initialization marking unit 901 can comprise inquiry parts 1001 and surface marking parts 1002.In stage from bottom to top, the marking result unit 902 of refining comprises initial learn parts 1003, internal feature marking parts 1004 and middle study parts 1005.

Figure 11 shows the workflow based on the training text selecting arrangement 304 of mixing method of second example.Process 1100 starts from step 1101, and wherein for each classification, the inquiry parts 1001 in the initialization marking unit 901 can utilize corresponding surface as key word of the inquiry the mark text in this classification to be inquired about.Then, in step 1102, surface marking parts 1002 carry out initialization marking according to inquiry and the similarity between the text that Query Result reflects to the mark text, with the result that initially given a mark.Afterwards, in step 1103, the initial learn parts 1003 that the marking result refines in the unit 902 at first select preceding t% text of each classification among the initialization marking unit 901 resulting preliminary classification results to carry out sorter study, acquisition preliminary classification device.Then, in step 1104, the preliminary classification device is classified to the mark text collection.In step 1105, for each classification that newly obtains, the internal feature marking parts 1004 that the marking result refines in the unit 902 can carry out new marking ordering to the mark text according to the internal feature of the mark text that is comprised.For example, internal feature marking parts 1004 can carry out new marking ordering to the lineoid of sorter or the distance of mark text distribution expectation to the mark text according to the mark text.Then, in step 1106, middle study parts 1005 select preceding p% text of each classification to carry out new sorter study, to obtain the middle classification device.For example, p%=t% or p% ≠ t%.In addition, middle study parts 1005 also can be selected positive example sample and negative routine sample respectively at each classification, and positive example sample and negative routine sample are used for sorter study.For example, at each classification c, be+1 with p+% the text marking that comes the front, that is to say that these texts are high-quality positive example samples; With p-% the text marking that comes other classification fronts is-1, that is to say that these texts are high-quality negative routine samples with respect to classification c.Then, in step 1107, process is identical as described above, and newly-generated middle classification device is used to the mark text collection is classified, and carries out new marking ordering according to the internal feature of mark text.Selectively, said process can iteration carry out, till current training text data set no longer changes.In other words, in step 1108, judge whether the training text data set of current generation changes with respect to last iteration.If change, process is then returned step 1106 to carry out the next round iteration.If current training text data set no longer changes, then in step 1109, export the training text data set that is generated.So far, process 1100 finishes.

The training text set that is generated is stored in the training text storage unit 308 (Fig. 3) subsequently, to be used to learn to generate final text classifier.

Figure 12 illustrates the indicative flowchart that makes up the workflow of equipment 300 according to the text classifier of the embodiment of the invention.Prior art is as shown in Figure 2 compared, make up equipment 300 according to text classifier according to the present invention and utilize external knowledge source (for example meaning of a word dictionary) to introduce the selection course of training text, and proposed above-mentioned two kinds of exemplary mixing methods (exemplary method 1 and 2) and be used to select training text from the surface that class name expands.According to the present invention, may be subjected to the adjustment that external knowledge is derived from the moving surface that generates by the sample distribution deviation that given mark text collection produces, thereby guarantee that the sorter that last training obtains has generalization ability and robustness preferably.

With reference to the accompanying drawings training text selection course according to the present invention and sorter establishment are described in detail above.As previously mentioned, according to the present invention, the surface that derives from the external knowledge source is introduced in the constructive process of the selection of training text and sorter.Obtained proofreading and correct control owing to derive from the data skewed popularity of given mark text set, therefore can improve the classification representativeness of training text greatly, and increased different classes of between the otherness of training text, thereby finally reach the generalization ability that improves the sorter that training obtains and the purpose of robustness.

But, need clearly customized configuration and processing that the present invention is not limited to above describe and illustrates in the drawings.And, for brevity, omit detailed description here to the known method technology.In the above-described embodiments, describe and show some concrete steps as example.But procedure of the present invention is not limited to the concrete steps that institute describes and illustrates, and those skilled in the art can make various changes, modification and interpolation after understanding spirit of the present invention, perhaps change the order between the step.

Element of the present invention can be implemented as hardware, software, firmware or their combination, and can be used in their system, subsystem, parts or the subassembly.When realizing with software mode, element of the present invention is program or the code segment that is used to carry out required task.Program or code segment can be stored in the machine readable media, perhaps send at transmission medium or communication links by the data-signal that carries in the carrier wave." machine readable media " can comprise any medium that can store or transmit information.The example of machine readable media comprises electronic circuit, semiconductor memory devices, ROM, flash memory, can wipe ROM (EROM), floppy disk, CD-ROM, CD, hard disk, fiber medium, radio frequency (RF) link, or the like.Code segment can be downloaded via the computer network such as the Internet, Intranet etc.

The present invention can realize with other concrete form, and do not break away from its spirit and essential characteristic.For example, the algorithm described in the specific embodiment can be modified, and system architecture does not break away from essence spirit of the present invention.Therefore, current embodiment is counted as exemplary but not determinate in all respects, scope of the present invention is by claims but not foregoing description definition, and, thereby the whole changes that fall in the scope of the implication of claim and equivalent all are included among the scope of the present invention.

Claims

1. method that is used to make up text classifier comprises:

Input mark text set;

Extract the internal feature of described mark text set;

Make up the surface of described mark text set with reference to the external knowledge source;

Take all factors into consideration the internal feature and the surface of described mark text set, from described mark text set, select training text; And

Utilize selected training text to learn to generate described text classifier.

2. the method for claim 1, the step of wherein extracting the internal feature of described mark text set comprises:

Each mark text in the described mark text set is carried out vectorization, obtaining vector space model corresponding to this mark text,

Wherein the vocabulary that vector space model comprised of each mark text constitutes the internal feature of described mark text set together.

3. the method for claim 1, the step of wherein creating the surface of described mark text set comprises:

At each classification:

Corresponding class name is carried out word segmentation processing it is decomposed into one group of vocabulary;

Each described vocabulary each meaning of a word in described external knowledge source is given a mark;

Select the high one or more meaning of a word of score of described vocabulary according to the marking result; And

The high one or more meaning of a word vocabulary of score of described one group of vocabulary and each vocabulary are combined, to constitute the surface at described classification of described mark text set.

4. method as claimed in claim 3, the class name of wherein said classification does not have the semantic information of natural language, and the step of creating the surface of described mark text set also comprises:

Automatically create the accurate class name of described classification by analyzing mark text that described classification comprises; And

Wherein, described accurate class name is used to the establishment of surface as the class name of described classification.

5. method as claimed in claim 3, the step of wherein creating the surface of described mark text set also comprises:

From described one group of vocabulary that decomposition obtains, remove stop words.

6. method as claimed in claim 3, the vocabulary that wherein selected each meaning of a word comes out by the semantic relation extension in described external knowledge source also is used as the surface of described classification.

7. the method for claim 1, wherein said external knowledge source is a dictionary.

8. method as claimed in claim 3, wherein from described mark text set, select the step of training text to comprise:

Calculate the distance that marks in twos between the text by the similarity of calculating between the described vector space model;

The described distance that marks in twos between the text of utilizing the surface adjustment of described mark text set to calculate; And

Utilize statistical method, select described training text according to the distance between the adjusted described mark text.

9. method as claimed in claim 8, wherein select the step of described training text to comprise:

Calculating is the number of the mark text that comprised in the border circular areas of the mark text that the most very much not comprises other classifications at center with each mark text; And

Selection has the mark text that is comprised in the border circular areas of mark text of minimum number, as described training text.

10. method as claimed in claim 8, wherein select the step of described training text to comprise:

Calculate the distance of each mark text to the convex set of the mark text of other classifications; And

Selection has the mark text of minor increment, as described training text.

11. method as claimed in claim 8, the described step that marks the distance between the text in twos of wherein utilizing described surface adjustment to calculate comprises:

Give higher weight to the internal feature and the vocabulary in the surface that appear at described mark text set simultaneously; And

The described distance that marks in twos between the text that calculates according to described weight adjustment.

12. method as claimed in claim 3 wherein selects the step of training text to comprise from described mark text set:

Utilize the described surface of described mark text set that each described mark text is carried out initialization marking;

Utilize the described internal feature of the described mark text set described initialization marking result that refines; And

Select described training text according to the marking result after each mark the refining of text.

13. method as claimed in claim 12 wherein comprises the step that each described mark text carries out initialization marking:

For each classification in the described mark text set, the vocabulary that utilizes corresponding surface to comprise is inquired about the mark text in this classification as key word of the inquiry; And

With the similarity of return results reflected each mark text of described inquiry and the described surface of respective classes as the described initialization of this mark text result that gives a mark.

14. method as claimed in claim 13, the described initialization marking result's that wherein refines step comprises:

(a), the mark text of the preceding t% of each classification gathered as training text carry out sorter study, to obtain a middle classification device according to the described initialization marking result of each described mark text;

(b) utilize described middle classification device that the mark text in the described mark text set is classified;

(c) each classification that obtains for described classification, the internal feature that marks text according to each that wherein comprises carries out new marking ordering to the mark text in this classification; And

(d) according to new marking ranking results, select the mark text of the preceding p% of each classification to gather and carry out sorter study as new training text, obtaining new middle classification device,

Repeat above-mentioned steps (b), (c) and (d), till the set of selected training text no longer changes.

15. method as claimed in claim 14, wherein when the internal feature that marks text according to each carried out new marking ordering to the mark text, the mark text was sorted by marking according to the lineoid of its described middle classification device or the distance of mark text distribution expectation.

16. method as claimed in claim 14 wherein selects the mark text of the preceding p% of each classification to comprise as the step of new training text set:

For each classification c:

The mark text of preceding p% of selecting this classification c is as at the positive example sample of this classification c,

The mark text of preceding p-% of selecting other classifications is as at the negative routine sample of this classification c.

17. an equipment that is used to make up text classifier comprises:

Input media is used for input mark text set;

The internal feature extraction element is used to extract the internal feature of described mark text set;

The surface construction device is used for making up with reference to the external knowledge source surface of described mark text set;

The training text selecting arrangement is used to take all factors into consideration the internal feature and the surface of described mark text set, selects training text from described mark text set; And

The sorter learning device is used to utilize selected training text to learn to generate described text classifier.

18. equipment as claimed in claim 17, wherein said internal feature extraction element comprises:

Text vector makeup is put, is used for each mark text of described mark text set is carried out vectorization, and obtaining vector space model corresponding to this mark text,

19. equipment as claimed in claim 17, wherein said surface construction device comprises:

The word segmentation processing unit is used at each classification, and corresponding class name is carried out word segmentation processing it is decomposed into one group of vocabulary;

Meaning of a word marking unit is used for each vocabulary is given a mark at each meaning of a word in described external knowledge source;

Meaning of a word selected cell is used for selecting according to the marking result of described meaning of a word marking unit the high one or more meaning of a word of score of described vocabulary; And

Assembled unit is used for described vocabulary and the high one or more meaning of a word vocabulary of its score are combined, to constitute the surface at described classification of described mark text set.

20. method as claimed in claim 19, the class name of wherein said classification does not have the semantic information of natural language, and described surface construction device also comprises:

Accurate class name generation unit is used for creating the accurate class name of described classification automatically by analyzing mark text that described classification comprises; And

21. method as claimed in claim 19, wherein said surface construction device also comprises:

Filter element is used for removing stop words from described one group of vocabulary that decomposition obtains.

22. equipment as claimed in claim 17, wherein said external knowledge source is a dictionary.

23. equipment as claimed in claim 19, wherein said training text selecting arrangement comprises:

Metrics calculation unit is used for calculating the distance that marks in twos between the text by the similarity of calculating between the described vector space model;

Distance adjusting unit, the described distance that marks in twos between the text that is used to utilize the surface adjustment of described mark text set to calculate; And

Based on the training text selected cell of statistical method, be used to utilize statistical method, select described training text according to the distance between the adjusted described mark text.

24. equipment as claimed in claim 23, wherein said training text selecting arrangement also comprises:

The weight generation unit is used for giving higher weight to the internal feature that appears at described mark text set simultaneously and the vocabulary of surface; And

The described distance that marks in twos between the text that the described weight adjustment that described distance adjusting unit is generated according to the weight generation unit calculates.

25. equipment as claimed in claim 19, wherein said training text selecting arrangement comprises:

Initialization marking unit is used to utilize the described surface of described mark text set that each described mark text is carried out initialization marking;

The marking result unit of refining is used to utilize the described internal feature of the described mark text set described initialization marking result that refines, and selects described training text according to the marking result after the refining of each mark text.

26. equipment as claimed in claim 25, wherein said initialization marking unit comprises:

The inquiry parts are used for each classification for described mark text set, and the vocabulary that utilizes corresponding surface to comprise is inquired about the mark text in this classification as key word of the inquiry; And

Surface marking parts are used for the similarity of each mark text that return results reflected of described inquiry and the described surface of respective classes as the described initialization of this mark text result that gives a mark.

27. equipment as claimed in claim 26, the wherein said marking result unit of refining comprises:

The initial learn parts, be used for described initialization marking result according to each described mark text, the mark text of the preceding t% of each classification gathered as training text carry out sorter study, to obtain a middle classification device, this middle classification device is used for the mark text of described mark text set is classified;

Internal feature marking parts, each classification that is used for obtaining for described classification, the internal feature that marks text according to each that wherein comprises carries out new marking ordering to the mark text in this classification; And

Middle study parts are used for according to new marking ranking results, and select the mark text of the preceding p% of each classification to gather and carry out sorter study as new training text, obtaining new middle classification device,

Wherein said internal feature marking parts and described middle study parts cycle iterative operation thereof are till the set of selected training text no longer changes.

28. equipment as claimed in claim 27, wherein said middle study parts comprise:

Positive example sample selector switch is used at each classification c, and the mark text of preceding p% of selecting this classification c is as at the positive example sample of this classification c, and

Negative routine sample selector switch, the mark text of preceding p-% that is used to select other classifications is as the negative routine sample at this classification c.