CN102023986B

CN102023986B - The method and apparatus of text classifier is built with reference to external knowledge

Info

Publication number: CN102023986B
Application number: CN200910171947.8A
Authority: CN
Inventors: 李建强; 赵彧; 刘博�
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd
Priority date: 2009-09-22
Filing date: 2009-09-22
Publication date: 2015-09-30
Anticipated expiration: 2029-09-22
Also published as: CN102023986A

Abstract

The present invention proposes the method and apparatus building text classifier with reference to external knowledge.Described method comprises: input mark text set; Extract the internal feature of mark text set; The surface of mark text set is built with reference to external knowledge sources (such as dictionary); Consider internal feature and the surface of mark text set, from mark text set, select training text; And the training text selected by utilizing learns to generate text classifier.According to the present invention, the sample distribution deviation that may be produced by mark text collection can be subject to the adjustment of the surface that external knowledge sources generates automatically, thus ensure that and finally train the sorter obtained to have good generalization ability and robustness.

Description

The method and apparatus of text classifier is built with reference to external knowledge

Technical field

Relate generally to information retrieval of the present invention and text classification.More specifically, the present invention relates to the method and apparatus building text classifier with reference to external knowledge.

Background technology

Along with the develop rapidly of electronic office and the Internet, the quantity of information of e-text becomes blast trend growth, and large-scale automatic information processing has become necessary means and the challenge that people utilize this extensive information better.

Information retrieval refers to that information is organized in a certain way, and finds out process and the technology of relevant information according to the needs of information user.And text automatic classification is one of main support technology realizing information retrieval, its basic object is divided into by text in predefined class, helps people to retrieve, inquire about, filter and utilize the effective means of information.The method of what early stage text classification adopted is KBE and expert system, but such method is very complicated and lack dirigibility.Along with the rise and development of machine learning, in the text classification field that the sorter model of a lot of machine learning is introduced into, achieve good effect from different aspects, become the mainstream technology realizing automatic Text Categorization at present.

Text classification based on machine learning is realized by the text classifier finally built, and its performance depends on used training data (text) set to a great extent, so just causes the key be selected to wherein of training data.

The selection of so-called training data refers to that one of them subset of selection is used for training corresponding text classifier from the given text collection (mark text set) with class label.Good training data system of selection can increase substantially the efficiency building sorter on the one hand by the quantity reducing training text, Quality advance training on the other hand by improving training text obtains generalization ability and the robustness of sorter, thus ensures the precision of classification.

There are some the relevant patent for training text selection and investigative techniques at present.

Such as, a kind of training text selection technique is provided in the US Patent No. 7409404B2 being entitled as " Creating taxonomies and training data for documentcategorization ", it is mainly under the prerequisite of interference as far as possible eliminating other feature of overclass, the quality of training text data of being refined by the statistical information of given mark text.

In addition, at Wang, J, Neskovic, two kinds of equipment and methods utilizing the internal statistical feature of given mark text collection to select also to carry out training text and then learn to generate text classifier are also illustrated in what P and Cooper, L.N write is entitled as " Training data selection for support vector machines " non-patent literature (In:LNCS vol.3610 2005) (calling non-patent literature 1 in the following text).Its concrete structure block diagram and workflow are as depicted in figs. 1 and 2.

As shown in Figure 1, build equipment 100 according to the text classifier of the prior art to be made up of the training text selecting arrangement 103 of input media 101, text vector gasifying device 102, Corpus--based Method method and sorter learning device 104.Input media 101 inputs one group of mark text from mark text storage unit 105.The each mark text of text vector gasifying device 102 to input carries out vectorization, and is stored in vector space model (VSM) storage unit 106 by the vector space model (VSM) corresponding to each mark text generated.Then, each mark text of training text selecting arrangement 103 Corpus--based Method method to vectorization of Corpus--based Method method carries out marking and therefrom selects suitable training text.The training text selected is classified device learning device 104 subsequently and generates text classifier for learning.

Fig. 2 shows the schematic workflow that text classifier shown in Fig. 1 builds equipment 100.Describe in the technology contents disclosed in non-patent literature 1 two kinds according to the internal feature of mark text collection carry out training text select and and then study generate the illustrative methods of text classifier, i.e. exemplary method 1 and exemplary method 2.In exemplary method 1, the maximum number not comprising the mark text comprised in the border circular areas of the mark text of other classifications centered by each mark text xi is referred to as N (xi); And the mark text in selecting the border circular areas wherein with minimum N (xi), as training text collection.In exemplary method 2, calculate the distance d (xi) of each mark text xi to the convex set of the mark text of other classifications; And select the mark text with minimum d (xi), as training text.

No matter exemplary method 1 or exemplary method 2, the feature as compute statistics all only derives from given mark text set self, wherein only considered the sample distribution of given mark text set inside.Therefore, the text classifier finally built is inevitable completely by the impact of given mark text, thus causes generalization ability and the poor robustness of this sorter.

Although also there are other training text systems of selection in prior art, but, current training text system of selection is all mainly utilize the external knowledge in given mark text set to realize, namely, the feature adopted and weight place one's entire reliance upon the Data distribution8 of given mark text collection, thus make the training text chosen can have very strong skewed popularity.This skewed popularity can propagate into the classification orientation of the final sorter built, and its generalization ability and robustness is greatly affected, finally causes classifier performance undesirable.

Summary of the invention

The present invention considers that above-mentioned problems of the prior art are developed just.

According to thought of the present invention, the surface deriving from external knowledge sources (such as, meaning of a word dictionary) is introduced in the selection course of training text, and specifically, core of the present invention realizes being mainly reflected in following two aspects:

(1) structure of surface: it mainly utilizes the definition about the meaning of a word or concept in outside meaning of a word dictionary, with the concept contained in class name or accurate class name for input, builds the surface independent of given mark text set; And

(2) training text based on mixing method is selected: be different from the tradition mark text marking sort method only considering the internal feature deriving from mark text set, and the method that the present invention provides considers the marking that the surface (from top to bottom) deriving from external knowledge sources and the internal feature (from bottom to top) deriving from given mark text set realize marking text and sorts.This process not only make use of knowledge that given mark text set contains but also make use of the semantic knowledge of external knowledge sources about class name or accurate class name to carry out the selection of training text data.

According to a first aspect of the present invention, provide a kind of method for building text classifier, it comprises: input mark text set; The surface of mark text set is built with reference to external knowledge sources; The internal feature and the surface that consider mark text set select training text from mark text set; And the training text selected by utilizing learns to generate text classifier.In an embodiment of the present invention, the vocabulary that the vector space model that internal feature can generate by carrying out vectorization to mark text comprises is formed, and surface can be by reference in meaning of a word dictionary about the semantic relation between the definition of the meaning of a word and vocabulary, expand characteristic vocabulary representative in these classifications multiple out by class name (or accurate class name).

According to a second aspect of the present invention, provide a kind of equipment for building text classifier, it comprises: input media, for inputting mark text set; Surface construction device, for building the surface of mark text set with reference to external knowledge sources; Training text selecting arrangement, for considering internal feature and the surface of mark text set, selects training text from mark text set; And sorter learning device, generate text classifier for utilizing selected training text to learn.

According to the present invention, the surface deriving from external knowledge sources is introduced in the selection of training text and the constructive process of sorter.Because the data skewed popularity deriving from given mark text set obtains Corrective control, the classification that therefore greatly can improve training text is representative, and add different classes of between the otherness of training text, thus finally reach the generalization ability of sorter and the object of robustness that improve and train and obtain.

Accompanying drawing explanation

By reference to the accompanying drawings, from below to the detailed description of the embodiment of the present invention, will understand the present invention better, label similar in accompanying drawing indicates similar part, wherein:

Fig. 1 is the structured flowchart building equipment 100 according to the text classifier of prior art;

Fig. 2 is the indicative flowchart of the workflow that equipment 100 shown in Fig. 1 is shown;

Fig. 3 is that text classifier builds the structured flowchart of equipment 300 according to an embodiment of the invention;

Fig. 4 is the process flow diagram of the course of work that equipment 300 shown in Fig. 3 is shown;

Fig. 5 is shown specifically the block diagram building the inner structure of the surface construction device in equipment 300 according to the text classifier of the embodiment of the present invention;

Fig. 6 is the block diagram of the inner structure of another example that the construction device of surface shown in Fig. 5 is shown;

Fig. 7 is shown specifically the block diagram building the inner structure of the first example of the training text selecting arrangement based on mixing method in equipment 300 according to the text classifier of the embodiment of the present invention;

Fig. 8 illustrates the workflow diagram based on the first example of the training text selecting arrangement of mixing method shown in Fig. 7;

Fig. 9 illustrates that the text classifier according to the embodiment of the present invention builds the generalized block diagram of the inner structure of the second example of the training text selecting arrangement based on mixing method in equipment 300;

Figure 10 is the detailed diagram of the inner structure of the second example being shown in further detail the training text selecting arrangement based on mixing method shown in Fig. 9;

Figure 11 illustrates the workflow diagram based on the second example of the training text selecting arrangement of mixing method shown in Figure 10; And

Figure 12 illustrates that the text classifier according to the embodiment of the present invention builds the indicative flowchart of the workflow of equipment 300.

Embodiment

Here, for convenience of description, first in the present invention, some technical terms used are carried out brief description:

term	definition
		based on the text classification of machine learning	machine learning is the main stream approach of text classification, and its general one group of text with class label (that is, marking text) carrys out the learning process of supervised classifier.
the selection of training text	the selection of training text is used for rejecting the irrelevant mark text of the terminal decision function used with sorter in given mark text collection, to improve the effect and efficiency that sorter builds.
		meaning of a word dictionary	meaning of a word dictionary defines the meaning of a word of the vocabulary that natural language is used and the container of mutual semantic relation thereof.According to the difference of language, single language, multilingual or across the meaning of a word dictionary of language may be had.
internal feature	refer to the feature built for training text selection and sorter deriving from given mark text collection inside.
		surface	refer to the feature built for training text selection and sorter deriving from given external knowledge sources (such as meaning of a word dictionary).The general semantic concept contained by given class name expands.
accurate class name	refer to representative vocabulary or the concept of a classification.When given class name be not provide in the mode of natural language time, these vocabulary or concept can be regarded class name as to use.

Fig. 3 is that text classifier builds the structured flowchart of equipment 300 according to an embodiment of the invention.As shown in Figure 3, in an embodiment of the present invention, text classifier build equipment 300 comprise input media 301, text vector gasifying device 302, surface construction device 303, based on the training text selecting arrangement 304 of mixing method and sorter learning device 305.Build compared with equipment 100 with shown in Fig. 1 according to the text classifier of prior art, text classifier builds input media 301, text vector gasifying device 302 and sorter learning device 305 that equipment 300 comprises and has function and structure similarly to the prior art.Therefore, unique distinction of the present invention should be the action and function that surface construction device 303 and the training text selecting arrangement 304 based on mixing method realize.

In addition, in the system of figure 3, be different from prior art, also comprise class name thesaurus 309, external knowledge sources 310 and surface storage unit 311.Their collaborative works, for realizing quoting for external knowledge in training text selection course of the present invention.Class name thesaurus 309 is for storing the item name based on natural language of specifying in given classification task item.The item name provided in classification task item generally can be described the vocabulary of understandable natural language of behaving, and especially for real Text Classification System, generally all has good man-machine interface, browses or inquire about to facilitate user.External knowledge sources 310 can be such as the readable meaning of a word dictionary of storing machine, there is defined the meaning of a word list that each target vocabulary may have.Multiple meaning of a word of each vocabulary are defined, the hierarchical structure relation namely between the different meaning of a word in meaning of a word dictionary.Surface construction device 303 with classification input by name, can utilize about the semantic relation between the definition of the meaning of a word and vocabulary in meaning of a word dictionary, expands out the representative feature vocabulary of this classification multiple by class name.Training text selecting arrangement 304 based on mixing method the factor of comprehensively internal feature and surface two aspect can realize the selection of training text data.

Fig. 4 is the process flow diagram of the course of work that equipment 300 shown in Fig. 3 is shown.Shown in Fig. 4, process starts from step 401, is wherein similar to prior art, and input media 301 inputs the set of one group of mark text from mark text storage unit 306.In step 402, text vector gasifying device 302 carries out vectorization to each mark text, to obtain the vector space model (VSM) corresponding to each mark text.The vocabulary that each VSM comprises can as the internal feature of this given mark text set.Here it is to be noted that it text vectorization is only extract a kind of exemplary instrumentation of mark text set internal feature, and should as limitation of the present invention.Any means for extracting mark text set internal feature known in those skilled in the art can be applied to the present invention similarly.

Then, in step 403, surface construction device 303 builds surface by reference to class name thesaurus 309 and external knowledge sources 310.Constructed surface is stored in surface storage unit 311 subsequently.About the building process of surface, will describe in detail subsequently.After the internal feature obtaining marking text set and surface, in step 404, the training text selecting arrangement 304 based on mixing method utilizes the internal feature of mixing method, comprehensive reference mark text set and surface to select training text.About the training text selection course based on mixing method, also describe in detail in content subsequently.The set of the training text selected by the training text selecting arrangement 304 based on mixing method can be stored in training text storage unit 308.Then, in step 405, sorter learning device 305 generates text classifier by learning selected training text collection.So far, process 400 terminates.

To first describe below according to surface constructive process of the present invention.Fig. 5 is shown specifically the block diagram building the inner structure of the surface construction device 303 in equipment 300 according to the text classifier of the embodiment of the present invention.As shown in the figure, surface construction device 303 can comprise word segmentation processing unit 501, filter element 502 (optional), the meaning of a word marking unit 503, meaning transference unit 504 and assembled unit 505.As previously mentioned, surface construction device 303 is with classification input by name, utilize external knowledge sources 310 (such as meaning of a word dictionary) about the semantic relation between the definition of the meaning of a word and vocabulary, expand out the representative feature vocabulary of this classification multiple, as surface by class name.It has realization and comprises the steps: that first (1) word segmentation processing unit 501 carries out pre-service to class name, to be decomposed into one group of vocabulary; (2) after completing the word segmentation processing of class name, alternatively, filter element 502 can filter the vocabulary generated, to remove stop words (stopwords); (3) meaning of a word that meaning of a word marking unit 503 defines in meaning of a word dictionary each vocabulary subsequently carries out marking sequence; (4) meaning transference unit 504 is according to the vocabulary obtained corresponding to the high one or more meaning of a word of component selections score of each meaning of a word; (5) last, one group of word combination that meaning of a word vocabulary high for all scores selected by meaning transference unit 504 and word segmentation processing unit 501 are obtained by class name word segmentation processing by assembled unit 505, the surface needed for composing training text selecting.

In said external feature-modeling process, another Chinese patent application No.200910129454.8 being entitled as " Word sense disambiguation method and system " that meaning of a word marking unit 503 can be submitted on March 20th, 2009 with reference to applicant of the present invention the marking sequencer procedure of each target vocabulary.The full content of this earlier application is incorporated into this by reference on the whole, for all objects.Described in this earlier application, the semantic feature that meaning of a word marking unit 503 contains at the Concept context of classification mode (open directory etc. as YahooDirectory, ODP is all typical stratification classification mode) according to each vocabulary carries out marking sequence to its meaning of a word defined in meaning of a word dictionary.

So-called Concept context refers to the various semantic relations between other concepts of comprising in the concept and classification mode thereof comprising target vocabulary.A large amount of syntactical and semantical feature for the semantic disambiguation of target vocabulary has been contained in Concept context.

Grammar property: other the co-occurrence term simultaneously appeared in same concept with target vocabulary forms its context vocabulary, as in " semantic net ", " semanteme " and " net " forms mutually the context vocabulary of the other side.

Semantic feature: to lie in and in every other concept that the concept that comprises target vocabulary links together with certain relation (as fraternal concept, sub-concept, father's concept etc.)." internet " this concept is comprised as having in the conceptional tree of hierarchical relationship for one, it comprises " semantic net " is its sub-concept, also have a distance " internet " to have the concept of semantic relation far away as " clothes " simultaneously, carrying out semantic disambiguation to the vocabulary in " internet " this concept be like this, " semantic net " and " clothes " can regard its Concept context information as, but according to the distance of semantic distance, in the marking of the last meaning of a word, they will give different weights.The main basis that this different weight is specified is exactly the relation division in semantic feature.

According to extracting the Concept context obtained, such as, can design multiple utilization semantic feature wherein to the method for carrying out Ordering and marking with reference to the multiple meaning of a word about target vocabulary provided in dictionary.Such as, earlier application No.200910129454.8 describes two kinds of such methods:

First method: utilize the various semantic relations in Concept context, different weights can be given to the context vocabulary of (appearing in Bu Tong contiguous concept) with target vocabulary co-occurrence, then realize utilizing the semantic feature in Concept context to carry out high-quality meaning of a word Ordering and marking (in traditional semantic disambiguation method, general co-occurrence term has identical weight).

Particularly, for the target vocabulary w appeared in concept name, in its Concept context there is { cw in total n context vocabulary ₁, cw ₂..., cw _n, meaning of a word sort algorithm can be obtained by following flow process:

(A) for each co-occurrence term cw _i, its respective weights W (cw in final semantic disambiguation process can be obtained by the calculating of certain semantic path length _i);

(B) based on the sense definition provided in meaning of a word dictionary, the meaning of a word w of target vocabulary is calculated ^rwith each co-occurrence term cw _ithe degree of correlation: 1) calculate w ^rwith context vocabulary cw _ieach meaning of a word carry out degree of correlation Rs; 2) meaning of a word w is calculated ^rwith each context vocabulary cw _idegree of correlation RW (cw _i), i.e. context vocabulary cw _iall meaning of a word Rs and;

And then obtain the degree of correlation of each meaning of a word relative to all context vocabulary (C).

Rank(w ^r＝∑W(cw _i)RW(cw _i)。

Second method: by the meaning of a word hierarchical structure provided in the hierarchical structure/graph structure in Concept context and semantic dictionary is carried out matching primitives, reach the object of meaning of a word Ordering and marking.On the one hand, a subset of the Concept context of target vocabulary normally body or stratification classification mode, target vocabulary is positioned at the center of this subset; On the other hand, provide the reference dictionary of sense definition generally also to comprise the hierarchical structure that one or more describes hierarchical relationship between the meaning of a word, the definition of a general meaning of a word is present in one or more such hierarchical structure.Comprehensive two aspects above, new figure matching algorithm can provide:

(A) consider the context vocabulary (appearing in concept name) of co-occurrence, calculate the ordering score Rank (cd) giving co-occurrence term similarity;

(B) consider with target vocabulary the topological structure of the Concept context being core, calculate the similarity score Rank (cs) representing Concept context topological structure;

(C) comprehensive scores Rank (w ^r)=θ Rank (cd)+(1-θ) Rank (cs), the hierarchical structure realizing the context vocabulary set from the different meaning of a word by comprising topology information is carried out figure and is mated to realize the meaning of a word and give a mark.

After each vocabulary obtained by class name word segmentation processing being carried out to marking sequence, the surface high one or more meaning of a word vocabulary of marking can selected and selected as training text by the various semantic relation extensions vocabulary out of meaning of a word dictionary definition.

Fig. 6 is the block diagram of the inner structure of another example that the construction device of surface shown in Fig. 5 is shown.Consider in some cases, class name is not describe with the vocabulary of natural language, and this situation can cause class name not have the semantic information of natural language aspect, thus causes surface building process normally not perform.To this, another example shown in Fig. 6 can process this situation.Different from Fig. 5, Fig. 6 also add an accurate class name construction unit 601 at outside feature construction device 303, it is by the process with the text of class label to vectorization, for each classification builds the structure of an accurate class name for surface automatically.Such as, accurate class name construction unit 601 can utilize certain Statistics-Based Method (such as latent semantic analysis) to carry out the relevant meaning of a word analysis of title to the mark text that each classification comprises, to obtain this type of other representative vocabulary as this type of other accurate class name.Such as, when utilizing latent semantic analysis, cluster (cluster) can be carried out to the mark text of each classification, therefrom selecting the words clustering relevant to title, as the accurate class name of respective classes.In this process, the participation of expert or user can be had, to improve the quality that accurate class name builds.Certainly, the creation method of accurate class name is not limited thereto.Other accurate class name creation methods that those skilled in the art easily expect also can be applied to the present invention similarly.

The training text system of selection based on mixing method according to the embodiment of the present invention is described in detail below with reference to Fig. 7 to Figure 11.As previously mentioned, in an embodiment of the present invention, the selection because usually realizing training text data of comprehensive internal feature and surface two aspect.On the one hand, the representativeness that the surface deriving from external knowledge sources (such as meaning of a word dictionary) gives respective classes describes vocabulary, and the selection for training text data provides guiding knowledge from top to bottom.On the other hand, the internal feature deriving from given mark text then by multiple text data by way of example from bottom to top reflect the statistical law that respective classes text should have.According to principle of the present invention, the mixing method considering the factor of these two aspects can be embodied as the following two kinds exemplary method.

< exemplary method 1>

Fig. 7 and Fig. 8 shows the first example selected according to the training text based on mixing method of the embodiment of the present invention, and wherein Fig. 7 is shown specifically the block diagram building the inner structure of the first example of the training text selecting arrangement based on mixing method in equipment 300 according to the text classifier of the embodiment of the present invention; Fig. 8 illustrates the workflow diagram based on the first example of the training text selecting arrangement of mixing method shown in Fig. 7.

In a first example, utilize weighted strategy to come balanced surface and internal feature role in mark text data is selected, thus directly can apply the selection that traditional statistical analysis technique carries out training text.Briefly, the first exemplary method comprises the steps:

The first step: the weight determining surface according to the quality height of external knowledge sources and class name.Such as, higher weight can be given for the representative vocabulary appeared in internal feature and surface simultaneously;

Second step: based on given Weight Algorithm, is incorporated into surface (exemplary method of the prior art 1 and 2 such as, shown in Fig. 2) in the training text system of selection of traditional Corpus--based Method.

Specifically, as shown in Figure 7, in exemplary method 1, the training text selecting arrangement 304 based on mixing method comprises the training text selection unit 704 of metrics calculation unit 701, distance adjusting unit 702, weight generation unit 703 and Corpus--based Method method.With reference to figure 8, process 800 starts from step 801, and wherein metrics calculation unit 701 such as calculates by the similarity between the VSM that calculates each mark text the distance marked between two between text.Then in step 802, distance adjusting unit 702 can utilize the surface of mark text to adjust the distance calculated.Weight generation unit 703 can determine the weight of surface according to the quality height of external knowledge sources and class name.Such as, the vocabulary simultaneously appeared in internal feature and surface can be endowed high weight.The weight for surface (vocabulary wherein comprised) generated by weight generation unit 703 can be used to adjust the distance marked between two between text calculated by metrics calculation unit 701, thus affect in training text selection course, according to the selection of the similarity between text for training text.Then, in step 803, the training text selection unit 704 of Corpus--based Method method utilizes statistical method, selects suitable training text collection according to the distance between the mark text after adjustment.Such as, exemplary method of the prior art 1 and 2 here, shown in Fig. 2 can be used to the selection of training text.

< exemplary method 2>

Fig. 9 to Figure 11 shows the second example selected according to the training text based on mixing method of the embodiment of the present invention.Fig. 9 illustrates that the text classifier according to the embodiment of the present invention builds the generalized block diagram of the inner structure of the second example of the training text selecting arrangement based on mixing method in equipment 300; Figure 10 is the detailed diagram of the inner structure of the second example being shown in further detail the training text selecting arrangement based on mixing method shown in Fig. 9; Figure 11 illustrates the workflow diagram based on the second example of the training text selecting arrangement of mixing method shown in Figure 10.In the second exemplary method, surface is utilized first to carry out initialization marking to mark text.Then, the marking result utilizing (iteration) learning process to refine initial.

As shown in Figure 9, on the whole, the training text selecting arrangement 304 based on mixing method of the second example comprises initialization marking unit 901 and marking result and to refine unit 902, wherein initialization marking unit 901 corresponds to the operation in from top to bottom stage, external knowledge is wherein utilized to carry out marking the initialization marking of text, refine unit 902 of marking result corresponds to the operation in from bottom to top stage, wherein utilizes external knowledge to refine result of initially giving a mark.

Figure 10 illustrates in greater detail the inner structure of the training text selecting arrangement 304 based on mixing method of the second example.As shown in Figure 10, in from top to bottom stage, initialization marking unit 901 can comprise inquiry parts 1001 and surface marking parts 1002.In from bottom to top stage, marking result unit 902 of refining comprises initial learn parts 1003, internal feature marking parts 1004 and middle learning unit derives 1005.

Figure 11 shows the workflow of the training text selecting arrangement 304 based on mixing method of the second example.Process 1100 starts from step 1101, and wherein for each classification, the inquiry parts 1001 in initialization marking unit 901 can utilize corresponding surface to inquire about the mark text in this classification as key word of the inquiry.Then, in step 1102, the similarity between the inquiry that reflects according to Query Result of surface marking parts 1002 and text carries out initialization marking, with result of initially being given a mark to marking text.Afterwards, in step 1103, front t% text of each classification in the preliminary classification result that first the initial learn parts 1003 that marking result is refined in unit 902 select initialization marking unit 901 to obtain carries out sorter study, obtains preliminary classification device.Then, in step 1104, preliminary classification device is classified to mark text collection.In step 1105, for each classification newly obtained, the internal feature marking parts 1004 that marking result is refined in unit 902 can carry out new marking sequence according to the internal feature of comprised mark text to mark text.Such as, internal feature marking parts 1004 can carry out new marking sequence to the lineoid of sorter or the distance of mark text distribution expectation to mark text according to mark text.Then, in a step 1106, middle learning unit derives 1005 selects front p% text of each classification to carry out new sorter study, to obtain middle classification device.Such as, p%=t% or p% ≠ t%.In addition, middle learning unit derives 1005 also can select positive example sample and negative routine sample for each classification respectively, and positive example sample and negative routine sample are used for sorter study.Such as, for each classification c, be+1 by p+% the text marking come above, that is, these texts are high-quality positive example samples; Give a mark p-% the text come before other classifications for-1, that is, these texts are high-quality negative routine samples relative to classification c.Then, in step 1107, as aforementioned process is identical, newly-generated middle classification device is used to classify to mark text collection, and carries out new marking sequence according to the internal feature of mark text.Selectively, said process can iteration carry out, until current training text data set no longer changes.In other words, in step 1108, judge whether the training text data set of current generation changes relative to last iteration.If changed, process then returns step 1106 to carry out next round iteration.If current training text data set no longer changes, then in step 1109, export the training text data set generated.So far, process 1100 terminates.

The training text set generated is stored in training text storage unit 308 (Fig. 3) subsequently, generates final text classifier for study.

Figure 12 illustrates that the text classifier according to the embodiment of the present invention builds the indicative flowchart of the workflow of equipment 300.Prior art is as shown in Figure 2 compared, building equipment 300 according to text classifier of the present invention utilizes external knowledge sources (such as meaning of a word dictionary) surface expanded from class name to be introduced in the selection course of training text, and proposes above-mentioned two kinds of Exemplary miscellaneous methods (exemplary method 1 and 2) for selecting training text.According to the present invention, the sample distribution deviation that may be produced by given mark text collection be subject to the adjustment of the surface that external knowledge sources generates automatically, thus ensure that and finally train the sorter obtained to have good generalization ability and robustness.

With reference to accompanying drawing, training text selection course according to the present invention and sorter establishment are described in detail above.As previously mentioned, according to the present invention, the surface deriving from external knowledge sources is introduced in the selection of training text and the constructive process of sorter.Because the data skewed popularity deriving from given mark text set obtains Corrective control, the classification that therefore greatly can improve training text is representative, and add different classes of between the otherness of training text, thus finally reach the generalization ability of sorter and the object of robustness that improve and train and obtain.

But need clear and definite, the present invention is not limited to described above and the customized configuration illustrated in the drawings and process.Further, for brevity, the detailed description to known method technology is omitted here.In the above-described embodiments, describe and show some concrete steps exemplarily.But procedure of the present invention is not limited to concrete steps that are described and that illustrate, and those skilled in the art after understanding spirit of the present invention, can make various change, amendment and interpolation, or changes the order between step.

Element of the present invention can be implemented as hardware, software, firmware or their combination, and can be used in their system, subsystem, parts or subassembly.When realizing with software mode, element of the present invention is used to the program or the code segment that perform required task.Program or code segment can be stored in machine readable media, or are sent at transmission medium or communication links by the data-signal carried in carrier wave." machine readable media " can comprise and can store or any medium of transmission information.The example of machine readable media comprises electronic circuit, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disk, CD-ROM, CD, hard disk, fiber medium, radio frequency (RF) link, etc.Code segment can be downloaded via the computer network of such as the Internet, Intranet etc.

The present invention can realize in other specific forms, and does not depart from its spirit and essential characteristic.Such as, the algorithm described in specific embodiment can be modified, and system architecture does not depart from essence spirit of the present invention.Therefore, current embodiment is all counted as exemplary but not determinate in all respects, scope of the present invention by claims but not foregoing description define, further, fall into the whole change in the implication of claim and the scope of equivalent thus be all included within the scope of the invention.

Claims

1., for building a method for text classifier, comprising:

Input mark text set;

Extract the internal feature of described mark text set;

The surface of described mark text set is built with reference to external knowledge sources; Comprise:

For each classification:

Word segmentation processing is carried out to be decomposed into one group of vocabulary to corresponding class name; The each meaning of a word of each described vocabulary in described external knowledge sources is given a mark;

According to one or more meaning of a word that marking result selects the score of described vocabulary high; And

One or more meaning of a word word combinations high for the score of described one group of vocabulary and each vocabulary are got up, to form the surface for described classification of described mark text set;

Consider internal feature and the surface of described mark text set, from described mark text set, select training text; And

Training text selected by utilization learns to generate described text classifier;

Wherein, described in consider internal feature and the surface of described mark text set, from described mark text set, select training text, specifically comprise:

The weight of surface is determined according to the quality height of external knowledge sources and class name; Based on given Weight Algorithm, surface is incorporated in the training text system of selection of traditional Corpus--based Method;

Or

Surface is utilized first to carry out initialization marking to mark text; Then utilize learning process to the marking result of refining initial.

2. the method for claim 1, the step wherein extracting the internal feature of described mark text set comprises:

Vectorization is carried out to each mark text in described mark text set, to obtain the vector space model corresponding to this mark text,

The vocabulary that wherein vector space model of each mark text comprises forms the internal feature of described mark text set together.

3. the method for claim 1, the class name of wherein said classification does not have the semantic information of natural language, and the step building the surface of described mark text set also comprises:

The mark text comprised by analyzing described classification creates the accurate class name of described classification automatically; And

Wherein, described accurate class name is used to the structure of surface as the class name of described classification.

4. the method for claim 1, the step wherein building the surface of described mark text set also comprises:

Stop words is removed from described one group of vocabulary that decomposition obtains.

5. the method for claim 1, each meaning of a word selected in it is also used as the surface of described classification in described external knowledge sources by semantic relation extension vocabulary out.

6. the method for claim 1, wherein said external knowledge sources is dictionary.

7. method as claimed in claim 2, wherein from described mark text set, select the step of training text to comprise:

The distance marked between two between text is calculated by the similarity calculated between described vector space model;

The described distance marked between two between text utilizing the surface Adjustable calculation of described mark text set to go out; And

Utilize statistical method, select described training text according to the distance between the described mark text after adjustment.

8. method as claimed in claim 7, wherein select the step of described training text to comprise:

Calculate the number the most very much not comprising the mark text comprised in the border circular areas of the mark text of other classifications centered by each mark text; And

Select to there is the mark text comprised in the border circular areas of the mark text of minimum number, as described training text.

9. method as claimed in claim 7, wherein select the step of described training text to comprise:

Calculate the distance of each mark text to the convex set of the mark text of other classifications; And

Select the mark text with minor increment, as described training text.

10. method as claimed in claim 7, the step of the described distance marked between text between two wherein utilizing described surface Adjustable calculation to go out comprises:

Higher weight is given to the vocabulary appeared in the internal feature of described mark text set and surface simultaneously; And

According to described weight come Adjustable calculation go out described in mark distance between text between two.

11. the method for claim 1, wherein from described mark text set, select the step of training text to comprise:

The described surface of described mark text set is utilized to carry out initialization marking to each described mark text;

Utilize the described internal feature of described mark text set to described initialization marking result of refining; And

Described training text is selected according to the marking result after each marks refining of text.

12. methods as claimed in claim 11, wherein comprise the step that each described mark text carries out initialization marking:

For each classification in described mark text set, the vocabulary utilizing corresponding surface to comprise is inquired about the mark text in this classification as key word of the inquiry; And

The similarity returning results the described surface of reflected each mark text and respective classes of described inquiry to be given a mark result as the described initialization of this mark text.

13. methods as claimed in claim 12, the step of described initialization of wherein refining marking result comprises:

A the mark text of t% before each classification, according to the described initialization marking result of each described mark text, is carried out sorter study as training text set, to obtain a middle classification device by ();

B () utilizes described middle classification device to classify to the mark text in described mark text set;

C each classification that () obtains for described classification, the internal feature according to each mark text wherein comprised carries out new marking sequence to the mark text in this classification; And

D (), according to new marking ranking results, selects the mark text of the front p% of each classification as new training text set to carry out sorter study, to obtain new middle classification device,

Repeat above-mentioned steps (b), (c) and (d), until the set of selected training text no longer changes.

14. methods as claimed in claim 13, wherein when carrying out new marking sequence according to the internal feature of each mark text to mark text, mark text to be given a mark sequence according to the distance that the lineoid of middle classification device described in it or the distribution of mark text are expected.

15. methods as claimed in claim 13, wherein select the mark text of the front p% of each classification to comprise as the step of new training text set:

For each classification:

Select the mark text of such other front p+% as such other positive example sample,

The mark text of the front p – % of other classifications is selected to bear routine sample as such other.

16. 1 kinds, for building the equipment of text classifier, comprising:

Input media, for inputting mark text set;

Internal feature extraction element, for extracting the internal feature of described mark text set;

Surface construction device, for building the surface of described mark text set with reference to external knowledge sources; Wherein said surface construction device comprises:

Word segmentation processing unit, for for each classification, carries out word segmentation processing to be decomposed into one group of vocabulary to corresponding class name;

Meaning of a word marking unit, for giving a mark to each meaning of a word of each vocabulary in described external knowledge sources;

Meaning transference unit, for the one or more meaning of a word selecting the score of described vocabulary high according to the marking result of described meaning of a word marking unit; And

Assembled unit, for getting up one or more meaning of a word word combinations high to described vocabulary and its score, to form the surface for described classification of described mark text set;

Training text selecting arrangement, for considering internal feature and the surface of described mark text set, selects training text from described mark text set; And

Sorter learning device, generates described text classifier for utilizing selected training text to learn;

Or

17. equipment as claimed in claim 16, wherein said internal feature extraction element comprises:

Text vector gasifying device, for carrying out vectorization to each mark text in described mark text set, to obtain the vector space model corresponding to this mark text,

18. equipment as claimed in claim 16, the class name of wherein said classification does not have the semantic information of natural language, and described surface construction device also comprises:

Accurate class name generation unit, the mark text for comprising by analyzing described classification creates the accurate class name of described classification automatically; And

19. equipment as claimed in claim 16, wherein said surface construction device also comprises:

Filter element, for removing stop words from decomposing in described one group of vocabulary of obtaining.

20. equipment as claimed in claim 16, wherein said external knowledge sources is dictionary.

21. equipment as claimed in claim 17, wherein said training text selecting arrangement comprises:

Metrics calculation unit, for calculating the distance marked between two between text by the similarity calculated between described vector space model;

Distance adjusting unit, for utilize the surface Adjustable calculation of described mark text set go out described in mark distance between text between two; And

The training text selection unit of Corpus--based Method method, for utilizing statistical method, selecting described training text according to the distance between the described mark text after adjustment.

22. equipment as claimed in claim 21, wherein said training text selecting arrangement also comprises:

Weight generation unit, for giving higher weight to the vocabulary appeared in the internal feature of described mark text set and surface simultaneously; And

Described distance adjusting unit according to the described weight that weight generation unit generates come Adjustable calculation go out described in mark distance between text between two.

23. equipment as claimed in claim 16, wherein said training text selecting arrangement comprises:

Initialization marking unit, carries out initialization marking for utilizing the described surface of described mark text set to each described mark text;

Marking result is refined unit, for utilizing the described internal feature of described mark text set to described initialization marking result of refining, and selects described training text according to the marking result after the refining of each mark text.

24. equipment as claimed in claim 23, wherein said initialization marking unit comprises:

Inquiry parts, for for each classification in described mark text set, the vocabulary utilizing corresponding surface to comprise is inquired about the mark text in this classification as key word of the inquiry; And

Surface marking parts, for giving a mark the similarity returning results the described surface of reflected each mark text and respective classes of described inquiry result as the described initialization of this mark text.

25. equipment as claimed in claim 24, wherein said marking result unit of refining comprises:

Initial learn parts, for the described initialization marking result according to each described mark text, the mark text of t% before each classification is carried out sorter study as training text set, to obtain a middle classification device, this middle classification device is used for classifying to the mark text in described mark text set;

Internal feature marking parts, for each classification obtained for described classification, the internal feature according to each mark text wherein comprised carries out new marking sequence to the mark text in this classification; And

Middle learning unit derives, for according to new marking ranking results, selects the mark text of the front p% of each classification as new training text set to carry out sorter study, to obtain new middle classification device,

Wherein said internal feature marking parts and described middle learning unit derives cycle iterative operation thereof, until the set of selected training text no longer changes.

26. equipment as claimed in claim 25, wherein said middle learning unit derives comprises:

Positive example samples selection device, for for each classification, selects the mark text of such other front p+% as such other positive example sample, and

Negative routine samples selection device, for selecting the mark text of p – % before other classifications as such other negative routine sample.