CN105786898A

CN105786898A - Domain ontology construction method and apparatus

Info

Publication number: CN105786898A
Application number: CN201410822832.1A
Authority: CN
Inventors: 黄毅; 周文辉; 冯俊兰; 李明洋; 张鹏
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2014-12-24
Filing date: 2014-12-24
Publication date: 2016-07-20
Anticipated expiration: 2034-12-24
Also published as: CN105786898B

Abstract

The invention discloses a domain ontology construction method and apparatus, which are used for mining professional and accurate domain concept sets to accurately attribute instances to correct concepts and improve the accuracy of domain ontology construction. The domain ontology construction method comprises the steps of extracting characteristic information of each document contained in a document set; according to the extracted characteristic information, clustering documents contained in the document set by utilizing a clustering algorithm to obtain K1 clusters; extracting at least one domain concept from the obtained clusters, and dividing each cluster obtained by clustering into a positive example cluster and a negative example cluster according to whether the cluster belongs to the extracted domain concept or not; selecting a preset number of documents from the positive example cluster and the negative example cluster, and determining a document classifier according to the selected documents; classifying unselected documents into a first document set belonging to the extracted domain concept and a second document set not belonging to the extracted domain concept by utilizing the document classifier; and performing next iteration on documents contained in the second document set.

Description

The construction method of a kind of domain body and device

Technical field

The present invention relates to data mining technology field, particularly relate to construction method and the device of a kind of domain body.

Background technology

" semantic net " is the term that computer and the Internet use when describing next stage network Development.So-called " semanteme " is exactly the implication of text.Semantic net can carry out, according to semanteme, the network that judges, and namely one is understood that human language, it is possible to makes the exchange between people with computer become like and interpersonal exchanges the same intelligent network easily.By " semantic net ", it is possible to build a network being attached based on data semantic in webpage so that network can according to the requirement automatic searching of user and searching web pages, until finding required content.How to extract Web information, be built into semantic, machine it will be appreciated that form, be the emphasis of current research of semantic web.

Body as a kind of modeling tool that can describe concept on semantic and knowledge level, is by the core of Web information semantic expressiveness and key point.Body plays an important role in the association area such as knowledge engineering, natural language processing, question answering system, information retrieval, intelligent information is integrated.

Body has the characteristic of conceptualization, and it is the abstractdesription of some phenomenon in the world.Concept Mining refers to from association area document, is obtained the process of field concept by the mode of Manual definition or machine learning.The concept taken out is for describing the classification of example in body, and sets up the relation between concept.The generation of Ontological concept and hierarchical relationship thereof is most important in ontological construction process.

Domain body is professional body, describes the relation between concept and concept in specific area, it is provided that the authority's understanding to this domain knowledge.The structure of domain body and application are the focal points in ontology research.Domain body builds often through manual mode, and the acquisition of concept and conceptual relation are set up and all lacked automatization's means, and rapid build ontology model is caused certain obstacle.

The recognition methods of Ontological concept is broadly divided into rule-based method, Statistics-Based Method and rule and the method for statistics combination.

Rule-based method, by the artificial identification to concept, takes out rule or template, finds out matched rule in text or meets the concept of template.This method generally also relies on natural language processing instrument, by the text characteristics such as word segmentation result, part of speech structure rule.New environment, by the impact of different language, different field, will be constructed new rule, work relatively complicated by this method, lacks versatility.

Statistics-Based Method utilizes machine learning techniques, finds the feature in language material, language material is labeled and trains, it is thus achieved that concept extraction model.The method generally adopted has HMM (HiddenMarkovModel, hidden Markov model), decision tree etc., and the method is not by the impact of language with field.

The method that rule combines with statistics is to adopt linguistics and mathematics statistical method jointly to obtain concept.Wherein, rule and method lays particular emphasis on acquisition concept to be selected, and statistical method is then for improving the accuracy and efficiency of Concept acquisition.Current most body learning system generally all adopts this associated methods to obtain Ontological concept.

For some industries, for instance, containing in mobile customer service substantial amounts of have to be analyzed and processes text data, including: the data such as professional knowledge storehouse, business norms, package information, customer service question and answer.The organizational structure of these data is different, has structurized workbook, semi-structured business norms and flow process, package information, and the QA of non-structure is to group, dialogue stream etc..Due to dispersibility and the multiformity of these data organizational structures, mobile contact staff usually needs repeatedly to search at work, and this professional knowledge that accurately can obtain rapidly needs to contact staff forms obstacle.Build mobile service ontology knowledge base and these structurings, semi-structured and unstructured data can be carried out unified Modeling, it is achieved to all kinds of business datum generalizations, stratification and intelligentized management.

But for the structure of domain body, owing to its concept set has the restriction of obvious field, concept provisioning request really is possessed accuracy and cogency.Therefore, concept identification process is needed supervision and the auxiliary of expert, to guarantee the professional of result and credibility.This allows for domain body building process needs a large amount of professional persons to assist, and cost of labor is significantly high.

Therefore, for the structure of domain body, there are the following problems for existing Concept Mining method: on the one hand, and different rules to be designed in different fields by rule-based method, it is impossible to extensive；On the other hand, Statistics-Based Method needs artificial prespecified concept set before mark training, it is necessary to observing all documents, otherwise there will be the disappearance of candidate concepts word set, the classification of example can be influenced by impact, it is impossible to ensure precision.And above two method all relies on accuracy and the coverage rate of early stage manual working result, have impact on the accuracy of Concept Mining result, and then affect the accuracy that domain body builds.

Summary of the invention

The embodiment of the present invention provides construction method and the device in a kind of body field, in order to excavate specialty and accurate field concept set, accurately to belong under correct concept by example, improves the accuracy that domain body builds.

The embodiment of the present invention provides a kind of Methodologies for Building Domain Ontology, including:

Extract the characteristic information of each document that collection of document comprises；

According to the characteristic information extracted, utilize the document that described collection of document is comprised by clustering algorithm to carry out cluster and obtain K₁Individual bunch, wherein K₁For positive integer；

From obtain bunch extract at least one field concept, be divided into positive example bunch and negative example bunch according to its field concept whether belonging to extraction for each bunch of obtaining of cluster；

Selecting the document of predetermined number respectively from described positive example bunch and negative example bunch, document classifier determined by the document according to selecting；

Utilize the document classifier determined by the Equations of The Second Kind collection of document of the first kind collection of document that non-selected document classification is the field concept belonging to extraction and the field concept being not belonging to extraction；

For the document that described Equations of The Second Kind collection of document comprises, according to each document characteristic of correspondence information, utilize clustering algorithm that the document that described second collection of document comprises is carried out cluster and obtain K₂Individual bunch, and perform from obtain bunch extract at least one field concept, it is divided into the step of positive example bunch and negative example bunch according to its field concept whether belonging to extraction for each bunch of obtaining of cluster, until the number of documents in described second collection of document is lower than preset value, wherein K₂For positive integer.

The embodiment of the present invention provides a kind of domain body construction device, including:

First extraction unit, for extracting the characteristic information of each document that collection of document comprises；

Cluster cell, for the characteristic information extracted according to described first extraction unit, utilizes the document that described collection of document is comprised by clustering algorithm to carry out cluster and obtains K₁Individual bunch, wherein K₁For positive integer；And the document that the second collection of document marked off for document division unit comprises, the characteristic information according to the extraction that described first extraction unit extracts, the document that described second collection of document is comprised carries out cluster and obtains K₁Individual bunch；

Second extraction unit, for from described cluster cell obtain bunch extract at least one field concept；

Bunch division unit, is divided into positive example bunch and negative example bunch for each bunch of obtaining for cluster according to its field concept whether belonging to extraction；

First determines unit, and for selecting the document of predetermined number respectively from described positive example bunch and negative example bunch, document classifier determined by the document according to selecting；

Document division unit, described determines that document classifier that unit determines is by first kind collection of document that non-selected document classification is the field concept belonging to extraction and the Equations of The Second Kind collection of document being not belonging to the field concept extracted for utilizing；And when the number of documents that described second collection of document comprises is not less than preset value, trigger the document that described cluster cell performs to comprise for described second collection of document, characteristic information according to the extraction that described first extraction unit extracts, utilizes clustering algorithm that the document that described second collection of document comprises is carried out cluster and obtains K₂The operation of individual bunch, wherein K₂For positive integer.

The construction method of the domain body that the embodiment of the present invention provides and device, in the process that domain body is built, under the premise that all concepts are unknown, by the auxiliary of clustering algorithm, take out concept in an iterative process, and based on the concept that each iteration takes out, each document is belonged under correct concept, avoid by manual operation process, owing to needing predetermined concepts set in advance to cause, concept lacks the situation of the document classification mistake caused, and improves the accuracy that domain body builds.

Other features and advantages of the present invention will be set forth in the following description, and, partly become apparent from description, or understand by implementing the present invention.The purpose of the present invention and other advantages can be realized by structure specifically noted in the description write, claims and accompanying drawing and be obtained.

Accompanying drawing explanation

Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the present invention, and the schematic description and description of the present invention is used for explaining the present invention, is not intended that inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is in the embodiment of the present invention, the implementing procedure schematic diagram of Methodologies for Building Domain Ontology；

Fig. 2 is in the embodiment of the present invention, the schematic flow sheet that document is clustered；

Fig. 3 is in the embodiment of the present invention, it is determined that cluster number of clusters K₁Schematic flow sheet；

Fig. 4 is in the embodiment of the present invention, it is determined that the schematic flow sheet of initial cluster center point；

Fig. 5 is in the embodiment of the present invention, it is determined that d_cSchematic flow sheet；

Fig. 6 is in the embodiment of the present invention, it is determined that the schematic flow sheet of document classifier；

Fig. 7 is in the embodiment of the present invention, the structural representation of domain body construction device.

Detailed description of the invention

In order to avoid in domain body building process, by manually extracting resultant error during concept, improve the accuracy that domain body builds, embodiments provide a kind of Methodologies for Building Domain Ontology and device.

Below in conjunction with Figure of description, the preferred embodiments of the present invention are illustrated, it is to be understood that, preferred embodiment described herein is merely to illustrate and explains the present invention, it is not intended to limit the present invention, and when not conflicting, the embodiment in the present invention and the feature in embodiment can be mutually combined.

As it is shown in figure 1, the implementing procedure schematic diagram of the Methodologies for Building Domain Ontology provided for the embodiment of the present invention, it is possible to comprise the following steps:

The characteristic information of each document that S11, extraction collection of document comprise.

S12, according to extract characteristic information, utilize the document that described collection of document is comprised by clustering algorithm to carry out cluster and obtain K₁Individual bunch.

Wherein, K₁For the positive integer be more than or equal to 1.

S13, from obtain bunch extract at least one field concept, be divided into positive example bunch and negative example bunch according to its field concept whether belonging to extraction for each bunch of obtaining of cluster.

S14, selecting the document of predetermined number respectively from positive example bunch and negative example bunch, document classifier determined by the document according to selecting.

It should be noted that the concept quantity extracted in step S14 and step S13 is corresponding, if step S13 extracts a concept, accordingly, step S14 can determine two classifying documents graders according to the document selected；If extracting multiple concept in step S13, accordingly, step S14 can determine many classifying documents grader according to the document selected.For the ease of describing, the embodiment of the present invention illustrates extracting a concept every time.In being embodied as, it is possible to implement according to actual needs, this is not limited by the embodiment of the present invention.

S15, utilize the document classifier determined by first kind collection of document that non-selected document classification is the field concept belonging to extraction and the Equations of The Second Kind collection of document being not belonging to the field concept extracted.

S16, judging that whether the number of documents comprised in the second collection of document is be more than or equal to preset value, if it is, perform step S17, otherwise, flow process terminates.

S17, according to each document characteristic of correspondence information, utilize clustering algorithm that the document that described second collection of document comprises is carried out cluster and obtain K₂Individual bunch, and perform step S13.

Wherein, K₂For the positive integer be more than or equal to 1.It should be noted that K₁And K₂Can be the same or different.

Specific implementation process below for the embodiment of the present invention illustrates.

In step s 11, time initial, for each document that collection of document comprises, within the scope of the document, extract the characteristic information set that can occasionally characterize the document.Wherein, optional feature information include following at least one:

1) document exercise question

Document exercise question is the title expressing documentation center content, expressing character, and similar document likely has similarity on exercise question.

2) document section title

Long document often occurs the one-level titles such as such as " chapter 1 ", " first segment ", " open object ", " set meal basic condition ", in the text through the conventional overstriking font representation more bigger than text character.In web page text, this kind of title is generally configured with<border>or other self-defining special patterns, it is that eye-observation or text resolution are all easier to identify.

Chapter title can be regarded as the center of this section or blanket.Generally similar document also possesses similar style of writing framework, plays the same tune on different musical instruments in the arrangement of chapters and sections.It addition, for certain class field document, the generation of document is likely to fixed form, and the document having same template is largely the example under the identical concept.Therefore chapter title can as one of file characteristics.

3) subtitle in document section

Long document often occurs two grades or three grades of titles such as such as " 1.1 ", " 1-1 ", " caller rate ", " moon usage charges ", in the text through conventional overstriking font representation, webpage is also generally configured with special pattern.Subtitle can be regarded as center or the summary of a certain paragraph.The same with chapter title, subtitle potentially contributes to distinguish different types of document, and therefore subtitle can also as one of file characteristics.

4) document body matter

The word segmentation result of text is feature the most frequently used in text classification.Being extracted by part-of-speech tagging, stem, removed the preconditioning techniques such as stop words, the useful information obtained in text characterizes document feature.If but the text participle wished in territory reaches good effect, then needing the auxiliary of field vocabulary, general participle instrument is then unsatisfactory.Additionally, often make feature space dimension expand using word segmentation result as feature, introduce more interference.Therefore it is not recommended that use text word segmentation result as file characteristics.

According to collection of document feature, from above optional feature, choose the feature composition word set that can characterize document said concepts, calculate word frequency as eigenvalue.The two-dimensional matrix of the characteristic vector composition that document is corresponding is the input of iterative part.

Based on the step S11 characteristic information obtained, two-dimensional matrix document and its character pair vector formed is as input, and iteration carries out clustering documents, conceptual abstraction, mark and document distribution.Each iteration takes out the concept that at least one is new, introduces it individually below.

Clustering documents

As in figure 2 it is shown, be the schematic flow sheet that document is clustered, it is possible to comprise the following steps:

S21, from collection of document select K₁Individual document is as initial cluster center point.

S22, each document comprised by the collection of document initial cluster center point closest with it is divided into same bunch.

S23, each bunch for obtaining, it is determined that the central point of this bunch is as new cluster centre point.

The document that described collection of document comprises is reclassified as K by the minimum distance between S24, each document and the new cluster centre point comprised according to collection of document₁Individual bunch.

Whether the document comprised in each bunch that S25, judgement obtain changes, if it is, perform step S23, otherwise, flow process terminates.

When being embodied as, it is possible to using the two-dimensional matrix of document and its character pair vector composition as input, utilize clustering algorithm, such as K-Means algorithm, it is intended that document is clustered by initial parameter.

Illustrate using K-Means algorithm that document is clustered below.When being embodied as, it is possible to use other clustering algorithms, this is not limited by the embodiment of the present invention.

K in K-Means algorithm is final desired cluster number of clusters, often through artificial appointment.K document (it should be noted that document here refers to the two-dimensional matrix of the corresponding characteristic vector composition of document) can be selected as initial cluster center point when clustering initial, other points and cluster centre point nearest apart from it same cluster of composition.The central point taking each bunch afterwards forms new cluster centre point set, recalculates the distance between each point and cluster centre point, again clusters.Successively repeatedly, until cluster result no longer changes.

Cluster result is affected with initial cluster center point by cluster number of clusters K, and the quality of cluster result directly determines the accuracy of Concept Mining.In the selection of initial cluster center point, K-Means++, CCIA etc. are relatively common algorithm, but result is not fixed, and have randomness.In order to solve this problem, in the embodiment of the present invention, it is possible to the flow process shown in Fig. 3 determines cluster number of clusters K:

S31, it is utilized respectively in the span of K set in advance the document that collection of document comprises by each magnitude value comprised and clusters.

In order to judge the Optimality of cluster result, in the embodiment of the present invention, introducing silhouette coefficient (SilhouetteCoefficient) and cluster result is measured, the method combines condensation degree and separating degree.In the invention process, it is possible to attempt using multiple cluster number of clusters.

S32, for each numerical value in span, it is determined that utilize this numerical value that the document that described collection of document comprises is clustered the silhouette coefficient that the cluster result obtained is corresponding.

Calculate the silhouette coefficient under each clustering cluster said conditions respectively, namely minimum number of clusters minK and maximum number of clusters maxK is set, choose minK≤K≤maxK, cluster in the span of minK to maxK, cluster result calculates its silhouette coefficient value respectively, and silhouette coefficient value shows that more greatly Clustering Effect is more good.

It should be noted that each numerical value in the span of minK to maxK, utilize the document that collection of document is comprised by this numerical value to cluster meansigma methods that silhouette coefficient corresponding to the cluster result obtained is document profile coefficient.Namely each document participating in cluster has body silhouette coefficient one by one, and its computing formula is:

S_{i} = \frac{b_{i} - a_{i}}{\max (a_{i}, b_{i})}

Wherein, i and j is document identification, is the natural number be more than or equal to 1；S_iSilhouette coefficient for document i；a_iFor document i and belonging to it bunch in average distance between other documents；b_iMinima for document i Yu the average distance of other bunches.Here, for other bunches, the average distance of document i and other bunches can be determined in accordance with the following methods: document i and bunch in the meansigma methods of distance between each document of comprising.It is utilize this numerical value that the document that described collection of document comprises is clustered the silhouette coefficient that the cluster result obtained is corresponding by the silhouette coefficient meansigma methods of comprised for collection of document document.

S33, determine that the numerical value corresponding with largest contours coefficient is K.

When being embodied as, it is possible to the step shown in Fig. 3 determines K₁And K₂, and the cluster number of clusters related in successive iterations process.

Further, in the embodiment of the present invention, it is possible to the flow process shown in Fig. 4 determines initial cluster center point:

S41, each document that collection of document is comprised, it is determined that the local density of the document.

It is also preferred that the left when being embodied as, it is possible to the local density of each document is determined according to below equation:If d_ij≥d_c, χ (d_ij-d_c)=1, if d_ij< d_c, χ (d_ij-d_c)=0, wherein: i and j is document identification；ρ_iLocal density for document i；d_ijFor the distance between document i and document j；d_cFor default distance threshold.It is to say, ρ_iIt is exactly less than d with document i distance_cThe quantity of other documents.

It is also preferred that the left d_cSelection principle as follows: make the medium density (such as, it is ensured that the average density of all documents is the 1%-2% of document total amount) of each document.But in view of the calculating of average density unrealistic, in the embodiment of the present invention, it is possible to the step shown in Fig. 5 determines d_c:

S51, each document comprised for collection of document, select the distance of preset ratio as distance threshold corresponding to the document according to the ascending order of the distance between the document with other document.

Such as, for each document, the order ascending according to the distance between the document and other document is ranked up, and chooses in ranking results, is positioned at the distance value of 10%*S (wherein S is the number of documents that collection of document comprises) position as distance threshold corresponding to the document.

The ascending sequence of distance threshold that S52, each document comprised by collection of document are corresponding.

S53, according to sequence after distance threshold, it is determined that first quartile is d_c。

S42, for each document, it is determined that the document and local density are more than the minimum range between the document of the document.

Concrete, for each document, determine that the document and local density are more than the minimum range δ between the document of the document according to below equation_i:Namely for document i, δ_iρ is compared for all local densities (ρ)_iThe minima of the distance between big document and document i.It should be noted that for the maximum document of local density, take δ_i=max_j(d_ij), i.e. ultimate range between document i and other documents.

Local density that S43, each document comprised in collection of document respectively are corresponding and and local density describe X-Y scheme more than the minimum range between the document of self for coordinate.

Taking ρ and δ is that transverse and longitudinal coordinate axes does X-Y scheme, a sample point on each document correspondence X-Y scheme, and taking the transverse and longitudinal coordinate of each sample point, to surround area with coordinate axes be desired value, i.e. A_i=ρ_i×δ_i, choose K bigger for A point as initial cluster center Candidate Set.

S44, according to and the descending order of area of coordinate axes composition rectangle, before selecting, K document is as initial cluster center.

It is also preferred that the left when being embodied as, in order to prevent, a certain value in ρ and δ is excessive causes that target face product value is excessive, namely gets rid of particular point, it is ensured that the cluster result calculated has higher coverage rate, by slope, impact point can be screened in the invention process.Specifically, with ρ for vertical coordinate, δ is abscissa, calculate a little the slope value k on this figure (namely for each document, it is determined that local density that the document is corresponding and the document and local density are more than the ratio of the minimum range between the document of the document), and be ranked up according to the order that k is descending.By according to the document after k sequence, choosing the corresponding k ρ and δ corresponding to document between maximum k value and minimum k value and retain, ρ and the δ corresponding to document of the centrally located part of k of correspondence is namely selected to retain.

It is also preferred that the left when being embodied as, it is possible to implement in such a way: the document comprised by collection of document is ranked up according to corresponding k value, the collection of document after sequence is on average divided into N number of subclass (N is the positive integer more than 2)；Select the document that the subclass except first subclass and last subclass comprises；Delete in X-Y scheme corresponding ρ and the δ that the document except the document selected is corresponding.

Such as, collection of document is on average divided into ensemble average and is divided into 4 parts, remove maximum portion and minimum portion, maximum maxS in middle two parts and minima minS is as the border of optional slope, namely when slope is between minS and maxS, and the area surrounded with coordinate axes sorts when bigger first K, this document is just selected as and puts into one of initial cluster center.

For the characteristic information extracted from the document of field, in general, the unconspicuous file characteristics value of feature is generally 0.Assuming two documents, document A and document B, wherein document A only has one section of plain text simplified, and document B then has a pictures, and their eigenvalue is all 0.If with general distance metric algorithm, their distance is 0, it is believed that being on all four two documents, this is clearly wrong.Some feature templates is generally followed in writing of field document so that field document can pass through chapter title etc. and commend.Ideally, the document examples with same or similar template can be polymerized to cluster, and some in them can determine into initial cluster center by algorithm.But real cluster process can be subject to the interference that a part does not possess the document of marked feature.For anti-tampering in advance, it is necessary to find out the initial cluster center of specification document as far as possible, in we's inventive embodiments, it is possible to determine the distance between two documents in accordance with the following methods: determine the quantity of same characteristic features information between these two documents；Using the inverse of the quantity of same characteristic features information between these two as the distance between these two documents.

It is also preferred that the left the distance between two documents can be determined according to below equation:

d_{ij} = \frac{1}{Σ_{M} (f_{m}^{(i)}^f_{m}^{(j)}) + ϵ}

In formula, M is characteristic vector dimension, namely document package containing feature quantity, m is characteristic information mark in document i and document j, and ε is very little constant, its effect be in order to avoid two documents there is no identical characteristic information time, denominator is that calculating when zero is abnormal.For two documents, its distance be both the inverse of same characteristic features information contents, show in formula, then be characteristic information with operation result and inverse.If two documents are all without marked feature, namely characteristic information is all 0, then the quantity of both same characteristic features information is 0, and distance between the two is infinitely great.

After the document completing that collection of document is comprised clusters, can continue according to the abstract concept made new advances of cluster result.Concrete, look first at Clustering Effect, choose one or more bunch that Clustering Effect is good, by compare cluster result with bunch in the number of documents that comprises, therefrom extract at least one concept being enough to summarize these documents, it is also preferred that the left can preferentially from Clustering Effect is good, abstract conception number of documents is many bunch.

According to the concept extracted, to pass through to cluster obtain bunch be divided into positive example bunch and negative example bunch, belong to this concept for positive example bunch, be not belonging to this concept for negative example bunch.Select the document of predetermined number to carry out labelling from positive example bunch and negative example bunch respectively, for instance to randomly choose the document of 30% from positive example bunch as positive example document, from negative example bunch, randomly choose the document of 30% as negative example text shelves.When being embodied as, it is possible to according to the marker number of the number of documents pro rate each bunch of each bunch, and guarantee that each bunch has at least a document to be labeled.

It should be noted that for the accuracy improving Methodologies for Building Domain Ontology, in the embodiment of the present invention, carrying out in labeling process to document, it is possible in conjunction with hand inspection mode, carry out document markup result checking corrigendum.Concrete, the positive example of check mark, it is ensured that the accuracy of labelling, find out in positive example bunch the document being clearly not positive example simultaneously, and be marked as negative example.

Utilize above-mentioned document markup result, it is determined that document classifier.Concrete, as shown in Figure 6, it is possible to comprise the following steps:

S61, the document selected is divided into according to preset ratio Training document set and test document set.

When being embodied as, according to certain division proportion (such as 2:1), the document (i.e. the document of above-mentioned labelling) selected from positive example bunch and negative example bunch respectively is divided into Training document set and test document set.

S62, utilize the document in Training document set carry out SVM classifier training obtain SVM classifier.

S63, utilize the document in test document set that the SVM classifier obtained is tested.

S64, judge that whether test result is more than predetermined threshold value, if it is, perform step S65, otherwise, perform step S66.

S65, determine that the SVM classifier obtained is document classifier.

When being embodied as, after obtaining document classifier, utilize it that remaining document (namely not labeled document) is classified, obtain belonging to the first kind collection of document of the field concept of extraction and being not belonging to the Equations of The Second Kind collection of document of the field concept extracted, the document comprised for Equations of The Second Kind collection of document judges that whether its number of documents comprised is more than predetermined threshold value, if words, enter iteration next time, participate in carrying out conceptual abstraction next time and document classification, it should be noted that, Equations of The Second Kind collection of document includes the above-mentioned document being labeled as negative example, and use the document of concept being not belonging to extract that document classifier obtains, namely for negative example text shelves, re-execute step S12～step S17.

After successive ignition, it is possible to extract a number of concept, and determine the document belonging to the concept extracted.But owing to document classifier there are certain requirements for the quantity of training data, if the number of documents comprised in the second collection of document is lower than predetermined threshold value, it is possible to the extraction to all concepts can be completed by clustering with artificial concept of extracting.

S66, re-execute according to extract characteristic information, utilize clustering algorithm that document clusters the step obtaining at least one bunch, until the test result obtaining SVM classifier meets predetermined threshold value.

If the test result of the SVM classifier obtained is not more than predetermined threshold value, then step S12～step S17 is re-executed for all documents.

The embodiment of the present invention in domain body building process, the Concept Mining method of the learning type that concept is grown out of nothing.Namely, under the premise of all concepts the unknown, by the auxiliary of clustering algorithm, concept is taken out in an iterative process.On the one hand, save conceptual abstraction early stage and summed up the manpower of concept set by artificial observed data, on the other hand, owing to it does not need the rule definition of early stage and manually enumerating of concept set, but product concept in an iterative process, solve owing to concept lacks the situation of the document classification mistake caused, it is ensured that the accuracy of conceptual abstraction and document classification, thus ensure that the correctness that body field builds.

In the embodiment of the present invention, also proposed a kind of method determining initial cluster center, carry out according to characteristic information between two documents and operation result determines the distance between two documents, make the distance between the document that same characteristic features occurs more close, local density according to document is as the measurement standard of initial cluster center and by filtering the deviation point that measured value is too high so that eliminate the interference of the unconspicuous document of feature in initial cluster center selection course.

Finally, the embodiment of the present invention use two graders document is classified, but not multi-categorizer, in conjunction with iteration pattern, unconspicuous to concept obfuscation or feature document is rejected next iteration and carries out finer analysis, which thereby enhance the accuracy rate of each Iterative classification, it is to avoid the interference of the document that peels off.Simultaneously as other concept documents enter next iteration, it is ensured that because of predetermined concepts set in advance, new concept, always in continuous renewal, will not cause that concept lacks so that the final concept extracted has higher coverage rate.

Based on same inventive concept, the embodiment of the present invention additionally provides the device that a kind of domain body builds, owing to the principle of said apparatus solution problem is similar to the method that domain body builds, therefore the enforcement of said apparatus may refer to the enforcement of method, repeats part and repeats no more.

As it is shown in fig. 7, the structural representation of the domain body construction device provided for the embodiment of the present invention, including:

First extraction unit 71, for extracting the characteristic information of each document that collection of document comprises；

Cluster cell 72, for the characteristic information extracted according to described first extraction unit, utilizes the document that described collection of document is comprised by clustering algorithm to carry out cluster and obtains K₁Individual bunch, wherein K₁For positive integer；And the document that the second collection of document marked off for document division unit comprises, according to the characteristic information that described first extraction unit extracts, the document that described second collection of document is comprised carries out cluster and obtains K₂Individual bunch, wherein K₂For positive integer；

Second extraction unit 73, for from described cluster cell obtain bunch extract at least one field concept；

Bunch division unit 74, is divided into positive example bunch and negative example bunch for each bunch of obtaining for cluster according to its field concept whether belonging to extraction；

First determines unit 75, and for selecting the document of predetermined number respectively from described positive example bunch and negative example bunch, document classifier determined by the document according to selecting；

Document division unit 76, described determines that document classifier that unit determines is by first kind collection of document that non-selected document classification is the field concept belonging to extraction and the Equations of The Second Kind collection of document being not belonging to the field concept extracted for utilizing；And when the number of documents that described second collection of document comprises is not less than preset value, trigger the document that described cluster cell performs to comprise for described second collection of document, characteristic information according to the extraction that described first extraction unit extracts, utilizes clustering algorithm that the document that described second collection of document comprises is carried out cluster and obtains K₂The operation of individual bunch.

Wherein, cluster cell 72, it is possible to including:

Select subelement, for selecting K from described collection of document₁Individual document is as initial cluster center point；

First bunch divides subelement, and the initial cluster center point closest with it for each document of being comprised by described collection of document is divided into same bunch；And

First determines subelement, for each bunch for obtaining, it is determined that the central point of this bunch is as new cluster centre point；

Second bunch divides subelement, and the document that described collection of document comprises is reclassified as K by the minimum distance that each document and described first for comprising according to described collection of document is determined between that subelement is determined, new cluster centre point₁Individual bunch；And if the K obtained₁Individual bunch changes, then trigger first and determine that subelement performs each bunch obtained for division again, it is determined that the central point of this bunch is as the operation of new cluster centre point.

Wherein, subelement is selected, it is possible to including:

First determines module, for each document comprised for described collection of document, it is determined that the local density of the document；

Second determines module, for for each document, it is determined that the document and local density are more than the minimum range between the document of the document；

Describe module, the local density corresponding for each document of comprising in described collection of document respectively and and local density describe X-Y scheme more than the minimum range between the document of self for coordinate；

Select module, for the order descending according to the area forming rectangle with coordinate axes, K before selecting₁Individual document is as initial cluster center.

When being embodied as, first determines module, specifically for determining the local density of each document that described collection of document comprises according to below equation:

If d_ij≥d_c, χ (d_ij-d_c)=1, if d_ij< d_c, χ (d_ij-d_c)=0, wherein:

I and j is document identification；

ρ_iLocal density for document i；

d_ijFor the distance between document i and document j；

d_cFor default distance threshold.

When being embodied as, select subelement, it is also possible to including:

3rd determines module, for determining d in accordance with the following methods_c: each document comprised for described collection of document, select the distance of preset ratio successively as distance threshold corresponding to the document according to the order that the distance between the document with other document is ascending；The ascending sequence of distance threshold that each document of being comprised by described collection of document is corresponding；According to the distance threshold after sequence, it is determined that first quartile is d_c。

When being embodied as, select module, it is also possible to at the order descending according to the area forming rectangle with coordinate axes, K before selecting₁Individual document is as before initial cluster center, for each document that described collection of document comprises, it is determined that local density that the document is corresponding and the document and local density are more than the ratio of the minimum range between the document of the document；According to the ratio that each document is corresponding, corresponding local density corresponding to ratio document between ratio maximum and ratio minima and the document and local density is selected to retain more than the minimum range between the document of the document from described X-Y scheme.

It is also preferred that the left selection module, it is possible to each document for described collection of document being comprised according to corresponding ratio is ranked up；Collection of document after sequence is on average divided into N number of subclass, and N is the positive integer more than 2；Select the document that the subclass except first subclass and last subclass comprises；Delete in described X-Y scheme corresponding local density corresponding to the document except the document selected and the document and local density more than the minimum range between the document of the document.

When being embodied as, the domain body construction device that the embodiment of the present invention provides can also include second and determine unit, wherein:

Described cluster cell 72, it is also possible to be used for being utilized respectively K set in advance₁Span in each magnitude value of comprising document that described collection of document is comprised cluster；

Described second determines unit, it is possible to for for each numerical value in described span, it is determined that utilize this numerical value that the document that described collection of document comprises is clustered the silhouette coefficient that the cluster result obtained is corresponding；And determine that the numerical value corresponding with largest contours coefficient is K₁。

When being embodied as, described second determines unit, it is possible to including:

Second determines subelement, for for each numerical value in described span, for utilizing this numerical value that the document that described collection of document comprises clusters each document that each bunch obtained comprises, determining the silhouette coefficient of the document according to below equation:Wherein: S_iSilhouette coefficient for the document；a_iFor the document and belonging to it bunch in average distance between other documents；b_iMinima for the document Yu the average distance of other bunches；

3rd determines subelement, is utilize this numerical value that the document that described collection of document comprises is clustered the silhouette coefficient that the cluster result obtained is corresponding for determining the silhouette coefficient meansigma methods of the comprised document of described collection of document.

When being embodied as, the device that the body field that the embodiment of the present invention provides builds, it is also possible to include document distance determining unit, for determining the quantity of same characteristic features information between these two documents；And using the inverse of the quantity of same characteristic features information between these two as the distance between these two documents.

When being embodied as, first determines that unit 75 may include that

Document test subelement, for being divided into Training document set and test document set by the document selected according to preset ratio；

Training subelement, obtains SVM classifier for utilizing the document in described Training document set to be supported the training of vector machine SVM classifier；

Test subelement, for utilizing the document in described test document set that the SVM classifier obtained is tested；And when the test result obtained is unsatisfactory for predetermined threshold value, trigger the document that described cluster cell comprises for described second collection of document, characteristic information according to the extraction that described first extraction unit extracts, utilizes clustering algorithm that document clusters the operation obtaining at least one bunch；

4th determines subelement, if meeting predetermined threshold value for test result, it is determined that the SVM classifier obtained is described document classifier.

For convenience of description, above each several part is divided by function and is respectively described for each module (or unit).Certainly, the function of each module (or unit) can be realized in same or multiple softwares or hardware when implementing the present invention.

Those skilled in the art are it should be appreciated that embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, complete software implementation or the embodiment in conjunction with software and hardware aspect.And, the present invention can adopt the form at one or more upper computer programs implemented of computer-usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) wherein including computer usable program code.

The present invention is that flow chart and/or block diagram with reference to method according to embodiments of the present invention, equipment (system) and computer program describe.It should be understood that can by the combination of the flow process in each flow process in computer program instructions flowchart and/or block diagram and/or square frame and flow chart and/or block diagram and/or square frame.These computer program instructions can be provided to produce a machine to the processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing device so that the instruction performed by the processor of computer or other programmable data processing device is produced for realizing the device of function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame.

These computer program instructions may be alternatively stored in and can guide in the computer-readable memory that computer or other programmable data processing device work in a specific way, the instruction making to be stored in this computer-readable memory produces to include the manufacture of command device, and this command device realizes the function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame.

These computer program instructions also can be loaded in computer or other programmable data processing device, make on computer or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computer or other programmable devices provides for realizing the step of function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame.

Although preferred embodiments of the present invention have been described, but those skilled in the art are once know basic creative concept, then these embodiments can be made other change and amendment.So, claims are intended to be construed to include preferred embodiment and fall into all changes and the amendment of the scope of the invention.

Obviously, the present invention can be carried out various change and modification without deviating from the spirit and scope of the present invention by those skilled in the art.So, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims

1. a Methodologies for Building Domain Ontology, it is characterised in that including:

2. the method for claim 1, it is characterised in that according to the characteristic information extracted, utilize the document that described collection of document is comprised by clustering algorithm to carry out cluster and obtain K₁Individual bunch, specifically include:

K is selected from described collection of document₁Individual document is as initial cluster center point；

Each document of being comprised by described collection of document initial cluster center point closest with it is divided into same bunch；

For each bunch obtained, it is determined that the central point of this bunch is as new cluster centre point；And

The document that described collection of document comprises is reclassified as K by the minimum distance between each document and the new cluster centre point that comprise according to described collection of document₁Individual bunch；

Repeat the central point of determining this bunch for each bunch obtained as new cluster centre point, and the document that described collection of document comprises is reclassified as K by the minimum distance between each document and the new cluster centre point that comprise according to described collection of document₁The step of individual bunch, until the document that each bunch obtained comprises no longer changes.

3. the method for claim 1, it is characterised in that determine K1 in accordance with the following methods:

It is utilized respectively K set in advance₁Span in each numerical value of comprising document that described collection of document is comprised cluster；

For each numerical value in described span, it is determined that utilize this numerical value that the document that described collection of document comprises is clustered the silhouette coefficient that the cluster result obtained is corresponding；

Determine that the numerical value corresponding with largest contours coefficient is K₁。

4. method as claimed in claim 3, it is characterised in that for each numerical value in described span, it is determined that utilize this numerical value that the document that described collection of document comprises is clustered the silhouette coefficient that the cluster result obtained is corresponding, specifically include:

For each numerical value in described span, for utilizing this numerical value that the document that described collection of document comprises clusters each document that each bunch obtained comprises, determine the silhouette coefficient of the document according to below equation:

S_{i} = \frac{b_{i} - a_{i}}{\max (a_{i}, b_{i})},

Wherein:

I and j is document identification；

S_iSilhouette coefficient for the document；

a_iFor the document and belonging to it bunch in average distance between other documents；

b_iMinima for the document Yu the average distance of other bunches；

The silhouette coefficient meansigma methods determining the comprised document of described collection of document is utilize this numerical value that the document that described collection of document comprises is clustered the silhouette coefficient that the cluster result obtained is corresponding.

5. method as claimed in claim 2, it is characterised in that select K1 document as initial cluster center point from described collection of document, specifically include:

For each document that described collection of document comprises, it is determined that the local density of the document；

For each document, it is determined that the document and local density are more than the minimum range between the document of the document；

Local density that each document of comprising in described collection of document respectively is corresponding and and local density describe X-Y scheme more than the minimum range between the document of self for coordinate；

According to the order that the area forming rectangle with coordinate axes is descending, K before selecting₁Individual document is as initial cluster center.

6. method as claimed in claim 5, it is characterised in that for each document that described collection of document comprises, it is determined that the local density of the document, specifically include:

The local density of each document that described collection of document comprises is determined according to below equation:

If d_ij≥d_c, χ (d_ij-d_c)=1, if d_ij<d_c, χ (d_ij-d_c)=0, wherein:

I and j is document identification；

ρ_iLocal density for document i；

d_ijFor the distance between document i and document j；

d_cFor default distance threshold.

7. method as claimed in claim 6, it is characterised in that determine d in accordance with the following methods_c:

For each document that described collection of document comprises, select the distance of preset ratio as distance threshold corresponding to the document according to the order that the distance between the document with other document is ascending；

The ascending sequence of distance threshold that each document of being comprised by described collection of document is corresponding；

According to the distance threshold after sequence, it is determined that first quartile is d_c。

8. method as claimed in claim 5, it is characterised in that in the order descending according to the area forming rectangle with coordinate axes, before selecting, K document is as, before initial cluster center, also including:

For each document that described collection of document comprises, it is determined that local density that the document is corresponding and the document and local density are more than the ratio of the minimum range between the document of the document；

According to the ratio that each document is corresponding, corresponding local density corresponding to ratio document between ratio maximum and ratio minima and the document and local density is selected to retain more than the minimum range between the document of the document from described X-Y scheme.

9. method as claimed in claim 8, it is characterized in that, according to the ratio that each document is corresponding, select corresponding local density corresponding to ratio document between ratio maximum and ratio minima and the document and local density to retain more than the minimum range between the document of the document from described X-Y scheme, specifically include:

The each document described collection of document comprised according to corresponding ratio is ranked up；

Collection of document after sequence is on average divided into N number of subclass, and N is the positive integer more than 2；

Select the document that the subclass except first subclass and last subclass comprises；

Delete in described X-Y scheme corresponding local density corresponding to the document except the document selected and the document and local density more than the minimum range between the document of the document.

10. the method as described in claim 1～9 any claim, it is characterised in that for any two document that described collection of document comprises, determine the distance between these two documents in accordance with the following methods:

Determine the quantity of same characteristic features information between these two documents；

Using the inverse of the quantity of same characteristic features information between these two as the distance between these two documents.

11. the method for claim 1, it is characterised in that selecting the document of predetermined number respectively from described positive example bunch and negative example bunch, document classifier determined by the document according to selecting, and specifically includes:

The document selected is divided into according to preset ratio Training document set and test document set；

Utilize the document in described Training document set to be supported the training of vector machine SVM classifier and obtain SVM classifier；

Utilize the document in described test document set that the SVM classifier obtained is tested；

If test result meets predetermined threshold value, it is determined that the SVM classifier obtained is described document classifier；

If test result is unsatisfactory for predetermined threshold value, then returns the characteristic information performed according to extracting, utilize clustering algorithm that document clusters the step obtaining at least one bunch, until the test result obtaining SVM classifier meets predetermined threshold value.

12. a domain body construction device, it is characterised in that including:

Cluster cell, for the characteristic information extracted according to described first extraction unit, utilizes the document that described collection of document is comprised by clustering algorithm to carry out cluster and obtains K₁Individual bunch, wherein K₁For positive integer；And the document that the second collection of document marked off for document division unit comprises, according to the characteristic information that described first extraction unit extracts, the document that described second collection of document is comprised carries out cluster and obtains K₂Individual bunch, wherein K₂For positive integer；

Document division unit, described determines that document classifier that unit determines is by first kind collection of document that non-selected document classification is the field concept belonging to extraction and the Equations of The Second Kind collection of document being not belonging to the field concept extracted for utilizing；And when the number of documents that described second collection of document comprises is not less than preset value, trigger the document that described cluster cell performs to comprise for described second collection of document, characteristic information according to the extraction that described first extraction unit extracts, utilizes clustering algorithm that the document that described second collection of document comprises is carried out cluster and obtains K₂The operation of individual bunch.

13. device as claimed in claim 12, it is characterised in that described cluster cell, specifically include:

14. device as claimed in claim 12, it is characterised in that also include second and determine unit, wherein:

Described cluster cell, is additionally operable to be utilized respectively K set in advance₁Span in each magnitude value of comprising document that described collection of document is comprised cluster；

Described second determines unit, for for each numerical value in described span, it is determined that utilize this numerical value that the document that described collection of document comprises is clustered the silhouette coefficient that the cluster result obtained is corresponding；And determine that the numerical value corresponding with largest contours coefficient is K₁。

15. device as claimed in claim 14, it is characterised in that described second determines unit, including:

16. device as claimed in claim 13, it is characterised in that described selection subelement, specifically include:

17. device as claimed in claim 16, it is characterised in that

Described first determines module, specifically for determining the local density of each document that described collection of document comprises according to below equation:

If d_ij≥d_c, χ (d_ij-d_c)=1, if d_ij<d_c, χ (d_ij-d_c)=0, wherein:

I and j is document identification；

ρ_iLocal density for document i；

d_ijFor the distance between document i and document j；

d_cFor default distance threshold.

18. device as claimed in claim 17, it is characterised in that described selection subelement, also include:

3rd determines module, for determining d in accordance with the following methods_c: each document comprised for described collection of document, select the distance of preset ratio as distance threshold corresponding to the document according to the ascending order of the distance between the document with other document；The ascending sequence of distance threshold that each document of being comprised by described collection of document is corresponding；According to the distance threshold after sequence, it is determined that first quartile is d_c。

19. device as claimed in claim 16, it is characterised in that

Described selection module, is additionally operable at the order descending according to the area forming rectangle with coordinate axes, K before selecting₁Individual document is as before initial cluster center, for each document that described collection of document comprises, it is determined that local density that the document is corresponding and the document and local density are more than the ratio of the minimum range between the document of the document；According to the ratio that each document is corresponding, corresponding local density corresponding to ratio document between ratio maximum and ratio minima and the document and local density is selected to retain more than the minimum range between the document of the document from described X-Y scheme.

20. device as claimed in claim 19, it is characterised in that

Described selection module, is ranked up specifically for each document described collection of document comprised according to corresponding ratio；Collection of document after sequence is on average divided into N number of subclass, and N is the positive integer more than 2；Select the document that the subclass except first subclass and last subclass comprises；Delete in described X-Y scheme corresponding local density corresponding to the document except the document selected and the document and local density more than the minimum range between the document of the document.

21. the device as described in claim 12～20 any claim, it is characterised in that also include:

Document distance determining unit, for determining the quantity of same characteristic features information between these two documents；And using the inverse of the quantity of same characteristic features information between these two as the distance between these two documents.

22. device as claimed in claim 12, it is characterised in that described first determines unit, specifically includes: