CN104008090A

CN104008090A - Multi-subject extraction method based on concept vector model

Info

Publication number: CN104008090A
Application number: CN201410178231.1A
Authority: CN
Inventors: 马甲林; 王志坚
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2014-04-29
Filing date: 2014-04-29
Publication date: 2014-08-27

Abstract

The invention provides a multi-subject extraction method based on a concept vector model. The method includes the following steps that firstly, a document is preprocessed through a traditional method and then vectors formed by feature words are preliminarily acquired; then synonyms are merged through the corresponding relation between word meanings and concepts in Hownet, disambiguation is conducted on polysemes through correlation between semantic classes and contexts, and the concept vector model is established to represent the document; concept similarity is calculated through related semantic information of the concepts in Hownet, a K-means algorithm is improved through a 'preset seed' method for clustering of the concepts, and then a plurality of subject concept clusters are formed; eventually, according to the corresponding relation between the concepts and words, a plurality of sub subject term sets are acquired. According to the method, semantic information is taken into consideration, the defects of sensitivity of the K-means algorithm to an initial center, space-time cost instability and the like are overcome, and the quality of extracted subjects is improved.

Description

A kind of multi-threaded extracting method based on Concept Vectors model

Technical field

The present invention relates to text message extractive technique field, particularly relate to a kind of multi-threaded extracting method based on Concept Vectors model.

Background technology

Since human society enters the information age, various e-texts emerge in multitude, in these mass texts, exist a large amount of multi-threaded texts, comprising many-sided abundant subject information, for example: one piece of report about premier's Li Keqiang visit Europe, both belong to political class news, belonged to again economic class news.Along with scientific and technical development, between subject, degrees of fusion is more and more higher, a plurality of ambits are all crossed in great majority research, many scientific text contents all never ipsilateral comprised a plurality of themes, as one piece of text about biological gene information excavating, the theme that had not only comprised computer science aspect but also comprise biomedical sector theme.Therefore, in real world, there are a large amount of multi-threaded texts, how from these texts, to extract a plurality of sub-topics information of valuable reflection different aspect, in fields such as information retrieval, Library and information securities, have a very wide range of applications.

Text subject is extracted to research abroad and arise from the fifties in last century, the theme extracting method of comparative maturity is the method based on statistical model at present, the method is mainly to utilize word frequency statistics information extraction theme, researchist had added the consideration to key elements such as title, position, syntactic structure and clue words afterwards, can from English text, extract high-quality theme.The domestic research that theme is extracted arises from the later stage eighties in last century, yet due to the complicacy of Chinese, many successful English theme extracting method are not suitable for Chinese.

At present, the method of domestic application widely or based on statistics, the method is under vector space model (VSM), hypotheses is pairwise orthogonal between vector, form between the vocabulary of text unrelated, this obviously and in text lexical semantic and context-sensitive reality be not inconsistent, again because Chinese vocabulary amount is very large, under VSM, certainly exist vectorial higher-dimension, sparse, ignore the problems such as lexical semantic and context, leaching process is subject to the interference of synonym and polysemant simultaneously, thereby in quality and efficiency, shows not good enough.How the study hotspot extracting about theme at present increases in semantic information if concentrating on, although there are a lot of scholars to propose the various theme extracting method based on semantic, but still do not reach the breakthrough of application level.In addition, extract multi-threaded and single theme and be extracted on algorithm and have very big difference, from one piece of text, identify a plurality of sub-topics words, only use the method for tradition based on word frequency statistics to realize.Community's partitioning algorithm in the complex network that the people such as Liao Tao proposes can extract multi-threaded, but does not relate to the semantic information of word, is a pure statistical method, extraction multi-threaded of low quality.

Therefore, for existing, take the traditional text treatment technology that word frequency statistics is foundation, one piece of text can only propose the problem of single theme, considered simultaneously traditional text disposal route in the face of vectorial higher-dimension, sparse and to lack efficiency of algorithm that word sense information, contextual information cause low and extract descriptor problem of low quality, and a kind of multi-threaded extracting method based on Concept Vectors model need to be provided.

Summary of the invention

The technical problem to be solved in the present invention is, in order to solve traditional text treatment technology, take word frequency statistics as foundation, one piece of text can only propose the problem of single theme, considered that traditional text disposal route is in the face of vectorial higher-dimension simultaneously, sparse and lack word sense information, efficiency of algorithm that contextual information causes is low and extract descriptor problem of low quality, a kind of multi-threaded extracting method based on Concept Vectors model is provided, the method utilizes < < to know net > > semantic knowledge-base, to representing that the Feature Words of text shines upon one by one, the text is expressed as to conceptual model, and synonym has been integrated in identical concept automatically in mapping process, realized vectorial dimensionality reduction, according to the correlativity of semantic category and context of co-text, the polysemant occurring in literary composition is arranged to discrimination.

The object of the present invention is to provide a kind of multi-threaded extracting method based on Concept Vectors model, comprise the following steps:

Step1: vector model represents: text is carried out to pre-service, with the vector that obtains being formed by Feature Words, and by vector space model, described pretreated text table is shown as to the vector that Feature Words forms;

Step2: conceptual model mapping: semantic and represent the semantic knowledge-base of semantic relation between concept with tree structure based on express natural language vocabulary with concept, utilize the corresponding relation between the meaning of a word and described concept to carry out Conceptual Projection to the Feature Words of described pretreated composition text, in Conceptual Projection process, synonym in text carries out merger automatically, then, according to the correlativity of semantic category and context of co-text, the polysemant occurring in literary composition is arranged to discrimination, afterwards, the vector space model of described text after merger and row's discrimination is mapped to concept space model,

Step3: multi-threaded word extracts: use improved K-means algorithm to carry out cluster to the concept in described text concept spatial model after merger and row's discrimination, form a plurality of Subject Concepts bunch, according to a plurality of Subject Concepts bunch that form, utilize concept and original text Feature Words corresponding relation, reversely obtain a plurality of theme feature word sets, to extract the multi-threaded word in single Chinese text.

Further, described step Step1 can comprise the steps:

Step1-1: use Words partition system to carry out participle to pending text T, then, go to stop, denoising, afterwards, obtain the elementary vector space model T={C of described text ₁, C ₂..., C _n, C wherein ₁, C ₂..., C _nrepresent n the vector being formed by Feature Words, described in go to stop referring to and filter the stop words occurring in text, described denoising refers to filters the vocabulary without practical significance occurring in text;

Step1-2: further extract proper vector from elementary vector space model, to obtain the senior vector space model T={C of the text ₁, C ₂..., C _m, m<=n wherein.

Further, described step Step2 can comprise the steps:

In text, comprise vocabulary implication and divide three kinds of situations: univocal, synonym and polysemant;

Conceptual Projection process realizes by inquiring about described semantic knowledge-base, wherein:

Search knowledge base, when the vocabulary that judgement comprises in text is univocal, directly obtains the concept of its unique correspondence;

Search knowledge base, when the vocabulary that judgement comprises in text is synonym, directly obtains the concept of its unique correspondence, in this process, by the synonym occurring in the text is integrated in same concept automatically, with witness vector dimensionality reduction;

Search knowledge base, when the vocabulary that judgement comprises in text is polysemant, the corresponding a plurality of concepts of this polysemant, concept is corresponding one by one with semantic category, information computing semantic category weights according to semantic category member word in the text, the corresponding concept of semantic category of choosing weights maximum is applicable to the concept of context of co-text herein as described polysemant, thereby realizes polysemant row discrimination.

Further, described search knowledge base, when the vocabulary that judgement comprises in text is polysemant, the corresponding a plurality of concepts of this polysemant, concept is corresponding one by one with semantic category, and according to semantic category member word information computing semantic category weights in the text, the corresponding concept of semantic category of choosing weights maximum is applicable to the concept of context of co-text herein as described polysemant, thereby realize polysemant row discrimination, comprise the steps:

In described semantic knowledge-base, the semanteme of concept is mainly described by the former collection of basic meaning, and the former collection of basic meaning is described by one group of semantic relevant word again, and the word of describing the former collection of certain concept basic meaning has formed a semantic category;

The corresponding a plurality of concepts of judgement polysemant, the quantity of information of all member's words that calculate the corresponding semantic category of described concept in described processing text, weighted calculation obtains the weights of each semantic category;

Select the concept corresponding to semantic category of weights maximum to be applicable to the concept of context of co-text herein as described polysemant, thereby realize polysemant row discrimination.

Further, described step Step2 can comprise the steps:

Step2-1: to all Feature Words in pending text T successively query semantics knowledge base, carry out Conceptual Projection;

Step2-1-1: search knowledge base, if the Feature Words C of T _mcorresponding unique concept, C _mfor univocal or synonym, directly obtain C _mconcept, go to step Setp2-2;

Step2-1-2: search knowledge base, if the Feature Words C of T _mcorresponding a plurality of concepts, Feature Words C _mfor polysemant, need polysemant C _mcarry out word sense disambiguation, select the concept that is applicable to this article context of co-text;

Step2-2: obtain the corresponding Concept Vectors T={ of text T (G1, C ₁), (G ₂, C ₂) ... (G _q, C _q);

Step2-3: further arrange the corresponding Concept Vectors T={ of output text T (G1, (C according to concept ₁..., C _i)), (G ₂, (C ₂..., C _j)) ..., (G _q, (C _q..., C _k)), (C wherein _q..., C _k) be concept G _qthe corresponding in the text word occurring.

Further, described step Step2-1-2 can comprise the steps:

Step2-1-2-1: search knowledge base, polysemant C _mcorresponding a plurality of concepts, the former collection of basic meaning of describing Concept Semantic has enough become a semantic category, thereby polysemant C _mcorresponding a plurality of semantic categories, thus the phrase of describing the former collection of basic meaning can be obtained, and described phrase becomes one group of semantic relevant phrase of reflection semantic category;

Step2-1-2-2: calculate polysemant C _mthe quantity of information of each semantic category member word in this article, described member's word w _icontained quantity of information H (w in this article _i) computing formula is as follows:

H(w _i)＝-TF(w _i,ST)×log[p(w _i)]，

Wherein, TF (w _i, ST) represent word w _ithe frequency occurring in text, ST represents text, P (w _i) be word w _iprobability distribution;

Step2-1-2-3: calculate polysemant C _mthe weights of each semantic category, its i semantic category L _iweights are:

CWeight (L_{i}) = Σ_{j = 1}^{n} H (w_{i}) \times \log_{2} n,

Wherein, n is semantic category L _ithere is number in member's word, semantic category weights larger in the text, and this semantic category member word is larger to the contribution of text semantic;

Step2-1-2-4: be polysemant C _mselector is should the optimum concept of civilian context semantic context, and formula is as follows:

{Best}_{C_{m} L_{i}} = MAX (CWeight (L_{i})) .

Further, described improved K-means algorithm can comprise the steps:

By the vector concept in pending text, carry out cluster, form a plurality of Subject Concepts bunch;

Select classical K-means clustering algorithm, by default kind of a submethod, make improvements.

Further, described in select classical K-means clustering algorithm, by a default kind of submethod, make improvements, can comprise the steps:

According to the theme of statistics, extract the way of thinking, detect certain theme in text by one group of synonym around and produce synonym co-occurrence language phenomenon;

Synonym co-occurrence language phenomenon based on detecting, the synonym of judgement in text around same theme, synonym shows as same concept in Concept Vectors model;

In described Concept Vectors mapping process, synonym is integrated in same concept, a concept possibility correspondence belongs to a plurality of vocabulary of the text, in multi-threaded word leaching process, choose and comprise front K concept that pending text word number is maximum as the default seed of the initial center of K-means cluster.

Further, described step Step3 can comprise the steps:

Step3-1: select to comprise front K the concept { G that text vocabulary number is maximum from the Concept Vectors of pending text T ₁, G ₂... .G _kas the initial classes center of cluster;

Step3-2: the similarity of calculating other concept components and K Ge Lei center in pending text T, concept is assigned to one by one in the class of corresponding similarity maximum, wherein, in described calculating text T, the similarity at other concept components and K Ge Lei center comprises: calculate the similarity between concept and concept and calculate the similarity between concept and concept set;

Step3-3: recalculate all kinds of central points, concept set GG={G ₁, G ₂... G _ncentral point by following formula, calculate:

CenterGG = \frac{Σ_{i = 1}^{n} w_{i}}{n}

W wherein _irepresent the number that its value of concept weights is this concept word that correspondence occurs in the text, n is concept centralized concept number.

Step3-4: repeating step Step3-2 and Step3-3, until the central point of class no longer changes, obtain the concept set of K classification: { { Ф ₁, { Ф ₂..., { Ф _k;

Step3-5: select to comprise the front k that concept number is maximum ₁individual concept set, obtains forming k ₁the concept set of individual sub-topics: { { Ф ₁, { Ф ₂..., { Ф k ₁, by the reverse k that obtains of corresponding relation of concept and Feature Words ₁sub-topics key vocabularies set: { (c ₁₁, c ₁₂..., c _1i), (c ₂₁, c ₂₂..., c _2j) ... (c _k11, c _k12..., c _k1t).

Further, the similarity between the similarity between described calculating concept and concept and calculating concept and concept set, can comprise the steps:

Described semantic knowledge-base was described concept originally by a plurality of justice, and justice has formerly formed a former hierarchical system of tree-shaped justice according to hyponymy;

By calculating the former distance in tree-shaped hierarchical system of justice, obtain the similarity of justice between former;

The semanteme of concept, by the former description of one group of justice, can calculate the similarity between concept by the similarity that justice is former;

By calculating the similarity of all concepts in certain concept and concept set, choose and the value of this concept similarity maximum, as the similarity of this concept and concept set.

Further, the described similarity that obtains the distance of concept and then obtain concept by calculating adopted former distance, can comprise the steps:

Suppose that two former path distances in adopted elite tree hierarchical system of justice are d, as follows apart from the computation process of d:

If former any one the concentrated justice of justice was w originally _i, L _ifor the former w of justice _ithe degree of depth in conceptional tree, a is apart from initial threshold, b is for meeting an arithmetic number of inequality max (L) <a/b, w _iwith the distance of its father node be:

d(w _i,parent(w _i))＝a-L _i.b

Any two former w of justice _i, w _jbetween distance definition be:

d(w _i,w _j)＝ω _k.[a-max(L _i,L _j).b]

Wherein, ω _krepresent weight corresponding to k kind relation, conventionally get ω _k>=1,

Any two former (w of justice _i, w _j) between semantic similarity as follows:

Sim (w_{i}, w_{j}) = \frac{θ}{d (w_{i}, w_{j}) + θ}

Wherein, apart from d, be w _iand w _jpath in the former hierarchical system of justice, is a positive integer, and θ is an adjustable parameter;

Concept U and V are respectively by former group of (p of justice separately _u1, p _u2..., p _un) and (p _v1, p _v2..., p _vm) describe, U, V similarity are:

Sim (U, V) = \frac{(U, V)}{\sqrt{(U, U) . (V, V)}}

Wherein,

(U, V) = Σ_{i}^{n} Σ_{j}^{m} Sim (p_{ui}, p_{vj}),

Concept U is by former group of (p of justice ₁, p ₂..., p _n) represent, concept set G is by concept set { G ₁₁, G ₂₁... G _m1form, the similarity of concept U and concept set G is defined as the maximal value of all concept similarities in U and G:

Sim(U,G)＝Max{Sim(U,G _i)|G _i∈G}。

The invention provides a kind of multi-threaded extracting method based on Concept Vectors model, the method is utilized semantic knowledge-base, corresponding relation by the meaning of a word and concept carries out merger to synonym, excavate the mapping relations of the meaning of a word and semantic category in same linguistic context, polysemant is arranged to discrimination, structuring concept vector representation text; Semantic similarity is expressed by calculating concept similarity, in this multi-threaded word extraction algorithm, utilize improvement K-means algorithm to carry out cluster to text concept and obtain a plurality of sub-topicses bunch, recycling concept and original text keyword corresponding relation, reversely obtain a plurality of subject key words collection.Wherein, by " default seed " method improvement K-means algorithm, to make up K the defect that the caused space-time expense of randomness is unstable, result fluctuation is larger that initial center is selected in traditional K-means algorithm.

The aspect that the present invention is additional and advantage in the following description part provide, and these will become obviously from the following description, or recognize by practice of the present invention.

Accompanying drawing explanation

Fig. 1 shows according to the schematic flow sheet of the multi-threaded extracting method based on Concept Vectors model of technical solution of the present invention;

Fig. 2 shows according to the < < of usining of the multi-threaded extracting method based on Concept Vectors model of technical solution of the present invention and knows that net > > is as the schematic flow sheet of semantic knowledge-base;

Fig. 3 show according to the multi-threaded extracting method based on Concept Vectors model of technical solution of the present invention using < < know net > > as semantic knowledge-base the semantic category schematic diagram about polysemant " moisture ";

Fig. 4 shows the curve synoptic diagram according to accuracy rate, recall rate and F1 change under the different value of K of the multi-threaded extracting method based on Concept Vectors model of technical solution of the present invention.

Embodiment

Describe embodiments of the present invention below in detail, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of identical or similar functions from start to finish.Below by the embodiment being described with reference to the drawings, be exemplary, only for explaining the present invention, and can not be interpreted as limitation of the present invention.

Unless those skilled in the art of the present technique are appreciated that specially statement, singulative used herein " ", " one ", " described " and " being somebody's turn to do " also can comprise plural form.Should be further understood that, the wording of using in instructions of the present invention " comprises " and refers to and have described feature, integer, step, operation, element and/or assembly, but do not get rid of, do not exist or adds one or more other features, integer, step, operation, element, assembly and/or their group.Should be appreciated that, in the time can claiming element to be " connected " or " coupling " arrives another element, it can be directly connected or coupled to other elements, or also can have intermediary element.In addition, " connection " used herein or " coupling " can comprise wireless connections or couple.Wording "and/or" used herein comprises arbitrary unit of listing item and all combinations that one or more is associated.

Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (comprising technical term and scientific terminology) have with the present invention under the identical meaning of the general understanding of those of ordinary skill in field.Should also be understood that such as those terms that define in general dictionary and should be understood to have the consistent meaning of meaning in the text up and down with prior art, unless and definition as here, can not explain by idealized or too formal implication.

< < knows that net > > introduces: < < knows that net > > is that a concept of take Chinese and english vocabulary representative is description object, with the pass of disclosing between concept and concept and between the attribute that concept was had, is the commonsense knowledge base of substance.At < <, know that in net > >, lexical semantic is described and is defined as concept.Each word can be expressed as several concepts, and concept is to be described by a kind of knowledge representation language (DEF), and this " vocabulary " that is used for describing concept makes again justice former, compares the scale of vocabulary, and the former quantity of justice seldom.It is former that < < knows that net > > has defined more than 1500 justice, be divided three classes: basic meaning is former, grammer justice is former and be related to that justice is former, the former main semanteme that has reflected concept of basic meaning in DEF, for example: vocabulary " fan ", at < <, know in net > > and with the basic meaning of DEF be originally: DEF={Human| people, * Fondof| likes, #WhileAway| leisure }, the expressed meaning is: " fan " is individual, this people likes certain thing, this word is relevant with leisure, describe between the basic meaning of certain concept former several " vocabulary " and have semantic dependency.At < <, know in net > >, if certain word only has a meaning, concept corresponding to this word so, and polysemant corresponding a plurality of concept often.

Concept Vectors is introduced: traditional vector space model (VSM) be take the vocabulary that forms text and represented the text as component of a vector, thinks that each component is quadrature, that is: uncorrelated mutually between vocabulary, obviously, this and reality are not inconsistent.As everyone knows, in text, between each vocabulary, exist complicated semantic relation, statistics descriptor extracting method under VSM model cannot correctly be processed synonym and polysemant, cause the phenomenons that the semantic contribution degree of synonym is counted less, the semantic contribution degree of polysemant is counted more, and Chinese vocabulary amount is very large, cause vectorial higher-dimension and Sparse Problems, have a strong impact on quality and efficiency that theme extracts.By < <, know net > > semantic knowledge-base, become Concept Vectors spatial model to represent text VSM model conversion, by the tree hierarchy system institutional framework of semantic knowledge-base concept, process semantic relation between vocabulary, first its building process carries out obtaining after participle and pre-service the feature word set of text to text, in text, comprise vocabulary implication and divide three kinds of situations: univocal, synonym and polysemant, Conceptual Projection process is known net > > semantic knowledge-base by inquiry < <, univocal and synonym directly obtain its unique use knowledge representation language---concept that DEF describes, and the synonym occurring in described mapping process Chinese version is integrated in corresponding concept automatically, and in Chinese, synonym is very many, synonym merger has realized further dimensionality reduction, the corresponding a plurality of concepts of polysemant, the concrete meaning in the text is often relevant with context of co-text, according to this characteristic of speech sounds, proposes to utilize semantic category to carry out polysemant row discrimination.

In order to find the concrete meaning (that is: corresponding concept) of certain polysemant in text, be defined as follows:

Definition 1: occur vocabulary { c in certain text ₁, c ₂..., c _m, (m>=1), at < <, know that in net > >, having corresponding concept is G _i, the G of description _ithe former collection of basic meaning be { y ₁, y ₂..., y _n, (n>=1) claim word set { c ₁, c ₂..., c _m, y ₁, y ₂..., y _mit is a semantic category.

Semantic category is corresponding one by one with concept, concept is known in net > > and is defined by DEF at < <, what describe the main semanteme of concept is that basic meaning is former, the former description of one group of justice of basic meaning reason, this group justice is former is one group semantic relevant " word ", thereby semantic category is one group of semantic relevant word, this group word consists of two parts: first is the former set member word of the basic meaning of certain concept; Second portion is this concept corresponding all words in this article.

When certain semantic category conforms to text linguistic context, probably occur a plurality of this semantic category member words in text, these word justice are relevant, and larger to the semantic contribution of this article, utilize this point can eliminate lexical ambiguity.Fig. 3 show according to the multi-threaded extracting method based on Concept Vectors model of technical solution of the present invention using < < know net > > as semantic knowledge-base the semantic category schematic diagram about polysemant " moisture ".As shown in Figure 3, polysemant " moisture ", at < <, know corresponding two concepts in net > >, semantic category member's word (that is: the former collection of basic meaning) corresponding to first concept is { " plant ", " soil ", " sunlight ", " growth " }, the implication of " moisture " refers to " water that object contains " herein, second semantic category member's word corresponding to concept comprises { " economy ", " data ", " growth ", " report " } in the implication of " moisture " refer to " being mingled with untrue composition ".

Due to the complicacy of Chinese, in same piece of writing text, polysemy and synon situation are very many, and simple mechanical word frequency statistics cannot be processed the problem that relates to lexical semantic, and this is to affect the key factor that text subject extracts quality.In order to solve polysemant row's discrimination and synonym identification problem, the present invention utilizes < < to know that net > > is integrated into synonym on identical concept, the polysemant that certain is contained to a plurality of semantic categories, finds out the semantic category that this lexicon closes text context linguistic context.Polysemant thinking of best semantic category in text in location is: if member's vocabulary occurs that weights sum is larger in the text under certain semantic category, illustrate that this semantic category more meets article theme than other semantic categories, this semantic category is the most suitable semantic category in this literary composition of this polysemant.Vocabulary w _icontained in the text quantity of information H (w _i) computing formula is as follows:

H(w _i)＝-TF(w _i,ST)×log[p(w _i)] (1)

TF (w wherein _i, ST) represent vocabulary w _ithe frequency occurring in text, ST represents text, P (w _i) be word w _iprobability distribution.

Definition 2: for polysemant c, its i semantic category L _iweights are:

CWeight (L_{i}) = Σ_{j = 1}^{n} H (w_{i}) \times \log_{2} n - - - (2)

Wherein, the number that n occurs in the text for certain semantic category Li member word.Semantic category weights are larger, and this semantic category member word is larger to the contribution of article theme.

Definition 3: polysemant c, at < <, know corresponding a plurality of semantic categories in net > >, selector is should the optimum semantic category formula of civilian context of co-text is as follows:

{Best}_{c L_{i}} = MAX (CWeight (L_{i})) - - - (3)

Concept similarity Computing Principle: similarity is to weigh an important indicator of two lexical semantic relations, relates to the information of the every aspects such as morphology, syntax, semantic even pragmatic of word.What wherein, word similarity is had the greatest impact is the semanteme of word.At < <, know that in net > >, vocabulary is described to concept, the similarity of vocabulary is calculated to be just converted into the similarity of concept is calculated.Between word distance and word similarity, there is close relationship.The distance of two words is larger, and its similarity is lower; Otherwise the distance of two words is less, its similarity is larger.

< < knows that net > > described concept originally by a plurality of justice, exists the relation of various complexity between justice is former, as: hyponymy, synonymy, to adopted relation etc.Wherein, the most important thing is hyponymy, all justice is former has formed a former hierarchical system of tree-shaped justice according to hyponymy, so the similarity that can obtain the distance of concept and then obtain concept by calculating adopted former distance.Suppose that two former path distances in adopted elite tree hierarchical system of justice are d, the computation process of d is as follows:

If former any one the concentrated justice of justice was w originally _i, L _ifor the former w of justice _ithe degree of depth in conceptional tree, a is apart from initial threshold, b is for meeting an arithmetic number of inequality max (L) <a/b, w _iwith the distance definition of its father node be:

d(w _i,parent(w _i))＝a-L _i.b (4)

Any two former w of justice _i, w _jbetween distance definition be:

d(w _i,w _j)＝ω _k.[a-max(L _i,L _j).b] (5)

Wherein, ω k represents weight corresponding to k kind relation, conventionally gets ω k >=1.Can verify, above-mentioned definition meets the mathematics requirement of the function of adjusting the distance, and formula (4), (5) reflect that the former position in the former hierarchical tree of justice of justice is darker, and the distance between the two is less, more similar.

Definition 4: any two former (w of justice _i, w _j) between semantic similarity as follows:

Sim (w_{i}, w_{j}) = \frac{θ}{d (w_{i}, w_{j}) + θ} - - - (6)

Wherein d is w _iand w _jpath in the former hierarchical system of justice is a positive integer.θ is an adjustable parameter.

Definition 5: establish concept U and V respectively by former group of (p of justice _u1, p _u2..., p _un) and (p _v1, p _v2..., p _vm) describe, U, V similarity are:

Sim (U, V) = \frac{(U, V)}{\sqrt{(U, U) . (V, V)}} - - - (7)

Wherein:

(U, V) = Σ_{i}^{n} Σ_{j}^{m} Sim (p_{ui}, p_{vj}),

Definition 6: concept U is by former group of (p of justice ₁, p ₂..., p _n) represent, concept set G is by concept set { G ₁₁, G ₂₁... G _m1form, the similarity of concept U and concept set G is defined as the maximal value of all concept similarities in U and G:

Sim(U,G)＝Max{Sim(U,G _i)|G _i∈G} (8)

Fig. 1 shows according to the schematic flow sheet of the multi-threaded extracting method based on Concept Vectors model of technical solution of the present invention.As shown in Figure 1, the object of the present invention is to provide a kind of multi-threaded extracting method based on Concept Vectors model, comprise the following steps:

Further, described step Step1 can comprise the steps:

Further, described step Step2 can comprise the steps:

Step2-3: further arrange the corresponding Concept Vectors T={ of output text T (G1, (C according to concept ₁..., C _i)), (G ₂, (C ₂..., C _j)) ..., (G _q, (C _q..., C _k)).

Further, described step Step2-1-2 can comprise the steps:

H(w _i)＝-TF(w _i,ST)×log[p(w _i)]，

CWeight (L_{i}) = Σ_{j = 1}^{n} H (w_{i}) \times \log_{2} n,

{Best}_{C_{m} L_{i}} = MAX (CWeight (L_{i})) .

Further, described improved K-means algorithm can comprise the steps:

Therefore, can make up like this defect that traditional K-means algorithm is unstable to the caused space-time expense of the susceptibility of initial center, result fluctuation is larger.

Further, described step Step3 can comprise the steps:

centerLL = \frac{Σ_{i = 1}^{n} w_{i}}{n}

d(w _i,parent(w _i))＝a-L _i.b

Any two former w of justice _i, w _jbetween distance definition be:

d(w _i,w _j)＝ω _k.[a-max(L _i,L _j).b]

Any two former (w of justice _i, w _j) between semantic similarity as follows:

Sim (w_{i}, w_{j}) = \frac{θ}{d (w_{i}, w_{j}) + θ}

Sim (U, V) = \frac{(U, V)}{\sqrt{(U, U) . (V, V)}}

Wherein,

(U, V) = Σ_{i}^{n} Σ_{j}^{m} Sim (p_{ui}, p_{vj}),

Sim(U,G)＝Max{Sim(U,G _i)|G _i∈G}。

The invention provides a kind of multi-threaded extracting method based on Concept Vectors model, the method utilizes < < to know net > > semantic knowledge-base, corresponding relation by the meaning of a word and concept carries out merger to synonym, excavate the mapping relations of the meaning of a word and semantic category in same linguistic context, polysemant is arranged to discrimination, structuring concept vector representation text; Semantic similarity is expressed by calculating concept similarity, in this multi-threaded word extraction algorithm, utilize improvement K-means algorithm to carry out cluster to text concept and obtain a plurality of sub-topicses bunch, recycling concept and original text keyword corresponding relation, reversely obtain a plurality of subject key words collection.Wherein, by " default seed " method improvement K-means algorithm, to make up K the defect that the caused space-time expense of randomness is unstable, result fluctuation is larger that initial center is selected in traditional K-means algorithm.

Below in conjunction with Fig. 2, further illustrate the present invention, should understand these embodiment is only not used in and limits the scope of the invention for the present invention is described, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the modification of the various equivalent form of values of the present invention.

Fig. 2 shows according to the < < of usining of the multi-threaded extracting method based on Concept Vectors model of technical solution of the present invention and knows that net > > is as the schematic flow sheet of semantic knowledge-base.As shown in Figure 2, first, input pending text T; Then, preprocessed text T, for example, is used ICTCLAS Words partition system to carry out participle, goes to stop, uses information gain (IG) tentatively to extract feature after denoising; Then, with vector space model, represent text T; Then, the concept space model that text T is mapped as, for example, univocal, synonym directly shine upon; Polysemant is carried out to word sense disambiguation according to the correlativity of semantic category and context of co-text to polysemant; Then, utilize improved K-means algorithm to carry out cluster to concept, for example, by the method improvement K-means algorithm of default " seed "; By Concept Semantic, carry out similarity calculating; Then, according to the corresponding relation of concept and word is reverse, obtain a plurality of sub-topics word sets.

Fig. 3 show according to the multi-threaded extracting method based on Concept Vectors model of technical solution of the present invention using < < know net > > as semantic knowledge-base the semantic category schematic diagram about polysemant " moisture ".As shown in Figure 3, polysemant " moisture ", at < <, know corresponding two concepts in net > >, semantic category member's word (that is: the former collection of basic meaning) corresponding to first concept is { " plant ", " soil ", " sunlight ", " growth " }, the implication of " moisture " refers to " water that object contains " herein, second semantic category member's word corresponding to concept comprises { " economy ", " data ", " growth ", " report " } in the implication of " moisture " refer to " being mingled with untrue composition ".

Experiment and interpretation of result: experimental data of the present invention comprises 20 classifications altogether from the disclosed standard corpus storehouse in Fudan University's natural language processing laboratory, 19637 pieces of texts, but all do not mark theme, consider Work-Load Factor, test herein and from 5 classifications of this corpus, select length length, the comparatively obvious 500 pieces of texts of multi-threaded feature, through being engaged in the professional of Chinese work, carry out theme line mark, as experiment sample.Experimental result is passed judgment on and is adopted general accuracy rate (P), recall rate (R) and overall target F1.

F 1 = \frac{2 PR}{P + R} - - - (11)

Parameter estimation: in order to obtain the most appropriate value of initial clustering bunch parameter k in algorithm 2, according to situations such as the actual length length of test sample book, text structures, through Chinese professional person, analyze, the value of every piece of sampling sub-topics number k1 gets 3, and manually for every piece of sample, 3 sub-topicses have been marked as standard value, experimental analysis k value in the situation that of k1=3, Fig. 4 reflects the situation of change of k accuracy rate (P), recall rate (R) and F1 under different values.

Fig. 4 shows the curve synoptic diagram according to accuracy rate, recall rate and F1 change under the different value of K of the multi-threaded extracting method based on Concept Vectors model of technical solution of the present invention.As shown in Figure 4, in the situation of 3 sub-topicses of every piece of sampling, improved k-means algorithm, the accuracy rate of extracting theme along with the increase of k value improves constantly, and recall rate is reducing, this is to cause clustering cluster refinement because k value increases, so accuracy rate rises gradually; Algorithm recall rate generally determines, but in this experiment, along with the continuous refinement of increase classification of k value, before choosing 3 (k1=3) maximum sub-topicses time, caused recall rate decline; In order to find most suitable k value, the F1 index situation of analysis chart 3, from the trend of overall target F1, when the peak of F1 appears at k=7, so algorithm 2 optimal value under this experiment sample object is k=7, the value that it should be noted that k is relevant with text to be processed.

Test of heuristics: for the method improvement K-means algorithm of testing by " default seed " extracts multi-threaded quality, experiment sample is 500 pieces of texts for preparing still, adopt above-mentioned parameter to estimate the result obtaining in experiment, get k=7, sub-topics number k1 is 3, first adopt traditional K-means algorithm, produce immediately the methods experiment 5 times of k initial center, and improved K-means to extract theme result statistics as shown in table 1:

The multi-threaded extraction result statistics of table 1K-means and improvement K-means

As can be seen from Table 1, K-means is in the situation that producing initial center immediately 5 times for tradition, the accuracy rate of result, recall rate and overall target F1 value are all very unstable, algorithm variation consuming time is larger, this is because traditional K-means algorithm is more responsive to initial cluster center, causes result and consuming time larger with different initial input fluctuations.For eliminating this defect, the present invention extracts feature according to theme, each theme often comprises a plurality of vocabulary with identical semantic concept, according to concept a plurality of theme Semantic center of quantity institute text in secret information of corresponding vocabulary in the text, set out in advance K initial center of possibility maximum, thereby improve K-means, the theme quality of not only extracting is higher, and the execution efficiency of algorithm is also greatly improved.

Those skilled in the art of the present technique be appreciated that step in the various operations discussed in the present invention, method, flow process, measure, scheme can by alternately, change, combination or delete.Further, have other steps in the various operations discussed in the present invention, method, flow process, measure, scheme also can by alternately, change, reset, decompose, combination or delete.Further, of the prior art have with the present invention in step in disclosed various operations, method, flow process, measure, scheme also can by alternately, change, reset, decompose, combination or delete.

The above is only part embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. the multi-threaded extracting method based on Concept Vectors model, is characterized in that, comprises the following steps:

2. the multi-threaded extracting method based on Concept Vectors model according to claim 1, is characterized in that, described step Step1 further comprises the steps:

3. the multi-threaded extracting method based on Concept Vectors model according to claim 1, is characterized in that, described step Step2 further comprises the steps:

4. the multi-threaded extracting method based on Concept Vectors model according to claim 3, is characterized in that, described step Step2-1-2 can comprise the steps:

H(w _i)＝-TF(w _i,ST)×log[p(w _i)]，

5. the multi-threaded extracting method based on Concept Vectors model according to claim 1, is characterized in that, described improved K-means algorithm further comprises the steps:

6. the multi-threaded extracting method based on Concept Vectors model according to claim 1, is characterized in that, described in select classical K-means clustering algorithm, by a default kind of submethod, make improvements, further comprise the steps:

7. the multi-threaded extracting method based on Concept Vectors model according to claim 1, is characterized in that, described step Step3 further comprises the steps:

8. the multi-threaded extracting method based on Concept Vectors model according to claim 1, is characterized in that, the described similarity that obtains the distance of concept and then obtain concept by calculating adopted former distance, can comprise the steps:

d(w _i,parent(w _i))＝a-L _i.b

Any two former w of justice _i, w _jbetween distance definition be:

d(w _i,w _j)＝ω _k.[a-max(L _i,L _j).b]

Any two former (w of justice _i, w _j) between semantic similarity as follows:

Wherein,

Sim(U,G)＝Max{Sim(U,G _i)|G _i∈G}。