CN103970729A - Multi-subject extracting method based on semantic categories - Google Patents

Multi-subject extracting method based on semantic categories Download PDF

Info

Publication number
CN103970729A
CN103970729A CN201410178218.6A CN201410178218A CN103970729A CN 103970729 A CN103970729 A CN 103970729A CN 201410178218 A CN201410178218 A CN 201410178218A CN 103970729 A CN103970729 A CN 103970729A
Authority
CN
China
Prior art keywords
semantic
concept
text
semantic category
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410178218.6A
Other languages
Chinese (zh)
Other versions
CN103970729B (en
Inventor
马甲林
王志坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201410178218.6A priority Critical patent/CN103970729B/en
Publication of CN103970729A publication Critical patent/CN103970729A/en
Application granted granted Critical
Publication of CN103970729B publication Critical patent/CN103970729B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a multi-subject extracting method based on semantic categories. The multi-subject extracting method based on the semantic categories comprises the following steps that firstly, a document is preprocessed according to a traditional method and a vector composed of feature words is obtained preliminarily; secondly, synonyms are merged by the utilization of the corresponding relation between word meanings and concepts of 'HowNet', polysemic word disambiguation is carried out according to the correlation between the semantic categories and the context, and a concept vector model is constructed to represent the document; then the concept vector model is converted to be a semantic category model according to the one-to-one corresponding relation between the concepts and the semantic categories; the concept similarity is calculated by the utilization of the related semantic information in the concepts in 'HowNet' and then the semantic similarity is obtained; the semantic categories are clustered by improving the K-means algorithm according to the method of presetting seeds, and a plurality of subject semantic category clusters are formed; finally, a plurality of sub-subject word sets are obtained in a reverse mode according to the corresponding relations between the semantic categories and the concepts and between the concepts and words. The method considers the semantic information, overcomes the defect that the sensibility to the initial center by the K-means algorithm and time-and-space cost are not stable, and improves the quality of extracted subjects.

Description

A kind of multi-threaded extracting method based on semantic category
Technical field
The present invention relates to text message extractive technique field, particularly relate to a kind of multi-threaded extracting method based on semantic category.
Background technology
Since human society enters the information age, various e-texts emerge in multitude, in these mass texts, exist a large amount of multi-threaded texts, comprising many-sided abundant subject information, for example: one section of report about premier's Li Keqiang visit Europe, both belong to political class news, belonged to again economic class news.Along with scientific and technical development, between subject, degrees of fusion is more and more higher, multiple ambits are all crossed in great majority research, many scientific text contents all never ipsilateral comprised multiple themes, as one section about the text of biological gene information excavating, the theme that had not only comprised computer science aspect but also comprise biomedical sector theme.Therefore, in real world, there are a large amount of multi-threaded texts, how from these texts, to extract multiple sub-topics information of valuable reflection different aspect, have a very wide range of applications in fields such as information retrieval, Library and information securities.
Text subject is extracted to research abroad and arise from the fifties in last century, the theme extracting method of comparative maturity is the method based on statistical model at present, the method is mainly to utilize word frequency statistics information extraction theme, researchist had added the consideration to key elements such as title, position, syntactic structure and clue words afterwards, can from English text, extract high-quality theme.The domestic research that theme is extracted arises from the later stage eighties in last century, but due to the complicacy of Chinese, many successful English theme extracting method are not suitable for Chinese.
At present, the method of domestic application widely or based on statistics, the method is under vector space model (VSM), hypotheses is pairwise orthogonal between vector, form between the vocabulary of text unrelated, this obviously and in text lexical semantic and context-sensitive reality be not inconsistent, again because Chinese vocabulary amount is very large, under VSM, certainly exist vectorial higher-dimension, sparse, ignore the problem such as lexical semantic and context, leaching process is subject to the interference of synonym and polysemant simultaneously, thereby in quality and efficiency, shows not good enough.How the study hotspot extracting about theme at present increases in semantic information if concentrating on, although there are a lot of scholars to propose the various theme extracting method based on semantic, but still do not reach the breakthrough of application level.In addition, extract multi-threaded and single theme and be extracted on algorithm and have very big difference, from one section of text, identify multiple sub-topics words, only use the method for tradition based on word frequency statistics to realize.Community's partitioning algorithm in the complex network that the people such as Liao Tao proposes can extract multi-threaded, but does not relate to the semantic information of word, is a pure statistical method, extraction multi-threaded of low quality.
Therefore, for the existing traditional text treatment technology taking word frequency statistics as foundation, one section of text can only propose the problem of single theme, considered simultaneously traditional text disposal route in the face of vectorial higher-dimension, sparse and to lack efficiency of algorithm that word sense information, contextual information cause low and extract descriptor problem of low quality, and a kind of multi-threaded extracting method based on semantic category need to be provided.
Summary of the invention
The technical problem to be solved in the present invention is, in order to solve traditional text treatment technology taking word frequency statistics as foundation, one section of text can only propose the problem of single theme, considered that traditional text disposal route is in the face of vectorial higher-dimension simultaneously, sparse and lack word sense information, efficiency of algorithm that contextual information causes is low and extract descriptor problem of low quality, a kind of multi-threaded extracting method based on semantic category is provided, the method is utilized " knowing net " semantic knowledge-base, the Feature Words that represents text is shone upon one by one, the text is expressed as to conceptual model, and synonym has been integrated in identical concept automatically in mapping process, realize vectorial dimensionality reduction, according to the correlativity of semantic category and context of co-text, the polysemant occurring in literary composition is arranged to discrimination.
The object of the present invention is to provide a kind of multi-threaded extracting method based on semantic category, comprise the following steps:
Step1: vector model represents: text is carried out to pre-service, with the vector that obtains being formed by Feature Words, and by vector space model, described pretreated text table is shown as to the vector that Feature Words forms;
Step2: conceptual model mapping: semantic and represent the semantic knowledge-base of semantic relation between concept with tree structure based on express natural language vocabulary with concept, utilize the corresponding relation between the meaning of a word and described concept to carry out Conceptual Projection to the Feature Words of described pretreated composition text, in Conceptual Projection process, synonym in text carries out merger automatically, then, according to the correlativity of semantic category and context of co-text, the polysemant occurring in literary composition is arranged to discrimination, afterwards, the vector space model of described text after merger and row's discrimination is mapped to concept space model,
Step3: semantic category model conversion: according to the expression mode of concept in semantic knowledge-base and the definition of semantic category, the two has one-to-one relationship, to become semantic class model by the text-converted of presenting Conceptual Model;
Step4: multi-threaded word extracts: use improved K-means algorithm to carry out cluster to all semantic categories of the described text that is expressed as semantic category model, form multiple theme semantic categories bunch, to the multiple theme semantic categories bunch that form, according to the corresponding relation of semantic category and concept and concept and original text Feature Words corresponding relation, reversely obtain multiple theme feature word sets, to extract the multi-threaded word in single Chinese text.
Further, described step Step1 can comprise the steps:
Step1-1: use Words partition system to carry out participle to pending text T, then, go to stop, denoising, afterwards, obtain the elementary vector space model T={C of described text 1, C 2..., C n, wherein C 1, C 2..., C nrepresent the vector that n is made up of Feature Words, described in go to stop referring to and filter the stop words occurring in text, described denoising refers to filters the vocabulary without practical significance occurring in text;
Step1-2: further extract proper vector from elementary vector space model, to obtain the senior vector space model T={C of the text 1, C 2..., C m, wherein m<=n.
Further, described step Step2 can comprise the steps:
In text, comprise vocabulary implication and divide three kinds of situations: univocal, synonym and polysemant;
Conceptual Projection process realizes by the described semantic knowledge-base of inquiry, wherein:
Search knowledge base, judgement, in the time that the vocabulary comprising in text is univocal, directly obtains the concept of its unique correspondence;
Search knowledge base, judgement, in the time that the vocabulary comprising in text is synonym, directly obtains the concept of its unique correspondence, in this process, by the synonym occurring in the text is integrated in same concept automatically, with witness vector dimensionality reduction;
Search knowledge base, judgement is in the time that the vocabulary comprising in text is polysemant, the corresponding multiple concepts of this polysemant, concept is corresponding one by one with semantic category, information computing semantic category weights according to semantic category member word in the text, the corresponding concept of semantic category of choosing weights maximum is applicable to the concept of context of co-text herein as described polysemant, thereby realizes polysemant row discrimination.
Further, described search knowledge base, judgement is in the time that the vocabulary comprising in text is polysemant, the corresponding multiple concepts of this polysemant, concept is corresponding one by one with semantic category, and according to semantic category member word information computing semantic category weights in the text, the corresponding concept of semantic category of choosing weights maximum is applicable to the concept of context of co-text herein as described polysemant, thereby realize polysemant row discrimination, comprise the steps:
In described semantic knowledge-base, the semanteme of concept is mainly described by the former collection of basic meaning, and the former collection of basic meaning is described by one group of semantic relevant word again, and the word of describing the former collection of certain concept basic meaning has formed a semantic category;
Judge the corresponding multiple concepts of polysemant, the quantity of information of all member's words that calculate the corresponding semantic category of described concept in described processing text, weighted calculation obtains the weights of each semantic category;
Select the concept corresponding to semantic category of weights maximum to be applicable to the concept of context of co-text herein as described polysemant, thereby realize polysemant row discrimination.
Further, described step Step2 can comprise the steps:
Step2-1: to all Feature Words query semantics knowledge base successively in pending text T, carry out Conceptual Projection;
Step2-1-1: search knowledge base, if the Feature Words C of T mcorresponding unique concept, C mfor univocal or synonym, directly obtain C mconcept, go to step Setp2-2;
Step2-1-2: search knowledge base, if the Feature Words C of T mcorresponding multiple concepts, Feature Words C mfor polysemant, need polysemant C mcarry out word sense disambiguation, select the concept that is applicable to this article context of co-text;
Step2-2: obtain the corresponding Concept Vectors T={ of text T (G1, C 1), (G 2, C 2) ... (G q, C q);
Step2-3: further arrange the corresponding Concept Vectors T={ of output text T (G1, (C according to concept 1..., C i)), (G 2, (C 2..., C j)) ..., (G q, (C q..., C k)), wherein (C q..., C k) be concept G qthe corresponding in the text word occurring.
Further, described step Step2-1-2 can comprise the steps:
Step2-1-2-1: search knowledge base, polysemant C mcorresponding multiple concepts, the former collection of basic meaning of describing Concept Semantic has enough become a semantic category, thereby polysemant C mcorresponding multiple semantic categories, thus the phrase of describing the former collection of basic meaning can be obtained, and described phrase becomes one group of semantic relevant phrase of reflection semantic category;
Step2-1-2-2: calculate polysemant C mthe quantity of information of each semantic category member word in this article, described member's word w icontained quantity of information H (w in this article i) computing formula is as follows:
H(w i)=-TF(w i,ST)×log[p(w i)],
Wherein, TF (w i, ST) and expression word w ithe frequency occurring in text, ST represents text, P (w i) be word w iprobability distribution;
Step2-1-2-3: calculate polysemant C mthe weights of each semantic category, its i semantic category L iweights are:
CWeight ( L i ) = &Sigma; j = 1 n H ( w i ) &times; log 2 n ,
Wherein, n is semantic category L ithere is number in member's word, semantic category weights larger in the text, and this semantic category member word is larger to the contribution of text semantic;
Step2-1-2-4: be polysemant C mselector is should the optimum concept of civilian context semantic context, and formula is as follows:
B est C m L i = MAX ( CWeight ( L i ) ) .
Further, described step Step3 can comprise the steps:
Concept Vectors T={ (the G1, (C of the pending text T obtaining according to Step2 1..., C i)), (G 2, (C 2..., C j)) ..., (G q, (C q..., C k)), inquire about described semantic knowledge-base, convert T to semantic category model, each component of T represents with five-tuple;
Further, the form of described five-tuple is (L i, w i, (C 1..., C k), G i, (C 1..., C i)) represent, wherein, L ifor semantic category, w ifor L iweights, (C 1..., C k, C 1..., C i) be L imember's word, G ifor L ifor concept;
Further, described semantic category weight w iaccording to Step2-1-2-2 word in the text the formula of quantity of information calculate above-mentioned semantic category L iall member's word information amounts, then, calculate L according to Step2-1-2-3 semantic category weights computing formula iweights.
Further, described improved K-means algorithm can comprise the steps:
Carry out cluster by the semantic category in pending text, form multiple theme semantic categories bunch;
Select classical K-means clustering algorithm, make improvements by default kind of a submethod.
Further, described in select classical K-means clustering algorithm, make improvements by a default kind of submethod, can comprise the steps:
Micro-theme phenomenon is described: a semantic category is made up of the former concentrated word of basic meaning of a word in literary composition or several synonym and these words correspondence in semantic knowledge-base, and the former collection of basic meaning has been described the main semanteme of concept, therefore, between member's word justice of semantic category, correlativity is extremely strong, and they common reflected a micro-theme;
According to micro-theme phenomenon, in literary composition, there will be several micro-themes, select front K the strongest micro-theme of quantity of information occurring in literary composition, as " default seed ", become the initial center of K-means algorithm, for overcoming the defects such as the susceptibility of K-means algorithm to initial center and space-time expense be unstable;
Wherein the quantity of information of micro-theme is reflected by the weights of semantic category, and pending text table is shown as after described semantic model, and front K semantic category of the maximum semantic category of weights is front K micro-theme of quantity of information maximum.
Further, described step Step4 can comprise the steps:
Step4-1: select front K semantic category of semantic category weights order maximum as the initial classes center of cluster from the semantic category model of pending text T;
Step4-2: the similarity of calculating other semantic categories and K Ge Lei center in pending text T, last semantic category is assigned to one by one in the class of corresponding similarity maximum, wherein, calculating the similarity of other semantic categories and K Ge Lei center in described text T comprises: the similarity between the similarity between computing semantic class and semantic category and computing semantic class and semantic class set;
Step4-3: recalculate all kinds of central points, the central point of classification LL calculates by following formula:
centerLL = &Sigma; i = 1 n w i n
Wherein w ifor semantic category weights, n is the number that semantic class set comprises semantic category;
Step4-4: repeating step Step4-2 and Step4-3, until the central point of class no longer changes, obtain the semantic class set of K classification: { { Ф 1, { Ф 2..., { Ф k;
Step4-5: select to comprise the front k that semantic category number is maximum 1individual Semantic class set, obtains forming k 1the Semantic class set of individual sub-topics: { { Ф 1, { Ф 2..., { Ф k 1, obtain the set of k1 sub-topics key vocabularies by the corresponding relation of semantic category and concept and concept and Feature Words is reverse: { (c 11, c 12..., c 1i), (c 21, c 22..., c 2j) ... (c k11, c k12..., c k1t).
Further, the described multi-threaded extracting method based on semantic category also comprises:
Obtain concept similarity by calculating adopted former distance, represent the similarity of corresponding semantic category with concept similarity;
Semantic category and semantic category classification similarity are by calculating concept and concept set similarity represents.
Further, described semantic category and semantic category classification similarity, by calculating concept and concept set similarity represents, can comprise the steps:
Described semantic knowledge-base was described concept originally by multiple justice, and justice has formerly formed a former hierarchical system of tree-shaped justice according to hyponymy;
Obtain the similarity of justice between former by calculating the former distance in tree-shaped hierarchical system of justice;
The semanteme of concept, by the former description of one group of justice, can calculate the similarity between concept by the similarity that justice is former;
By calculating the similarity of all concepts in certain concept and concept set, choose and the value of this concept similarity maximum, as the similarity of this concept and concept set.
Further, the described similarity that obtains the distance of concept and then obtain concept by calculating adopted former distance, can comprise the steps:
Suppose that two former path distances in adopted elite tree hierarchical system of justice are d, as follows apart from the computation process of d:
If former any one the concentrated justice of justice was w originally i, L ifor the former w of justice ithe degree of depth in conceptional tree, a is apart from initial threshold, b is for meeting an arithmetic number of inequality max (L) <a/b, w iwith the distance of its father node be:
d(w i,parent(w i))=a-L i.b
Any two former w of justice i, w jbetween distance definition be:
d(w i,w j)=ω k.[a-max(L i,L j).b]
Wherein, ω krepresent weight corresponding to k kind relation, conventionally get ω k>=1,
Any two former (w of justice i, w j) between semantic similarity as follows:
Sim ( w i , w j ) = &theta; d ( w i , w j ) + &theta;
Wherein, be w apart from d iand w jpath in the former hierarchical system of justice, is a positive integer, and θ is an adjustable parameter;
Concept U and V are respectively by former group of (p of justice separately u1, p u2..., p un) and (p v1, p v2..., p vm) describe, U, V similarity are:
Sim ( U , V ) = ( U , V ) ( U , U ) . ( V , V )
Wherein, ( U , V ) = &Sigma; i n &Sigma; j m Sim ( p ui , p vj ) ,
Concept U is by former group of (p of justice 1, p 2..., p n) represent, concept set G is by concept set { G 11, G 21... G m1composition, the similarity of concept U and concept set G is defined as the maximal value of all concept similarities in U and G:
Sim(U,G)=Max{Sim(U,G i)|G i∈G}。
The invention provides a kind of multi-threaded extracting method based on Concept Vectors model, the method is utilized semantic knowledge-base, corresponding relation by the meaning of a word and concept carries out merger to synonym, excavate the mapping relations of the meaning of a word and semantic category in same linguistic context, polysemant is arranged to discrimination, structuring concept vector representation text; Semantic similarity is expressed by calculating concept similarity, in this multi-threaded word extraction algorithm, utilize improvement K-means algorithm to carry out cluster to text concept and obtain multiple sub-topicses bunch, recycling concept and original text keyword corresponding relation, reversely obtain multiple subject key words collection.Wherein, by " default seed " method improvement K-means algorithm, to make up K the defect that the caused space-time expense of randomness is unstable, result fluctuation is larger that initial center is selected in traditional K-means algorithm.
The aspect that the present invention is additional and advantage in the following description part provide, and these will become obviously from the following description, or recognize by practice of the present invention.
Brief description of the drawings
Fig. 1 shows according to the schematic flow sheet of the multi-threaded extracting method based on semantic category of technical solution of the present invention;
Fig. 2 shows according to the schematic flow sheet using " knowing net " as semantic knowledge-base of the multi-threaded extracting method based on semantic category of technical solution of the present invention;
Fig. 3 show according to the multi-threaded extracting method based on semantic category of technical solution of the present invention using " knowing net " as semantic knowledge-base about the semantic category schematic diagram of polysemant " moisture ";
Fig. 4 shows the curve synoptic diagram according to accuracy rate, recall rate and F1 change under the different value of K of the multi-threaded extracting method based on semantic category of technical solution of the present invention.
Embodiment
Describe embodiments of the present invention below in detail, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of identical or similar functions from start to finish.Be exemplary below by the embodiment being described with reference to the drawings, only for explaining the present invention, and can not be interpreted as limitation of the present invention.
Unless those skilled in the art of the present technique are appreciated that specially statement, singulative used herein " ", " one ", " described " and " being somebody's turn to do " also can comprise plural form.Should be further understood that, the wording using in instructions of the present invention " comprises " and refers to and have described feature, integer, step, operation, element and/or assembly, exists or adds one or more other features, integer, step, operation, element, assembly and/or their group but do not get rid of.Should be appreciated that, in the time can claiming element to be " connected " or " coupling " arrives another element, it can be directly connected or coupled to other elements, or also can have intermediary element.In addition, " connection " used herein or " coupling " can comprise wireless connections or couple.Wording "and/or" used herein comprises arbitrary unit of listing item and all combinations that one or more is associated.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (comprising technical term and scientific terminology) have with the present invention under the identical meaning of the general understanding of those of ordinary skill in field.Should also be understood that such as those terms that define in general dictionary and should be understood to have the meaning consistent with meaning in the text up and down of prior art, unless and definition as here, can not explain by idealized or too formal implication.
" knowing net " introduce: " knowing net " be a concept taking Chinese and english vocabulary representative as description object, be the commonsense knowledge base of substance with the pass of disclosing between concept and concept and between the attribute that concept was had.In " knowing net ", lexical semantic is described and is defined as concept.Each word can be expressed as several concepts, and concept is to be described by a kind of knowledge representation language (DEF), and this " vocabulary " that is used for describing concept makes again justice former, compares the scale of vocabulary, and the former quantity of justice is little.It is former that " knowing net " defined more than 1500 justice, be divided three classes: basic meaning is former, grammer is adopted former and relation is adopted former, the former main semanteme that has reflected concept of basic meaning in DEF, for example: vocabulary " fan ", basic meaning with DEF in " knowing net " was originally: DEF={Human| people, * Fondof| likes, #WhileAway| leisure }, the expressed meaning is: " fan " is individual, this people likes certain thing, this word is relevant with leisure, describes between the basic meaning of certain concept former several " vocabulary " and has semantic dependency.In " knowing net ", if certain word only has a meaning, concept corresponding to this word so, and often corresponding multiple concept of polysemant.
Concept Vectors is introduced: traditional vector space model (VSM) represents the text taking the vocabulary that forms text as component of a vector, thinks that each component is orthogonal, that is: uncorrelated mutually between vocabulary, obviously, this and reality are not inconsistent.As everyone knows, in text, between each vocabulary, exist complicated semantic relation, statistics descriptor extracting method under VSM model cannot correctly be processed synonym and polysemant, cause the phenomenons that the semantic contribution degree of synonym is counted less, the semantic contribution degree of polysemant is counted more, and Chinese vocabulary amount is very large, cause vectorial higher-dimension and Sparse Problems, have a strong impact on quality and efficiency that theme extracts.By " knowing net " semantic knowledge-base, become Concept Vectors spatial model to represent text VSM model conversion, by the tree hierarchy system institutional framework of semantic knowledge-base concept, process semantic relation between vocabulary, first its building process carries out obtaining after participle and pre-service the feature word set of text to text, in text, comprise vocabulary implication and divide three kinds of situations: univocal, synonym and polysemant, Conceptual Projection process is by inquiry " knowing net " semantic knowledge-base, univocal and synonym directly obtain its unique use knowledge representation language---concept that DEF describes, and the synonym occurring in described mapping process Chinese version is integrated in corresponding concept automatically, and in Chinese, synonym is very many, synonym merger has realized further dimensionality reduction, the corresponding multiple concepts of polysemant, the concrete meaning in the text is often relevant with context of co-text, according to this characteristic of speech sounds, proposes to utilize semantic category to carry out polysemant row discrimination.
In order to find the concrete meaning (that is: corresponding concept) of certain polysemant in text, be defined as follows:
Definition 1: occur vocabulary { c in certain text 1, c 2..., c m, (m>=1), in " knowing net ", having corresponding concept is G i, the G of description ithe former collection of basic meaning be { y 1, y 2..., y n, (n>=1) claim word set { c 1, c 2..., c m, y 1, y 2..., y mit is a semantic category.
Semantic category is corresponding one by one with concept, concept defines by DEF in " knowing net ", what describe the main semanteme of concept is that basic meaning is former, the former description of one group of justice of basic meaning reason, this group justice is former is one group semantic relevant " word ", thereby semantic category is one group of semantic relevant word, this group word is made up of two parts: Part I is the former set member word of the basic meaning of certain concept; Part II is this concept corresponding all words in this article.
When certain semantic category conforms to text linguistic context, probably occur multiple this semantic category member words in text, these word justice are relevant, and larger to the semantic contribution of this article, utilize this point can eliminate lexical ambiguity.Fig. 3 show according to the multi-threaded extracting method based on Concept Vectors model of technical solution of the present invention using " knowing net " as semantic knowledge-base about the semantic category schematic diagram of polysemant " moisture ".As shown in Figure 3, polysemant " moisture ", corresponding two concepts in " knowing net ", semantic category member's word (that is: the former collection of basic meaning) corresponding to first concept is { " plant ", " soil ", " sunlight ", " growth " }, the implication of " moisture " refers to " water that object contains " herein, and the implication that second semantic category member's word corresponding to concept comprises " moisture " in { " economy ", " data ", " growth ", " report " } refers to " being mingled with untrue composition ".
Due to the complicacy of Chinese, in same section text, polysemy and synon situation are very many, and simple mechanical word frequency statistics cannot be processed the problem that relates to lexical semantic, and this is a key factor that affects text subject extraction quality.In order to solve polysemant row discrimination and synonym identification problem, utilization of the present invention " knowing net " is integrated into synonym on identical concept, and the polysemant that certain is contained to multiple semantic categories is found out this lexicon and close the semantic category of text context linguistic context.Polysemant thinking of best semantic category in text in location is: if member's vocabulary occurs that weights sum is larger in the text under certain semantic category, illustrate that this semantic category more meets article theme than other semantic categories, this semantic category is the most suitable semantic category in this literary composition of this polysemant.Vocabulary w icontained in the text quantity of information H (w i) computing formula is as follows:
H(w i)=-TF(w i,ST)×log[p(w i)] (1)
Wherein TF (w i, ST) and expression vocabulary w ithe frequency occurring in text, ST represents text, P (w i) be word w iprobability distribution.
Definition 2: for polysemant c, its i semantic category L iweights are:
CWeight ( L i ) = &Sigma; j = 1 n H ( w j ) &times; log 2 n - - - ( 2 )
Wherein, n is certain semantic category L ithe number that member's word occurs in the text.Semantic category weights are larger, and this semantic category member word is larger to the contribution of article theme.
Definition 3: polysemant c, corresponding multiple semantic categories in " knowing net ", selector is should the optimum semantic category formula of civilian context of co-text is as follows:
Best c L i = MAX ( CWeight ( L i ) ) - - - ( 3 )
Concept similarity Computing Principle: similarity is to weigh an important indicator of two lexical semantic relations, relates to the information of the every aspect such as morphology, syntax, semantic even pragmatic of word.What wherein, word similarity is had the greatest impact is the semanteme of word.In " knowing net ", vocabulary is described to concept, and the similarity of vocabulary is calculated to be just converted into the similarity of concept is calculated.Between word distance and word similarity, there is close relationship.The distance of two words is larger, and its similarity is lower; Otherwise the distance of two words is less, its similarity is larger.
" knowing net " described concept originally by multiple justice, and justice exists the relation of various complexity between former, as: hyponymy, synonymy, to adopted relation etc.Wherein, the most important thing is hyponymy, all justice is former has formed a former hierarchical system of tree-shaped justice according to hyponymy, so the similarity that can obtain the distance of concept and then obtain concept by calculating adopted former distance.Suppose that two former path distances in adopted elite tree hierarchical system of justice are d, the computation process of d is as follows:
If former any one the concentrated justice of justice was w originally i, L ifor the former w of justice ithe degree of depth in conceptional tree, a is apart from initial threshold, b is for meeting an arithmetic number of inequality max (L) <a/b, w iwith the distance definition of its father node be:
d(w i,parent(w i))=a-L i.b (4)
Any two former w of justice i, w jbetween distance definition be:
d(w i,w j)=ω k.[a-max(L i,L j).b] (5)
Wherein, ω krepresent weight corresponding to k kind relation, conventionally get ω k>=1.Can verify, above-mentioned definition meets the mathematics requirement of the function of adjusting the distance, and formula (4), (5) reflect that the former position in the former hierarchical tree of justice of justice is darker, and the distance between the two is less, more similar.
Definition 4: any two former (w of justice i, w j) between semantic similarity as follows:
Sim ( w i , w j ) = &theta; d ( w i , w j ) + &theta; - - - ( 6 )
Wherein d is w iand w jpath in the former hierarchical system of justice is a positive integer.θ is an adjustable parameter.
Definition 5: establish concept U and V respectively by former group of (p of justice u1, p u2..., p un) and (p v1, p v2..., p vm) describe, U, V similarity are:
Sim ( U , V ) = ( U , V ) ( U , U ) . ( V , V ) - - - ( 7 )
Wherein: ( U , V ) = &Sigma; i n &Sigma; j m Sim ( p ui , p vj ) .
Definition 6: concept U is by former group of (p of justice 1, p 2..., p n) represent, concept set G is by concept set { G 11, G 21... G m1composition, the similarity of concept U and concept set G is defined as the maximal value of all concept similarities in U and G:
Sim(U,G)=Max{Sim(U,G i)|G i∈G} (8)
Fig. 1 shows according to the schematic flow sheet of the multi-threaded extracting method based on semantic category of technical solution of the present invention.As shown in Figure 1, the object of the present invention is to provide a kind of multi-threaded extracting method based on semantic category, comprise the following steps:
Step1: vector model represents: text is carried out to pre-service, with the vector that obtains being formed by Feature Words, and by vector space model, described pretreated text table is shown as to the vector that Feature Words forms;
Step2: conceptual model mapping: semantic and represent the semantic knowledge-base of semantic relation between concept with tree structure based on express natural language vocabulary with concept, utilize the corresponding relation between the meaning of a word and described concept to carry out Conceptual Projection to the Feature Words of described pretreated composition text, in Conceptual Projection process, synonym in text carries out merger automatically, then, according to the correlativity of semantic category and context of co-text, the polysemant occurring in literary composition is arranged to discrimination, afterwards, the vector space model of described text after merger and row's discrimination is mapped to concept space model,
Step3: semantic category model conversion: according to the expression mode of concept in semantic knowledge-base and the definition of semantic category, the two has one-to-one relationship, to become semantic class model by the text-converted of presenting Conceptual Model;
Step4: multi-threaded word extracts: use improved K-means algorithm to carry out cluster to all semantic categories of the described text that is expressed as semantic category model, form multiple theme semantic categories bunch, to the multiple theme semantic categories bunch that form, according to the corresponding relation of semantic category and concept and concept and original text Feature Words corresponding relation, reversely obtain multiple theme feature word sets, to extract the multi-threaded word in single Chinese text.
Further, described step Step1 can comprise the steps:
Step1-1: use Words partition system to carry out participle to pending text T, then, go to stop, denoising, afterwards, obtain the elementary vector space model T={C of described text 1, C 2..., C n, wherein C 1, C 2..., C nrepresent the vector that n is made up of Feature Words, described in go to stop referring to and filter the stop words occurring in text, described denoising refers to filters the vocabulary without practical significance occurring in text;
Step1-2: further extract proper vector from elementary vector space model, to obtain the senior vector space model T={C of the text 1, C 2..., C m, wherein m<=n.
Further, described step Step2 can comprise the steps:
In text, comprise vocabulary implication and divide three kinds of situations: univocal, synonym and polysemant;
Conceptual Projection process realizes by the described semantic knowledge-base of inquiry, wherein:
Search knowledge base, judgement, in the time that the vocabulary comprising in text is univocal, directly obtains the concept of its unique correspondence;
Search knowledge base, judgement, in the time that the vocabulary comprising in text is synonym, directly obtains the concept of its unique correspondence, in this process, by the synonym occurring in the text is integrated in same concept automatically, with witness vector dimensionality reduction;
Search knowledge base, judgement is in the time that the vocabulary comprising in text is polysemant, the corresponding multiple concepts of this polysemant, concept is corresponding one by one with semantic category, information computing semantic category weights according to semantic category member word in the text, the corresponding concept of semantic category of choosing weights maximum is applicable to the concept of context of co-text herein as described polysemant, thereby realizes polysemant row discrimination.
Further, described search knowledge base, judgement is in the time that the vocabulary comprising in text is polysemant, the corresponding multiple concepts of this polysemant, concept is corresponding one by one with semantic category, and according to semantic category member word information computing semantic category weights in the text, the corresponding concept of semantic category of choosing weights maximum is applicable to the concept of context of co-text herein as described polysemant, thereby realize polysemant row discrimination, comprise the steps:
In described semantic knowledge-base, the semanteme of concept is mainly described by the former collection of basic meaning, and the former collection of basic meaning is described by one group of semantic relevant word again, and the word of describing the former collection of certain concept basic meaning has formed a semantic category;
Judge the corresponding multiple concepts of polysemant, the quantity of information of all member's words that calculate the corresponding semantic category of described concept in described processing text, weighted calculation obtains the weights of each semantic category;
Select the concept corresponding to semantic category of weights maximum to be applicable to the concept of context of co-text herein as described polysemant, thereby realize polysemant row discrimination.
Further, described step Step2 can comprise the steps:
Step2-1: to all Feature Words query semantics knowledge base successively in pending text T, carry out Conceptual Projection;
Step2-1-1: search knowledge base, if the Feature Words C of T mcorresponding unique concept, C mfor univocal or synonym, directly obtain C mconcept, go to step Setp2-2;
Step2-1-2: search knowledge base, if the Feature Words C of T mcorresponding multiple concepts, Feature Words C mfor polysemant, need polysemant C mcarry out word sense disambiguation, select the concept that is applicable to this article context of co-text;
Step2-2: obtain the corresponding Concept Vectors T={ of text T (G1, C 1), (G 2, C 2) ... (G q, C q);
Step2-3: further arrange the corresponding Concept Vectors T={ of output text T (G1, (C according to concept 1..., C i)), (G 2, (C 2..., C j)) ..., (G q, (C q..., C k)), wherein (C q..., C k) be concept G qthe corresponding in the text word occurring.
Further, described step Step2-1-2 can comprise the steps:
Step2-1-2-1: search knowledge base, polysemant C mcorresponding multiple concepts, the former collection of basic meaning of describing Concept Semantic has enough become a semantic category, thereby polysemant C mcorresponding multiple semantic categories, thus the phrase of describing the former collection of basic meaning can be obtained, and described phrase becomes one group of semantic relevant phrase of reflection semantic category;
Step2-1-2-2: calculate polysemant C mthe quantity of information of each semantic category member word in this article, described member's word w icontained quantity of information H (w in this article i) computing formula is as follows:
H(w i)=-TF(w i,ST)×log[p(w i)],
Wherein, TF (w i, ST) and expression word w ithe frequency occurring in text, ST represents text, P (w i) be word w iprobability distribution;
Step2-1-2-3: calculate polysemant C mthe weights of each semantic category, its i semantic category L iweights are:
CWeight ( L i ) = &Sigma; j = 1 n H ( w i ) &times; log 2 n ,
Wherein, n is semantic category L ithere is number in member's word, semantic category weights larger in the text, and this semantic category member word is larger to the contribution of text semantic;
Step2-1-2-4: be polysemant C mselector is should the optimum concept of civilian context semantic context, and formula is as follows:
Best C m L i = MAX ( CWeight ( L i ) ) .
Further, described step Step3 can comprise the steps:
Concept Vectors T={ (the G1, (C of the pending text T obtaining according to Step2 1..., C i)), (G 2, (C 2..., C j)) ..., (G q, (C q..., C k)), inquire about described semantic knowledge-base, convert T to semantic category model, each component of T represents with five-tuple;
Further, the form of described five-tuple is (L i, w i, (C 1..., C k), G i, (C 1..., C i)) represent, wherein, L ifor semantic category, w ifor L iweights, (C 1..., C k, C 1..., C i) be L imember's word, G ifor L ifor concept;
Further, described semantic category weight w iaccording to Step2-1-2-2 word in the text the formula of quantity of information calculate above-mentioned semantic category L iall member's word information amounts, then, calculate L according to Step2-1-2-3 semantic category weights computing formula iweights.
Further, described improved K-means algorithm can comprise the steps:
Carry out cluster by the semantic category in pending text, form multiple theme semantic categories bunch;
Select classical K-means clustering algorithm, make improvements by default kind of a submethod.
Further, described in select classical K-means clustering algorithm, make improvements by a default kind of submethod, can comprise the steps:
Micro-theme phenomenon is described: a semantic category is made up of the former concentrated word of basic meaning of a word in literary composition or several synonym and these words correspondence in semantic knowledge-base, and the former collection of basic meaning has been described the main semanteme of concept, therefore, between member's word justice of semantic category, correlativity is extremely strong, and they common reflected a micro-theme;
According to micro-theme phenomenon, in literary composition, there will be several micro-themes, select front K the strongest micro-theme of quantity of information occurring in literary composition, as " default seed ", become the initial center of K-means algorithm, for overcoming the defects such as the susceptibility of K-means algorithm to initial center and space-time expense be unstable;
Wherein the quantity of information of micro-theme is reflected by the weights of semantic category, and pending text table is shown as after described semantic model, and front K semantic category of the maximum semantic category of weights is front K micro-theme of quantity of information maximum.
Further, described step Step4 can comprise the steps:
Step4-1: select front K semantic category of semantic category weights order maximum as the initial classes center of cluster from the semantic category model of pending text T;
Step4-2: the similarity of calculating other semantic categories and K Ge Lei center in pending text T, last semantic category is assigned to one by one in the class of corresponding similarity maximum, wherein, calculating the similarity of other semantic categories and K Ge Lei center in described text T comprises: the similarity between the similarity between computing semantic class and semantic category and computing semantic class and semantic class set;
Step4-3: recalculate all kinds of central points, the central point of classification LL calculates by following formula:
centerLL = &Sigma; i = 1 n w i n
Wherein w ifor semantic category weights, n is the number that semantic class set comprises semantic category;
Step4-4: repeating step Step4-2 and Step4-3, until the central point of class no longer changes, obtain the semantic class set of K classification: { { Ф 1, { Ф 2..., { Ф k;
Step4-5: select to comprise the front k that semantic category number is maximum 1individual Semantic class set, obtains forming k 1the Semantic class set of individual sub-topics: { { Ф 1, { Ф 2..., { Ф k 1, obtain the set of k1 sub-topics key vocabularies by the corresponding relation of semantic category and concept and concept and Feature Words is reverse: { (c 11, c 12..., c 1i), (c 21, c 22..., c 2j) ... (c k11, c k12..., c k1t).
Further, the described multi-threaded extracting method based on semantic category also comprises:
Obtain concept similarity by calculating adopted former distance, represent the similarity of corresponding semantic category with concept similarity;
Semantic category and semantic category classification similarity are by calculating concept and concept set similarity represents.
Further, described semantic category and semantic category classification similarity, by calculating concept and concept set similarity represents, can comprise the steps:
Described semantic knowledge-base was described concept originally by multiple justice, and justice has formerly formed a former hierarchical system of tree-shaped justice according to hyponymy;
Obtain the similarity of justice between former by calculating the former distance in tree-shaped hierarchical system of justice;
The semanteme of concept, by the former description of one group of justice, can calculate the similarity between concept by the similarity that justice is former;
By calculating the similarity of all concepts in certain concept and concept set, choose and the value of this concept similarity maximum, as the similarity of this concept and concept set.
Further, the described similarity that obtains the distance of concept and then obtain concept by calculating adopted former distance, can comprise the steps:
Suppose that two former path distances in adopted elite tree hierarchical system of justice are d, as follows apart from the computation process of d:
If former any one the concentrated justice of justice was w originally i, L ifor the former w of justice ithe degree of depth in conceptional tree, a is apart from initial threshold, b is for meeting an arithmetic number of inequality max (L) <a/b, w iwith the distance of its father node be:
d(w i,parent(w i))=a-L i.b
Any two former w of justice i, w jbetween distance definition be:
d(w i,w j)=ω k.[a-max(L i,L j).b]
Wherein, ω krepresent weight corresponding to k kind relation, conventionally get ω k>=1,
Any two former (w of justice i, w j) between semantic similarity as follows:
Sim ( w i , w j ) = &theta; d ( w i , w j ) + &theta;
Wherein, be w apart from d iand w jpath in the former hierarchical system of justice, is a positive integer, and θ is an adjustable parameter;
Concept U and V are respectively by former group of (p of justice separately u1, p u2..., p un) and (p v1, p v2..., p vm) describe, U, V similarity are:
Sim ( U , V ) = ( U , V ) ( U , U ) . ( V , V )
Wherein, ( U , V ) = &Sigma; i n &Sigma; j m Sim ( p ui , p vj ) ,
Concept U is by former group of (p of justice 1, p 2..., p n) represent, concept set G is by concept set { G 11, G 21... G m1composition, the similarity of concept U and concept set G is defined as the maximal value of all concept similarities in U and G:
Sim(U,G)=Max{Sim(U,G i)|G i∈G}。
The invention provides a kind of multi-threaded extracting method based on Concept Vectors model, the method is utilized " knowing net " semantic knowledge-base, corresponding relation by the meaning of a word and concept carries out merger to synonym, excavate the mapping relations of the meaning of a word and semantic category in same linguistic context, polysemant is arranged to discrimination, structuring concept vector representation text; Semantic similarity is expressed by calculating concept similarity, in this multi-threaded word extraction algorithm, utilize improvement K-means algorithm to carry out cluster to text concept and obtain multiple sub-topicses bunch, recycling concept and original text keyword corresponding relation, reversely obtain multiple subject key words collection.Wherein, by " default seed " method improvement K-means algorithm, to make up K the defect that the caused space-time expense of randomness is unstable, result fluctuation is larger that initial center is selected in traditional K-means algorithm.
Further illustrate the present invention below in conjunction with Fig. 2, should understand these embodiment is only not used in and limits the scope of the invention for the present invention is described, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the amendment of the various equivalent form of values of the present invention.
Fig. 2 shows according to the schematic flow sheet using " knowing net " as semantic knowledge-base of the multi-threaded extracting method based on Concept Vectors model of technical solution of the present invention.As shown in Figure 2, first, input pending text T; Then, preprocessed text T, for example, uses ICTCLAS Words partition system to carry out participle, goes to stop, uses information gain (IG) tentatively to extract feature after denoising; Then, represent text T with vector space model; Then, the concept space model that text T is mapped as, for example, univocal, synonym directly shine upon; Polysemant is carried out to word sense disambiguation according to the correlativity of semantic category and context of co-text to polysemant; Then, utilize improved K-means algorithm to carry out cluster to concept, for example, by the method improvement K-means algorithm of default " seed "; Carry out similarity calculating by Concept Semantic; Then, obtain multiple sub-topics word sets according to the corresponding relation of concept and word is reverse.
Fig. 3 show according to the multi-threaded extracting method based on Concept Vectors model of technical solution of the present invention using " knowing net " as semantic knowledge-base about the semantic category schematic diagram of polysemant " moisture ".As shown in Figure 3, polysemant " moisture ", corresponding two concepts in " knowing net ", semantic category member's word (that is: the former collection of basic meaning) corresponding to first concept is { " plant ", " soil ", " sunlight ", " growth " }, the implication of " moisture " refers to " water that object contains " herein, and the implication that second semantic category member's word corresponding to concept comprises " moisture " in { " economy ", " data ", " growth ", " report " } refers to " being mingled with untrue composition ".
Experiment and interpretation of result: experimental data of the present invention comprises 20 classifications altogether from the disclosed standard corpus storehouse in Fudan University's natural language processing laboratory, 19637 sections of texts, but all do not mark theme, consider Work-Load Factor, test herein and from 5 classifications of this corpus, select length, the comparatively obvious 500 sections of texts of multi-threaded feature of length, carry out theme line mark through the professional who is engaged in Chinese work, as experiment sample.Experimental result is passed judgment on and is adopted general accuracy rate (P), recall rate (R) and overall target F1.
F 1 = 2 PR P + R - - - ( 11 )
Parameter estimation: in order to obtain the most appropriate value of initial clustering bunch parameter k in described improved k-means algorithm, according to the situation such as actual length length, text structure of test sample book, analyze through Chinese professional person, the value of every section of sampling sub-topics number k1 gets 3, and manually 3 sub-topicses are marked as standard value for every section of sample, experimental analysis k value in the situation that of k1=3, Fig. 4 reflects the situation of change of k accuracy rate (P), recall rate (R) and F1 under different values.
Fig. 4 shows the curve synoptic diagram according to accuracy rate, recall rate and F1 change under the different value of K of the multi-threaded extracting method based on Concept Vectors model of technical solution of the present invention.As shown in Figure 4, in the situation of 3 sub-topicses of every section of sampling, improved k-means algorithm, the accuracy rate of extracting theme along with the increase of k value improves constantly, and recall rate is reducing, this is to cause clustering cluster refinement because k value increases, so accuracy rate rises gradually; Algorithm recall rate generally determines, but in this experiment, along with the continuous refinement of increase classification of k value, before choosing 3 (k1=3) maximum sub-topicses time, caused recall rate decline; In order to find most suitable k value, the F1 index situation of analysis chart 3, from the trend of overall target F1, when the peak of F1 appears at k=7, so algorithm 2 optimal value under this experiment sample object is k=7, the value that it should be noted that k is relevant with text to be processed.
Test of heuristics: extract multi-threaded quality in order to test by the method improvement K-means algorithm of " default seed ", experiment sample is 500 sections of texts for preparing still, adopt above-mentioned parameter to estimate the result obtaining in experiment, get k=7, sub-topics number k1 is 3, first adopt traditional K-means algorithm, produce immediately the methods experiment 5 times of k initial center, and improved K-means to extract theme result statistics as shown in table 1:
The multi-threaded extraction result statistics of table 1K-means and improvement K-means
As can be seen from Table 1, K-means is in the situation that producing initial center immediately 5 times for tradition, accuracy rate, recall rate and the overall target F1 value of result are all very unstable, algorithm variation consuming time is larger, this is because traditional K-means algorithm is more responsive to initial cluster center, causes result and consuming time larger with different initial input fluctuations.For eliminating this defect, the present invention extracts feature according to theme, each theme often comprises multiple vocabulary with identical semantic concept, according to the concept multiple theme Semantic center of the quantity institute's text in secret information of corresponding vocabulary in the text, set out in advance K initial center of possibility maximum, thereby improve K-means, the theme quality of not only extracting is higher, and the execution efficiency of algorithm is also greatly improved.
Those skilled in the art of the present technique be appreciated that step in the various operations discussed in the present invention, method, flow process, measure, scheme can by alternately, change, combination or delete.Further, have other steps in the various operations discussed in the present invention, method, flow process, measure, scheme also can by alternately, change, reset, decompose, combination or delete.Further, of the prior art have with the present invention in step in disclosed various operations, method, flow process, measure, scheme also can by alternately, change, reset, decompose, combination or delete.
The above is only part embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (10)

1. the multi-threaded extracting method based on semantic category, is characterized in that, comprises the following steps:
Step1: vector model represents: text is carried out to pre-service, with the vector that obtains being formed by Feature Words, and by vector space model, described pretreated text table is shown as to the vector that Feature Words forms;
Step2: conceptual model mapping: semantic and represent the semantic knowledge-base of semantic relation between concept with tree structure based on express natural language vocabulary with concept, utilize the corresponding relation between the meaning of a word and described concept to carry out Conceptual Projection to the Feature Words of described pretreated composition text, in Conceptual Projection process, synonym in text carries out merger automatically, then, according to the correlativity of semantic category and context of co-text, the polysemant occurring in literary composition is arranged to discrimination, afterwards, the vector space model of described text after merger and row's discrimination is mapped to concept space model,
Step3: semantic category model conversion: according to the expression mode of concept in semantic knowledge-base and the definition of semantic category, the two has one-to-one relationship, to become semantic class model by the text-converted of presenting Conceptual Model;
Step4: multi-threaded word extracts: use improved K-means algorithm to carry out cluster to all semantic categories of the described text that is expressed as semantic category model, form multiple theme semantic categories bunch, to the multiple theme semantic categories bunch that form, according to the corresponding relation of semantic category and concept and concept and original text Feature Words corresponding relation, reversely obtain multiple theme feature word sets, to extract the multi-threaded word in single Chinese text.
According to claim 1 based on semantic category multi-threaded extracting method, it is characterized in that, described step Step1 further comprises the steps:
Step1-1: use Words partition system to carry out participle to pending text T, then, go to stop, denoising, afterwards, obtain the elementary vector space model T={C of described text 1, C 2..., C n, wherein C 1, C 2..., C nrepresent the vector that n is made up of Feature Words, described in go to stop referring to and filter the stop words occurring in text, described denoising refers to filters the vocabulary without practical significance occurring in text;
Step1-2: further extract proper vector from elementary vector space model, to obtain the senior vector space model T={C of the text 1, C 2..., C m, wherein m<=n.
3. the multi-threaded extracting method based on semantic category according to claim 1, is characterized in that, described step Step2 further comprises the steps:
Step2-1: to all Feature Words query semantics knowledge base successively in pending text T, carry out Conceptual Projection;
Step2-1-1: search knowledge base, if the Feature Words C of T mcorresponding unique concept, C mfor univocal or synonym, directly obtain C mconcept, go to step Setp2-2;
Step2-1-2: search knowledge base, if the Feature Words C of T mcorresponding multiple concepts, Feature Words C mfor polysemant, need polysemant C mcarry out word sense disambiguation, select the concept that is applicable to this article context of co-text;
Step2-2: obtain the corresponding Concept Vectors T={ of text T (G1, C 1), (G 2, C 2) ... (G q, C q);
Step2-3: further arrange the corresponding Concept Vectors T={ of output text T (G1, (C according to concept 1..., C i)), (G 2, (C 2..., C j)) ..., (G q, (C q..., C k)), wherein (C q..., C k) be concept G qthe corresponding in the text word occurring.
4. the multi-threaded extracting method based on semantic category according to claim 3, is characterized in that, described step Step2-1-2 can comprise the steps:
Step2-1-2-1: search knowledge base, polysemant C mcorresponding multiple concepts, the former collection of basic meaning of describing Concept Semantic has enough become a semantic category, thereby polysemant C mcorresponding multiple semantic categories, thus the phrase of describing the former collection of basic meaning can be obtained, and described phrase becomes one group of semantic relevant phrase of reflection semantic category;
Step2-1-2-2: calculate polysemant C mthe quantity of information of each semantic category member word in this article, described member's word w icontained quantity of information H (w in this article i) computing formula is as follows:
H(w i)=-TF(w i,ST)×log[p(w i)],
Wherein, TF (w i, ST) and expression word w ithe frequency occurring in text, ST represents text, P (w i) be word w iprobability distribution;
Step2-1-2-3: calculate polysemant C mthe weights of each semantic category, its i semantic category L iweights are:
Wherein, n is semantic category L ithere is number in member's word, semantic category weights larger in the text, and this semantic category member word is larger to the contribution of text semantic;
Step2-1-2-4: be polysemant C mselector is should the optimum concept of civilian context semantic context, and formula is as follows:
5. the multi-threaded extracting method based on semantic category according to claim 1, is characterized in that, described step Step3 further comprises the steps:
Concept Vectors T={ (the G1, (C of the pending text T obtaining according to Step2 1..., C i)), (G 2, (C 2..., C j)) ..., (G q, (C q..., C k)), inquire about described semantic knowledge-base, convert T to semantic category model, each component of T represents with five-tuple;
Further, the form of described five-tuple is (L i, w i, (C 1..., C k), G i, (C 1..., C i)) represent, wherein, L ifor semantic category, w ifor L iweights, (C 1..., C k, C 1..., C i) be L imember's word, G ifor L ifor concept;
Further, described semantic category weight w iaccording to Step2-1-2-2 word in the text the formula of quantity of information calculate above-mentioned semantic category L iall member's word information amounts, then, calculate L according to Step2-1-2-3 semantic category weights computing formula iweights.
6. the multi-threaded extracting method based on semantic category according to claim 1, is characterized in that, described improved K-means algorithm further comprises the steps:
Carry out cluster by the semantic category in pending text, form multiple theme semantic categories bunch;
Select classical K-means clustering algorithm, make improvements by default kind of a submethod.
According to claim 1 based on semantic category multi-threaded extracting method, it is characterized in that, described in select classical K-means clustering algorithm, make improvements by a default kind of submethod, further comprise the steps:
Micro-theme phenomenon is described: a semantic category is made up of the former concentrated word of basic meaning of a word in literary composition or several synonym and these words correspondence in semantic knowledge-base, and the former collection of basic meaning has been described the main semanteme of concept, therefore, between member's word justice of semantic category, correlativity is extremely strong, and they common reflected a micro-theme;
According to micro-theme phenomenon, in literary composition, there will be several micro-themes, select front K the strongest micro-theme of quantity of information occurring in literary composition, as " default seed ", become the initial center of K-means algorithm, for overcoming the defects such as the susceptibility of K-means algorithm to initial center and space-time expense be unstable;
Wherein the quantity of information of micro-theme is reflected by the weights of semantic category, and pending text table is shown as after described semantic model, and front K semantic category of the maximum semantic category of weights is front K micro-theme of quantity of information maximum.
8. the multi-threaded extracting method based on semantic category according to claim 1, is characterized in that, described step Step4 further comprises the steps:
Step4-1: select front K semantic category of semantic category weights order maximum as the initial classes center of cluster from the semantic category model of pending text T;
Step4-2: the similarity of calculating other semantic categories and K Ge Lei center in pending text T, last semantic category is assigned to one by one in the class of corresponding similarity maximum, wherein, calculating the similarity of other semantic categories and K Ge Lei center in described text T comprises: the similarity between the similarity between computing semantic class and semantic category and computing semantic class and semantic class set;
Step4-3: recalculate all kinds of central points, the central point of classification LL calculates by following formula:
Wherein w ifor semantic category weights, n is the number that semantic class set comprises semantic category;
Step4-4: repeating step Step4-2 and Step4-3, until the central point of class no longer changes, obtain the semantic class set of K classification: { { Ф 1, { Ф 2..., { Ф k;
Step4-5: select to comprise the front k that semantic category number is maximum 1individual Semantic class set, obtains forming k 1the Semantic class set of individual sub-topics: { { Ф 1, { Ф 2..., { Ф k 1, obtain the set of k1 sub-topics key vocabularies by the corresponding relation of semantic category and concept and concept and Feature Words is reverse: { (c 11, c 12..., c 1i), (c 21, c 22..., c 2j) ... (c k11, c k12..., c k1t).
9. the multi-threaded extracting method based on semantic category according to claim 1, is characterized in that, also comprises:
Obtain concept similarity by calculating adopted former distance, represent the similarity of corresponding semantic category with concept similarity;
Semantic category and semantic category classification similarity are by calculating concept and concept set similarity represents.
10. the multi-threaded extracting method based on semantic category according to claim 9, is characterized in that, described semantic category and semantic category classification similarity, by calculating concept and concept set similarity represents, comprise the steps:
Suppose that two former path distances in adopted elite tree hierarchical system of justice are d, as follows apart from the computation process of d:
If former any one the concentrated justice of justice was w originally i, L ifor the former w of justice ithe degree of depth in conceptional tree, a is apart from initial threshold, b is for meeting an arithmetic number of inequality max (L) <a/b, w iwith the distance of its father node be:
d(w i,parent(w i))=a-L i.b
Any two former w of justice i, w jbetween distance definition be:
d(w i,w j)=ω k.[a-max(L i,L j).b]
Wherein, ω krepresent weight corresponding to k kind relation, conventionally get ω k>=1,
Any two former (w of justice i, w j) between semantic similarity as follows:
Wherein, be w apart from d iand w jpath in the former hierarchical system of justice, is a positive integer, and θ is an adjustable parameter;
Concept U and V are respectively by former group of (p of justice separately u1, p u2..., p un) and (p v1, p v2..., p vm) describe, U, V similarity are:
Wherein,
Concept U is by former group of (p of justice 1, p 2..., p n) represent, concept set G is by concept set { G 11, G 21... G m1composition, the similarity of concept U and concept set G is defined as the maximal value of all concept similarities in U and G:
Sim(U,G)=Max{Sim(U,G i)|G i∈G}。
CN201410178218.6A 2014-04-29 2014-04-29 A kind of multi-threaded extracting method based on semantic category Expired - Fee Related CN103970729B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410178218.6A CN103970729B (en) 2014-04-29 2014-04-29 A kind of multi-threaded extracting method based on semantic category

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410178218.6A CN103970729B (en) 2014-04-29 2014-04-29 A kind of multi-threaded extracting method based on semantic category

Publications (2)

Publication Number Publication Date
CN103970729A true CN103970729A (en) 2014-08-06
CN103970729B CN103970729B (en) 2016-08-24

Family

ID=51240247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410178218.6A Expired - Fee Related CN103970729B (en) 2014-04-29 2014-04-29 A kind of multi-threaded extracting method based on semantic category

Country Status (1)

Country Link
CN (1) CN103970729B (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104298709A (en) * 2014-09-05 2015-01-21 上海中和软件有限公司 Text theme mining method based on intra-sentence association graph
CN104484411A (en) * 2014-12-16 2015-04-01 中国科学院自动化研究所 Building method for semantic knowledge base based on a dictionary
CN105718440A (en) * 2014-12-03 2016-06-29 南开大学 Text semantic representation method based on aggregation weighting matrix compression algorithm
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity
CN106708969A (en) * 2016-12-02 2017-05-24 山西大学 Co-occurrence latent semantic vector space model semantic core method based on literature resource topic clustering
CN106844328A (en) * 2016-08-23 2017-06-13 华南师范大学 A kind of new extensive document subject matter semantic analysis and system
CN106980867A (en) * 2016-01-15 2017-07-25 奥多比公司 Semantic concept in embedded space is modeled as distribution
CN107105349A (en) * 2017-05-17 2017-08-29 东莞市华睿电子科技有限公司 A kind of video recommendation method
CN107153672A (en) * 2017-03-22 2017-09-12 中国科学院自动化研究所 User mutual intension recognizing method and system based on Speech Act Theory
CN107209760A (en) * 2014-12-10 2017-09-26 凯恩迪股份有限公司 The sub-symbol data coding of weighting
CN107436916A (en) * 2017-06-15 2017-12-05 百度在线网络技术(北京)有限公司 The method and device of intelligent prompt answer
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
CN107633472A (en) * 2017-10-31 2018-01-26 广州努比互联网科技有限公司 It is a kind of based on teaching event stream and the internet learning method discussed immediately
CN107832288A (en) * 2017-09-27 2018-03-23 中国科学院自动化研究所 The measure and device of Chinese word semantic similarity
CN108766581A (en) * 2018-05-07 2018-11-06 上海市公共卫生临床中心 The key message method for digging and assistant diagnosis system of health medical treatment data
CN108804410A (en) * 2017-05-05 2018-11-13 北京数洋智慧科技有限公司 A kind of semantic interpretation method based on artificial intelligence text semantic similarity analysis
CN108804641A (en) * 2018-06-05 2018-11-13 鼎易创展咨询(北京)有限公司 A kind of computational methods of text similarity, device, equipment and storage medium
CN109033307A (en) * 2018-07-17 2018-12-18 华北水利水电大学 Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method
CN109710921A (en) * 2018-12-06 2019-05-03 深圳市中农易讯信息技术有限公司 Calculation method, device, computer equipment and the storage medium of Words similarity
CN110196905A (en) * 2018-02-27 2019-09-03 株式会社理光 It is a kind of to generate the method, apparatus and computer readable storage medium that word indicates
JP2019159918A (en) * 2018-03-14 2019-09-19 富士通株式会社 Clustering program, clustering method, and clustering apparatus
CN110442863A (en) * 2019-07-16 2019-11-12 深圳供电局有限公司 A kind of short text semantic similarity calculation method and its system, medium
CN110457325A (en) * 2019-08-12 2019-11-15 北京百度网讯科技有限公司 Method and apparatus for output information
CN110929529A (en) * 2019-11-29 2020-03-27 长沙理工大学 Text clustering method based on synonym forest semantic similarity
CN113326411A (en) * 2020-02-28 2021-08-31 中国移动通信集团福建有限公司 Network behavior knowledge enhancement method and device and electronic equipment
TWI744751B (en) * 2019-03-20 2021-11-01 日商斯庫林集團股份有限公司 Synonym judging method, computer-readable recording medium with synonym judging program recorded, and synonym judging device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079024B (en) * 2006-06-19 2010-06-16 腾讯科技(深圳)有限公司 Special word list dynamic generation system and method
CN103020111B (en) * 2012-10-29 2015-06-17 苏州大学 Image retrieval method based on vocabulary tree level semantic model
CN102945228B (en) * 2012-10-29 2016-07-06 广西科技大学 A kind of Multi-document summarization method based on text segmentation technology

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104298709A (en) * 2014-09-05 2015-01-21 上海中和软件有限公司 Text theme mining method based on intra-sentence association graph
CN105718440B (en) * 2014-12-03 2019-01-29 南开大学 Text semantic representation method based on polymerization weighting matrix compression algorithm
CN105718440A (en) * 2014-12-03 2016-06-29 南开大学 Text semantic representation method based on aggregation weighting matrix compression algorithm
US11061952B2 (en) 2014-12-10 2021-07-13 Kyndi, Inc. Weighted subsymbolic data encoding
CN107209760A (en) * 2014-12-10 2017-09-26 凯恩迪股份有限公司 The sub-symbol data coding of weighting
CN104484411B (en) * 2014-12-16 2017-12-22 中国科学院自动化研究所 A kind of construction method of the semantic knowledge-base based on dictionary
CN104484411A (en) * 2014-12-16 2015-04-01 中国科学院自动化研究所 Building method for semantic knowledge base based on a dictionary
CN106980867A (en) * 2016-01-15 2017-07-25 奥多比公司 Semantic concept in embedded space is modeled as distribution
CN106980867B (en) * 2016-01-15 2022-04-15 奥多比公司 Modeling semantic concepts in an embedding space as distributions
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity
CN105955948B (en) * 2016-04-22 2018-07-24 武汉大学 A kind of short text theme modeling method based on semanteme of word similarity
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
CN106844328B (en) * 2016-08-23 2020-04-21 华南师范大学 Large-scale document theme semantic analysis method and system
CN106844328A (en) * 2016-08-23 2017-06-13 华南师范大学 A kind of new extensive document subject matter semantic analysis and system
CN106708969A (en) * 2016-12-02 2017-05-24 山西大学 Co-occurrence latent semantic vector space model semantic core method based on literature resource topic clustering
CN106708969B (en) * 2016-12-02 2020-01-10 山西大学 Semantic core method for latent semantic vector space model based on document resource topic clustering co-occurrence
CN107153672A (en) * 2017-03-22 2017-09-12 中国科学院自动化研究所 User mutual intension recognizing method and system based on Speech Act Theory
CN108804410B (en) * 2017-05-05 2022-03-29 北京数洋智慧科技有限公司 Semantic interpretation method based on artificial intelligence text semantic similarity analysis
CN108804410A (en) * 2017-05-05 2018-11-13 北京数洋智慧科技有限公司 A kind of semantic interpretation method based on artificial intelligence text semantic similarity analysis
CN107105349A (en) * 2017-05-17 2017-08-29 东莞市华睿电子科技有限公司 A kind of video recommendation method
CN107436916A (en) * 2017-06-15 2017-12-05 百度在线网络技术(北京)有限公司 The method and device of intelligent prompt answer
CN107832288B (en) * 2017-09-27 2020-06-16 中国科学院自动化研究所 Method and device for measuring semantic similarity of Chinese words
CN107832288A (en) * 2017-09-27 2018-03-23 中国科学院自动化研究所 The measure and device of Chinese word semantic similarity
CN107633472A (en) * 2017-10-31 2018-01-26 广州努比互联网科技有限公司 It is a kind of based on teaching event stream and the internet learning method discussed immediately
CN110196905A (en) * 2018-02-27 2019-09-03 株式会社理光 It is a kind of to generate the method, apparatus and computer readable storage medium that word indicates
JP7006402B2 (en) 2018-03-14 2022-01-24 富士通株式会社 Clustering program, clustering method and clustering device
JP2019159918A (en) * 2018-03-14 2019-09-19 富士通株式会社 Clustering program, clustering method, and clustering apparatus
CN108766581A (en) * 2018-05-07 2018-11-06 上海市公共卫生临床中心 The key message method for digging and assistant diagnosis system of health medical treatment data
CN108804641A (en) * 2018-06-05 2018-11-13 鼎易创展咨询(北京)有限公司 A kind of computational methods of text similarity, device, equipment and storage medium
CN108804641B (en) * 2018-06-05 2021-11-09 鼎易创展咨询(北京)有限公司 Text similarity calculation method, device, equipment and storage medium
CN109033307A (en) * 2018-07-17 2018-12-18 华北水利水电大学 Word polyarch vector based on CRP cluster indicates and Word sense disambiguation method
CN109033307B (en) * 2018-07-17 2021-08-31 华北水利水电大学 CRP clustering-based word multi-prototype vector representation and word sense disambiguation method
CN109710921A (en) * 2018-12-06 2019-05-03 深圳市中农易讯信息技术有限公司 Calculation method, device, computer equipment and the storage medium of Words similarity
TWI744751B (en) * 2019-03-20 2021-11-01 日商斯庫林集團股份有限公司 Synonym judging method, computer-readable recording medium with synonym judging program recorded, and synonym judging device
CN110442863B (en) * 2019-07-16 2023-05-05 深圳供电局有限公司 Short text semantic similarity calculation method, system and medium thereof
CN110442863A (en) * 2019-07-16 2019-11-12 深圳供电局有限公司 A kind of short text semantic similarity calculation method and its system, medium
CN110457325A (en) * 2019-08-12 2019-11-15 北京百度网讯科技有限公司 Method and apparatus for output information
CN110929529A (en) * 2019-11-29 2020-03-27 长沙理工大学 Text clustering method based on synonym forest semantic similarity
CN110929529B (en) * 2019-11-29 2023-04-18 长沙理工大学 Synonym word Lin Yuyi similarity-based text clustering method
CN113326411A (en) * 2020-02-28 2021-08-31 中国移动通信集团福建有限公司 Network behavior knowledge enhancement method and device and electronic equipment
CN113326411B (en) * 2020-02-28 2024-05-03 中国移动通信集团福建有限公司 Network behavior knowledge enhancement method and device and electronic equipment

Also Published As

Publication number Publication date
CN103970729B (en) 2016-08-24

Similar Documents

Publication Publication Date Title
CN103970729A (en) Multi-subject extracting method based on semantic categories
CN104008090A (en) Multi-subject extraction method based on concept vector model
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN101398814B (en) Method and system for simultaneously abstracting document summarization and key words
CN107066553B (en) Short text classification method based on convolutional neural network and random forest
CN106294593B (en) In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
CN102419778B (en) Information searching method for discovering and clustering sub-topics of query statement
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN103473283B (en) Method for matching textual cases
Bouaziz et al. Short text classification using semantic random forest
CN105045875B (en) Personalized search and device
CN110674252A (en) High-precision semantic search system for judicial domain
CN104699766A (en) Implicit attribute mining method integrating word correlation and context deduction
CN103049569A (en) Text similarity matching method on basis of vector space model
CN101114298A (en) Method for gaining oral vocabulary entry, device and input method system thereof
CN109376352A (en) A kind of patent text modeling method based on word2vec and semantic similarity
CN104765779A (en) Patent document inquiry extension method based on YAGO2s
CN105447119A (en) Text clustering method
CN105335510A (en) Text data efficient searching method
Sebti et al. A new word sense similarity measure in WordNet
CN114997288A (en) Design resource association method
CN109299251A (en) A kind of abnormal refuse messages recognition methods and system based on deep learning algorithm
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN102314464B (en) Lyrics searching method and lyrics searching engine

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160824

CF01 Termination of patent right due to non-payment of annual fee