CN104750777B

CN104750777B - text marking method and system

Info

Publication number: CN104750777B
Application number: CN201410850313.6A
Authority: CN
Inventors: 王勇; 张霞; 赵立军
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2014-12-31
Filing date: 2014-12-31
Publication date: 2018-04-06
Anticipated expiration: 2034-12-31
Also published as: CN104750777A

Abstract

The present invention provides a kind of text marking method and system, and method therein includes：According to domain body structure concept collection of illustrative plates；According to text collection, the topic model of text collection is built；Topic model is distributed including theme vocabulary；It is distributed according to theme vocabulary, the concept in the concept that each vocabulary in acquired each theme is corresponding and concept collection of illustrative plates, obtains the Subject Concept correlation between each theme in topic model and the concept in concept collection of illustrative plates；According to Subject Concept correlation, text concept is distributed using text subject distribution and carries out Concept correlations adjustment, to complete text marking, wherein, text subject distribution carries out theme to text to be marked based on the distribution of theme vocabulary and marks to obtain；Text concept distribution is treated mark text progress concept tagging based on concept collection of illustrative plates and obtained.Using the present invention, it can solve the problem that the spill tag occurred in text marking and miss target problem.

Description

Text marking method and system

Technical field

The present invention relates to text marking technical field, more specifically, is related to a kind of text marking method and system.

Background technology

With the popularization of mobile Internet and social networks, generate substantial amounts of user and generate text (User Generated Content, abbreviation UGC), due to culture background and the difference of statement custom, people are often using different The word content similar with form of presentation expression, thus in traditional search engines it is widely used based on the inverted index of word come The method of management UGC contents can not disclose UGC inherent correlation, so as to carry out effectively safeguarding to these texts, examine Rope and recommendation, so understanding that the implication of text becomes very necessary in semantic level.

UGC can be entered using natural language processing technique (Natural Language Processing, abbreviation NLP) The understanding of row depth, but due to the complexity of Human Natural Language, it is impossible to the understanding that depth is carried out to UGC is truly realized, And this deep understanding also tends to be unnecessary.If in fact, carrying out semantic tagger to text, word is built to semanteme The mapping of concept, even if then superficial layer analyzing can only be carried out to UGC, points of UGC in semantic concept spatially can be also judged according to this Cloth, so as to provide practical basis for UGC management, search and recommendation.

In text mining field, subject analysis method is a kind of common method that semantic tagger is carried out to text.As A kind of statistical method based on unsupervised learning, for given text collection, subject analysis method can determine by parameter Some implicit themes, each theme are the set of some vocabulary, and every text can then be expressed as general on multiple themes Rate is distributed, and is compared with the word in bag of words, and the dimension of implicit theme is much lower, therefore can be effectively prevented from word one-level Noise.

Although subject analysis method has good analysis ability to the high frequency hot word in text, used due to it Topic model it is assumed that be not appropriate for learning a large amount of long-tail words included in text, is made using the probability of exponential distribution When obtaining the theme that topic model is formed and do not include long-tail word, therefore text being labeled using topic model, it can not also mark The long-tail word gone out in text, this causes the availability of text marking to be greatly affected.

In order to solve the identification of long-tail word this problem, concept collection of illustrative plates can be introduced during text marking.Concept map Although spectrum can solve the problem that the identification problem of long-tail word, because concept atlas calculation lacks the general inclination to resource text Understanding, if some non-core word repeatedly occurs, system will be misled the concept corresponding to it is general as the key of text Read so as to cause to mark by mistake, simultaneously, it is also possible to because the number that core word occurs is few, and cause spill tag.

The content of the invention

In view of the above problems, it is an object of the invention to provide a kind of text marking method and system, using topic model and The mixing notation methods of concept collection of illustrative plates, to solve the problems, such as the spill tag occurred in text marking and by mistake target.

According to an aspect of the present invention, there is provided a kind of text marking method, including：

According to concept collection of illustrative plates of the domain body structure with the correlation between concept；And

According to text collection, the topic model of text collection is built；Wherein, topic model is included corresponding to text collection Theme-vocabulary distribution；

It is distributed according to theme-vocabulary, is in the concept that each vocabulary in each theme is corresponding and concept collection of illustrative plates general Read, obtain theme-Concept correlations between the concept in each theme and concept collection of illustrative plates in topic model, wherein, each The corresponding concept of each vocabulary in theme obtains according to concept collection of illustrative plates；

According to theme-Concept correlations, Concept correlations tune is carried out to text-concept distribution using text-theme distribution It is whole, to complete text marking；Wherein, text-theme distribution is based on theme-vocabulary distribution and carries out theme mark to text to be marked Obtain；Text-concept distribution is treated mark text progress concept tagging based on concept collection of illustrative plates and obtained.

According to another aspect of the present invention, there is provided a kind of text marking system, including：

Concept collection of illustrative plates construction unit, for according to concept collection of illustrative plates of the domain body structure with the correlation between concept；

Topic model construction unit, for according to collection of document, building the topic model of collection of document；Wherein topic model Including the theme corresponding to text collection-vocabulary distribution；

Theme-Concept correlations acquiring unit, for each vocabulary institute in theme-vocabulary distribution, each theme Concept in corresponding concept and concept collection of illustrative plates, obtain topic model in each theme and concept collection of illustrative plates in concept between Theme-Concept correlations, wherein, the corresponding concept of each vocabulary in each theme obtains according to concept collection of illustrative plates；

Text marking unit, for according to theme-Concept correlations, being distributed using text-theme distribution to text-concept Concept correlations adjustment is carried out, to complete text marking；Wherein,

Text marking unit includes：

Theme labeling module, for theme mark is carried out to text to be marked based on the distribution of theme-vocabulary with obtain text- Theme distribution；

Concept tagging module, concept tagging is carried out to obtain text-concept for treating mark text based on concept collection of illustrative plates Distribution.

It was found from technical scheme above, text marking method and system provided by the invention, according to theme-conceptual dependency Property, text-theme distribution of the text to be marked obtained using topic model is to the text to be marked that is obtained by concept collection of illustrative plates The correlation of text-concept lifted, to solve the problems, such as the spill tag occurred in text marking and by mistake target.

In order to realize above-mentioned and related purpose, one or more aspects of the invention include will be explained in below and The feature particularly pointed out in claim.Some illustrative aspects of the present invention are described in detail in following explanation and accompanying drawing. However, some modes in the various modes of the principle that the present invention only can be used of these aspect instructions.It is in addition, of the invention It is intended to include all these aspects and their equivalent.

Brief description of the drawings

By reference to the explanation and the content of claims below in conjunction with accompanying drawing, and with to the present invention more comprehensively Understand, other purposes and result of the invention will be more apparent and should be readily appreciated that.In the accompanying drawings：

Fig. 1 is the text marking method flow schematic diagram according to the embodiment of the present invention；

Fig. 2 is the text marking handling process schematic diagram according to the embodiment of the present invention；

Fig. 3 is the concept-general generated according to the embodiment of the present invention according to Wikipedia concepts and google range formulas Read a fragmentary views of correlation；

Fig. 4 is the correlation calculations process example schematic diagram between theme and concept according to the embodiment of the present invention；

Fig. 5 is the schematic diagram of correlativity calculation result example one between theme and concept according to the embodiment of the present invention；

Fig. 6 is the schematic diagram of correlativity calculation result example two between theme and concept according to the embodiment of the present invention；

Fig. 7 is acquisition text-concept distribution example schematic diagram according to the embodiment of the present invention；

Fig. 8 is acquisition text-theme distribution example schematic diagram according to the embodiment of the present invention；

Fig. 9 is the text marking Adjustable calculation process example schematic diagram according to the embodiment of the present invention；

Figure 10 is the text marking Adjustable calculation result example schematic diagram according to the embodiment of the present invention；

Figure 11 is the text marking system logic structure block diagram according to the embodiment of the present invention.

Identical label indicates similar or corresponding feature or function in all of the figs.

Embodiment

In the following description, for purposes of illustration, in order to provide the comprehensive understanding to one or more embodiments, explain Many details are stated.It may be evident, however, that these embodiments can also be realized in the case of these no details.

The overall of text can be provided for topic model to be inclined to, but can not mark out the long-tail word in text；It is and general The presence of concept, especially long-tail concept can be provided by reading collection of illustrative plates mark, but the problems such as spill tag, mark by mistake be present.The present invention carries Go out mixing text marking method that is a kind of while using topic model and concept collection of illustrative plates, topic model can kept to text While being integrally inclined to the advantage understood, still ensure that the long-tail word in concept collection of illustrative plates is marked exactly.

The specific embodiment of the present invention is described in detail below with reference to accompanying drawing.

In order to illustrate text marking method provided by the invention, Fig. 1 shows text marking according to embodiments of the present invention Flow.

As shown in figure 1, text marking method provided by the invention includes：

S110：According to concept collection of illustrative plates of the domain body structure with the correlation between concept；And

According to text collection, the topic model of text collection is built；Wherein, topic model is included corresponding to text collection Theme-vocabulary distribution.

Specifically, structure concept collection of illustrative plates and topic model.Wherein, the famous relation in domain body and unknown relation Concept collection of illustrative plates of the structure with concept-Concept correlations, it is the follow-up base for determining theme-Concept correlations to build this concept collection of illustrative plates Plinth.

Also, text collection is analyzed using subject analysis method, builds topic model, wherein, topic model bag Include theme-vocabulary distribution of multiple themes and multiple themes.That is, by using subject analysis method (such as：LDA Algorithm) text collection is analyzed, generate theme-vocabulary distribution of multiple themes.

S120：It is distributed according to theme-vocabulary, in the concept corresponding to each vocabulary in each theme and concept collection of illustrative plates Concept, obtain topic model in each theme and concept collection of illustrative plates in concept between theme-Concept correlations, wherein, The corresponding concept of each vocabulary in each theme obtains according to concept collection of illustrative plates.

Specifically, according to concept collection of illustrative plates, obtain relative with each vocabulary in each theme in the topic model built The concept answered.It should be noted that the concept queries service provided by the concept collection of illustrative plates of structure, the vocabulary energy in each theme It is enough to obtain corresponding concept.

Then, for each theme obtained in topic model, calculate concept corresponding to the high frequency words in the theme and The correlation between concept in concept collection of illustrative plates, and accumulative summation, obtain the correlation between concept in the theme and concept collection of illustrative plates Property.

S130：According to theme-Concept correlations, conceptual dependency is carried out to text-concept distribution using text-theme distribution Property adjustment, to complete text marking；Wherein, text-theme distribution is based on theme-vocabulary distribution and carries out theme to text to be marked Mark obtains；Text-concept distribution is treated mark text progress concept tagging based on concept collection of illustrative plates and obtained.

Specifically, mark text is treated according to the concept collection of illustrative plates of acquisition and carries out concept tagging, obtain the text of text to be marked Sheet-concept is distributed；And theme mark is carried out to text to be marked according to the theme in the topic model of acquisition-vocabulary distribution, Obtain text-theme distribution of text to be marked.

Wherein, using the concept collection of illustrative plates of structure, concept tagging is carried out to text to be marked, the weights and concept of concept occur Number it is directly proportional, and when some concept and other conceptual dependencies, increase the weights of the concept.Theme mould based on structure Theme-vocabulary distribution in type, carries out theme mark to text to be marked using theme assessment algorithm, obtains text-theme point Cloth.

In above process, it is proposed that concept-topic relativity, i.e., each theme and concept collection of illustrative plates in topic model In one or more concept between establish a kind of relation of correlation, so that the models coupling of two isomeries be got up；So Afterwards according to concept-topic relativity, the degree of lifting or reduction of the theme for concept tagging can be quantitatively calculated.

The text marking method of the present invention includes two large divisions, Part I, builds theme-Concept correlations；Second Point, the text marking stage.The core of text marking method proposed by the invention is, establish topic model and concept collection of illustrative plates it Between correlation relation (i.e.：Build theme-Concept correlations), so as to the result marked using theme, to concept mark Note is adjusted.The text marking method of the present invention will be described in depth below.

Part I：Build theme-Concept correlations

1st, structure concept collection of illustrative plates

Concept collection of illustrative plates is built according to domain body, and the source of domain body includes：

A, similar to wikipedia (Wikipedia) semi-structured text；

B, the field concept text write by domain expert；

C, similar to Wordnet concept system.

Based on above-mentioned domain body, structure concept collection of illustrative plates c, wherein, m concept is included in concept collection of illustrative plates c, and at mostIndividual concept-Concept correlations.

For Wordnet concept system, because it is already provided with concept-Concept correlations, so with regard to without carrying out again Calculate.

For a and b both of these cases, concept-Concept correlations are calculated using google distances, can be used following public Formula represents：

Wherein, f (c₁) represent to quote concept c₁The page number (for a), or include c₁Text number it is (right In b)；

f(c₁,c₂) represent to quote concept c simultaneously₁And c₂The page number (for a), or include c simultaneously₁And c₂'s The number of text (for b)；

M represents total page number (for a) or total textual data (for b).

2nd, topic model is built

Using subject analysis method (such as：LDA algorithm), for given text collection R, the n theme of this area is generated, N ＜ | R |, each theme k is defined as a multinomial distribution on vocabularyWherein,Wherein, w represents some vocabulary, and V represents vocabulary.

, wherein it is desired to explanation, in LDA algorithmIt is unknown implicit variable, it is necessary to according in text collection Word is estimated, typically learns the implicit variable in LDA using approximate inference algorithms.LDA original papers The mean-field variational expectation maximisation algorithms used in Blei02, but The Gibbs Sampling used in Griffiths02 are more easily understood, it is proposed that being come using Gibbs Sampling Learnt, by this is not the scope that the present invention is covered, therefore no longer LDA training algorithm is repeated.

3rd, theme-Concept correlations structure

For each theme, first, by calculate obtain concept corresponding to each vocabulary in each theme with it is described The correlation between concept in concept collection of illustrative plates；

Then, theme-cumulative summation of vocabulary distribution according to each vocabulary on theme, obtains theme-Concept correlations.

That is, for each theme, according to the concept-Concept correlations provided in concept collection of illustrative plates, the theme is calculated In each word w corresponding to concept c (w) and according to domain body build concept collection of illustrative plates in any concept c between Correlation, the then distribution according to word w on theme kWeighting, obtain the correlation between theme k and any field concept c P (c | k), calculation formula is：

Wherein, w represents vocabulary；C (w) represents the concept corresponding to vocabulary w；

C represents the concept in concept collection of illustrative plates；Represent theme-vocabulary distribution；P (c | c (w)) represent concept c (w) with it is general Read the correlation between c；P (c | k) represent theme-Concept correlations.

, wherein it is desired to explanation, concept corresponding to each vocabulary and the concept map in each theme is calculated It is the concept queries service that is provided by the concept collection of illustrative plates of structure to obtain and each word during correlation between the concept in spectrum Concept corresponding to converging；Wherein, there is detailed description in the concept according to corresponding to bilingual lexicon acquisition in the prior art, in this hair Repeated no more in bright.

Part II：The text marking stage

1st, concept tagging

For text to be marked, in the present invention using the concept collection of illustrative plates of structure, concept tagging is carried out to text to be marked. Wherein it is possible to concept mark is carried out to text d to be marked using the simply dimensioning algorithm based on word frequency, or TextLink algorithms Note, obtain text-concept distributionWhereinThis probability distribution table Show the weight for the field concept that document is included.

The mask method of concept collection of illustrative plates, it is by establishing a domain body, the domain body includes the concept of this area And multiple conceptual examples, each concept have one or more title, concept has a hyponymy, between conceptual example With a variety of naming relationships or unknown relation, by the analysis to these relations, the degree of correlation between concept can be set up. Hereafter, by concept name, the concept discussed in text is identified, it is necessary to tie when a word corresponds to multiple concepts The context for closing text carries out the disambiguation of concept.Can also be according to concept-Concept correlations, it is determined that each marking concept and other The correlation of all concepts, the text-concept for then further determining that text according to the relevance ranking are distributed

2nd, theme marks

For text to be marked, then it is distributed by the theme-vocabulary obtained in the topic model of structureUse theme Assessment algorithm carries out theme mark to text d to be marked, obtains text d text-theme distribution Wherein,This probability distribution represents the weight for the theme that document is included.

, wherein it is desired to explanation, when carrying out concept tagging and theme marks, due to being not present between both marks Direct logic dependencies, concept tagging can be first carried out, then carry out theme mark；Theme mark can also be first carried out, Carry out concept tagging；Or concept tagging and theme mark are carried out simultaneously.

3rd, mark adjustment

For obtaining text-concept distribution in concept taggingIn each concept c, use formula below calculate concept Correlation during c and theme mark between the text d of gained all theme T (d), E (c) are concept c desired value：

Wherein, E (c) represents concept c desired value；C represents each concept in text-concept distribution；T (d) represents to treat Mark all themes in text d；Represent text d to be marked text-theme distribution.

If the E (c) obtained is more than text-Concept correlationsDegree of correlation lifting is then carried out, uses following adjustment Function：

Wherein,Represent text-Concept correlations；E (c) represents the desired value of concept；

κ is scale factor, and the ratio that concept tagging and theme mark is adjusted by adjusting κ size.

Then, each concept is traveled through, obtains text-Concept correlations after adjustmentMost Afterwards, it is normalized, has just obtained final annotation results.In above-mentioned Tuning function, κ is a scale factor, is passed through Adjust κ size come adjust concept tagging and theme mark ratio, pass through this adjustment so that system can keep theme While the entirety that mark can analyze text is inclined to this advantage, and can enough ensures that the long-tail word in text to be marked is not neglected Depending on.

In order to further illustrate the whole flow process of text marking, Fig. 2 shows the text marking processing of the embodiment of the present invention Flow.

In an embodiment as illustrated in figure 2, whole flow process includes：Theme-Concept correlations structure stage, concept tagging rank Section, theme mark stage and mark adjusting stage.

In theme-Concept correlations build the stage, according to domain body structure concept collection of illustrative plates；And calculated using LDA Method is trained to text collection, generation theme-vocabulary distribution.According to the concept collection of illustrative plates of generation and theme-vocabulary distribution, structure Theme-Concept correlations.

In the concept tagging stage, according to the concept collection of illustrative plates of generation, concept tagging is carried out to text to be marked.

In the theme mark stage, it is distributed according to the theme of generation-vocabulary, theme mark is carried out to text to be marked.

In the adjusting stage is marked, according to structure theme-Concept correlations, the result marked using theme is to concept tagging Result lifted or reduced, form last annotation results.

Example

Theme-conceptual relation construction

Fig. 3 shows one of concept-Concept correlations according to Wikipedia concepts and the generation of google range formulas Fragment.As shown in figure 3, according to Wikipedia concepts, structure concept collection of illustrative plates, product concept-Concept correlations.

Then, topic model training is carried out to some newsletter archives using topic model developing algorithm, target topic Number is set to 250, the theme that learns-word distributionWherein, shown in the theme of theme 110 and theme 128-word distribution such as Under：

Theme 110 (total words 1428)

206 are calculated, cloud computing 173, China 76, conference 68, the 6th 48, forum 48, speech 39, welcome guest 30, Innovation 28, big data 28, using 25, data 22, technology 22, center 21, openstack21, general manager 18, everybody 16, subject under discussion 15, manager 15, it is believed that 13, introduce 13, ibm 13, focus 13, traffic 12, meeting 12, cloud 10 is remaining Volume treasured 10, country 10, advantage 10, Beijing 10, manufacturer 10, new projects 9, the author 9, case 9, project 9, channel 8, Title 8, aviation 8, aws 8,

Theme 128 (total words 886)

Safety 159, leak 75, software 30, network 24,23 are found, hacker 22, Microsoft 21, security 20, base Gold 19, influence 15, foundation 14, website 14, google 14, heartbleed 12, security 11, linux 11, Openssl 10, code 10, network security 9, version 9, basis 9, challenge 8, encrypt 8, cloudpassage 8, malice 8, facility 8, computer 8,7 are operated in, full company 7, third party 7, infrastructure 7, ibm 7, report 7, today 7, letter Breath safety 6, uneasiness 6, measure 6, joint 6,90 6, agreement 6, linux 5, cve5, ioactive 5, website 5,

Finally, using theme-Concept correlations formula, the correlation P (c | k) between theme k and concept c is calculated, such as For theme 110, calculating process such as Fig. 4, result of calculation Fig. 5.Similarly, theme-Concept correlations formula can also be used, is obtained Correlation between theme 128 and field concept, result of calculation are as shown in Figure 6.

Text marking

Text to be marked is as follows：

Sandbox mechanism and technology in cloud computing security system dissect, most crucial in the cloud computing environment of multi-tenant Security doctrine is multi-tenant isolation.Flying apsaras security sandbox adheres to this principle and given birth to, and it itself provides security protection for flying apsaras Ability, and every cloud computing product to be carried in flying apsaras ecological environment provides most basic multi-tenant isolation scheme.In base In the odps products of flying apsaras, flying apsaras security sandbox provides anti-to the isolation of the multilayer of kernel layers and safety from java linguistic levels Shield measure.We as example, will inquire into the structure and system of security sandbox in multi-tenant cloud computing environment during this is shared

Concept tagging is carried out using the algorithm of the concept collection of illustrative plates of preceding introduction, obtains text as shown in Figure 7-concept distribution

Then, above-mentioned text to be marked is assessed using theme assessment algorithm, obtains document-master as shown in Figure 8 Topic distributionIn 2 maximally related themes.

Marked according to theme, it can be seen that text to be marked is not only related to theme 110, also with have on theme 128 it is certain Relation (weight 0.057), use mark adjustment formula to calculate text as shown in Figure 9-concept and be distributedIn it is each general Read the correlation of c and text d all themes.

As shown in Figure 10, for concept 19541494 (Cloud computing) and concept 3157360 (Multitentancy), because Concept correlations are more than topic relativity, Concept correlations are compressed.For concept 7398 (Computer security) etc., then because topic relativity is more than Concept correlations, therefore Concept correlations get a promotion.

As can see from Figure 10, the long-tail as concept 1291932 (Sandbox (Computer security)) Concept, also because the presence of theme 128, its Concept correlations get a promotion, so as to correctly be marked, illustrate this hair Bright validity.

Corresponding with the above method, the present invention also provides a kind of text marking system, and Figure 11 is shown according to of the invention real Apply the text marking system logic structure of example.

As shown in figure 11, text marking system 1100 provided by the invention includes concept collection of illustrative plates construction unit 1110, theme Model construction unit 1120, theme-Concept correlations acquiring unit 1130, text marking unit 1140.

Wherein, concept collection of illustrative plates construction unit 1110 is used for according to domain body structure with the general of the correlation between concept Read collection of illustrative plates.

Topic model construction unit 1120 is used for the topic model for according to text collection, building text collection, wherein, theme Model includes theme-vocabulary distribution corresponding to text collection.

Each vocabulary that theme-Concept correlations acquiring unit 1130 is used in theme-vocabulary distribution, each theme Concept in corresponding concept and concept collection of illustrative plates, obtain concept in each theme and the concept collection of illustrative plates in topic model it Between theme-Concept correlations, the corresponding concept of each vocabulary in each theme obtains according to concept collection of illustrative plates.

Text marking unit 1140 is used for according to theme-Concept correlations, using text-theme distribution to text-concept Distribution carries out Concept correlations adjustment, to complete text marking.

Wherein, text marking unit 1140 further comprises：

Theme labeling module 1141 is used to carry out text to be marked theme mark based on theme-vocabulary distribution to obtain text Sheet-theme distribution；

Concept tagging module 1142 be used for based on concept collection of illustrative plates treat mark text carry out concept tagging with obtain text-generally Read distribution.

Wherein, the concept of correlation of the concept collection of illustrative plates construction unit 1110 between according to domain body structure with concept During collection of illustrative plates,

The source of domain body includes：The semi-structured page or field concept text；

For the semi-structured page or field concept text, the correlation in calculating field body between concept, its formula It is expressed as：

Wherein, f (c₁) represent to quote concept c₁The page number or include c₁Text number；

f(c₁, c₂) represent to quote concept c simultaneously₁And c₂The page number or include c simultaneously₁And c₂Text Number；

M represents total page number or total textual data.

Wherein, subject analysis method is used to analyze to build topic model text collection, wherein,

Using subject analysis method, analyzed for text collection R, generate n theme；Wherein, n ＜ | R |, Mei Gezhu Distributions of the k on vocabulary is inscribed to be expressed as：

Wherein,

Wherein, w represents vocabulary, and V represents vocabulary.

Each theme of the theme-Concept correlations acquiring unit 1130 in topic model is obtained with it is general in concept collection of illustrative plates During theme-Concept correlations between thought,

First, obtain related between the concept corresponding to each vocabulary in each theme and the concept in concept collection of illustrative plates Property；

Then, theme-cumulative summation of vocabulary distribution according to each vocabulary on each theme, obtains theme-concept phase Guan Xing；Its formula is：

Text marking unit 1140 text-theme distribution using text to be marked to the text of text to be marked- During concept distribution carries out concept tagging correlation adjustment,

The mathematic expectaion of text-Concept correlations is：

Wherein, E (c) represents concept c desired value；C represents each concept in text-concept distribution；T (d) represents to treat Mark all themes in text d；Represent text d to be marked text-theme distribution；

If E (c) is more than text-Concept correlations, degree of correlation lifting is carried out, is adjusted using equation below：

κ is scale factor, and the ratio that concept tagging and theme mark is adjusted by adjusting κ size；

Then, each concept is traveled through, obtains text-Concept correlations after adjustmentFinally It is normalized, completes text marking.

The more specific interaction of above-mentioned each module or unit, may refer to the description in method flow, no longer superfluous herein State.

Text marking method and system provided by the invention can be seen that by above-mentioned embodiment, utilize topic model Obtained text-theme distribution, according to theme-Concept correlations of structure, to text-concept for being obtained using concept collection of illustrative plates Correlation is lifted, so as to solve the problems, such as the spill tag occurred in text marking and by mistake target.

Described in an illustrative manner above with reference to accompanying drawing according to text marking method and system proposed by the present invention.But It is, it will be appreciated by those skilled in the art that the text marking method and system proposed for the invention described above, can also be not Various improvement are made on the basis of disengaging present invention.Therefore, protection scope of the present invention should be by appended claim The content of book determines.

Claims

1. a kind of text marking method, including：

According to text collection, the topic model of text collection is built；Wherein, the topic model is included corresponding to text collection Theme-vocabulary distribution；

In the corresponding concept of each vocabulary and the concept collection of illustrative plates in the theme-vocabulary distribution, each theme Concept, obtain theme-conceptual dependency between the concept in each theme and the concept collection of illustrative plates in the topic model Property, wherein, the corresponding concept of each vocabulary in each theme obtains according to the concept collection of illustrative plates；

According to the theme-Concept correlations, Concept correlations tune is carried out to text-concept distribution using text-theme distribution It is whole, to complete text marking；Wherein, the text-theme distribution is based on the theme-vocabulary distribution to text to be marked progress Theme marks to obtain；The text-concept distribution carries out concept tagging to the text to be marked based on the concept collection of illustrative plates and obtained Arrive.

2. text marking method as claimed in claim 1, wherein, each theme in the topic model is obtained with it is described During theme-Concept correlations between concept in concept collection of illustrative plates,

First, obtain related between the concept corresponding to each vocabulary in each theme and the concept in the concept collection of illustrative plates Property；

Then, theme-cumulative summation of vocabulary distribution according to each vocabulary on each theme, obtains theme-Concept correlations； Its formula is：

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>|</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>&Sigma;</mi> <mi>w</mi> </munder> <mi>P</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>|</mo> <mi>c</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msubsup> <mi>&Phi;</mi> <mi>k</mi> <mi>w</mi> </msubsup> </mrow>

Wherein, w represents vocabulary；C (w) represents the concept corresponding to vocabulary w；C represents the concept in concept collection of illustrative plates；Represent master Topic-vocabulary distribution；

P (c | c (w)) represents the correlation between concept c (w) and concept c；P (c | k) represent theme-Concept correlations.

3. text marking method as claimed in claim 1, wherein, according to the theme-Concept correlations, using text- During theme distribution carries out the correlation adjustment of concept tagging to text-concept distribution,

The concept desired value of text-Concept correlations is：

<mrow> <mi>E</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>P</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>|</mo> <mi>T</mi> <mrow> <mo>(</mo> <mi>d</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>&Sigma;</mi> <mi>k</mi> </munder> <mi>P</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>|</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msubsup> <mi>&theta;</mi> <mi>d</mi> <mi>k</mi> </msubsup> </mrow>

E (c) represents concept c desired value；C represents each concept in text-concept distribution；T (d) represents text d to be marked In all themes；Represent text d to be marked text-theme distribution；

<mrow> <msup> <msubsup> <mi>&delta;</mi> <mi>d</mi> <mi>c</mi> </msubsup> <mo>&prime;</mo> </msup> <mo>=</mo> <mi>&kappa;</mi> <mo>&CenterDot;</mo> <msubsup> <mi>&delta;</mi> <mi>d</mi> <mi>c</mi> </msubsup> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>&kappa;</mi> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <mi>E</mi> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> </mrow>

Then, each concept is traveled through, obtains text-Concept correlations after adjustment

Finally it is normalized, completes text marking.

4. a kind of text marking system, including：

Topic model construction unit, for according to collection of document, building the topic model of collection of document；Wherein, the theme mould Type includes theme-vocabulary distribution corresponding to text collection；

Theme-Concept correlations acquiring unit, for each vocabulary institute in the theme-vocabulary distribution, each theme Concept in corresponding concept and the concept collection of illustrative plates, obtain each theme in the topic model and the concept collection of illustrative plates In concept between theme-Concept correlations, wherein, the corresponding concept of each vocabulary in each theme is according to institute Concept collection of illustrative plates is stated to obtain；

Text marking unit, for according to the theme-Concept correlations, being distributed using text-theme distribution to text-concept Concept correlations adjustment is carried out, to complete text marking；Wherein,

The text marking unit further comprises：

Theme labeling module, it is described to obtain for carrying out theme mark to text to be marked based on the theme-vocabulary distribution Text-theme distribution；

Concept tagging module, for carrying out concept tagging to the text to be marked to obtain the text based on the concept collection of illustrative plates Sheet-concept is distributed.

5. text marking system as claimed in claim 4, wherein, the theme-Concept correlations acquiring unit is obtaining institute During stating theme-Concept correlations between the concept in each theme and the concept collection of illustrative plates in topic model,

C represents the concept in concept collection of illustrative plates；Represent theme-vocabulary distribution；P (c | c (w)) represent concept c (w) and concept c it Between correlation；P (c | k) represent theme-Concept correlations.

6. text marking system as claimed in claim 4, wherein, the text marking unit is utilizing text-theme distribution During the correlation adjustment of concept tagging being carried out to text-concept distribution,

The concept desired value of text-Concept correlations is：

Wherein, E (c) represents concept c desired value；C represents each concept in text-concept distribution；T (d) represents to be marked All themes in text d；Represent text d to be marked text-theme distribution；

Finally it is normalized, completes text marking.