CN104750777B - text marking method and system - Google Patents
text marking method and system Download PDFInfo
- Publication number
- CN104750777B CN104750777B CN201410850313.6A CN201410850313A CN104750777B CN 104750777 B CN104750777 B CN 104750777B CN 201410850313 A CN201410850313 A CN 201410850313A CN 104750777 B CN104750777 B CN 104750777B
- Authority
- CN
- China
- Prior art keywords
- concept
- text
- theme
- mrow
- collection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of text marking method and system, and method therein includes:According to domain body structure concept collection of illustrative plates;According to text collection, the topic model of text collection is built;Topic model is distributed including theme vocabulary;It is distributed according to theme vocabulary, the concept in the concept that each vocabulary in acquired each theme is corresponding and concept collection of illustrative plates, obtains the Subject Concept correlation between each theme in topic model and the concept in concept collection of illustrative plates;According to Subject Concept correlation, text concept is distributed using text subject distribution and carries out Concept correlations adjustment, to complete text marking, wherein, text subject distribution carries out theme to text to be marked based on the distribution of theme vocabulary and marks to obtain;Text concept distribution is treated mark text progress concept tagging based on concept collection of illustrative plates and obtained.Using the present invention, it can solve the problem that the spill tag occurred in text marking and miss target problem.
Description
Technical field
The present invention relates to text marking technical field, more specifically, is related to a kind of text marking method and system.
Background technology
With the popularization of mobile Internet and social networks, generate substantial amounts of user and generate text (User
Generated Content, abbreviation UGC), due to culture background and the difference of statement custom, people are often using different
The word content similar with form of presentation expression, thus in traditional search engines it is widely used based on the inverted index of word come
The method of management UGC contents can not disclose UGC inherent correlation, so as to carry out effectively safeguarding to these texts, examine
Rope and recommendation, so understanding that the implication of text becomes very necessary in semantic level.
UGC can be entered using natural language processing technique (Natural Language Processing, abbreviation NLP)
The understanding of row depth, but due to the complexity of Human Natural Language, it is impossible to the understanding that depth is carried out to UGC is truly realized,
And this deep understanding also tends to be unnecessary.If in fact, carrying out semantic tagger to text, word is built to semanteme
The mapping of concept, even if then superficial layer analyzing can only be carried out to UGC, points of UGC in semantic concept spatially can be also judged according to this
Cloth, so as to provide practical basis for UGC management, search and recommendation.
In text mining field, subject analysis method is a kind of common method that semantic tagger is carried out to text.As
A kind of statistical method based on unsupervised learning, for given text collection, subject analysis method can determine by parameter
Some implicit themes, each theme are the set of some vocabulary, and every text can then be expressed as general on multiple themes
Rate is distributed, and is compared with the word in bag of words, and the dimension of implicit theme is much lower, therefore can be effectively prevented from word one-level
Noise.
Although subject analysis method has good analysis ability to the high frequency hot word in text, used due to it
Topic model it is assumed that be not appropriate for learning a large amount of long-tail words included in text, is made using the probability of exponential distribution
When obtaining the theme that topic model is formed and do not include long-tail word, therefore text being labeled using topic model, it can not also mark
The long-tail word gone out in text, this causes the availability of text marking to be greatly affected.
In order to solve the identification of long-tail word this problem, concept collection of illustrative plates can be introduced during text marking.Concept map
Although spectrum can solve the problem that the identification problem of long-tail word, because concept atlas calculation lacks the general inclination to resource text
Understanding, if some non-core word repeatedly occurs, system will be misled the concept corresponding to it is general as the key of text
Read so as to cause to mark by mistake, simultaneously, it is also possible to because the number that core word occurs is few, and cause spill tag.
The content of the invention
In view of the above problems, it is an object of the invention to provide a kind of text marking method and system, using topic model and
The mixing notation methods of concept collection of illustrative plates, to solve the problems, such as the spill tag occurred in text marking and by mistake target.
According to an aspect of the present invention, there is provided a kind of text marking method, including:
According to concept collection of illustrative plates of the domain body structure with the correlation between concept;And
According to text collection, the topic model of text collection is built;Wherein, topic model is included corresponding to text collection
Theme-vocabulary distribution;
It is distributed according to theme-vocabulary, is in the concept that each vocabulary in each theme is corresponding and concept collection of illustrative plates general
Read, obtain theme-Concept correlations between the concept in each theme and concept collection of illustrative plates in topic model, wherein, each
The corresponding concept of each vocabulary in theme obtains according to concept collection of illustrative plates;
According to theme-Concept correlations, Concept correlations tune is carried out to text-concept distribution using text-theme distribution
It is whole, to complete text marking;Wherein, text-theme distribution is based on theme-vocabulary distribution and carries out theme mark to text to be marked
Obtain;Text-concept distribution is treated mark text progress concept tagging based on concept collection of illustrative plates and obtained.
According to another aspect of the present invention, there is provided a kind of text marking system, including:
Concept collection of illustrative plates construction unit, for according to concept collection of illustrative plates of the domain body structure with the correlation between concept;
Topic model construction unit, for according to collection of document, building the topic model of collection of document;Wherein topic model
Including the theme corresponding to text collection-vocabulary distribution;
Theme-Concept correlations acquiring unit, for each vocabulary institute in theme-vocabulary distribution, each theme
Concept in corresponding concept and concept collection of illustrative plates, obtain topic model in each theme and concept collection of illustrative plates in concept between
Theme-Concept correlations, wherein, the corresponding concept of each vocabulary in each theme obtains according to concept collection of illustrative plates;
Text marking unit, for according to theme-Concept correlations, being distributed using text-theme distribution to text-concept
Concept correlations adjustment is carried out, to complete text marking;Wherein,
Text marking unit includes:
Theme labeling module, for theme mark is carried out to text to be marked based on the distribution of theme-vocabulary with obtain text-
Theme distribution;
Concept tagging module, concept tagging is carried out to obtain text-concept for treating mark text based on concept collection of illustrative plates
Distribution.
It was found from technical scheme above, text marking method and system provided by the invention, according to theme-conceptual dependency
Property, text-theme distribution of the text to be marked obtained using topic model is to the text to be marked that is obtained by concept collection of illustrative plates
The correlation of text-concept lifted, to solve the problems, such as the spill tag occurred in text marking and by mistake target.
In order to realize above-mentioned and related purpose, one or more aspects of the invention include will be explained in below and
The feature particularly pointed out in claim.Some illustrative aspects of the present invention are described in detail in following explanation and accompanying drawing.
However, some modes in the various modes of the principle that the present invention only can be used of these aspect instructions.It is in addition, of the invention
It is intended to include all these aspects and their equivalent.
Brief description of the drawings
By reference to the explanation and the content of claims below in conjunction with accompanying drawing, and with to the present invention more comprehensively
Understand, other purposes and result of the invention will be more apparent and should be readily appreciated that.In the accompanying drawings:
Fig. 1 is the text marking method flow schematic diagram according to the embodiment of the present invention;
Fig. 2 is the text marking handling process schematic diagram according to the embodiment of the present invention;
Fig. 3 is the concept-general generated according to the embodiment of the present invention according to Wikipedia concepts and google range formulas
Read a fragmentary views of correlation;
Fig. 4 is the correlation calculations process example schematic diagram between theme and concept according to the embodiment of the present invention;
Fig. 5 is the schematic diagram of correlativity calculation result example one between theme and concept according to the embodiment of the present invention;
Fig. 6 is the schematic diagram of correlativity calculation result example two between theme and concept according to the embodiment of the present invention;
Fig. 7 is acquisition text-concept distribution example schematic diagram according to the embodiment of the present invention;
Fig. 8 is acquisition text-theme distribution example schematic diagram according to the embodiment of the present invention;
Fig. 9 is the text marking Adjustable calculation process example schematic diagram according to the embodiment of the present invention;
Figure 10 is the text marking Adjustable calculation result example schematic diagram according to the embodiment of the present invention;
Figure 11 is the text marking system logic structure block diagram according to the embodiment of the present invention.
Identical label indicates similar or corresponding feature or function in all of the figs.
Embodiment
In the following description, for purposes of illustration, in order to provide the comprehensive understanding to one or more embodiments, explain
Many details are stated.It may be evident, however, that these embodiments can also be realized in the case of these no details.
The overall of text can be provided for topic model to be inclined to, but can not mark out the long-tail word in text;It is and general
The presence of concept, especially long-tail concept can be provided by reading collection of illustrative plates mark, but the problems such as spill tag, mark by mistake be present.The present invention carries
Go out mixing text marking method that is a kind of while using topic model and concept collection of illustrative plates, topic model can kept to text
While being integrally inclined to the advantage understood, still ensure that the long-tail word in concept collection of illustrative plates is marked exactly.
The specific embodiment of the present invention is described in detail below with reference to accompanying drawing.
In order to illustrate text marking method provided by the invention, Fig. 1 shows text marking according to embodiments of the present invention
Flow.
As shown in figure 1, text marking method provided by the invention includes:
S110:According to concept collection of illustrative plates of the domain body structure with the correlation between concept;And
According to text collection, the topic model of text collection is built;Wherein, topic model is included corresponding to text collection
Theme-vocabulary distribution.
Specifically, structure concept collection of illustrative plates and topic model.Wherein, the famous relation in domain body and unknown relation
Concept collection of illustrative plates of the structure with concept-Concept correlations, it is the follow-up base for determining theme-Concept correlations to build this concept collection of illustrative plates
Plinth.
Also, text collection is analyzed using subject analysis method, builds topic model, wherein, topic model bag
Include theme-vocabulary distribution of multiple themes and multiple themes.That is, by using subject analysis method (such as:LDA
Algorithm) text collection is analyzed, generate theme-vocabulary distribution of multiple themes.
S120:It is distributed according to theme-vocabulary, in the concept corresponding to each vocabulary in each theme and concept collection of illustrative plates
Concept, obtain topic model in each theme and concept collection of illustrative plates in concept between theme-Concept correlations, wherein,
The corresponding concept of each vocabulary in each theme obtains according to concept collection of illustrative plates.
Specifically, according to concept collection of illustrative plates, obtain relative with each vocabulary in each theme in the topic model built
The concept answered.It should be noted that the concept queries service provided by the concept collection of illustrative plates of structure, the vocabulary energy in each theme
It is enough to obtain corresponding concept.
Then, for each theme obtained in topic model, calculate concept corresponding to the high frequency words in the theme and
The correlation between concept in concept collection of illustrative plates, and accumulative summation, obtain the correlation between concept in the theme and concept collection of illustrative plates
Property.
S130:According to theme-Concept correlations, conceptual dependency is carried out to text-concept distribution using text-theme distribution
Property adjustment, to complete text marking;Wherein, text-theme distribution is based on theme-vocabulary distribution and carries out theme to text to be marked
Mark obtains;Text-concept distribution is treated mark text progress concept tagging based on concept collection of illustrative plates and obtained.
Specifically, mark text is treated according to the concept collection of illustrative plates of acquisition and carries out concept tagging, obtain the text of text to be marked
Sheet-concept is distributed;And theme mark is carried out to text to be marked according to the theme in the topic model of acquisition-vocabulary distribution,
Obtain text-theme distribution of text to be marked.
Wherein, using the concept collection of illustrative plates of structure, concept tagging is carried out to text to be marked, the weights and concept of concept occur
Number it is directly proportional, and when some concept and other conceptual dependencies, increase the weights of the concept.Theme mould based on structure
Theme-vocabulary distribution in type, carries out theme mark to text to be marked using theme assessment algorithm, obtains text-theme point
Cloth.
In above process, it is proposed that concept-topic relativity, i.e., each theme and concept collection of illustrative plates in topic model
In one or more concept between establish a kind of relation of correlation, so that the models coupling of two isomeries be got up;So
Afterwards according to concept-topic relativity, the degree of lifting or reduction of the theme for concept tagging can be quantitatively calculated.
The text marking method of the present invention includes two large divisions, Part I, builds theme-Concept correlations;Second
Point, the text marking stage.The core of text marking method proposed by the invention is, establish topic model and concept collection of illustrative plates it
Between correlation relation (i.e.:Build theme-Concept correlations), so as to the result marked using theme, to concept mark
Note is adjusted.The text marking method of the present invention will be described in depth below.
Part I:Build theme-Concept correlations
1st, structure concept collection of illustrative plates
Concept collection of illustrative plates is built according to domain body, and the source of domain body includes:
A, similar to wikipedia (Wikipedia) semi-structured text;
B, the field concept text write by domain expert;
C, similar to Wordnet concept system.
Based on above-mentioned domain body, structure concept collection of illustrative plates c, wherein, m concept is included in concept collection of illustrative plates c, and at mostIndividual concept-Concept correlations.
For Wordnet concept system, because it is already provided with concept-Concept correlations, so with regard to without carrying out again
Calculate.
For a and b both of these cases, concept-Concept correlations are calculated using google distances, can be used following public
Formula represents:
Wherein, f (c1) represent to quote concept c1The page number (for a), or include c1Text number it is (right
In b);
f(c1,c2) represent to quote concept c simultaneously1And c2The page number (for a), or include c simultaneously1And c2's
The number of text (for b);
M represents total page number (for a) or total textual data (for b).
2nd, topic model is built
Using subject analysis method (such as:LDA algorithm), for given text collection R, the n theme of this area is generated,
N < | R |, each theme k is defined as a multinomial distribution on vocabularyWherein,Wherein, w represents some vocabulary, and V represents vocabulary.
, wherein it is desired to explanation, in LDA algorithmIt is unknown implicit variable, it is necessary to according in text collection
Word is estimated, typically learns the implicit variable in LDA using approximate inference algorithms.LDA original papers
The mean-field variational expectation maximisation algorithms used in Blei02, but
The Gibbs Sampling used in Griffiths02 are more easily understood, it is proposed that being come using Gibbs Sampling
Learnt, by this is not the scope that the present invention is covered, therefore no longer LDA training algorithm is repeated.
3rd, theme-Concept correlations structure
For each theme, first, by calculate obtain concept corresponding to each vocabulary in each theme with it is described
The correlation between concept in concept collection of illustrative plates;
Then, theme-cumulative summation of vocabulary distribution according to each vocabulary on theme, obtains theme-Concept correlations.
That is, for each theme, according to the concept-Concept correlations provided in concept collection of illustrative plates, the theme is calculated
In each word w corresponding to concept c (w) and according to domain body build concept collection of illustrative plates in any concept c between
Correlation, the then distribution according to word w on theme kWeighting, obtain the correlation between theme k and any field concept c
P (c | k), calculation formula is:
Wherein, w represents vocabulary;C (w) represents the concept corresponding to vocabulary w;
C represents the concept in concept collection of illustrative plates;Represent theme-vocabulary distribution;P (c | c (w)) represent concept c (w) with it is general
Read the correlation between c;P (c | k) represent theme-Concept correlations.
, wherein it is desired to explanation, concept corresponding to each vocabulary and the concept map in each theme is calculated
It is the concept queries service that is provided by the concept collection of illustrative plates of structure to obtain and each word during correlation between the concept in spectrum
Concept corresponding to converging;Wherein, there is detailed description in the concept according to corresponding to bilingual lexicon acquisition in the prior art, in this hair
Repeated no more in bright.
Part II:The text marking stage
1st, concept tagging
For text to be marked, in the present invention using the concept collection of illustrative plates of structure, concept tagging is carried out to text to be marked.
Wherein it is possible to concept mark is carried out to text d to be marked using the simply dimensioning algorithm based on word frequency, or TextLink algorithms
Note, obtain text-concept distributionWhereinThis probability distribution table
Show the weight for the field concept that document is included.
The mask method of concept collection of illustrative plates, it is by establishing a domain body, the domain body includes the concept of this area
And multiple conceptual examples, each concept have one or more title, concept has a hyponymy, between conceptual example
With a variety of naming relationships or unknown relation, by the analysis to these relations, the degree of correlation between concept can be set up.
Hereafter, by concept name, the concept discussed in text is identified, it is necessary to tie when a word corresponds to multiple concepts
The context for closing text carries out the disambiguation of concept.Can also be according to concept-Concept correlations, it is determined that each marking concept and other
The correlation of all concepts, the text-concept for then further determining that text according to the relevance ranking are distributed
2nd, theme marks
For text to be marked, then it is distributed by the theme-vocabulary obtained in the topic model of structureUse theme
Assessment algorithm carries out theme mark to text d to be marked, obtains text d text-theme distribution
Wherein,This probability distribution represents the weight for the theme that document is included.
, wherein it is desired to explanation, when carrying out concept tagging and theme marks, due to being not present between both marks
Direct logic dependencies, concept tagging can be first carried out, then carry out theme mark;Theme mark can also be first carried out,
Carry out concept tagging;Or concept tagging and theme mark are carried out simultaneously.
3rd, mark adjustment
For obtaining text-concept distribution in concept taggingIn each concept c, use formula below calculate concept
Correlation during c and theme mark between the text d of gained all theme T (d), E (c) are concept c desired value:
Wherein, E (c) represents concept c desired value;C represents each concept in text-concept distribution;T (d) represents to treat
Mark all themes in text d;Represent text d to be marked text-theme distribution.
If the E (c) obtained is more than text-Concept correlationsDegree of correlation lifting is then carried out, uses following adjustment
Function:
Wherein,Represent text-Concept correlations;E (c) represents the desired value of concept;
κ is scale factor, and the ratio that concept tagging and theme mark is adjusted by adjusting κ size.
Then, each concept is traveled through, obtains text-Concept correlations after adjustmentMost
Afterwards, it is normalized, has just obtained final annotation results.In above-mentioned Tuning function, κ is a scale factor, is passed through
Adjust κ size come adjust concept tagging and theme mark ratio, pass through this adjustment so that system can keep theme
While the entirety that mark can analyze text is inclined to this advantage, and can enough ensures that the long-tail word in text to be marked is not neglected
Depending on.
In order to further illustrate the whole flow process of text marking, Fig. 2 shows the text marking processing of the embodiment of the present invention
Flow.
In an embodiment as illustrated in figure 2, whole flow process includes:Theme-Concept correlations structure stage, concept tagging rank
Section, theme mark stage and mark adjusting stage.
In theme-Concept correlations build the stage, according to domain body structure concept collection of illustrative plates;And calculated using LDA
Method is trained to text collection, generation theme-vocabulary distribution.According to the concept collection of illustrative plates of generation and theme-vocabulary distribution, structure
Theme-Concept correlations.
In the concept tagging stage, according to the concept collection of illustrative plates of generation, concept tagging is carried out to text to be marked.
In the theme mark stage, it is distributed according to the theme of generation-vocabulary, theme mark is carried out to text to be marked.
In the adjusting stage is marked, according to structure theme-Concept correlations, the result marked using theme is to concept tagging
Result lifted or reduced, form last annotation results.
Example
Theme-conceptual relation construction
Fig. 3 shows one of concept-Concept correlations according to Wikipedia concepts and the generation of google range formulas
Fragment.As shown in figure 3, according to Wikipedia concepts, structure concept collection of illustrative plates, product concept-Concept correlations.
Then, topic model training is carried out to some newsletter archives using topic model developing algorithm, target topic
Number is set to 250, the theme that learns-word distributionWherein, shown in the theme of theme 110 and theme 128-word distribution such as
Under:
Theme 110 (total words 1428)
206 are calculated, cloud computing 173, China 76, conference 68, the 6th 48, forum 48, speech 39, welcome guest 30,
Innovation 28, big data 28, using 25, data 22, technology 22, center 21, openstack21, general manager 18, everybody
16, subject under discussion 15, manager 15, it is believed that 13, introduce 13, ibm 13, focus 13, traffic 12, meeting 12, cloud 10 is remaining
Volume treasured 10, country 10, advantage 10, Beijing 10, manufacturer 10, new projects 9, the author 9, case 9, project 9, channel 8,
Title 8, aviation 8, aws 8,
Theme 128 (total words 886)
Safety 159, leak 75, software 30, network 24,23 are found, hacker 22, Microsoft 21, security 20, base
Gold 19, influence 15, foundation 14, website 14, google 14, heartbleed 12, security 11, linux 11,
Openssl 10, code 10, network security 9, version 9, basis 9, challenge 8, encrypt 8, cloudpassage 8, malice
8, facility 8, computer 8,7 are operated in, full company 7, third party 7, infrastructure 7, ibm 7, report 7, today 7, letter
Breath safety 6, uneasiness 6, measure 6, joint 6,90 6, agreement 6, linux 5, cve5, ioactive 5, website 5,
Finally, using theme-Concept correlations formula, the correlation P (c | k) between theme k and concept c is calculated, such as
For theme 110, calculating process such as Fig. 4, result of calculation Fig. 5.Similarly, theme-Concept correlations formula can also be used, is obtained
Correlation between theme 128 and field concept, result of calculation are as shown in Figure 6.
Text marking
Text to be marked is as follows:
Sandbox mechanism and technology in cloud computing security system dissect, most crucial in the cloud computing environment of multi-tenant
Security doctrine is multi-tenant isolation.Flying apsaras security sandbox adheres to this principle and given birth to, and it itself provides security protection for flying apsaras
Ability, and every cloud computing product to be carried in flying apsaras ecological environment provides most basic multi-tenant isolation scheme.In base
In the odps products of flying apsaras, flying apsaras security sandbox provides anti-to the isolation of the multilayer of kernel layers and safety from java linguistic levels
Shield measure.We as example, will inquire into the structure and system of security sandbox in multi-tenant cloud computing environment during this is shared
Concept tagging is carried out using the algorithm of the concept collection of illustrative plates of preceding introduction, obtains text as shown in Figure 7-concept distribution
Then, above-mentioned text to be marked is assessed using theme assessment algorithm, obtains document-master as shown in Figure 8
Topic distributionIn 2 maximally related themes.
Marked according to theme, it can be seen that text to be marked is not only related to theme 110, also with have on theme 128 it is certain
Relation (weight 0.057), use mark adjustment formula to calculate text as shown in Figure 9-concept and be distributedIn it is each general
Read the correlation of c and text d all themes.
As shown in Figure 10, for concept 19541494 (Cloud computing) and concept 3157360
(Multitentancy), because Concept correlations are more than topic relativity, Concept correlations are compressed.For concept 7398
(Computer security) etc., then because topic relativity is more than Concept correlations, therefore Concept correlations get a promotion.
As can see from Figure 10, the long-tail as concept 1291932 (Sandbox (Computer security))
Concept, also because the presence of theme 128, its Concept correlations get a promotion, so as to correctly be marked, illustrate this hair
Bright validity.
Corresponding with the above method, the present invention also provides a kind of text marking system, and Figure 11 is shown according to of the invention real
Apply the text marking system logic structure of example.
As shown in figure 11, text marking system 1100 provided by the invention includes concept collection of illustrative plates construction unit 1110, theme
Model construction unit 1120, theme-Concept correlations acquiring unit 1130, text marking unit 1140.
Wherein, concept collection of illustrative plates construction unit 1110 is used for according to domain body structure with the general of the correlation between concept
Read collection of illustrative plates.
Topic model construction unit 1120 is used for the topic model for according to text collection, building text collection, wherein, theme
Model includes theme-vocabulary distribution corresponding to text collection.
Each vocabulary that theme-Concept correlations acquiring unit 1130 is used in theme-vocabulary distribution, each theme
Concept in corresponding concept and concept collection of illustrative plates, obtain concept in each theme and the concept collection of illustrative plates in topic model it
Between theme-Concept correlations, the corresponding concept of each vocabulary in each theme obtains according to concept collection of illustrative plates.
Text marking unit 1140 is used for according to theme-Concept correlations, using text-theme distribution to text-concept
Distribution carries out Concept correlations adjustment, to complete text marking.
Wherein, text marking unit 1140 further comprises:
Theme labeling module 1141 is used to carry out text to be marked theme mark based on theme-vocabulary distribution to obtain text
Sheet-theme distribution;
Concept tagging module 1142 be used for based on concept collection of illustrative plates treat mark text carry out concept tagging with obtain text-generally
Read distribution.
Wherein, the concept of correlation of the concept collection of illustrative plates construction unit 1110 between according to domain body structure with concept
During collection of illustrative plates,
The source of domain body includes:The semi-structured page or field concept text;
For the semi-structured page or field concept text, the correlation in calculating field body between concept, its formula
It is expressed as:
Wherein, f (c1) represent to quote concept c1The page number or include c1Text number;
f(c1, c2) represent to quote concept c simultaneously1And c2The page number or include c simultaneously1And c2Text
Number;
M represents total page number or total textual data.
Wherein, subject analysis method is used to analyze to build topic model text collection, wherein,
Using subject analysis method, analyzed for text collection R, generate n theme;Wherein, n < | R |, Mei Gezhu
Distributions of the k on vocabulary is inscribed to be expressed as:
Wherein,
Wherein, w represents vocabulary, and V represents vocabulary.
Each theme of the theme-Concept correlations acquiring unit 1130 in topic model is obtained with it is general in concept collection of illustrative plates
During theme-Concept correlations between thought,
First, obtain related between the concept corresponding to each vocabulary in each theme and the concept in concept collection of illustrative plates
Property;
Then, theme-cumulative summation of vocabulary distribution according to each vocabulary on each theme, obtains theme-concept phase
Guan Xing;Its formula is:
Wherein, w represents vocabulary;C (w) represents the concept corresponding to vocabulary w;
C represents the concept in concept collection of illustrative plates;Represent theme-vocabulary distribution;P (c | c (w)) represent concept c (w) with it is general
Read the correlation between c;P (c | k) represent theme-Concept correlations.
Text marking unit 1140 text-theme distribution using text to be marked to the text of text to be marked-
During concept distribution carries out concept tagging correlation adjustment,
The mathematic expectaion of text-Concept correlations is:
Wherein, E (c) represents concept c desired value;C represents each concept in text-concept distribution;T (d) represents to treat
Mark all themes in text d;Represent text d to be marked text-theme distribution;
If E (c) is more than text-Concept correlations, degree of correlation lifting is carried out, is adjusted using equation below:
Wherein,Represent text-Concept correlations;E (c) represents the desired value of concept;
κ is scale factor, and the ratio that concept tagging and theme mark is adjusted by adjusting κ size;
Then, each concept is traveled through, obtains text-Concept correlations after adjustmentFinally
It is normalized, completes text marking.
The more specific interaction of above-mentioned each module or unit, may refer to the description in method flow, no longer superfluous herein
State.
Text marking method and system provided by the invention can be seen that by above-mentioned embodiment, utilize topic model
Obtained text-theme distribution, according to theme-Concept correlations of structure, to text-concept for being obtained using concept collection of illustrative plates
Correlation is lifted, so as to solve the problems, such as the spill tag occurred in text marking and by mistake target.
Described in an illustrative manner above with reference to accompanying drawing according to text marking method and system proposed by the present invention.But
It is, it will be appreciated by those skilled in the art that the text marking method and system proposed for the invention described above, can also be not
Various improvement are made on the basis of disengaging present invention.Therefore, protection scope of the present invention should be by appended claim
The content of book determines.
Claims (6)
1. a kind of text marking method, including:
According to concept collection of illustrative plates of the domain body structure with the correlation between concept;And
According to text collection, the topic model of text collection is built;Wherein, the topic model is included corresponding to text collection
Theme-vocabulary distribution;
In the corresponding concept of each vocabulary and the concept collection of illustrative plates in the theme-vocabulary distribution, each theme
Concept, obtain theme-conceptual dependency between the concept in each theme and the concept collection of illustrative plates in the topic model
Property, wherein, the corresponding concept of each vocabulary in each theme obtains according to the concept collection of illustrative plates;
According to the theme-Concept correlations, Concept correlations tune is carried out to text-concept distribution using text-theme distribution
It is whole, to complete text marking;Wherein, the text-theme distribution is based on the theme-vocabulary distribution to text to be marked progress
Theme marks to obtain;The text-concept distribution carries out concept tagging to the text to be marked based on the concept collection of illustrative plates and obtained
Arrive.
2. text marking method as claimed in claim 1, wherein, each theme in the topic model is obtained with it is described
During theme-Concept correlations between concept in concept collection of illustrative plates,
First, obtain related between the concept corresponding to each vocabulary in each theme and the concept in the concept collection of illustrative plates
Property;
Then, theme-cumulative summation of vocabulary distribution according to each vocabulary on each theme, obtains theme-Concept correlations;
Its formula is:
<mrow>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>c</mi>
<mo>|</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munder>
<mi>&Sigma;</mi>
<mi>w</mi>
</munder>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>c</mi>
<mo>|</mo>
<mi>c</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
<mo>)</mo>
</mrow>
<mo>&CenterDot;</mo>
<msubsup>
<mi>&Phi;</mi>
<mi>k</mi>
<mi>w</mi>
</msubsup>
</mrow>
Wherein, w represents vocabulary;C (w) represents the concept corresponding to vocabulary w;C represents the concept in concept collection of illustrative plates;Represent master
Topic-vocabulary distribution;
P (c | c (w)) represents the correlation between concept c (w) and concept c;P (c | k) represent theme-Concept correlations.
3. text marking method as claimed in claim 1, wherein, according to the theme-Concept correlations, using text-
During theme distribution carries out the correlation adjustment of concept tagging to text-concept distribution,
The concept desired value of text-Concept correlations is:
<mrow>
<mi>E</mi>
<mrow>
<mo>(</mo>
<mi>c</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>c</mi>
<mo>|</mo>
<mi>T</mi>
<mrow>
<mo>(</mo>
<mi>d</mi>
<mo>)</mo>
</mrow>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munder>
<mi>&Sigma;</mi>
<mi>k</mi>
</munder>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>c</mi>
<mo>|</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mo>&CenterDot;</mo>
<msubsup>
<mi>&theta;</mi>
<mi>d</mi>
<mi>k</mi>
</msubsup>
</mrow>
E (c) represents concept c desired value;C represents each concept in text-concept distribution;T (d) represents text d to be marked
In all themes;Represent text d to be marked text-theme distribution;
If E (c) is more than text-Concept correlations, degree of correlation lifting is carried out, is adjusted using equation below:
<mrow>
<msup>
<msubsup>
<mi>&delta;</mi>
<mi>d</mi>
<mi>c</mi>
</msubsup>
<mo>&prime;</mo>
</msup>
<mo>=</mo>
<mi>&kappa;</mi>
<mo>&CenterDot;</mo>
<msubsup>
<mi>&delta;</mi>
<mi>d</mi>
<mi>c</mi>
</msubsup>
<mo>+</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>-</mo>
<mi>&kappa;</mi>
<mo>)</mo>
</mrow>
<mo>&CenterDot;</mo>
<mi>E</mi>
<mrow>
<mo>(</mo>
<mi>c</mi>
<mo>)</mo>
</mrow>
</mrow>
Wherein,Represent text-Concept correlations;E (c) represents the desired value of concept;
κ is scale factor, and the ratio that concept tagging and theme mark is adjusted by adjusting κ size;
Then, each concept is traveled through, obtains text-Concept correlations after adjustment
Finally it is normalized, completes text marking.
4. a kind of text marking system, including:
Concept collection of illustrative plates construction unit, for according to concept collection of illustrative plates of the domain body structure with the correlation between concept;
Topic model construction unit, for according to collection of document, building the topic model of collection of document;Wherein, the theme mould
Type includes theme-vocabulary distribution corresponding to text collection;
Theme-Concept correlations acquiring unit, for each vocabulary institute in the theme-vocabulary distribution, each theme
Concept in corresponding concept and the concept collection of illustrative plates, obtain each theme in the topic model and the concept collection of illustrative plates
In concept between theme-Concept correlations, wherein, the corresponding concept of each vocabulary in each theme is according to institute
Concept collection of illustrative plates is stated to obtain;
Text marking unit, for according to the theme-Concept correlations, being distributed using text-theme distribution to text-concept
Concept correlations adjustment is carried out, to complete text marking;Wherein,
The text marking unit further comprises:
Theme labeling module, it is described to obtain for carrying out theme mark to text to be marked based on the theme-vocabulary distribution
Text-theme distribution;
Concept tagging module, for carrying out concept tagging to the text to be marked to obtain the text based on the concept collection of illustrative plates
Sheet-concept is distributed.
5. text marking system as claimed in claim 4, wherein, the theme-Concept correlations acquiring unit is obtaining institute
During stating theme-Concept correlations between the concept in each theme and the concept collection of illustrative plates in topic model,
First, obtain related between the concept corresponding to each vocabulary in each theme and the concept in the concept collection of illustrative plates
Property;
Then, theme-cumulative summation of vocabulary distribution according to each vocabulary on each theme, obtains theme-Concept correlations;
Its formula is:
<mrow>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>c</mi>
<mo>|</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munder>
<mi>&Sigma;</mi>
<mi>w</mi>
</munder>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>c</mi>
<mo>|</mo>
<mi>c</mi>
<mrow>
<mo>(</mo>
<mi>w</mi>
<mo>)</mo>
</mrow>
<mo>)</mo>
</mrow>
<mo>&CenterDot;</mo>
<msubsup>
<mi>&Phi;</mi>
<mi>k</mi>
<mi>w</mi>
</msubsup>
</mrow>
Wherein, w represents vocabulary;C (w) represents the concept corresponding to vocabulary w;
C represents the concept in concept collection of illustrative plates;Represent theme-vocabulary distribution;P (c | c (w)) represent concept c (w) and concept c it
Between correlation;P (c | k) represent theme-Concept correlations.
6. text marking system as claimed in claim 4, wherein, the text marking unit is utilizing text-theme distribution
During the correlation adjustment of concept tagging being carried out to text-concept distribution,
The concept desired value of text-Concept correlations is:
<mrow>
<mi>E</mi>
<mrow>
<mo>(</mo>
<mi>c</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>c</mi>
<mo>|</mo>
<mi>T</mi>
<mrow>
<mo>(</mo>
<mi>d</mi>
<mo>)</mo>
</mrow>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munder>
<mi>&Sigma;</mi>
<mi>k</mi>
</munder>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>c</mi>
<mo>|</mo>
<mi>k</mi>
<mo>)</mo>
</mrow>
<mo>&CenterDot;</mo>
<msubsup>
<mi>&theta;</mi>
<mi>d</mi>
<mi>k</mi>
</msubsup>
</mrow>
Wherein, E (c) represents concept c desired value;C represents each concept in text-concept distribution;T (d) represents to be marked
All themes in text d;Represent text d to be marked text-theme distribution;
If E (c) is more than text-Concept correlations, degree of correlation lifting is carried out, is adjusted using equation below:
<mrow>
<msup>
<msubsup>
<mi>&delta;</mi>
<mi>d</mi>
<mi>c</mi>
</msubsup>
<mo>&prime;</mo>
</msup>
<mo>=</mo>
<mi>&kappa;</mi>
<mo>&CenterDot;</mo>
<msubsup>
<mi>&delta;</mi>
<mi>d</mi>
<mi>c</mi>
</msubsup>
<mo>+</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>-</mo>
<mi>&kappa;</mi>
<mo>)</mo>
</mrow>
<mo>&CenterDot;</mo>
<mi>E</mi>
<mrow>
<mo>(</mo>
<mi>c</mi>
<mo>)</mo>
</mrow>
</mrow>
Wherein,Represent text-Concept correlations;E (c) represents the desired value of concept;
κ is scale factor, and the ratio that concept tagging and theme mark is adjusted by adjusting κ size;
Then, each concept is traveled through, obtains text-Concept correlations after adjustment
Finally it is normalized, completes text marking.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410850313.6A CN104750777B (en) | 2014-12-31 | 2014-12-31 | text marking method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410850313.6A CN104750777B (en) | 2014-12-31 | 2014-12-31 | text marking method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104750777A CN104750777A (en) | 2015-07-01 |
CN104750777B true CN104750777B (en) | 2018-04-06 |
Family
ID=53590461
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410850313.6A Active CN104750777B (en) | 2014-12-31 | 2014-12-31 | text marking method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104750777B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101963989A (en) * | 2010-09-30 | 2011-02-02 | 大连理工大学 | Word elimination process for extracting domain ontology concept |
CN102591917A (en) * | 2011-12-16 | 2012-07-18 | 华为技术有限公司 | Data processing method and system and related device |
CN102902700A (en) * | 2012-04-05 | 2013-01-30 | 中国人民解放军国防科学技术大学 | Online-increment evolution topic model based automatic software classifying method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5189838B2 (en) * | 2007-12-27 | 2013-04-24 | 日立オートモティブシステムズ株式会社 | Map data distribution system, map data distribution method, and communication terminal |
-
2014
- 2014-12-31 CN CN201410850313.6A patent/CN104750777B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101963989A (en) * | 2010-09-30 | 2011-02-02 | 大连理工大学 | Word elimination process for extracting domain ontology concept |
CN102591917A (en) * | 2011-12-16 | 2012-07-18 | 华为技术有限公司 | Data processing method and system and related device |
CN102902700A (en) * | 2012-04-05 | 2013-01-30 | 中国人民解放军国防科学技术大学 | Online-increment evolution topic model based automatic software classifying method |
Also Published As
Publication number | Publication date |
---|---|
CN104750777A (en) | 2015-07-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10956471B2 (en) | Readability awareness in natural language processing systems | |
US10380156B2 (en) | Readability awareness in natural language processing systems | |
CN111291570B (en) | Method and device for realizing element identification in judicial documents | |
CN104598611B (en) | The method and system being ranked up to search entry | |
CN105117398B (en) | A kind of software development problem auto-answer method based on crowdsourcing | |
CN104615767A (en) | Searching-ranking model training method and device and search processing method | |
CN104462066A (en) | Method and device for labeling semantic role | |
CN112328800A (en) | System and method for automatically generating programming specification question answers | |
CN116992005B (en) | Intelligent dialogue method, system and equipment based on large model and local knowledge base | |
Paliwal et al. | Sentiment analysis of COVID-19 vaccine rollout in India | |
Smith et al. | Should algorithms be regulated by government? | |
Węcel et al. | Artificial intelligence—friend or foe in fake news campaigns | |
Chawla et al. | Comparative analysis of semantic similarity word embedding techniques for paraphrase detection | |
US20170039183A1 (en) | Metric Labeling for Natural Language Processing | |
US20140272842A1 (en) | Assessing cognitive ability | |
CN104750777B (en) | text marking method and system | |
Meenakshi et al. | Sentiment analysis of amazon mobile reviews | |
US20190378043A1 (en) | Technologies for discovering specific data in large data platforms and systems | |
Rakhmanov | On validity of sentiment analysis scores and development of classification model for student-lecturer comments using weight-based approach and deep learning | |
Reusch et al. | Transformer-encoder-based mathematical information retrieval | |
Le et al. | Intelligent retrieval system on legal information | |
Gong | [Retracted] Analysis and Application of the Business English Translation Query and Decision Model with Big Data Corpus | |
Ding et al. | Backdoor adjustment of confounding by provenance for robust text classification of multi-institutional clinical notes | |
Jin et al. | Bi-granularity Adversarial Training for Non-factoid Answer Retrieval | |
Yu et al. | Conceptual Modeling: 33rd International Conference, ER 2014, Atlanta, GA, USA, October 27-29, 2014. Proceedings |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |