CN105677684A

CN105677684A - Method for making semantic annotations on content generated by users based on external data sources

Info

Publication number: CN105677684A
Application number: CN201410675420.XA
Authority: CN
Inventors: 钱卫宁; 杜鹃; 章群燕; 周傲英
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2014-11-21
Filing date: 2014-11-21
Publication date: 2016-06-15

Abstract

The invention discloses a method for making semantic annotations on content generated by users based on external data sources. The method comprises following steps: 1, pre-processing system: clustering content generated by users in order to obtain more than one semantic entity; 2, configuration step: generating query sentences based on keywords of semantic entities, positioning and grabbing page collections related to semantic entities by searching external resources based on query sentences and giving weight value to all pages in page collections based on relevancy; 3, semantic annotation step: extracting information related to semantic entities in the page collections based on weight values, which is used for making annotations for semantic entities in a supplementary way in order to obtain expanded and optimized semantic entities.The method is used for expanding and optimizing semantic entities of low quality. The method for making semantic annotations on content generated by users based on external data sources has following beneficial effects: based on a conventional information extraction method for word segmentation and cluster processing, expanding and optimizing operation is carried out by adoption of external resources large in information amount and standard in data format so that semantic entities of higher quality can be obtained.

Description

A kind of method user-generated content being carried out semantic tagger based on external data source

Technical field

The invention belongs to field of computer technology, be specifically related to a kind of method user-generated content being carried out semantic tagger based on external data source.

Background technology

Along with the development of Web2.0, on the Internet, increasing application comprises the data that user produces, for instance microblogging, online forum, video website etc.; The data that these users produce are referred to as user generated content (UGC) (i.e. User GeneratedContent), these data are different from traditional web data, they are freely submitted to by user, it is possible to directly embody the episode topic that user discusses. Carrying out extraction of semantics based on UGC and can hold topic that user discussing and the attitude to topic more accurately, to the analysis of public opinion, hot ticket is followed the trail of has significant role.

In the process of research UGC, grasp the semanteme understanding UGC it is critical that a bit. Such as: the word that a content is " RIO " (Rio shoot Niagara) in UGC, for the user not knowing this film, somebody do not know what this is, additionally, can be treated as being this local rather than this film of Rio shoot Niagara of Rio. Visible, data in UGC being set up semantic is necessary for understanding UGC. If able to " RIO " this entity is set up semantic information, such as { " Rio shoot Niagara ", " film ", " 3D ", " 2011 ", " animation ", " CarlosSaldanha " ... }, so, no matter it is computer or user, " RIO " this entity can both be had more correct understanding. In fact, this concept of semantic network already proposes, and its main thought is the information in network to be described by some metadata, enables user or application program better to process it. So, imitating semantic network, for UGC, if it is possible to set up the semantic entity of UGC, for analyzing user behavior, grasping society's dynamic research will be significantly improved.

Because UGC is directly generated by user, it embodies the individual character of user, is significant;But, also Just because of this, UGC's is of low quality. The form that it is not fixed, it is possible to comprise some error messages, this brings huge challenge to analysis and research UGC. In summary, news data common from the Internet for UGC is different, and its low quality characteristic is mainly manifested in: 1) for an information, and the data that user produces usually are interrupted, by a few words or even idea or the event of in short expressing user. 2) time user inputs information on the internet, it will usually use some non-normal expression, for instance: abbreviation, another name, symbol, expression etc. 3) data that user produces would generally comprise a lot of mistake, for instance cacography. 4) data that user produces may be mingled with polyglot. Therefore, different from traditional information extraction data set such as news data, UGC mass is low, carries out it, in process of entity extraction, extraction result to be caused undesirable because quality of data noise is high at the traditional information extraction method such as SVM of application. A problem comparatively thorny in information extraction problem is also become for the process of UGC. Using on UGC for the traditional data mining of news data and entity abstracting method can not be completely applicable, it is necessary to find that a kind of new method goes to analyze and process UGC data.

In natural language processing process, it is intended to make computer understand the language of the mankind. And in the process processed, generally from the skewed popularity information of text corpus learning vocabulary and structure, syntax is analyzed. These study are based on context and statistical information, as used word frequency, mutual information etc., carry out morphological analysis, use Markov model, and probability context-free grammar, probability syntactic analysis etc. carries out syntactic analysis. These methods all rely on high-quality normalized data set. For UGC, grammatical structure is random, and people are when expressing, typically not pay special attention to grammatical structure, it is contemplated that what what says, and can introduce some neologisms and interchangeability of Chinese characters word, this is for natural language processing, these words can be considered as different words and treat, and result is not satisfactory.

In Chinese natural language processes, participle is a difficult problem. Because between the word of English, splitting in space, and in Chinese, only between sentence and sentence, punctuation mark split, between word and word, there is no clear and definite boundary, to Chinese natural language is processed, it is necessary to obtain high-quality word segmentation result. Existing Chinese words segmentation has: string matching is analyzed, such as 1) Forward Maximum Method method (by left-to-right direction); 2) reverse maximum matching method (by the right side to left direction); 3) minimum cutting (make the word number cut out in each sentence minimum); 4) two-way maximum matching method (carry out by left-to-right, by the right side to left twice sweep). Understand segmenting method, imitate mankind's understanding to grammer, use syntax and syntactic analysis simultaneously, process ambiguity. Similarly, for low-quality UGC, the randomness of grammer makes these methods obtain desirable result all without way.

In information extraction, based on the text collection that participle is good, extract relevant topic event. Conventional method has: the learning method of supervised, the learning method of Semi-supervised, the learning method of non-supervisory formula. The learning method of supervised, based on the training set marked, learning model building, mainly has: support vector machine, nearest-neighbors method, gauss hybrid models, bayesian algorithm, decision tree etc. But, it is generally the case that the training set marked is more difficult to get, and the learning method of Semi-supervised is through a small amount of markup information to start, and the process modeling of iteration, the result of process is as the training dataset of training pattern next time. The information that the learning method of non-supervisory formula need not mark in advance, the non-supervisory formula study of common one is cluster, such as single pass.

So far, by natural language processing and information extraction method, it is possible to extract topic from UGC, but, as mentioned above, the low quality characteristic of UGC is bigger to noise produced by natural language processing and information extraction, result is not satisfactory, it is necessary to it is optimized improvement.In existing research, in the process processing UGC data, some method choice filter out the data of poor quality. Such as utilize the link information in content and the scoring between user, thus providing the quality score of data. By the scoring of quality, when extracting, it is possible to the data that filter quality is low, directly it is operated on high-quality data set. This method has walked around the low-quality problem of UGC to a certain extent, to the social networking system processing some knowledge question classes, as " Yahoo question and answer " are contributed to some extent. But, the mode of this avoidance is but easily lost many important informations. For such as forum, the UGC information such as microblogging, they are short and random, and the content quality that user delivers in different time mood difference is also different, the content quality delivered with different instruments even also difference to some extent: the quality issued such as computer is slightly higher, and the quality that mobile phone is issued is slightly lower. Quality score is carried out therefore, it is difficult to data divided according to user.

Nowadays, the present situation of use external resource is, many researchs process based on external resource, as utilized external resource to set up dictionary, but it is mainly based upon processed offline, the structured message using some external resources forms training dataset and training pattern, is not related to utilize online data source that user generated data is carried out the research of semantic tagger.

In order to overcome the impact of Chinese word segmentation result in prior art, filter low quality data and lose important information and do not support the defects such as online data source search, the present invention proposes a kind of method user-generated content being carried out semantic tagger based on external data source.

Summary of the invention

The present invention proposes a kind of method user-generated content being carried out semantic tagger based on external data source, comprises the steps:

Pre-treatment step: user-generated content is clustered, obtains more than one semantic entity;

Configuration step: generate query statement according to the key word in described semantic entity, external resource is searched for according to described query statement, therefrom location captures the page set relevant to described semantic entity, and giving weighted value according to degree of correlation to each page in described page set, described weighted value is for the degree of correlation of the page with semantic entity; Its degree of correlation with semantic entity of the more high expression of weighted value is more high;

Semantic tagger step: extract the information relevant to described semantic entity in described page set by described weighted value, for carrying out described semantic entity supplementing mark, be expanded the semantic entity optimized.

In the method that based on external data source user-generated content carried out semantic tagger that the present invention proposes, in described pre-treatment step, utilizing neural LISP program LISP and information extraction described user-generated content to carry out cluster and obtains described semantic entity, described information extraction technique includes single side scan clustering algorithm and support vector machine.

In the method that user-generated content carries out semantic tagger based on external data source that the present invention proposes, described semantic entity is made up of more than one key word, is comprised the steps: through being mutually combined the process generating query statement search external resource by key word

Step a1: scan for from the single key word of described semantic entity respectively as query statement according to apriori algorithm;

Step a2: by obtaining returning the interim set of single key word composition of result after search, the key word in described interim combination is scanned for as query statement with the combination of another single key word successively;

Step a3: repeat the above steps a2, until all crucial contaminations all do not return result or all key words have all combined and scanned for as query statement in described interim set.

In the method that user-generated content carries out semantic tagger based on external data source that the present invention proposes, described external resource is the online data source shared by network or the off-line data source being stored in local device.

In the method that based on external data source user-generated content carried out semantic tagger that the present invention proposes, if described external resource is online data source, then searches for described online data source and capture the process of page set and comprise the steps:

Step b1: arranging search word, related pages set and key word phrase, described key word phrase sorts in descending order or ascending order;

Step b2: by described search word and each word combination in described key word phrase, scan in external resource according to the search word after combination, if search obtains the page being correlated with, crawl the described page and be added in described related pages set;

Step b3: each page in described related pages set is given weighted value, and by described weighted value with descending or ascending sort.

In the method that user-generated content carries out semantic tagger based on external data source that the present invention proposes, the configuration process of the weighted value of the described page comprises the steps:

Step c1: calculate described key word and be arranged in the position weight parameter of described query statement;

Step c2: calculate described page number of times weight parameter of crawled number of times in described page set;

Step c3: calculate in the described page with Keywords matching degree mate weight parameter;

Step c4: calculate the special weight parameter of special phrase occurrence number in the described page;

Step c5: described position weight parameter, described number of times weight parameter, described coupling weight parameter are multiplied after carrying out normalization process with described special weight parameter obtain the weighted value of the described page respectively.

In the method that based on external data source user-generated content carried out semantic tagger that the present invention proposes, in described pre-treatment step, from described page set, extract the priority of relevant information as shown below:

P_t> P_a-S > P_i-S > P_l-S > P_a-U > P_i-U > P_l-U;

Wherein, Pt representation page title (name of the semantic entity of page-describing), Pa represents the first paragraph (introduction that semantic entity is brief, it is similar to summary), S represents that the information of described page set is present in described user generated data, Pi represents information boxes (semantic entity association attributes), Pl represents remainder (divided by description for semantic entity in the page exceptionally of top), and U represents that the information of described page set does not exist in described user generated data.

In the method that user-generated content carries out semantic tagger based on external data source that the present invention proposes, in described pre-treatment step, setting up six heuristic rules based on described priority, described heuristic rule is respectively as below equation represents:

L′_w1={P_t> P_a-S};

L′_w2={P_t> P_a-S > P_i-S};

L′_w3={P_t> P_a-S > P_i-S > P_l-S};

L′_w4={P_t> P_a-S > P_i-S > P_l-S > P_a-U};

L′_w5={P_t> P_a-S > P_i-S > P_l-S > P_a-U > P_i-U};

L′_w6={P_t> P_a-S > P_i-S > P_l-S > P_a-U > P_i-U > P_l-U};

Wherein, Lw ' 1 to Lw ' 6 represents the semantic entity after six heuristic rules optimization respectively, Pt representation page title, Pa represents first paragraph, S represent described page set information oneself be present in described user generated data, Pi represents that information boxes, Pl represent that remainder, U represent that the information of described page set does not exist in described user generated data.

The beneficial effects of the present invention is:

Low-quality semantic entity is optimized extension by the present invention. In the such as single pass of existing information abstracting method, SVM etc. carry out on the basis of participle and clustering processing, re-use and contain much information and the external resource of data format specifications, such as Wikipedia, Amazon etc. are optimized extension, thus obtaining high-quality semantic entity.

Accompanying drawing explanation

Fig. 1 is the method that user-generated content is carried out semantic tagger based on external data source by the present invention

Fig. 2 follows apriori principle to generate the schematic diagram of query statement in embodiment.

Fig. 3 uses the inventive method to carry out the schematic diagram of semantic tagger in embodiment.

Fig. 4 is the schematic diagram of hedge forum precisionN value in embodiment.

Fig. 5 is the schematic diagram of Sina's microblogging precisionN value in embodiment.

The schematic diagram of hedge forum F-MEASURE value when Fig. 6 is α=0.3 in embodiment.

Fig. 7 is the schematic diagram of α=0.5 hedge forum F-MEASURE value in embodiment.

Fig. 8 is the schematic diagram of α=0.7 hedge forum F-MEASURE value in embodiment.

Fig. 9 is the schematic diagram of α=0.3 Sina microblogging F-MEASURE value in embodiment.

Figure 10 is the schematic diagram of α=0.5 Sina microblogging F-MEASURE value in embodiment.

Figure 11 is the schematic diagram of α=0.7 Sina microblogging F-MEASURE value in embodiment.

Detailed description of the invention

In conjunction with specific examples below and accompanying drawing, the present invention is described in further detail. Implementing the process of the present invention, condition, experimental technique etc., outside the lower content mentioned specially, be the universal knowledege of this area and known general knowledge, the present invention is not particularly limited content.

Consult Fig. 1, the method that user-generated content is carried out semantic tagger based on external data source by the present invention, comprise the steps:

Configuration step: generate query statement according to the key word in semantic entity, searches for external resource according to query statement, and therefrom location captures the page set relevant to semantic entity, and gives weighted value according to degree of correlation to each page in page set;

Semantic tagger step: the page gives the degree of correlation that weights are used to weigh the page with semantic entity, weights are more high shows that this page is more high with the dependency of described semantic entity. Extracting the information relevant to semantic entity in page set, for carrying out semantic entity supplementing mark, be expanded the semantic entity optimized.

(1) pre-treatment step, the key word of generative semantics entity

At present, the difficult point for Data Clustering Algorithm is broadly divided into following 2 points:

1) along with the growth rapidly of data volume, can not grasp the distribution situation of data completely, can not determine which classification what data can be concrete is divided into. Traditional clustering algorithm has not adapted to so burgeoning data, and traditional clustering algorithm includes such as k-meanse.

2) appearance of high dimensional data breaches the restriction of various module. Especially for the appearance of long text, the general similarity calculating method that the feature of higher-dimension makes lost efficacy, and the distance that a lot of low-dimensionals can judge will become inseparable in higher dimensional space, causes the data sample that finally cannot distinguish between different mode.

In pre-treatment step, the present invention uses single pass clustering algorithm (SinglePassClustering). Single pass clustering algorithm is the algorithm of a kind of non-supervisory formula, uses widely in network information extraction, and this algorithm is in that to set similarity threshold.For n data sample, its calculation cost is O (n*n). Text is usually a kind of non-structured data, is all store with the form of text data in the most of data having cybertimes now. And understand text data and to use text data be a technical barrier of Data Mining. At Data Mining, clustering algorithm is that a class is applicable to have AD HOC and the data of mensurable similarity. In text cluster, more general a kind of pattern expression is characteristic vector, and by calculating the similarity between characteristic vector, the entity in data is judged to some classification. Cluster result, namely semantic entity one by one, be each represented by a string key word Lw.

(2) query statement and search external resource are generated

Semantic entity can be obtained by above example to be made up of more than one key word, using key word through being mutually combined generation query statement and comprising the steps: as search external resource

Step a1: scan for from the single key word of semantic entity respectively as query statement according to apriori algorithm;

Step a2: will obtain returning the interim set of single key word composition of result after search, the key word in temporarily combining is scanned for as query statement with the combination of another single key word successively;

Step a3: repeat the above steps a2, until all crucial contaminations all do not return result or all key words have all combined and scanned for as query statement in interim set.

Such as, step a1 produces query statement by contaminations different in key word Lw. Lw={w1, w2, w3, w4}, generate a group polling statement { { w1}, { w2}, { w3}, { w4}, { w1w2}, { w1w3}, { w1w4}, { w2w3}, { w2w4}, { w3w4}, { w1w2w3}, { w1w2w4}, { w1w3w4}, { w2w3w4}, { w1w2w3w4}}. Additionally, in order to reduce inquiry times, when inquiry, the present invention starts a query at from the combination that number is few, and follows apriori principle. Referring to Fig. 2 mark part, it then follows apriori principle, if { w1, w2} search does not return result, then { w1, w2, w3}, { w1, w2, w4}, { w1, w2, w3, w4} will not scan for again.

Generate query statement and search for the process of external resource and be:

1) the single key word wi from Lw scans for initially as query statement.

2) record has the key word returning collection for gather Ltemp{wi temporarily, wj ... wk}, the number of the key word of each inquiry is N (during single key word N=1), and from interim set Ltemp, the inquiry of N+1 key word of structure, scans for.

3) repeat 2) step, scan for until all combinations in Ltemp are taken as an inquiry all without all key words returned in result or Lw.

According to above-mentioned create-rule, if the present invention has only to browse search engine returns the page 1 of result, namely maximally related content, then the number of times altogether accessed is secondary for [0,2n-1]. It practice, when searching for the word more than three different field on encyclopedia, have the probability returning result non-normally low.

(3) external data source is obtained

Have at present many external resources all actively provide some free for data backup for studying. But, real-time is needed for the excavation of UGC on social networks, for instance: political event etc. popular at present, use those Backup Datas would generally omit many important informations.In order to obtain better optimum results, the present invention selects the data going in real time on one's own initiative to crawl in external resource, thus obtaining up-to-date version. Therefore, if using reptile to go to crawl these data, an important problem is exactly, if obtaining a balance in accuracy and efficiency. As a rule, two kinds of methods have been crawled. The first is off-line mode, and it can crawl the page as much as possible in advance, is stored in this locality, and the post processing being used as calculates. Now most web crawlers is exactly such. The second is line model, and it when processing calculating, can crawl according to different demands external resource again.

Being more than the Implementation of pseudocode mode of line model, line model arthmetic statement is as follows: given search word searchWord, related pages set P, sorted key word phrase Lw; Each word in searchWord=searchWord+Lw, scans for searchWord in external resource, when having related pages P ' to return, this page is crawled to get off to be added into page set, i.e. P=P ∪ P '; Finally give weighted value to each page in P, and sort.

Selecting line model to carry out semantic entity optimization in the present invention, to be better adapted to grasp the hot ticket of UGC, it is by Time Triggered, has stronger ageing, so the early time data in off-line data source contribution that the present invention is analyzed not quite. Additionally, off-line data source needs to be continuously updated local data, simultaneously need to safeguard that large data sets indexes, it is not easy to page search, and external resource can find related pages online more simply. Line model has only to determine, according to Lw, the page that crawls, although in the time that the time crawling the page joins calculating, but, the quantity Nsp (SEi) crawling the page is needed to be controllable for a semantic entity, Nsp (SEi) ∈ [0,2NL-1] when NL increases time, Nsp (SEi) can't be linearly increasing, but meeting stipulations are to a bit. Assuming that for the semantic entity SEi search external resource time spent be Ts (SEi), and the time required for a page that crawls is μ, then Ts (SEi)=μ * Nsp (SEi), detailed process is as follows:

This example selects two data sets carry out test statistics information, referring to table 1:

Table 1 hedge forum and Sina's microblog data collection are introduced

Data source	Hedge forum	Sina's microblogging
			Comprise content	Title	Microblogging
Topic size	68M	2G
			Topic number	224	41
There is external resource topic number	141	33
			Time span	2009.10-2010-10	2009.08-2012-02

First the semantic entity marked in advance is divided into training set and test set, then adjusts parameter to obtain the highest accuracy rate. Additionally, give precisionN measurement index in this example, this measurement index represents when the key word of the different number of use is to construct query statement, it is possible to positioning the accuracy of related pages in external resource, N is in alternative page pool, the top n page.

In order to obtain higher accuracy rate, parameter a, b, c are adjusted arranging by this example. Finally, for three eigenvalues selected by hedge forum, it is set to α w=0.9, α o=0.1, α m=0.3. For Sina's microblogging, selected four eigenvalues, it is set as α w=0.2, α o=0.1, α m=0.5, α s=0.6.

Fig. 4 is that hedge forum obtains the external resource stage, uses 1,2,3,5 key word combinations to constitute the corresponding precisionN value that query statement is obtained respectively, shown in figure:

1) the highest precisionN occurs in three key words of use and is combined search and when N is 5, is 85%.Peak occurs in three key words of use, rather than when five key words, its reason is by chance because the semantic entity that traditional method extracts can not well express this semantic entity, if the same semantic entity that the word in Lw all indicates that, so, the key word of search is more many, and result out should be more accurate. But, but on the contrary, therefore, this also just demonstrates result, by the necessity that external resource is optimized.

2) in the part of remaining 15% mistake, the semantic entity of the same name difference page that a portion is because in external resource in the different field of offer.

Fig. 5 is the precisionN value reached in Sina's microblogging acquisition external resource stage, it is different from hedge forum, time in figure, peak occurs in and uses five key words to be combined searching for, find after analyzing initial data, main cause is on the position that the sequence of some key words is come below, therefore, uses first three word to scan for, it is not core vocabulary, is not perfectly correlated with event. This also uses traditional method to extract one of produced problem just.

The on-line search performance evaluation of table 2 hedge forum

Above false code illustrates in search procedure, it is necessary to the page number crawled. When using three key word combinatorial search, for a semantic entity, on average need to crawl 6.29 pages; And when the number of key word is five, each semantic entity on average needs to crawl 20.361-24.536 the page. Meanwhile, different data sets, use same keyword search, it is necessary to the actual pages number of search is similar. Additionally, this table shows, when the key word number of search increases, it is linearly increasing therewith that each semantic entity needs the page that crawls, meanwhile, actual value all than maximum little a lot. The key word using search is more many, and the difference between the actual search page and maximum is more big. μ=1.43 second are set, namely on average crawl the time that a page needs. The value of μ is the arithmetic mean of instantaneous value produced by ten random page. The All Time of cost in Ttotal expression process in table, including searched page, analyzes and obtains the page. In order to simplify Ttotal, only calculating searched page and obtain the time that the page spends, when N=5 time, precisionN reaches peak. Therefore

T_total=μ * (5+ Σ N_sp(SE_i))

Table 2 is same with table 3 presents Ttotal value, for hedge forum, process 224 semantic entities, use three key words to be combined query statement, what searched page and the acquisition page spent is about 30 minutes total time, and average each semantic entity spends 8 seconds. For Sina's microblogging, processing 41 semantic entities, use five key words to be combined query statement, searched page is about 23 minutes with what the acquisition page spent total time, and average each semantic entity spends 36 seconds.

The microblogging on-line search performance evaluation of table 3 Sina

(4) configuration weighted value

After using the query statement search generated, the present invention can set up an alternative page pool. The page in this alternative page pool, is all more or less relevant to certain semantic entity. The present invention needs to find that and the maximally related page of semantic entity from this page pool. In discovery procedure, many characteristic parameters can as the foundation configuring this page weight value.

1) key word position weight parameter δ w in Lw;

Being directed to different cluster results, in Lw, the weighted value calculation of key word is different, according to its sorting position. (can also by, in preprocessing process, producing the weighted value that in Lw, key word uses and be calculated. ) page degree of association that obtains of weighted value is high in Lw keyword search is higher. The present invention uses the mode of combination key word to scan for, and therefore, if using multiple key word to scan for, then the page obtained is by the cumulative gained of the weighted value of these key words. Assume query statement by w1, w2 ... and wi} constitute, wherein 1≤i≤N (Lw), the weighted value for key word wi is expressed as S (wi).

S (wi)=Len (Lw)-Pos (wi), wherein Len (Lw) represents the length of Lw, and Pos (wi) represents key word wi seat in Lw. So, for the weighted value of this search statement it is: w=∑ S (wx)

2) page number of times weight parameter δ o of occurrence number in alternative page pool;

The key word query composition in different semantic entity is used to scan for, obtain the alternative page pool being made up of several pages, in this alternative page pool, for a page, if the number of times that this page occurs is more many, just illustrate that it is more relevant to this semantic entity. Because in processing procedure, the related pages of each semantic entity is independent, therefore, the alternative page pool of each semantic entity is also independent, when choosing this eigenvalue, whole volume without the concern for alternative page pool, it is only necessary to consider the number of times that certain specific webpage occurs in alternative page pool.

3) the coupling weight parameter δ m of key word in Lw in the name-matches of whether this page;

In external resource, a page is used for describing a semantic entity, such as Wikipedia, Amazon, Youtube, if Page Name just mates with this semantic entity, consult Fig. 3, as: " Rio shoot Niagara ", { " Rio shoot Niagara ", " film ", " Psittacula alexandri fasciata ", ..., then, the degree of association of this page should be higher. If they are similar, as: " Rio shoot Niagara ", " Rio ", " shoot Niagara ", " film " ... }, then degree of association is slightly higher. According to this rule, the present invention provides the calculation of following δ m value, corresponding different semantic entity and related pages thereof:

In formula, Name (page_x) represent semantic entity.

4) the special weight parameter δ s of some special words whether is comprised;

If event is carried out semantic entity extraction, then, some special words such as: " * * * event ", " * * * accident ", the page weight value comprising these special words will increase. Therefore, the semantic entity for different field extracts, and the definition of special word is different.

δ_{s} ({page}_{x}) = \{\begin{matrix} 2, & Name ({page}_{x}) &Element; {SpecialWords} \\ 1, & otherwise \end{matrix}

Each parameter, δ w, δ o, δ m, δ s is mapped to interval [α w, 1], [α o, 1], [α m, 1] by respective normalization, and on [α s, 1], wherein 1 represents optimal value, and α w, α o, α m, α s is by user-defined. In alternative page pool, the weighted value of each page is calculated as:

{Score}_{item_page} = α_{w}^{*} * α_{o}^{*} * α_{m}^{*} * α_{s}^{*},

Wherein, α w*, α o*, α m*, α s* is the weighted value after normalization, Score_{item_page}The weighted value of representation page configuration. Further, except above-mentioned four weight parameter being previously mentioned, other eigenvalues are also included, for instance: the page returns the sorting position in set at outside resource searching; The page creates or the last time edited in external resource; The credibility etc. of external resource.Finding in practical operation, δ w, δ o, δ these weighted values of m, δ s are maximum for the contribution degree finding most related pages, it is notable that combine for different UGC data and external resource, and choosing of eigenvalue can difference to some extent.

(5) semantic tagger is optimized

In optimizing phase, external resource, the information on related pages is extracted with for identifying semantic entity. When extracting keywords, according to observing gained above, those all can have the different forms of expression for the important word of description of semantic entity, for instance: hyperlink, black matrix, this semantic entity can be expressed more clearly in these words. Being compared to and use traditional participle technique, selection is directly extracted the reason of these key words and is:

1) traditional participle technique needs text is traveled through, for instance Forward Maximum Method, reverse maximum match etc., it is considered to more many factors, and the result for participle is more accurate, but, the complexity of algorithm but can improve. Especially applying on Chinese text collection, the result of participle will not be completely correct. For Optimization Work, introducing mistake needs to avoid.

2) as shown in table 1, for certain semantic entity, those are labeled as hyperlink or black matrix, are all some important nouns, and such as time, place, personage, these are all necessary for expressing a semantic entity clearly. Due to their special representing form, they can be directly extracted, it is not necessary to carries out participle.

Therefore, time the user that literary composition situation is more complicated in processes produces data, select not use participle that statement is split when external resource extracts. Meanwhile, in order to prevent the key word extracted from having redundancy, when extracting, a step inspection can be done. When two words have a repetition, and length difference less than 2 time, only select that long word, for instance: " Rio shoot Niagara " and " shoot Niagara ", " Rio shoot Niagara " can be selected. This will directly reduce the length of Lw, and the least possible loss expresses the relevant information required for semantic entity. For wikipedia, by the observation above drawn, for a page, it is organized well, and meanwhile, its composition is shown in table 4 external resource page properties. Typically, for the key word occurred in the page, its sequence should follow following rule:

Rule 1:

P_t> P_a-S > P_i-S > P_l-S > P_a-U > P_i-U > P_l-U

Wherein " S " represents that this key word occurred in UGC data, and " U " represents do not occur in UGC data. Because the length of Lw' should be controlled, therefore, need time the key word in external resource is extracted out to be ranked up with reference to rule 1. When selecting to express a semantic entity, most important key word is optimized and joins in Lw'. According to the characteristic that the page is constituted, when extracting keywords time, the invention provides following six heuristic rules.

Heuristic rule 1:

L′_w1={P_t> P_a-S}

The key word that Lw'1 comprises is the key word simultaneously occurred in external resource and UGC, and it occurs in title and the summary of the page. In six heuristic rules, this rule is most stringent, and therefore, the word that it extracts is minimum.

Heuristic rule 2:

L′_w2={P_t> P_a-S > P_i-S}

Lw'2 adds on the basis of Lw'1 and comprises in infobox, simultaneously appears in the key word in external resource and UGC.In Infobox, contain the important key word that more semantic entity is relevant, but, it is observed that these key words generally some repeats to some extent with the key word in summary.

Heuristic rule 3:

L′_w3={P_t> P_a-S > P_i-S > P_l-S}

Lw'3 comprises all key words simultaneously appeared in external resource and UGC. Visible, lw'1, lw'2, lw'3 does not comprise the important key word not occurred in UGC. It is to say, the key word in semantic entity can only be corrected by first three heuristic rule, it is impossible to supplement.

Heuristic rule 4:

L′_w4={P_t> P_a-S > P_i-S > P_l-S > P_a-U}

Being compared to the result that first three heuristic rule produces, Lw'4 comprises the key word not occurred in some UGC. Lw'4 on the basis of heuristic rule 3, add in the summary of external resource occur all key words, these key words be likely not to have in UGC occur, but, for expressing this semantic entity, be but very important.

Heuristic rule 5:

L′_w5={P_t> P_a-S > P_i-S > P_l-S > P_a-U > P_i-U}

Compared with heuristic rule 4, the obtained result Lw'5 of heuristic rule 5 comprises the word not occurred in more UGC, and the result of several heuristic rules before the length ratio of Lw'5 is all long, and it may comprise the key word that many importance degrees are relatively low.

Heuristic rule 6:

L′_w6={P_t> P_a-S > P_i-S > P_l-S > P_a-U > P_i-U > P_l-U}

The obtained result Lw'6 of heuristic rule 6 contains the key word in all external resources, and therefore, its length is maximum, is also of most redundancy. It is the superset of all above-mentioned rules.

Above-mentioned 6 heuristic rules are followed successively by the superset of previous bar heuristic rule, and the key word number comprised is also many than the key word number that previous bar comprises, and meanwhile, the importance degree meansigma methods of these key words is more and more lower. By these six heuristic rules, Lw' can obtain different looking into complete-precision ratio. Further, for different external resources, which bar heuristic rule is best suitable for also being different.

Concrete algorithm is as follows, first input parameter is related pages P and the threshold value n controlling semantic entity key word number, it is output as the semantic entity Lw' after optimizing, from heuristic rule 1, successively heuristic rule is applied on page p, extracting keywords, carries out redundancy process, and the maximum key word of Extracting Information amount joins in Lw'.

For the experiment of optimizing phase, use first three the key word query composition statement in hedge forum Lw to scan for, and obtain five alternative pages. The first five key word query composition statement in Sina microblogging Lw scans for, and obtains three alternative pages. When constituting Lw', UGC is combined with related pages and accounts for. According to six heuristic rules, it is possible to obtain 7 different semantic entities and express, including the semantic entity before optimizing, before wherein Lw represents optimization, namely use the result that traditional method is obtained, Lw'1, Lw'2 ..., Lw'6 represents the semantic entity that six heuristic rule correspondences produce respectively.

In Fig. 6 to Figure 11, it is shown that hedge forum and Sina's microblogging the two data set under different α are arranged, corresponding F-MEASURE value. From these six figure, it can be deduced that to draw a conclusion:

That is:

1) result using the result after external resource optimization all obtained than original method under the setting of three kinds of α is accurate. Further, optimal value exceeds nearly three to four-fold than original semantic entity.

2) on average, in these six figure, Lw'4, Lw'5, the F-MEASURE value of Lw'6 is all high than Lw, is optimized hence with the present invention and is a need for.Because heuristic rule 4-6 can comprise the key word outside UGC, in other words, carrying out semantic entity extraction only according to target data set is inapplicable for low-quality UGC.

3) hedge forum peak occurs in Lw'4, substantially the twice of Lw, and therefore, context of methods greatly enhances the accuracy that semantic entity is expressed. Sina's microblogging peak occurs between Lw'5 and Lw'2. This illustrates, different data sets, uses external resource select different optimisation strategy when being optimized.

4) value for Lw'4, Lw'5 is fairly close, say, that it is true that in this external resource of wikipedia, occur in summary with the maximally related key word of semantic entity. Therefore, in processing procedure later, it is possible to only extracting header and the key word in summary, thus improving the efficiency using external resource to be optimized.

5) in hedge forum, the value of Lw'1 and Lw'2 is lower than Lw on the contrary, and reason is, heuristic rule 1 and 2 only extracts the key word simultaneously occurred in the summary of UGC and external resource and information boxes. Owing to (1) UGC is short and irregular; (2) summary and information boxes are short. Therefore, for a key word, the probability of the summary info box simultaneously appearing in UGC and external resource is very low. In 6 heuristic rules, heuristic rule 1 and 2 is most stringent of, and the key word number in semantic entity produced by it is also minimum, so, its result is more worse than the semantic entity before optimizing on the contrary. On the contrary, in Sina's microblogging, this situation do not occur, it is lower that reason is that Sina's microblogging is compared to hedge forum quality.

6) when emphasizing to look into parasexuality, the F-MEASURE value that heuristic rule 3 and the semantic entity Lw before optimizing reach is close. It is to say, when expressing certain semantic entity in UGC, external resource covers substantially the key word in UGC, therefore, use external resource to expand, the content in UGC will not be lost.

The protected content of the present invention is not limited to above example. Under the spirit and scope without departing substantially from inventive concept, those skilled in the art it is conceivable that change and advantage be all included in the present invention, and with appending claims for protection domain.

Claims

1. method user-generated content being carried out semantic tagger based on external data source, it is characterised in that comprise the steps:

Configuration step: generate query statement according to the key word in described semantic entity, external resource is searched for according to described query statement, therefrom location captures the page set relevant to described semantic entity, and giving weighted value according to degree of correlation to each page in described page set, described weighted value is for the degree of correlation of the page with semantic entity;

2. the method based on external data source, user-generated content being carried out semantic tagger as claimed in claim 1, it is characterized in that, in described pre-treatment step, utilizing neural LISP program LISP and information extraction technique described user-generated content to carry out cluster and obtains described semantic entity, described information extraction technique includes single side scan clustering algorithm and support vector machine.

3. the method based on external data source, user-generated content being carried out semantic tagger as claimed in claim 1, it is characterized in that, in described configuration step, described semantic entity is made up of more than one key word, is comprised the steps: through being mutually combined the process generating query statement search external resource by key word

4. the method based on external data source, user-generated content being carried out semantic tagger as claimed in claim 1, it is characterised in that described external resource is the online data source shared by network or the off-line data source being stored in local device.

5. the method based on external data source, user-generated content being carried out semantic tagger as claimed in claim 4, it is characterised in that if described external resource is online data source, then the process searching for described online data source crawl page set comprises the steps:

6. the method based on external data source, user-generated content being carried out semantic tagger as claimed in claim 1, it is characterised in that the configuration process of the weighted value of the described page comprises the steps:

7. the method based on external data source, user-generated content being carried out semantic tagger as claimed in claim 1, it is characterised in that in described pre-treatment step, extract the priority of relevant information from described page set as shown below:

P_t> P_a-S > P_t-S > P_t-S > P_a-U > P_t-U > P_t-U;

Wherein, Pt representation page title, Pa represents first paragraph, S represents that the information of described page set is present in described user generated data, Pi represents that information boxes, Pl represent that remainder, U represent that the information of described page set does not exist in described user generated data.

8. the method based on external data source, user-generated content being carried out semantic tagger as claimed in claim 7, it is characterized in that, in described pre-treatment step, setting up six heuristic rules based on described priority, described heuristic rule is respectively as below equation represents:

L′_w1={P_t> P_a-S};

L′_w2={P_t> P_a-S > P_t-S};

L′_w3={P_t> P_a-S > P_t-S > P_t-S};

L′_w4={P_t> P_a-S > P_t-S > P_t-S > P_a-U};

L′_w5={P_t> P_a-S > P_t-S > P_t-S > P_a-U > P_t-U};

L′_w6={P_t> P_a-S > P_t-S > P_t-S > P_a-U > P_t-U > P_t-U};

Wherein, Lw'1 to Lw'6 represents the semantic entity after six heuristic rules optimization respectively, Pt representation page title, Pa represents first paragraph, S represents that the information of described page set is present in described user generated data, Pi represents that information boxes, Pl represent that remainder, U represent that the information of described page set does not exist in described user generated data.