CN103164428B

CN103164428B - Determine the method and apparatus of the correlativity of microblogging and given entity

Info

Publication number: CN103164428B
Application number: CN201110414476.6A
Authority: CN
Inventors: 张姝; 孟遥; 夏迎炬; 于浩
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-12-13
Filing date: 2011-12-13
Publication date: 2016-01-20
Anticipated expiration: 2031-12-13
Also published as: CN103164428A

Abstract

The present invention relates to the method and apparatus of the correlativity determining microblogging and given entity.Determine that the method for the correlativity of each microblogging in multiple microblogging and given entity comprises: the feature extracting each microblogging in multiple microblogging; According to the similarity between extracted feature determination microblogging; And utilize similarity between determined microblogging, the correlativity of each microblogging in multiple microblogging and given entity is determined based on semi-supervised classifier.

Description

Determine the method and apparatus of the correlativity of microblogging and given entity

Technical field

The present invention relates to micro-blog information excavation applications, be specifically related to the method and apparatus of the correlativity determining microblogging and given entity.

Background technology

Microblogging (such as, pushing away spy, Sohu's microblogging and Tengxun's microblogging etc.), as a kind of social media, has promptly won worldwide welcome.How to manage the information relevant with microblogging to grasp people to be subject to research institution to the feedback of commodity and comment etc. a large amount of concerns to the response of government policy, people.There are some researchs, such as opining mining and online reputation management etc., they focus on the media that supervisory user generates.One of key content of these researchs is first to obtain the information relevant with studied entity (such as product, company or particular event).

Obtain the information relevant with studied entity and will face following two problems.First, microblogging and entity all comprise little information.Microblogging is different from traditional media generated by user.It allows user to generate the message being no more than 140 characters.Little contextual information can be obtained.Therefore, it is challenging for monitoring and analyze these message.In addition, entity title may be fuzzy, causes this to be the task of having challenge.Such as, the title Apple of Apple also can represent apples.The title Amazon of Amazon Company also can represent Amazon River.It is very important for filtering nom dub coupling for the related content that determination and analysis people effectively talk about this entity.Secondly, the tissue in training data is different with the entity in test data, and this causes being difficult to sorter to be trained for for special entity.

Therefore, a kind of technology that can solve the problem is needed.

Summary of the invention

Provide hereinafter about brief overview of the present invention, to provide about the basic comprehension in some of the present invention.Should be appreciated that this general introduction is not summarize about exhaustive of the present invention.It is not that intention determines key of the present invention or pith, and nor is it intended to limit the scope of the present invention.Its object is only provide some concept in simplified form, in this, as the preorder in greater detail discussed after a while.

A fundamental purpose of the present invention is, provides a kind of method and apparatus determining the correlativity of microblogging and given entity.

According to an aspect of the present invention, provide a kind of method determining the correlativity of each microblogging in multiple microblogging and given entity, comprising: the feature extracting each microblogging in multiple microblogging; According to the similarity between extracted feature determination microblogging; And utilize similarity between determined microblogging, the correlativity of each microblogging in multiple microblogging and given entity is determined based on semi-supervised classifier.

According to another aspect of the present invention, provide a kind of device determining the correlativity of each microblogging in multiple microblogging and given entity, comprising: microblogging feature extraction unit, be configured to the feature of each microblogging extracted in multiple microblogging; Similarity determining unit, the similarity between being configured to according to extracted feature determination microblogging; And correlation determination unit, be configured to utilize the similarity between determined microblogging, determine the correlativity of each microblogging in multiple microblogging and given entity based on semi-supervised classifier.

According to a further aspect of the invention, a kind of computer program for realizing said method is provided.

According to a further aspect of the invention, providing a kind of computer program of computer-readable medium form, it recording the computer program code for realizing said method.

By below in conjunction with the detailed description of accompanying drawing to most preferred embodiment of the present invention, these and other advantage of the present invention will be more obvious.

Accompanying drawing explanation

Below with reference to the accompanying drawings illustrate embodiments of the invention, above and other objects, features and advantages of the present invention can be understood more easily.Parts in accompanying drawing are just in order to illustrate principle of the present invention.In the accompanying drawings, same or similar technical characteristic or parts will adopt same or similar accompanying drawing label to represent.

Fig. 1 is the process flow diagram of the method that the correlativity determining microblogging and given entity is according to an embodiment of the invention shown;

Fig. 2 is the process flow diagram that the method determining the correlativity of microblogging and entity according to an embodiment of the invention based on label propagation algorithm is shown;

Fig. 3 illustrates to be combined with supervised classifier and semi-supervised classifier according to an embodiment of the invention to determine the process flow diagram of the method for the correlativity of microblogging and given entity;

Fig. 4 is the schematic diagram that the webpage with encyclopaedical attribute is shown, this webpage is for eliminating the ambiguity of vocabulary;

Fig. 5 is the schematic diagram that related term query webpage is shown, this webpage is for searching the vocabulary relevant to specific vocabulary;

Fig. 6 is the block diagram of the configuration of the device that the correlativity determining microblogging and given entity is according to an embodiment of the invention shown;

Fig. 7 is the block diagram of the illustrative arrangement that correlation determination unit is according to an embodiment of the invention shown;

Fig. 8 is the block diagram of an exemplary configuration of the device that the correlativity determining microblogging and given entity is according to an embodiment of the invention shown;

Fig. 9 is the block diagram of the configuration that necessity judging unit is according to an embodiment of the invention shown;

Figure 10 is the block diagram of the configuration that seed selection module is according to an embodiment of the invention shown; And

Figure 11 is the structural drawing of the citing that the computing equipment that may be used for the method and apparatus implementing the correlativity determining microblogging and given entity is according to an embodiment of the invention shown.

Embodiment

With reference to the accompanying drawings embodiments of the invention are described.The element described in an accompanying drawing of the present invention or a kind of embodiment and feature can combine with the element shown in one or more other accompanying drawing or embodiment and feature.It should be noted that for purposes of clarity, accompanying drawing and eliminate expression and the description of unrelated to the invention, parts known to persons of ordinary skill in the art and process in illustrating.

The method 100 of the correlativity determining microblogging and given entity is according to an embodiment of the invention described referring to Fig. 1.

Fig. 1 is the process flow diagram of the method 100 that the correlativity determining microblogging and given entity is according to an embodiment of the invention shown.

As shown in Figure 1, in step S102, the feature of each microblogging in multiple microblogging can be extracted.

In step S104, can according to the similarity between extracted feature determination microblogging.

In step S106, the similarity between determined microblogging can be utilized, determine the correlativity of each microblogging in multiple microblogging and given entity based on semi-supervised classifier.Such as, based on semi-supervised classifier, each microblogging can be labeled as true or false, wherein, true respresentation microblogging is relevant to given entity, and false expression microblogging is uncorrelated with given entity.

As required, semi-supervised classifier can be suitable any semi-supervised classifier.Such as, semi-supervised classifier can be the sorter based on label propagation or the sorter based on algorithm of camping step by step etc.

Below algorithm (Bootstrapping) of camping step by step is described.Camp step by step algorithm, being also self-training (self-training), is a kind of semi-supervised learning method.The core concept of the method is first by utilizing a small amount of artificial labeled data to estimate the initial parameter of system.When system is at actual motion, if find the unlabeled data higher with artificial labeled data similarity, then system it can be used as " automatic marking " data to join in training set, re-training, thus improved system performance.

The method 200 determining the correlativity of microblogging and entity based on label propagation algorithm is described referring to Fig. 2.Fig. 2 is the process flow diagram that the method 200 determining the correlativity of microblogging and entity according to an embodiment of the invention based on label propagation algorithm is shown.

The basic thought of label propagation algorithm (LabelPropagationAlgorithm) is on the figure of a Weight, and label propagates into the node of non-label from the node of label.And in the process propagated, if the weight on limit is larger, then label propagation is easier, this means that if the similarity between two nodes is higher, then these two nodes tend to belong to identical classification.In other words, if the similarity between two microbloggings is high, then these two microbloggings tend to all relevant to special entity or all uncorrelated with special entity.

As shown in Figure 2, in step S202, can by each microblogging in multiple microblogging being considered as node, between two microblogging with common trait, build limit and represent the weight on limit by the similarity between two microbloggings with common trait, build microblogging node diagram.

Particularly, at chart G{V, E, W} carry out label distribution, wherein V is the set of n node.E is the set on m limit, and W is weights W _ijn × n matrix, wherein W _ijit is the weight on limit (i, j).

In step S204, a part of node can be selected from node as seed.Seed can be selected according to various mode.Such as, manually seed can be selected.Or, Supervised classification device (such as, maximum entropy classifiers or Naive Bayes Classifier) can be utilized to select seed, will the process utilizing maximum entropy classifiers to select seed be described in detail after a while.

In step S206, the algorithm can propagated based on label determines the correlativity of each microblogging in multiple microblogging and given entity.

Fig. 3 illustrates to be combined with supervised classifier and semi-supervised classifier according to an embodiment of the invention to determine the process flow diagram of the method 300 of the correlativity of microblogging and given entity.Herein, step S102, S104 with S106 are identical with those steps described with reference to Fig. 1.

As shown in Figure 3, in step S102, the feature of each microblogging in multiple microblogging can be extracted.

In step S302, the feature be associated with given entity can be extracted.

Particularly, can from extracting the word that be associated with given entity as feature at least one page lower page: the entity homepage that given entity is associated, have the encyclopaedical attribute of network webpage and for helping user to obtain the webpage of associative key by several keyword.The reason so done is, the title of entity generally includes little vocabulary, and the title of some entity may have ambiguity, such as Apple, Amazon etc.The more information about entity can be obtained by introducing external resource.

Such as, the word that is associated with given entity can be extracted as feature from the entity homepage of given entity.Entity homepage can be searched according to the URL of each entity.Word in entity homepage is usually more relevant to this entity and more can represent this entity, selects word to represent this entity, wherein do not comprise stop word from entity homepage.But the webpage of some entities JavaScript creates, or even create with Flash, be thus difficult to so far to extract text message from these webpages.

In addition, the word that is associated with given entity can be extracted as feature from the webpage (webpage such as, shown in Fig. 4) with encyclopaedical attribute.An example with the webpage of encyclopaedical attribute is wikipedia (Wikipedia) webpage.In order to obtain this entity information higher-quality, and overcoming the loss problem of related homepage, such as wikipedia can be utilized to eliminate polysemy page.Such as, if the title of given entity has ambiguity, then can from the web page interrogation candidate related pages with network encyclopedia attribute.Then, the URL information of the entity homepage whether containing given entity can be analyzed to determine in candidate's related pages to candidate's related pages.If the URL information of entity homepage containing given entity in candidate's related pages, then can think that this candidate's related pages is associated with given entity really, then extract word in this homepage as the feature for this entity.

Fig. 4 is the schematic diagram that the webpage with encyclopaedical attribute is exemplarily shown, this webpage is for eliminating the ambiguity of vocabulary.As shown in Figure 4, such as, in webpage, input Linux, some that can obtain Linux are explained, can be eliminated the ambiguity of Linux by these explanations.

In addition, the word that can be associated from the middle extraction of the webpage (webpage such as, shown in Fig. 5) for obtaining associative key with given entity is as feature.GoogleSet webpage for obtaining an example of the webpage of associative key.GoogleSet provides the word similar with query terms, and thus GoogleSet may be used for the abundant information relevant with entity.Such as, if in GoogleSet webpage input " YaleUniversity ", then return be associated word " Stanford ", " Columbia ".This information is useful, the potential semantic information that it provides to a certain extent.

Fig. 5 is the schematic diagram schematically showing related term query webpage, and this webpage is used for searching related term.As shown in Figure 5, such as, in related term query webpage, input Linux, will the vocabulary such as vocabulary windows, windows7, mac, windowsxp, windowsvista, android, mobile, unix, iphone, macos, solaris, internetexplorer, the windowslive relevant to Linux be returned.These vocabulary returned are all relevant to Linux, and this gives the potential applications information of Linux to a certain extent.

In addition, the URL in homepage and Wiki webpage is also very strong indicator.If microblogging comprises identical URL with homepage or Wiki webpage, then this microblogging more may be relevant to this entity.

Correspond to for the above-mentioned feature of entity, can extract unigrams, bigrams, capitalization word and from the URL of microblogging as feature.For " Xi'an Communications University ", when unigram, " Xi'an Communications University " will be represented as west/peace/friendship/logical/large/, and when bigram, " Xi'an Communications University " will be represented as Xi'an/peace friendship/traffic/greatly logical/university.

In addition, the metadata in entity homepage can be defined as important feature.Metatag in HTML page provides the high-quality keyword for representing its webpage.If webpage has metadata, then they are used to the desirable features representing this entity.

In addition, can also by entity homepage and/or have network encyclopedia attribute webpage in capitalization word and/or uniform resource position mark URL be defined as important feature.Capitalization word may be more important word or named entity.These words are strengthened as a kind of feature by selecting these words.

Can the feature of microblogging be extracted with various appropriate ways and use the character representation microblogging extracted.When extracting the feature of entity by external resource, the microblogging corresponding with given entity can be represented as:

Vector(T _i，O _k)＝{F ₁，F ₂，...，F _n}(1)

Herein, T _imicroblogging, O _kentity, F _ithe category feature described before being.Such as, F ₁the feature extracted from homepage can be represented, F ₂the feature extracted from wikipedia webpage can be represented, F ₃the feature extracted from GoogleSet can be represented.Formula (2) can be passed through and calculate each F _ivalue.

Value (F_{i}) = \underset{m}{Σ} {Wt}_{m} - - - (2)

Wherein, Wt _mfeature t _mweight.Word frequency-reverse document-frequency (TF-IDF) can be passed through and calculate weight, or be only to given { 0, the 1} value of weight.T _mf _iand T _ibetween while there is feature.

Then with reference to Fig. 3 describing method 300.In step S304, the Supervised classification device trained can be utilized tentatively to determine the correlativity of each microblogging in multiple microblogging and given entity.

In the process of training Supervised classification device, training data can be utilized train the Supervised classification device with general characteristic.Preferably, this characteristic can not be too general relative to being denoted as genuine microblogging, can not be too narrow for being denoted as false microblogging.

Such as, Supervised classification device can be Naive Bayes Classifier or maximum entropy classifiers etc.

Below naive Bayesian is described sorter.Naive Bayesian sorter is a kind of special Bayes classifier (Bayesclassifiers).The essence of Naive Bayes Classifier utilizes Bayes's condition probability formula, calculate under the condition of the proper vector of known text document, the document belongs to the conditional probability (i.e. posterior probability) of different text categories, then the document is summed up as that class with maximum a posteriori probability according to maximum likelihood principle.Why be called simple, be that each feature of its hypothesis constitutive characteristic vector is separate.Naive Bayes Classifier receives and extensively payes attention to and generally use in automatic Text Categorization research.

Suppose that Text eigenvector comprises the individual different feature of d, i.e. x=(x ₁, x ₂... x _d) ^t, and suppose the individual different text categories of a total l, their category label is respectively w ₁, w ₂... w _l, then the text categories of proper vector x is determined as by Bayes classifier:

w (x) = \arg \max_{1 \leq j \leq l} p (w_{j} | x) = \arg \max_{1 \leq j \leq l} p (x | w_{j}) p_{j} - - - (3)

Wherein, p _j, j=1,2 ..., l is the class prior probability of each text categories.

Due to separate between hypothesis feature, the class conditional probability of proper vector can be rewritten as:

Thus, the text categories of proper vector x is determined as by Naive Bayes Classifier:

Class prior probability p can be estimated by the ratio that jth class training sample number accounts for training sample sum _j.Estimate conditional probability p (x _i| w _j) method a lot.One of conventional estimation formulas is as follows:

p (x_{i} | w_{j}) = \frac{n_{i} + 1}{n + d} - - - (6)

Wherein, n _ifor feature x _ithe total degree occurred in jth class Training document, n is total number of times that all features occur in jth class Training document, and d is the dimension of proper vector.

Below maximum entropy classifiers is described.The ultimate principle of maximum entropy model is, if when only having grasped the partial knowledge about required unknown distribution, under normal circumstances, the probability distribution meeting known knowledge generally has multiple, one should be selected to meet these known knowledge but the maximum probability distribution of entropy.Entropy has weighed the uncertainty of a stochastic variable.Time entropy is maximum, uncertain maximum.So, when known portions knowledge, keep entropy maximum when meeting known knowledge, this is the unique selection not having any subjectivity and be inclined to that can make.The advantage of maximum entropy model is, when Modling model, only needs to pay close attention to how selected characteristic, and does not need to consider how to use these features.

Because selected characteristic of division is limited, maximum entropy classification results can show poor sometimes, and how to improve nicety of grading is further problem demanding prompt solution.The method improved has a lot, such as, selects by other features, selects other disaggregated models.But these methods are all equivalent to the classification results having abandoned maximum entropy previous stage, and if can excavate from maximum entropy classification results the knowledge made new advances and be used, be also a kind of approach of lifting undoubtedly.In addition, the set of the entity title in training data is different with the set of the entity title in test data, and this causes the supervised classifier of training data training for very poor efficiency test data.In order to utilize the customizing messages of the special entity in test data or excavate this customizing messages with improving SNR, the semi-supervised method of classification (such as, label propagation algorithm) can be adopted to revise the classification results provided by maximum entropy classifiers.

Therefore, next in step S306, can judge whether to be necessary according to the initial determinations of step S304 the correlativity determining each microblogging in multiple microblogging and given entity based on semi-supervised classifier (such as, label propagation algorithm).This task can be solved like this in conjunction with measure of supervision and semi-supervised method.

Particularly, some the entity microblogging set will classified by semisupervised classification can be selected, instead of all entities.Such as, can compare to the quantity of the incoherent microblogging of given entity and corresponding threshold value being confirmed as.

Such as, the ratio being accounted for the quantity of the microblogging for given entity by the quantity that Supervised classification device (such as, maximum entropy classifiers) is labeled as false microblogging (that is, microblogging incoherent with given entity) can be calculated, as follows:

Ratio(O _k)＝Num(False)/Num(O _k)(7)

Wherein, O _kpresentation-entity, Num (False) represents the quantity being labeled as false microblogging by Supervised classification device, Num (O _k) be the quantity of microblogging for given entity.

If for given entity O _kratio (O _k) be less than threshold value, then in step S104 and S106, then utilize semi-supervised classifier to for given entity O _kmicroblogging classify.And do not need to classify to the microblogging for other entities with semi-supervised classifier.

In other words, if be confirmed as being less than threshold value with the quantity of the incoherent microblogging of given entity, then determine the correlativity of each microblogging in multiple microblogging and given entity based on semi-supervised classifier.If be confirmed as being not less than threshold value with the quantity of the incoherent microblogging of given entity, then no longer determine the correlativity of each microblogging in multiple microblogging and given entity based on semi-supervised classifier.If judge to be necessary to determine based on semi-supervised classifier the correlativity of each microblogging in multiple microblogging and given entity in step S306, then flow process proceeds to step S102.If judge to there is no need to determine based on semi-supervised classifier the correlativity of each microblogging in multiple microblogging and given entity in step S306, then process ends.

Next, utilize the initial determinations of step S304 to select the process of seed description.

When utilizing Supervised classification device, although maximum entropy total result is on the low side, if the output high to degree of confidence, its accurate rate is very high, then these can be judged sample is as seed comparatively accurately, recycling semi supervise algorithm is classified.

Particularly, the Supervised classification device trained can be utilized to determine the degree of confidence of the correlativity of each microblogging and given entity.Then, can respectively from the microblogging relevant to given entity and from the incoherent microblogging of given entity select the microblogging with high confidence level as seed.

When Supervised classification device is maximum entropy classifiers, each microblogging is categorized as true or false by maximum entropy classifiers, and wherein, true respresentation microblogging is relevant to given entity, and false expression microblogging is uncorrelated with given entity.Next, from being labeled as the N number of microblogging selecting degree of confidence the highest genuine microblogging, and from being labeled as the N number of microblogging selecting degree of confidence the highest false microblogging, and using these microbloggings (2N microblogging) as seed.Preferably, in order to obtain the high accuracy that seed is selected, such as, can N=10 be set.Certainly, as required, N can be greater or lesser value.

According to embodiments of the invention, propose a kind of exception: after the step of the correlativity of each microblogging determined based on semi-supervised classifier in multiple microblogging and given entity, if microblogging comprises the two or more word in the title of given entity, then this microblogging is defined as relevant to given entity.

Such as, if entity title comprises more than one word, such as, YaleUniversity (Yale University), then think that this title comprises the more multi-semantic meaning information for distinguishing with other entities.If microblogging comprises complete entity title (more than one word), namely microblogging not only comprises " Yale " but also comprise " University ", be then designated as very by this microblogging, regards as relevant to this entity by this microblogging.

The device 600 of the correlativity determining microblogging and given entity is according to an embodiment of the invention described referring to Fig. 6.

Fig. 6 is the block diagram of the configuration of the device 600 that the correlativity determining microblogging and given entity is according to an embodiment of the invention shown.

As shown in Figure 6, device 600 comprises microblogging feature extraction unit 602, similarity determining unit 604 and correlation determination unit 606.

Microblogging feature extraction unit 602 can extract the feature of each microblogging in multiple microblogging.Similarity determining unit 604 can according to the similarity between extracted feature determination microblogging.Correlation determination unit 606 can utilize the similarity between determined microblogging, determines the correlativity of each microblogging in multiple microblogging and given entity based on semi-supervised classifier.

Such as, semi-supervised classifier can be the sorter based on label propagation or the sorter based on algorithm of camping step by step etc.Should be appreciated that the semi-supervised classifier enumerated is only exemplary here, in fact, other suitable arbitrarily semi-supervised classifiers can be used as required.Describe in detail above based on the sorter of label propagation and the sorter based on algorithm of camping step by step, for the sake of simplicity, do not repeated them here.

Fig. 7 is the block diagram of the illustrative arrangement that correlation determination unit 606 is according to an embodiment of the invention shown.

Correlation determination unit 606 can comprise node diagram and build module 606-2, seed selection module 606-4 and the first correlation determining module 606-6.

Node diagram build module 606-2 can by each microblogging in multiple microblogging being considered as node, between two microblogging with common trait, build limit and represent the weight on limit by the similarity between two microbloggings with common trait, build microblogging node diagram.

Seed selection module 606-4 can select a part of node as seed from node.Such as, when describing referring to Fig. 8, seed selection module 606-4 can select a part of node as seed according to the initial determinations of the preliminary determining unit 610 of correlativity from node.

The algorithm that first correlation determining module 606-6 can propagate based on label determines the correlativity of each microblogging in multiple microblogging and given entity.

Fig. 8 is the block diagram of the exemplary configuration of the device 800 that the correlativity determining microblogging and given entity is according to an embodiment of the invention shown.

As shown in Figure 8, device 800, except comprising microblogging feature extraction unit 602, similarity determining unit 604 and correlation determination unit 606, can also comprise substance feature extraction unit 608, the preliminary determining unit 610 of correlativity and necessity judging unit 612.

Owing to describing the function of microblogging feature extraction unit 602, similarity determining unit 604 and correlation determination unit 606 with reference to Fig. 6, for the sake of simplicity, do not repeat them here.

Substance feature extraction unit 608 can extract the feature be associated with given entity.The word that substance feature extraction unit 608 can be associated from extraction at least one page lower page with given entity is as feature: the entity homepage that given entity is associated, there is the webpage of network encyclopedia attribute, and the webpage for helping user to pass through several keyword acquisition associative key.Owing to describing entity homepage with reference to Fig. 3, Fig. 4 and Fig. 5, there is the webpage of network encyclopedia attribute and for helping user to obtain the webpage of associative key by several keyword, for the sake of simplicity, do not repeat them here.

The preliminary determining unit of correlativity 610 can utilize the Supervised classification device trained tentatively to determine the correlativity of each microblogging in multiple microblogging and given entity.Such as, Supervised classification device can be Naive Bayes Classifier or maximum entropy classifiers etc.

Necessity judging unit 612 can judge whether to be necessary according to the initial determinations of the preliminary determining unit of correlativity 610 correlativity determining each microblogging in multiple microblogging and given entity based on semi-supervised classifier.

Fig. 9 is the block diagram of the configuration that necessity judging unit 612 is according to an embodiment of the invention shown.

As shown in Figure 9, necessity judging unit 612 can comprise comparison module 612-2 and the second correlation determining module 612-4.

Comparison module 612-2 can compare to the quantity of the incoherent microblogging of given entity and corresponding threshold value being confirmed as.If be confirmed as being less than threshold value with the quantity of the incoherent microblogging of given entity, then the second correlation determining module 612-4 can determine the correlativity of each microblogging in multiple microblogging and given entity based on semi-supervised classifier.

Figure 10 is the block diagram of the configuration that seed selection module 606-4 is according to an embodiment of the invention shown.

As shown in Figure 10, seed selection module 606-4 can comprise degree of confidence determination submodule 606-4a and seed chooser module 606-4b.

Degree of confidence determination submodule 606-4a can utilize the Supervised classification device trained to determine the degree of confidence of the correlativity of each microblogging and given entity.Seed chooser module 606-4b can respectively from the microblogging relevant to given entity and from the incoherent microblogging of given entity select the microblogging with high confidence level as seed.

According to one embodiment of present invention, device 600 or device 800 can also comprise key character determining unit (not shown).

Key character determining unit can by entity homepage and/or have network encyclopedia attribute webpage in capitalization word and/or uniform resource position mark URL be defined as important feature; And the metadata in entity homepage is defined as important feature.

In addition, according to one embodiment of present invention, if microblogging comprises the two or more word in the title of given entity, then microblogging can be defined as relevant to given entity by correlation determination unit 606.

Below ultimate principle of the present invention is described in conjunction with specific embodiments, but, it is to be noted, for those of ordinary skill in the art, whole or any step or the parts of method and apparatus of the present invention can be understood, can in the network of any calculation element (comprising processor, storage medium etc.) or calculation element, realized with hardware, firmware, software or their combination, this is that those of ordinary skill in the art use their basic programming skill just can realize when having read explanation of the present invention.

Therefore, object of the present invention can also be realized by an operation program or batch processing on any calculation element.Described calculation element can be known fexible unit.Therefore, object of the present invention also can realize only by the program product of providing package containing the program code realizing described method or device.That is, such program product also forms the present invention, and the storage medium storing such program product also forms the present invention.Obviously, described storage medium can be any storage medium developed in any known storage medium or future.

When realizing embodiments of the invention by software and/or firmware, from storage medium or network to the computing machine with specialized hardware structure, the program forming this software installed by multi-purpose computer 1100 such as shown in Figure 11, this computing machine, when being provided with various program, can perform various function etc.

In fig. 11, CPU (central processing unit) (CPU) 1101 performs various process according to the program stored in ROM (read-only memory) (ROM) 1102 or from the program that storage area 1108 is loaded into random access memory (RAM) 1103.In RAM1103, also store the data required when CPU1101 performs various process etc. as required.CPU1101, ROM1102 and RAM1103 are via bus 1104 link each other.Input/output interface 1105 also link to bus 1104.

Following parts link is to input/output interface 1105: importation 1106 (comprising keyboard, mouse etc.), output 1107 (comprise display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.), storage area 1108 (comprising hard disk etc.), communications portion 1109 (comprising network interface unit such as LAN card, modulator-demodular unit etc.).Communications portion 1109 is via network such as the Internet executive communication process.As required, driver 1110 also can link to input/output interface 1105.Detachable media 1111 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed on driver 1110 as required, and the computer program therefrom read is installed in storage area 1108 as required.

When series of processes above-mentioned by software simulating, from network such as the Internet or storage medium, such as detachable media 1111 installs the program forming software.

It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Figure 11, distributes the detachable media 1111 to provide program to user separately with equipment.The example of detachable media 1111 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or hard disk that storage medium can be ROM1102, comprise in storage area 1108 etc., wherein computer program stored, and user is distributed to together with comprising their equipment.

The present invention also proposes a kind of program product storing the instruction code of machine-readable.When instruction code is read by machine and performs, the above-mentioned method according to the embodiment of the present invention can be performed.

Correspondingly, be also included within of the present invention disclosing for carrying the above-mentioned storage medium storing the program product of the instruction code of machine-readable.Storage medium includes but not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick etc.

Those of ordinary skill in the art should be understood that what exemplify at this is exemplary, and the present invention is not limited thereto.

In this manual, the statement such as " first ", " second " and " the N number of " is to described feature be distinguished on word, clearly to describe the present invention.Therefore, should not be regarded as there is any determinate implication.

As an example, each step of said method and all modules of the said equipment and/or unit may be embodied as software, firmware, hardware or its combination, and as the part in relevant device.When in said apparatus, all modules, unit are configured by software, firmware, hardware or its mode combined, spendable concrete means or mode are well known to those skilled in the art, and do not repeat them here.

As an example, when being realized by software or firmware, to the computing machine (multi-purpose computer 1100 such as shown in Figure 11) with specialized hardware structure, the program forming this software can be installed from storage medium or network, this computing machine, when being provided with various program, can perform various functions etc.

Above in the description of the specific embodiment of the invention, the feature described for a kind of embodiment and/or illustrate can use in one or more other embodiments in same or similar mode, combined with the feature in other embodiments, or substitute the feature in other embodiments.

Should emphasize, term " comprises/comprises " existence referring to feature, key element, step or assembly when using herein, but does not get rid of the existence or additional of one or more other features, key element, step or assembly.

In addition, method of the present invention be not limited to specifications in describe time sequencing perform, also can according to other time sequencing ground, perform concurrently or independently.Therefore, the execution sequence of the method described in this instructions is not construed as limiting technical scope of the present invention.About the embodiment comprising above embodiment, following remarks is also disclosed:

Remarks 1. 1 kinds determines the method for the correlativity of each microblogging in multiple microblogging and given entity, comprising:

Extract the feature of each microblogging in described multiple microblogging;

The similarity between described microblogging is determined according to extracted feature; And

Utilize the similarity between determined described microblogging, determine the correlativity of each microblogging in described multiple microblogging and described given entity based on semi-supervised classifier.

The method of remarks 2. according to remarks 1, wherein, described semi-supervised classifier is the sorter propagated based on label.

The method of remarks 3. according to remarks 2, wherein, describedly determine that based on semi-supervised classifier the step of the correlativity of each microblogging in described multiple microblogging and described given entity comprises:

By each microblogging in described multiple microblogging being considered as node, between two microbloggings with common trait, build limit and with described in there is common trait two microbloggings between similarity represent and the weight on described limit build microblogging node diagram;

From described node, select a part of node as seed; And

The algorithm propagated based on label determines the correlativity of each microblogging in described multiple microblogging and described given entity.

The method of remarks 4. according to remarks 3, described determine the step of the similarity between described microblogging according to extracted feature before, also comprise:

Extract the feature be associated with described given entity;

The Supervised classification device that trains is utilized tentatively to determine the correlativity of each microblogging in described multiple microblogging and described given entity; And

Judge whether to be necessary according to initial determinations the correlativity determining each microblogging in described multiple microblogging and described given entity based on semi-supervised classifier.

The method of remarks 5. according to remarks 4, wherein, describedly judges whether to be necessary to determine that the step of the correlativity of each microblogging in described multiple microblogging and described given entity comprises based on semi-supervised classifier according to initial determinations:

Compare to the quantity of the incoherent microblogging of described given entity and corresponding threshold value being confirmed as;

If be confirmed as being less than described threshold value with the quantity of the incoherent microblogging of described given entity, then determine the correlativity of each microblogging in described multiple microblogging and described given entity based on semi-supervised classifier.

The method of remarks 6. according to remarks 4, wherein, describedly from described node, select a part of node to comprise as the step of seed:

From described node, select a part of node as seed according to described initial determinations.

The method of remarks 7. according to remarks 6, wherein, describedly from described multiple node, select a part of node to comprise as the step of seed according to described initial determinations:

The Supervised classification device trained is utilized to determine the degree of confidence of the correlativity of each microblogging and described given entity; And

Respectively from the microblogging relevant to described given entity and from the incoherent microblogging of described given entity select the microblogging with high confidence level as seed.

The method of remarks 8. according to remarks 4, wherein, described Supervised classification device is maximum entropy classifiers or Naive Bayes Classifier.

The method of remarks 9. according to remarks 4, wherein, described extraction comprises with the step of the feature that described given entity is associated:

From extracting the word that is associated with described given entity as feature at least one page lower page: the entity homepage that described given entity is associated, have the encyclopaedical attribute of network webpage and for helping user to obtain the webpage of associative key by several keyword.

The method of remarks 10. according to remarks 9, also comprises:

By described entity homepage and/or described in have network encyclopedia attribute webpage in capitalization word and/or uniform resource position mark URL be defined as important feature; And

Metadata in described entity homepage is defined as important feature.

The method of remarks 11. according to remarks 9, wherein,

If the title of described given entity has ambiguity, then from the described web page interrogation candidate related pages with network encyclopedia attribute;

Described candidate's related pages is analyzed, to determine the URL information whether containing the entity homepage of described given entity in described candidate's related pages; And

If the URL information of the entity homepage containing described given entity in described candidate's related pages, then think that this candidate's related pages is associated with described given entity really.

The method of remarks 12. according to any one of remarks 1 to 11, described to determine the step of the correlativity of each microblogging in described multiple microblogging and described given entity based on semi-supervised classifier after, also comprise:

If described microblogging comprises the two or more word in the title of described given entity, then described microblogging is defined as relevant to described given entity.

Remarks 13. 1 kinds determines the device of the correlativity of each microblogging in multiple microblogging and given entity, comprising:

Microblogging feature extraction unit, is configured to the feature of each microblogging extracted in described multiple microblogging;

Similarity determining unit, is configured to the similarity determining between described microblogging according to extracted feature; And

Correlation determination unit, is configured to utilize the similarity between determined described microblogging, determines the correlativity of each microblogging in described multiple microblogging and described given entity based on semi-supervised classifier.

The device of remarks 14. according to remarks 13, wherein, described semi-supervised classifier is the sorter propagated based on label.

The device of remarks 15. according to remarks 14, wherein, described correlation determination unit comprises:

Node diagram builds module, be configured to by each microblogging in described multiple microblogging being considered as node, between two microbloggings with common trait, build limit and with described in there is common trait two microbloggings between similarity represent and the weight on described limit build microblogging node diagram;

Seed selection module, is configured to from described node, select a part of node as seed; And

First correlation determining module, the algorithm being configured to propagate based on label determines the correlativity of each microblogging in described multiple microblogging and described given entity.

The device of remarks 16. according to remarks 15, also comprises:

Substance feature extraction unit, is configured to extract the feature be associated with described given entity;

The preliminary determining unit of correlativity, is configured to utilize the Supervised classification device trained tentatively to determine the correlativity of each microblogging in described multiple microblogging and described given entity; And

Necessity judging unit, is configured to the correlativity judging whether to be necessary to determine based on semi-supervised classifier each microblogging in described multiple microblogging and described given entity according to initial determinations.

The device of remarks 17. according to remarks 16, wherein, described necessity judging unit comprises:

Comparison module, is configured to compare to the quantity of the incoherent microblogging of described given entity and corresponding threshold value being confirmed as;

Second correlation determining module, if be configured to the quantity be confirmed as with the incoherent microblogging of described given entity to be less than described threshold value, then determines the correlativity of each microblogging in described multiple microblogging and described given entity based on semi-supervised classifier.

The device of remarks 18. according to remarks 16, wherein, described seed selection module is configured to from described node, select a part of node as seed according to described initial determinations.

The device of remarks 19. according to remarks 18, wherein, described seed selection module comprises:

Degree of confidence determination submodule, is configured to utilize the Supervised classification device trained to determine the degree of confidence of the correlativity of each microblogging and described given entity; And

Seed chooser module, be configured to respectively from the microblogging relevant to described given entity and from the incoherent microblogging of described given entity select the microblogging with high confidence level as seed.

The device of remarks 20. according to remarks 16, wherein, described Supervised classification device is maximum entropy classifiers or Naive Bayes Classifier.

The device of remarks 21. according to remarks 16, wherein, the word that described substance feature extraction unit is configured to be associated from extraction at least one page lower page with described given entity is as feature:

The entity homepage that described given entity is associated,

There is the webpage of network encyclopedia attribute, and

The webpage that several keyword obtains associative key is passed through for helping user.

The device of remarks 22. according to remarks 21, also comprises key character determining unit, and described key character determining unit is configured to:

Metadata in described entity homepage is defined as important feature.

The device of remarks 23. according to any one of remarks 13 to 22, described correlation determination unit is also configured to: if described microblogging comprises the two or more word in the title of described given entity, be then defined as by described microblogging relevant to described given entity.

Remarks 24. 1 kinds stores the program product of the instruction code of machine-readable, when described instruction code is read by machine and performs, can perform the method for the correlativity of each microblogging in the multiple microblogging of determination according to any one of remarks 1-12 and given entity.

Remarks 25. 1 kinds carries the storage medium of the program product as described in remarks 24.

Although above by the description of specific embodiments of the invention to invention has been disclosure, but, should be appreciated that, those skilled in the art can design various amendment of the present invention, improvement or equivalent in the spirit and scope of claims.These amendments, improvement or equivalent also should be believed to comprise in protection scope of the present invention.

Claims

1. determine a method for the correlativity of each microblogging in multiple microblogging and given entity, comprising:

Extract the feature of each microblogging in described multiple microblogging;

Utilize the similarity between determined described microblogging, determine the correlativity of each microblogging in described multiple microblogging and described given entity based on semi-supervised classifier;

Wherein, described semi-supervised classifier is the sorter propagated based on label,

Wherein, describedly determine that based on semi-supervised classifier the step of the correlativity of each microblogging in described multiple microblogging and described given entity comprises:

From described node, select a part of node as seed; And

2. method according to claim 1, before the step determining the similarity between described microblogging according to extracted feature, also comprises:

Extract the feature be associated with described given entity;

3. method according to claim 2, wherein, describedly judges whether to be necessary to determine that the step of the correlativity of each microblogging in described multiple microblogging and described given entity comprises based on semi-supervised classifier according to initial determinations:

4. method according to claim 2, wherein, describedly from described node, select a part of node to comprise as the step of seed:

5. method according to claim 4, wherein, describedly from described multiple node, select a part of node to comprise as the step of seed according to described initial determinations:

6. method according to claim 2, wherein, described Supervised classification device is maximum entropy classifiers or Naive Bayes Classifier.

7. method according to claim 2, wherein, described extraction comprises with the step of the feature that described given entity is associated:

8. determine a device for the correlativity of each microblogging in multiple microblogging and given entity, comprising:

Correlation determination unit, is configured to utilize the similarity between determined described microblogging, determines the correlativity of each microblogging in described multiple microblogging and described given entity based on semi-supervised classifier,

Wherein, described correlation determination unit comprises: