CN102662986A - System and method for microblog message retrieval - Google Patents

System and method for microblog message retrieval Download PDF

Info

Publication number
CN102662986A
CN102662986A CN2012100658040A CN201210065804A CN102662986A CN 102662986 A CN102662986 A CN 102662986A CN 2012100658040 A CN2012100658040 A CN 2012100658040A CN 201210065804 A CN201210065804 A CN 201210065804A CN 102662986 A CN102662986 A CN 102662986A
Authority
CN
China
Prior art keywords
speech
user
retrieval
microblogging
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012100658040A
Other languages
Chinese (zh)
Inventor
程学旗
李静远
房伟伟
王元卓
李一为
方滨兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN2012100658040A priority Critical patent/CN102662986A/en
Publication of CN102662986A publication Critical patent/CN102662986A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This invention provides a microblog retrieval method.A system obtains latest microblog messages regularly, calculates a digest of each message, extracts the association relations between words in the digest and save them into a word relation database. After receiving a query word from a retrieval user, this retrieval system recommends a plurality of words having association relations with the query word to the retrieval user in the form of a list, the retrieval user constructs topics according to the logical relation established between the recommended words and a query word, and the retrieval system uses the constructed topics as retrieval conditions .This retrieval system can return microblogs messages matched with the conditions back to the users. The system carries out the retrieval based on the topics consisting of a plurality of words, and can improve the retrieval effect and user experience. Furthermore, the retrieval system can provide the retrieval users with the message data of the topics the user most care about to a the maximum degree according to the hobbies and attributes of the users by providing a user-based message searching method.

Description

Twitter message searching system and method
Technical field
The present invention relates to information retrieval, relate in particular to retrieval Twitter message.
Background technology
Since first microblogging Twitter in 2006 was born, the microblogging service kept considerable growing trend always, and its growth mainly shows following two aspects:
1) user's growth: Twitter user will reach about 300,000,000 people's in the whole world scale from beginning blowout in 2008 to the end of the year 2011.The also phenomenal growth of domestic microblogging number of users: Sina's microblogging is released in the period of two, and number of users is realized from zero to the miracle that surpasses 200,000,000.Tengxun's microblogging is by its huge QQ user group, and its microblogging user quantitatively even surpassed Sina's microblogging.
2) growth of influence power: because customer volume increases sharply, microblogging is huge to the influence power in the world, has surmounted any one Internet service of same time.The media characteristic that it is open makes microblogging change the right of speech distribution system in the world; It is powerful ageing, attracts from government, company to proprietary attention among the people, and it is exactly illustration that each department of government, major company in 2011 offer official's microblogging one after another.
A real-time retrieval that critical services is a message of microblogging.Because the expansion of aforementioned microblogging userbase and influence power, the generation frequency of Twitter message sharply promotes, and to the end of the year 2011, Sina's microblogging size of message every day reaches 200,000,000 scale, and size of message every day of Twitter also is hundred million ranks.Mass data has proposed challenge to the reaction efficiency and the accuracy rate of real-time retrieval.
The countermeasure that the main flow microblogging is taked is included into certain topic for utilizing hash label prompting searching system with a piece of news.For example, Twitter uses and uses " # China # " to represent that current message belongs to the topic that is entitled as " China " in " # China ", the Sina's microblogging.Yet this kind method has multiple limitation:
At first; This hash label need the message publisher initiatively according to the rules rule write voluntarily; And the form and the disunity of each microblogging service; Therefore meet certain topic, but user and do not know how to beat hash label, or do not know that certain hash label representes certain topic and stamped wrong or influence less label, possibly cause can't be retrieved system's real-time retrieval and providing of this message.
Secondly, the hash label mode is limited as a speech with topic, can't guarantee comprehensive certain topic or all the relevant message of incident of obtaining of search subscriber.For example,, can't obtain all relevant information of Beijing dense fog on the same day fully, and these information are likely what the searchers hoped to obtain with " Capital Airport " label search.
At last, except Twitter message (it is a kind of short text information), also has user agent information in the microblogging, like user's type, attribute, preference etc.The hash label mode can't be applied to the microblogging customer attribute information in the data retrieval service.
Summary of the invention
Therefore, the objective of the invention is to overcome the defective of above-mentioned prior art, a kind of Twitter message searching system is provided, take into account incidence relation and user property between a plurality of speech, improved retrieval effectiveness and user experience.
The objective of the invention is to realize through following technical scheme:
On the one hand, the invention provides a kind of Twitter message searching system, this system comprises:
The microblogging memory module is used to preserve the Twitter message and the microblogging user profile of up-to-date issue;
The word association relationship module is used for obtaining from the microblogging memory module termly the Twitter message of up-to-date issue, extracts and preserve the incidence relation between word in every message;
The searching, managing module is used for according to the incidence relation between word, and a plurality of speech that will be related with the query word of retrieval user input return to retrieval user as recommending speech to tabulate, and is used for coming the microblogging memory module is retrieved according to the topic that retrieval user makes up; Said topic is recommended to set up logical relation between speech and the query word by retrieval user through foundation and is made up.
In the technique scheme, the word association relationship module can be extracted the Twitter message of up-to-date issue termly from the microblogging memory module, calculates the summary of every message, extracts and preserve the incidence relation between the word in this summary.
In the technique scheme; Said searching, managing module can with and said query word between the highest preceding n the speech of company's limit weights as recommending the speech tabulation to offer retrieval user, the company's limit weights between said two speech are number of times that the incidence relation between these two speech occurs.
In the technique scheme, said logical relation can comprise " logical and " and/or " logical OR " and/or " logic NOT ".Retrieval user can be selected 0 or a plurality of speech from said recommendation speech tabulation, between this group speech and said query word, set up " logical OR " perhaps relation of " logical and ", to form a topic.Retrieval user can be selected the part speech and divide into groups from said recommendation speech tabulation, be the relation of " logical OR " on the same group between the speech, is the relation of " logical and " and/or " logic NOT " between group and the group, thereby forms a topic.
In the technique scheme, the word association relationship module can according to the short text long-term accumulation is formed a dictionary for word segmentation that is exclusively used in short text, be filtered the summary that forms this short text with Twitter message as short text through dictionary for word segmentation.
In the technique scheme, the word association relationship module can be with Twitter message as short text, and a collection of issuing time is close, that the geographic position is close short text utilizes the method for text cluster that it is divided into groups, for every group of message is stamped identical summary.
In the technique scheme, the word association relationship module can be with summary that every Twitter message calculated is saved in the microblogging memory module as the attribute of this message.
In the technique scheme; This system can also comprise microblogging user property computing module; Be used to obtain the microblogging user summary of the m bar Twitter message of issue recently, be chosen in the frequency of occurrences is the highest in these summaries preceding k speech personal attribute's label as this microblogging user.
In the technique scheme, microblogging user property computing module can also be used for personal attribute's label of regular update microblogging user.
In the technique scheme, the searching, managing module can also be used for constructed topic as search condition personal attribute's label of microblogging user being retrieved, and recommends to meet the microblogging user of search condition and/or the message of its issue to retrieval user.
Another aspect the invention provides a kind of Twitter message search method, and this method comprises:
Step 1) is received the query word of retrieval user input by the searching, managing module;
Step 2) by the searching, managing module will and this query word between the highest preceding n the speech of company's limit weights as recommending the speech tabulation to return to retrieval user;
Step 3) based on recommending the speech tabulation, makes up topic through setting up the logical relation of recommending between speech and the query word by retrieval user;
Step 4) by the searching, managing module with constructed topic as search condition, come the microblogging memory module is retrieved, the Twitter message that will meet this search condition returns to retrieval user.
In the technique scheme, can from said recommendation speech tabulation, select 0 or a plurality of speech by retrieval user, between this group speech and said query word, set up " logical OR " perhaps relation of " logical and ", thereby form a topic in step 3).
In the technique scheme; Can from said recommendation speech tabulation, select part speech and grouping by retrieval user in step 3); Be the relation of " logical OR " on the same group between the speech, be the relation of " logical and " and/or " logic NOT " between group and the group, thereby form a topic.
In the technique scheme; Can also comprise step 5) by the searching, managing module with constructed topic as search condition; Come personal attribute's label of microblogging user is retrieved, microblogging user and/or its message of issuing that will meet this search condition return to retrieval user.
In the technique scheme, in step 2) before, can also comprise that the Twitter message that will comprise this query word by the searching, managing module returns to the step of retrieval user.
Compared with prior art, the invention has the advantages that:
Retrieval based on a keyword in the existing microblogging is expanded to the retrieval based on the topic that is made up of a plurality of speech, can improve retrieval effectiveness, improve user experience.And, through hobby and the attribute that utilizes the microblogging user, a kind of information inquiry means based on the people are provided, the topic message data that can provide it to be concerned about to retrieval user to the full extent.
Description of drawings
Followingly the embodiment of the invention is described further with reference to accompanying drawing, wherein:
Fig. 1 is the structural representation according to the Twitter message searching system of the embodiment of the invention;
Fig. 2 is the process flow diagram according to the Twitter message search method of the embodiment of the invention.
Embodiment
In order to make the object of the invention, technical scheme and advantage are clearer, pass through specific embodiment to further explain of the present invention below in conjunction with accompanying drawing.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
Shown in Figure 1 is according to an embodiment of the invention Twitter message implement the structural representation of searching system.This system comprises searching, managing module 100, word association relationship module 200, microblogging user property module 300 and microblogging memory module 400.Wherein, searching, managing module 100 is used to provide with the mutual interface of retrieval user and to retrieval user provides retrieval service.Word association relationship module 200 is used for the word association relationship module is obtained up-to-date issue termly from the microblogging memory module Twitter message, extracts and preserve the incidence relation between word in every message (incidence relation between word also can be called word incidence relation or speech relation).Microblogging user property module 300 is used to extract microblogging user's message, according to the summary of its message, confirms personal attribute's label of this microblogging user, and this label is saved in the microblogging memory module.Microblogging memory module 400 is used to preserve the Twitter message of recent issue, and active in the recent period microblogging user profile.In this application, said retrieval user refers to use the user of the Twitter message searching system that the application provides; And said microblogging user refers to the user for the microblogging service, for example uses the user of Twitter or Sina's microblogging.
More specifically, with reference to figure 1, microblogging memory module 400 comprises Twitter message memory module 401 and microblogging user storage module 402.Wherein, Twitter message memory module 401 is used for the Twitter message information of the up-to-date issue of buffer memory.Microblogging user storage module 402 is used for the personal information of the active or responsible consumer of buffer memory.In one embodiment, can use Redis as data cached storage tool.Redis be one can be based on the Key-Value log type database of internal memory, its read-write and search efficiency are than being height based on the type of database of permanent storage.In the Redis database, set up a storage set (collection), be used for storing the message (for example, the message of issue in 5 days) of the nearest issue of microblogging; In this storage set the form of the value of every record need be in relational database such strict conformance.For example: the value that key1 is corresponding possibly be a pictures, and the corresponding value of key2 possibly be a passage in same storage set.In addition; In the Redis database, set up another storage set; Be used to store the microblogging user's who meets certain condition personal information; For example can store the microblogging user's who meets following condition personal information: history message surpasses 1,000 microblogging user, daily gives out information and surpass 5 microblogging user, the nearest microblogging user who issued message in 5 days.In other embodiments, other memory storages that also can adopt those skilled in the art to know for example, are considered from the cost angle, can use the storage tool based on the permanent memory more cheap than memory cache, like MongoDB etc.Again for example, say, can use SQL is supported better relevant database such as Oracle, MySQL etc. from the angle that makes things convenient for data processing.
Word association relationship module 200 comprises latest news extract management module 201, digest calculations module 202, speech relationship storage module 203.Wherein, latest news extract management module 201 is used for reading from microblogging memory module 400 termly the Twitter message of up-to-date issue, and it is delivered to digest calculations module 202.
Digest calculations module 202 is used for calculating according to the Twitter message content summary of this message.Can regard Twitter message as short text; Extract the core word in the Twitter message according to existing text snippet method; Calculate the summary of this Twitter message, for example, in one embodiment; The computing method of summary are according to the short text long-term accumulation being formed a special dictionary for word segmentation to short text, filtering the summary that forms this short text through dictionary for word segmentation.In yet another embodiment, the computing method of summary are a collection of issuing time is close, that the geographic position is close short text, utilize the method for text cluster that it is divided into groups, for every group of message is stamped identical summary.It is thus clear that said summary is the set of a speech in fact, it comprises the primary word of this short text, and removes insignificant function words such as modal particle.Therefore, can be with the characteristic of summary as this message.After accomplishing digest calculations, digest calculations module 202 and the attribute that will make a summary as this message are saved in the microblogging memory module 400, for example can be saved in the specific fields of this Twitter message in Twitter message memory module 401.
Speech relationship storage module 203 is used for preserving the incidence relation (abbreviating the speech relation as) between the summary word.Incidence relation between said word is meant the incidence relation that exists between two speech that appear at simultaneously in same Twitter message or the same summary.For example, Twitter message: { I have been trapped in the Capital Airport, and Beijing dense fog is estimated to walk not know clearly tonight., through digest calculations, its summary then forms three incidence relations between word: { Beijing-airport }, { Beijing-dense fog }, { airport-dense fog } for { airport, Beijing dense fog }.Can use database to preserve the speech relation; This database can be called as the speech relational database; In the speech relational database, also write down two company's limit weights between speech, the company's limit weights between said two speech are number of times that the incidence relation between these two speech occurs in the speech relational database.For example, if in the speech relational database, there has been speech relation { Beijing-airport }, and its weights (i.e. company's limit weights between two speech) are 230 (just this speech closes to tie up to and occurred in the database 230 times), then add this weights increase by 1 behind this record.Than existing search based on hash label, use the incidence relation between said word, can enlarge the hunting zone, make retrieval user can obtain the relevant information beyond the term.
For example, latest news extract management module 201 is calculated current system timestamp t2, and from microblogging memory module 400, obtains the maximum entry time stamp t1 of the message of calculating summary; Then; Extract entry time by any regular is (t1 in batches; T2) all message in the open interval (promptly obtaining the new information that did not calculate summary of the buffer memory in nearest a period of time), and calculate the summary of every message and result of calculation is write microblogging memory module 400 through digest calculations module 202.The speech relationship storage module record relevant with current summary in the neologisms relational database more simultaneously, the company's limit weights that appear at two speech in the same summary arbitrarily add 1.In one embodiment, speech relationship storage module 203 can adopt relevant database MySQL to preserve the speech relation.In other embodiments, also can use other relevant database (for example, Oracle, SQLServer etc.) to preserve the speech relation.
Microblogging user property module 300 comprises user message extract management module 301 and user property computing module 302.Wherein user message extract management module 301 is extracted microblogging message that the user sends out and summary thereof termly from microblogging memory module 400.User property computing module 302 calculates personal attribute's label (can abbreviate attribute tags or user tag as) of this user according to the summary set of message that this user sends out.Said personal attribute's label is meant one group of speech in order to the hobby of describing a microblogging user, focus etc.As through analyzing the summary of a message that the user sends out, from the summary of message that this user sends out, select the label that the highest several speech of existing frequency are used as this user, for example can be with { performer cook film love } personal attribute's label as certain microblogging user.
Again for example, user message extract management module 301 is taken out and was refreshed last time constantly the earliest from microblogging memory module 400, and is somebody's turn to do nearest 200 message of the moment early than 5 days user of current system time.Add up 20 the highest speech of word frequency in the summary of said 200 message, as this user's personal attribute information (that is this attribute of user label).User's personal attribute information should not comprise function word information such as modal particle, and should be main with notional words such as personage, place, time, behaviors.
Searching, managing module 100 comprises retrieval user interface module 101 and retrieval session management module 102.Wherein, retrieval user interface module 101 be used to receive retrieval user input, return Query Result, and provide based on the interface of recommending speech list builder topic to retrieval user.Retrieval session management module 102; Be used for speech relationship storage module 203 being retrieved to obtain to recommend the speech tabulation, based on by the topic of retrieval user structure microblogging memory module 400 being retrieved and result for retrieval being offered retrieval user interface module 101 to return result for retrieval based on the query word of retrieval user input.
Wherein, recommending the speech tabulation is a plurality of relevant with this query word recommendation speech (for example, 20) that is obtained by retrieval session management module 102 query word relationship storage module 203, to recommend said retrieval user.In one embodiment, can get maximally related 20 speech of importing with retrieval user of speech and return, just in the speech relation data, connect 20 the highest speech of limit weights as recommending the speech tabulation with query word as recommending speech to tabulate.Also can get among another embodiment with maximally related 15 speech of this query word and return as recommending speech to tabulate; Simultaneously, from the Twitter message that returns that comprises this query word, choose nearest 50 message, from the summary of these message, select 5 the highest speech of the frequency of occurrences replenishing as aforementioned recommendation speech tabulation.
Retrieval user is selected 0 or a plurality of recommendation speech through the interface that retrieval user interface module 101 provides, and makes up topic through the logical relation (for example logical and, logical OR, logic NOT) that is provided with between the selected speech.In one embodiment, can from said recommendation speech tabulation, select 0 or a plurality of speech, be the relation of logical OR (or) or logical and (and) between this group speech and the former query word, form a topic of forming by a plurality of speech.Another embodiment is: from said recommendation speech tabulation, select part speech and grouping, be the relation of logical OR on the same group between the speech, and be the relation of logical and between group and the group, thereby form a topic of being made up of a plurality of speech.In yet another embodiment, from said recommendation speech tabulation, selecting part speech and grouping, is the relation of logical OR (or) on the same group between the speech; And can be the relation that concerns group and/or logic NOT (not) of logical and (and) between group and the group; Thereby form the topic of forming by a plurality of speech, for example, (k 1Or k 2Or k 3) and (k 4Or k 5) not (k 6Or k 7), k iFor recommending speech, comprise the topic that the logic AOI concerns thereby form.Should point out that AOI relation can randomly changing, can carry out self-defined through the retrieval user interface module by the user.
For example; Retrieval user input " airport "; Retrieval session management module 102 is to the corresponding speech relation of speech relationship storage module 203 retrievals " airport "; With result for retrieval as to the tabulation of the recommendation speech of retrieval user and return to retrieval user, for example { capital, Beijing, Nangyuan District, Hongqiao, Pudong, new white clouds, dense fog, visibility, heavy rain, thunderstorm, delay, cancellation, percent of punctuality }.Then, from recommend the speech tabulation, select the part speech to make up for example { (k through retrieval user interface module 101 by retrieval user 1Or k 2Or k 3) and (k 4Or k 5) not (k 6Or k 7), k iBe to recommend speech } such be the topic at center with the speech, like { (or Nangyuan District, or capital, Beijing) and (dense fog or visibility) and (incuring loss through delay the or cancellation) not (heavy rain or thunderstorm) }.Retrieval session management module 102 is retrieved as search condition the topic that this retrieval user makes up to the microblogging memory module, return all message that satisfy above search condition.
It is thus clear that, in an embodiment of the present invention, employing be to make up topic by retrieval user, and the mechanism of retrieving based on constructed topic.Full automatic retrieval based on topic only is shown in the searching system like long texts such as news, blogs, because single language material is longer, with common document method of abstracting or high-dimensional proper vector, can more accurately describe two similarity degrees between the document.And like short texts such as microblogging visitor or SMSs; Through the check of inventor in real system, find above method and inapplicable, so the inventor to have adopted be the center with the speech; Topic mechanism by user intervention can guarantee through its message order of accuarcy of practice test.
And utilize and retrieve by the topic mechanism of retrieval user intervention; Its benefit is: the first, use the incidence relation between said word, than existing search based on hash label; Can enlarge the hunting zone, make retrieval user can obtain the relevant information beyond the term; The second, can overcome Twitter message because text is short and small, full automatic topic recommend method is often remote from the subject, can not satisfy the deficiency of the needs of retrieval user, and the better method that can manual intervention forms topic is provided for retrieval user.
In yet another embodiment, retrieval session management module 102 can also be retrieved as search condition constructed topic to personal attribute's label of microblogging user, return the message that the microblogging user that satisfies above search condition is issued.Thereby also considered the microblogging attribute of user; For example; The example in the above-mentioned Capital Airport; Can be through retrieval microblogging attribute of user label, recommending retrieval user with the closely-related microblogging user in the Capital Airport (for example, often the user of issue and the relevant microblogging in the Capital Airport) and/or the message of its issue.The message that these users the sent out aforementioned topic rule that perhaps do not match, but the useful peripheral information that can be used as this time retrieval replenishes, thus to retrieval user the message relevant with its topic of being concerned about is provided to the full extent.In addition, a kind of retrieval method based on user profile is provided, can provides and the closely-related microblogging user of its topic of being concerned about to retrieval user, so that retrieval user can be paid close attention to these microbloggings user.For example, use constructed topic to remove to retrieve microblogging attribute of user label, recommend to meet fully the microblogging user of this topic.Perhaps recommend to have in the speech in its attribute tags microblogging user above the speech identical (speech that does not comprise non-relation) in m word and the constructed topic.
In yet another embodiment of the present invention, a kind of search method based on said system also is provided.This method may further comprise the steps:
Step 1) receives by the query word of retrieval user through 101 inputs of retrieval user interface module.
Step 2), select a plurality of recommendation speech relevant (for example, 20) by retrieval session management module 102 from the speech relation data, and recommend said retrieval user with this query word.In one embodiment, can get maximally related 20 speech of importing with retrieval user of speech and return, just in the speech relation data, connect 20 the highest speech of limit weights as recommending the speech tabulation with query word as recommending speech to tabulate.Also can get among another embodiment with maximally related 15 speech of this query word and return as recommending speech to tabulate; Simultaneously, from the Twitter message that returns that comprises this query word, choose nearest 50 message, from the summary of these message, select 5 the highest speech of the frequency of occurrences replenishing as aforementioned recommendation speech tabulation.
Step 3) is selected 0 or a plurality of recommendation speech by retrieval user, makes up topic through the logical relation (for example logical and, logical OR, logic NOT) that is provided with between the selected speech.In one embodiment, can from said recommendation speech tabulation, select 0 or a plurality of speech, be the relation of logical OR (or) and/or logical and (and) between this group speech and the former query word, form a topic of forming by a plurality of speech.Another embodiment is: from said recommendation speech tabulation, select part speech and grouping, be the relation of logical OR on the same group between the speech, and be the relation of logical and between group and the group, thereby form a topic of being made up of a plurality of speech.In yet another embodiment, from said recommendation speech tabulation, selecting part speech and grouping, is the relation of logical OR (or) on the same group between the speech; And can be the relation that concerns group and/or logic NOT (not) of logical and (and) between group and the group; Thereby form the topic of forming by a plurality of speech, for example, (k 1Or k 2Or k 3) and (k 4Or k 5) not (k 6Or k 7), k iFor recommending speech, comprise the topic that the logic AOI concerns thereby form.Should point out that AOI relation can randomly changing, can carry out self-defined through the retrieval user interface module by the user.
Step 4), the topic that is made up in step 3) according to retrieval user by retrieval session management module 102 returns the Twitter message that meets this search condition and gives retrieval user as search condition.
In yet another embodiment; This method can also comprise step 5); By retrieval session management module 102 according to constructed topic, ins conjunction with personal attribute's label of user in the microblogging user storage module 402, recommend and this topic maximally related a plurality of (for example; 30) the microblogging user gives retrieval user, and/or the message that said microblogging user issue is provided is to retrieval user.For example, use constructed topic rule to go to retrieve personal attribute's label of microblogging user, the message of recommending complete legal microblogging user and/or its issue.Again for example, recommend retrieval user with having in the speech in its attribute tags above the microblogging user of the speech identical (speech that does not comprise non-relation) in m word and the constructed topic and/or the message of its issue.
Fig. 2 has provided according to the present invention the process flow diagram of the microblogging search method of another embodiment.After the difference of this method and said method was to receive the query word of retrieval user input, the Twitter message that the searching, managing module can utilize existing search method will comprise this query word returned to retrieval user with recommending speech to tabulate.Retrieval user if also think to inquire about further, then can further be retrieved through making up topic on the basis of these Twitter messages, can give the very big dirigibility of retrieval user like this, also can improve user experience.
More specifically, this method comprises: the query word that step S101) is received the retrieval user input by the searching, managing module; Step S102) retrieves and returns the Twitter message that comprises this query word by the searching, managing module; Step S103) selects to exist a plurality of speech of incidence relation to return to retrieval user by the searching, managing module as recommending speech to tabulate with this query word; Step S104) tabulates based on the recommendation speech by retrieval user, make up topic through setting up the logical relation of recommending between speech and the query word; Step S105) obtain the Twitter message that satisfies this topic by the searching, managing module, increment returns to retrieval user; Step S106), recommends to satisfy the microblogging user of this topic and/or the message of its issue to retrieval user by personal attribute's label of searching, managing module according to constructed topic and microblogging user.Wherein at step S103) can also be from the summary of the Twitter message that comprises this query word that returned, select preceding r the highest speech of occurrence number additional as said recommendation speech tabulation.
Though the present invention is described through preferred embodiment, yet the present invention is not limited to described embodiment here, also comprises various changes and the variation done without departing from the present invention.

Claims (17)

1. Twitter message searching system, this system comprises:
The microblogging memory module is used to preserve the Twitter message and the microblogging user profile of up-to-date issue;
The word association relationship module is used for obtaining from the microblogging memory module termly the Twitter message of up-to-date issue, extracts and preserve the incidence relation between word in every message;
The searching, managing module is used for according to the incidence relation between word, and a plurality of speech that will be related with the query word of retrieval user input return to retrieval user as recommending speech to tabulate, and is used for coming the microblogging memory module is retrieved according to the topic that retrieval user makes up; Said topic is recommended to set up logical relation between speech and the query word by retrieval user through foundation and is made up.
2. system according to claim 1, wherein, the word association relationship module is extracted the Twitter message of up-to-date issue termly from the microblogging memory module, calculates the summary of every message, extracts and preserve the incidence relation between the word in this summary.
3. system according to claim 1 and 2; Said searching, managing module will and said query word between the highest preceding n the speech of company's limit weights as recommending the speech tabulation to offer retrieval user, the company's limit weights between said two speech are number of times that the incidence relation between these two speech occurs.
4. system according to claim 1 and 2, wherein said logical relation comprise " logical and " and/or " logical OR " and/or " logic NOT ".
5. system according to claim 4, wherein, retrieval user is selected 0 or a plurality of speech from said recommendation speech tabulation, between this group speech and said query word, set up " logical OR " perhaps relation of " logical and ", to form a topic.
6. system according to claim 4; Wherein retrieval user is selected the part speech and is divided into groups from said recommendation speech tabulation; Be the relation of " logical OR " on the same group between the speech, be the relation of " logical and " and/or " logic NOT " between group and the group, thereby form a topic.
7. the summary that forms this short text as short text, according to the short text long-term accumulation is formed a dictionary for word segmentation that is exclusively used in short text, filters with Twitter message in system according to claim 2, word association relationship module through dictionary for word segmentation.
8. as short text, a collection of issuing time is close, that the geographic position is close short text utilizes the method for text cluster that it is divided into groups, for every group of message is stamped identical summary with Twitter message for system according to claim 2, word association relationship module.
9. system according to claim 2, the word association relationship module will be saved in the microblogging memory module as the attribute of this message to summary that every Twitter message calculated.
10. according to claim 2,7,8 or 9 described systems; Also comprise microblogging user property computing module; Be used to obtain the microblogging user summary of the m bar Twitter message of issue recently, be chosen in the frequency of occurrences is the highest in these summaries preceding k speech personal attribute's label as this microblogging user.
11. system according to claim 10, microblogging user property computing module also is used for personal attribute's label of regular update microblogging user.
12. system according to claim 10; The searching, managing module also is used for constructed topic as search condition; Personal attribute's label to the microblogging user is retrieved, and recommends to meet the microblogging user of search condition and/or the message of its issue to retrieval user.
13. the microblogging search method based on the described system of above-mentioned arbitrary claim, this method comprises:
Step 1) is received the query word of retrieval user input by the searching, managing module;
Step 2) by the searching, managing module will and this query word between the highest preceding n the speech of company's limit weights as recommending the speech tabulation to return to retrieval user;
Step 3) based on recommending the speech tabulation, makes up topic through setting up the logical relation of recommending between speech and the query word by retrieval user;
Step 4) by the searching, managing module with constructed topic as search condition, come the microblogging memory module is retrieved, the Twitter message that will meet this search condition returns to retrieval user.
14. method according to claim 13, wherein step 3) is selected 0 or a plurality of speech by retrieval user from said recommendation speech tabulation, between this group speech and said query word, sets up " logical OR " perhaps relation of " logical and ", thereby forms a topic.
15. method according to claim 13; Wherein step 3) is selected the part speech by retrieval user and is divided into groups from said recommendation speech tabulation; Be the relation of " logical OR " on the same group between the speech, be the relation of " logical and " and/or " logic NOT " between group and the group, thereby form a topic.
16. method according to claim 13; Wherein also comprise step 5) by the searching, managing module with constructed topic as search condition; Come personal attribute's label of microblogging user is retrieved, microblogging user and/or its message of issuing that will meet this search condition return to retrieval user.
17. method according to claim 13 is wherein in step 2) before, comprise that also the Twitter message that will comprise this query word by the searching, managing module returns to the step of retrieval user.
CN2012100658040A 2012-01-13 2012-01-13 System and method for microblog message retrieval Pending CN102662986A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100658040A CN102662986A (en) 2012-01-13 2012-01-13 System and method for microblog message retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100658040A CN102662986A (en) 2012-01-13 2012-01-13 System and method for microblog message retrieval

Publications (1)

Publication Number Publication Date
CN102662986A true CN102662986A (en) 2012-09-12

Family

ID=46772477

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100658040A Pending CN102662986A (en) 2012-01-13 2012-01-13 System and method for microblog message retrieval

Country Status (1)

Country Link
CN (1) CN102662986A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150383A (en) * 2013-03-15 2013-06-12 中国科学院计算技术研究所 Event evolution analysis method of short text data
CN103902628A (en) * 2012-12-28 2014-07-02 腾讯科技(北京)有限公司 User relation information storing method and device
CN104065677A (en) * 2013-03-20 2014-09-24 腾讯科技(深圳)有限公司 Service data recommending method and device
CN104135529A (en) * 2014-08-05 2014-11-05 北京视像元素技术有限公司 Information discovery and sharing system based on entire space-time label web
WO2014180196A1 (en) * 2013-05-08 2014-11-13 华为技术有限公司 Information recommendation processing method and device
CN105138512A (en) * 2015-08-12 2015-12-09 小米科技有限责任公司 Phrase recommendation method and apparatus
CN106294405A (en) * 2015-05-22 2017-01-04 国家计算机网络与信息安全管理中心 A kind of microblogging topic evolution analysis method and device
CN106649585A (en) * 2016-11-18 2017-05-10 福建中金在线信息科技有限公司 Retrieval method and device
CN107423999A (en) * 2017-03-31 2017-12-01 优品财富管理股份有限公司 A kind of orientation based on user grouping issues advertising method and system
CN112749546A (en) * 2021-01-13 2021-05-04 叮当快药科技集团有限公司 Retrieval matching processing method and device for medical semantics

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101253498A (en) * 2005-05-31 2008-08-27 谷歌公司 Learning facts from semi-structured text
CN101714144A (en) * 2008-10-07 2010-05-26 英业达股份有限公司 Associated characters and words query system and method thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101253498A (en) * 2005-05-31 2008-08-27 谷歌公司 Learning facts from semi-structured text
CN101714144A (en) * 2008-10-07 2010-05-26 英业达股份有限公司 Associated characters and words query system and method thereof

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902628A (en) * 2012-12-28 2014-07-02 腾讯科技(北京)有限公司 User relation information storing method and device
CN103902628B (en) * 2012-12-28 2018-09-28 腾讯科技(北京)有限公司 A kind of storage method and device of customer relationship information
CN103150383A (en) * 2013-03-15 2013-06-12 中国科学院计算技术研究所 Event evolution analysis method of short text data
CN103150383B (en) * 2013-03-15 2015-07-29 中国科学院计算技术研究所 A kind of event evolution analysis method of short text data
CN104065677A (en) * 2013-03-20 2014-09-24 腾讯科技(深圳)有限公司 Service data recommending method and device
CN104065677B (en) * 2013-03-20 2018-05-25 腾讯科技(深圳)有限公司 A kind of business datum recommends method and apparatus
WO2014180196A1 (en) * 2013-05-08 2014-11-13 华为技术有限公司 Information recommendation processing method and device
CN104135529B (en) * 2014-08-05 2017-10-13 北京视像元素技术有限公司 INFORMATION DISCOVERY, share system based on full-time empty label net
CN104135529A (en) * 2014-08-05 2014-11-05 北京视像元素技术有限公司 Information discovery and sharing system based on entire space-time label web
CN106294405A (en) * 2015-05-22 2017-01-04 国家计算机网络与信息安全管理中心 A kind of microblogging topic evolution analysis method and device
CN105138512A (en) * 2015-08-12 2015-12-09 小米科技有限责任公司 Phrase recommendation method and apparatus
CN106649585A (en) * 2016-11-18 2017-05-10 福建中金在线信息科技有限公司 Retrieval method and device
CN107423999A (en) * 2017-03-31 2017-12-01 优品财富管理股份有限公司 A kind of orientation based on user grouping issues advertising method and system
CN107423999B (en) * 2017-03-31 2021-03-30 优品财富管理股份有限公司 Directional advertisement issuing method and system based on user grouping
CN112749546A (en) * 2021-01-13 2021-05-04 叮当快药科技集团有限公司 Retrieval matching processing method and device for medical semantics

Similar Documents

Publication Publication Date Title
CN102662986A (en) System and method for microblog message retrieval
US11580104B2 (en) Method, apparatus, device, and storage medium for intention recommendation
McMinn et al. Building a large-scale corpus for evaluating event detection on twitter
Sankaranarayanan et al. Twitterstand: news in tweets
CN103049440B (en) A kind of recommendation process method of related article and disposal system
CN107451861B (en) Method for identifying user internet access characteristics under big data
CN110869968A (en) Event processing system
CN111475509A (en) Big data-based user portrait and multidimensional analysis system
CN110825769A (en) Data index abnormity query method and system
CN101727454A (en) Method for automatic classification of objects and system
CN102667761A (en) Scalable cluster database
WO2015096609A1 (en) Method and system for creating inverted index file of video resource
CN104516910A (en) Method and system for recommending content in client-side server environment
Yao et al. Provenance-based indexing support in micro-blog platforms
CN111522846B (en) Data aggregation method based on time sequence intermediate state data structure
CN113609374A (en) Data processing method, device and equipment based on content push and storage medium
CN104317877A (en) Netuser behavior data real-time processing method based on distributed computation
CN103559258A (en) Webpage ranking method based on cloud computation
Kim et al. TwitterTrends: a spatio-temporal trend detection and related keywords recommendation scheme
Ghane Big data pipeline with ML-based and crowd sourced dynamically created and maintained columnar data warehouse for structured and unstructured big data
CN112541119A (en) Efficient and energy-saving small recommendation system
US9405846B2 (en) Publish-subscribe based methods and apparatuses for associating data files
CN102597969A (en) Database management device using key-value store with attributes, and key-value-store structure caching-device therefor
Dave et al. Identifying big data dimensions and structure
Rudenko et al. A Preference-based Stream Analyzer.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C05 Deemed withdrawal (patent law before 1993)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120912