CN105183803A

CN105183803A - Personalized search method and search apparatus thereof in social network platform

Info

Publication number: CN105183803A
Application number: CN201510529035.9A
Authority: CN
Inventors: 喻梅; 邢文涛; 侯德俊; 姜月; 吕方; 汪腾海
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2015-08-25
Filing date: 2015-08-25
Publication date: 2015-12-23

Abstract

The present invention discloses a personalized search method and a search apparatus thereof in a social network platform. The personalized search method comprises the following steps: carrying out Chinese word segmentation on a microblog text of a user; extracting feature words from a word segmentation result as interest tags of the user; carrying out quantitative calculation on the interest tags by using a vector space model to acquire a match degree measure of a page and user interests; and finally, realizing a personalized ranking result based on user interests by combing a Lucene scoring mechanism with the match degree measure of the page and the user interests. The personalized search apparatus comprises: a processing module, an extraction module, an acquisition module, and a ranking module. By using the method and the apparatus of the present invention, the personalization of search results is realized to a maximum extent, so that the pages according with more interests of a user are greater in weight and topper in ranking so as to improve satisfaction degree of the user. Meanwhile, the automation degree is improved, so that the method and the apparatus can better adapt to changes of user demands.

Description

Individuation search method in a kind of social network-i i-platform and searcher thereof

Technical field

The present invention relates to natural language processing, data mining, information retrieval field, particularly relate to the individuation search method in a kind of social network-i i-platform and searcher thereof.

Background technology

Search engine can help user from the countless network informations, obtain the information of user's needs fast.But traditional search engines only provides a kind of simple search service, and return the result of unified indifference for search word, user needs to test to Search Results, in a lot of situation, this is a not only time-consuming but also job for effort.Therefore, traditional search engines more and more can not meet the individual demand of different user.

It is short and small that microblogging text has text, and data structure is complicated, and with features such as a lot of special symbols, have some special word in microblogging text, these vocabulary are different from the vocabulary used in general corpus.Text participle for microblogging is kind of a challenge.

There is many individuation service systems at present, proposed various thinking to realize personalized service.They solve some problem of personalized search to some extent, but go back Shortcomings: to the study of user interest and the automaticity of expression not high, require that user inputs personal information and needs the feedback information that user provides a large amount of, the change of user's request can not be adapted to well.

Summary of the invention

The invention provides the individuation search method in a kind of social network-i i-platform and searcher thereof, present invention achieves the personalization of Search Results, improve user satisfaction, described below:

An individuation search method in social network-i i-platform, described individuation search method comprises the following steps:

Chinese word segmentation process is carried out to user's microblogging text;

Feature Words is extracted, as the interest tags of user from word segmentation result;

By vector space model by described interest tags quantum chemical method, obtain the page and user interest matching degree value;

In conjunction with the marking mechanism of Lucene, the described page and user interest matching degree value, the final personalized ordering result realized based on user interest.

Wherein, describedly from word segmentation result, extract Feature Words, the step as the interest tags of user is specially:

From word segmentation result, Feature Words is extracted, as the interest tags of user by microblog users keyword.

Wherein, described by vector space model by described interest tags quantum chemical method, the step obtaining the page and user interest matching degree value is specially:

When first interest tags containing user in the page, the first bit element of vector is just 1; Second interest tags not containing user in the page, the second element of vector is just 0;

The Weight Acquisition page of the interest tags by quantifying, each interest tags and user interest matching degree value.

A personalized search device in social network-i i-platform, described personalized search device comprises:

Processing module, for carrying out Chinese word segmentation process to user's microblogging text;

Extraction module, for extracting Feature Words from word segmentation result, as the interest tags of user;

Acquisition module, for by vector space model by described interest tags quantum chemical method, obtain the page and user interest matching degree value;

Order module, for the marking mechanism in conjunction with Lucene, the described page and user interest matching degree value, the final personalized ordering result realized based on user interest.

Wherein, described extraction module comprises:

Extract submodule, for extracting Feature Words by microblog users keyword from word segmentation result, as the interest tags of user.

Wherein, described acquisition module comprises:

Quantize submodule, for working as first interest tags containing user in the page, the first bit element of vector is just 1; Second interest tags not containing user in the page, the second element of vector is just 0;

Obtain submodule, for the Weight Acquisition page and the user interest matching degree value of the interest tags after by quantifying, each interest tags.

The beneficial effect of technical scheme provided by the invention is: the present invention is for Sina's microblogging, build Lucene search engine on the platform, by analyzing the microblogging issued in user's a period of time, according to the interest tendency that user shows in social networks, the page-ranking algorithm of Lucene is improved.Introduce label to coincide the concept of the factor, weigh the interest goodness of fit of the page and user, a kind of weight computing formula of applicable personalized ordering is proposed, by obtaining the ranking results meeting user interest to the user interest information analysis be stored in user interest table, the Search Results obtained meets the interest tendency of user.Farthest realize the personalization of Search Results, make the page weight more meeting user interest larger, rank is more forward, to improve user satisfaction.Improve automaticity simultaneously, adapt to the change of user's request better.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the individuation search method in a kind of social network-i i-platform;

Fig. 2 is the schematic diagram extracting Feature Words;

Fig. 3 is the schematic diagram calculating MRR mean value;

Fig. 4 is the structural representation of the personalized search device in a kind of social network-i i-platform;

Fig. 5 is the schematic diagram of extraction module;

Fig. 6 is the schematic diagram of acquisition module.

In accompanying drawing, the list of parts representated by each label is as follows:

1: processing module; 2: extraction module;

3: acquisition module; 4: order module;

21: extract submodule; 31: quantize submodule;

32: obtain submodule.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below embodiment of the present invention is described further in detail.

Lucene (HatcherE, GospodneticO, McCandlessM.Luceneinaction [J] .2004) be a full-text index engine tool bag write with Java, it can be embedded into easily in various application and realizes full-text index/search function.Lucene application programming interfaces design more common, input/output structure all the spitting image of the table==> record==> field of database, so much traditional application file, database can be mapped in the storage organization/interface of Lucene eaily.On the whole: Database Systems Lucene can being supported full-text index as.Realize personalized search, most important is exactly understand the interest of user, based on for interest predict the information that user needs most.The user's content of microblog to obtaining is needed to carry out text-processing, to obtain the label of representative of consumer interest.Lucene has oneself complete marking mechanism a set of.It is not calculated scoring for each web page resources in advance, but carries out real-time evaluation and calculating when user search.Document the key word that input according to user of branch is different and different, the document scores of demand of being more close to the users can be higher, also just comes before getting over when returning results.The machine-processed frequency can regarded certain key word as and occur in a document of the marking of Lucene.

Search engine also needs a standard weighing content of pages and user interest tag match degree.Obviously, each user may have several interest tags, each page also may comprise several contents matched, but is not that more multi-user is interested for the interest tags contained in the page, this is because each label is different in user status in the heart.It has been generally acknowledged that, the number of times that interest tags occurs in user's microblogging is more, and represent that this interest tags is higher in user status in the heart, namely user is interested in this word.Be embedded into by Lucene in social networks, make it become intelligent, this, by far beyond the expectation of user for traditional search engines, is a problem having bright prospects.

Embodiment 1

Embodiments provide the individuation search method in a kind of social network-i i-platform, see Fig. 1, the method comprises the following steps:

101: Chinese word segmentation process is carried out to user's microblogging text;

The embodiment of the present invention uses the Chinese lexical analysis system ICTCLAS of the Computer Department of the Chinese Academy of Science.The major function of ICTCLAS comprises: Chinese word segmentation, part-of-speech tagging, named entity recognition, new word identification etc., support user-oriented dictionary simultaneously.Its main thought utilizes stacked hidden Markov model to carry out layering, to increase the accuracy of participle and to ensure the efficiency of participle.During specific implementation, can also adopt other participle software, the embodiment of the present invention does not repeat this.

102: from word segmentation result, extract Feature Words, as the interest tags of user;

103: by vector space model by interest tags quantum chemical method, obtain the page and user interest matching degree value;

That is, the matching degree of the interest tags of TagMatch (A, U) Parametric Representation page A and user U is used.This numerical value is higher, shows that page A more meets the interest of user U.

104: in conjunction with marking mechanism, the page and the user interest matching degree value of Lucene, final realization is based on the personalized ordering result of user interest.

That is, to the page got and user interest matching degree value, utilize the sort algorithm of the personalized search of Lucene to user search to the page carry out recommendation search.

In sum, the embodiment of the present invention farthest realizes the personalization of Search Results by above-mentioned steps 101-step 104, and make the page weight more meeting user interest larger, rank is more forward, to improve user satisfaction.

Embodiment 2

Be described in detail below in conjunction with the scheme in concrete computing formula, example, accompanying drawing 2 pairs of embodiments 1, refer to hereafter:

201: carry out in personalized search process to search information, first to carry out microblogging text and carry out word segmentation processing;

Wherein, for the feature of the data structure of the complexity of microblogging text, the embodiment of the present invention is extracted useful micro-blog information, is associated by forwarding content of text simultaneously; For in microblogging text with the feature of a lot of special symbol, because the embodiment of the present invention mainly inquires into the user characteristics based on microblogging text, relation between user is not further discussed, therefore the user profile after "@" symbol is ignored, be not discussed, the theme noun in " #...# " is directly as one of user's keyword; For the feature in microblogging text with some special word, WUK with the addition of new stop words by statistical method, and filters the URL formatted data in microblogging.

202: from word segmentation result, extract Feature Words, as the interest tags of user;

Wherein, traditional weight calculation method TF-IDF self has two obvious weak points, result is extracted in meeting accuracy on Feature Words produces certain impact with authoritative, and its core reasons are search word frequency (TF) weights and (the comprehensive weights that the combination of inverse document frequency (IDF) weights causes weigh deviation.In conjunction with microblogging text, there is unlike plain text collection this feature of randomness, adopting WUK (microblog users keyword) algorithm to be only according to carrying out feature extraction with TF.

Be input as microblogging text data see Fig. 2, WUK algorithm, remove redundant information, comprising: URL formatted data, expression data and special symbol.Use hidden Markov model (HMM) participle technique by microblog data participle and be stored in lists of keywords, then import stop words and generate a list, the stop words existed in lists of keywords is removed, TF sequence is carried out to lists of keywords and generates word cloud (" keyword " higher to the frequency of occurrences in network text gives visual outstanding), finally export TF word cloud result.

Extracted the Feature Words of user by WUK algorithm after, each word is exactly an interest tags of user, and in order to weigh the status of each interest tags, by the TF value summation of the TF value of each word divided by all Feature Words, just obtains the weight of each interest tags.

203: by vector space model by interest tags quantum chemical method, obtain the page and user interest matching degree value;

Wherein, realize personalized search, search engine also needs a standard weighing content of pages and user interest tag match degree.Obviously, each user may have several interest tags, each page also may comprise several contents matched, but is not that more multi-user is interested for the interest tags contained in the page, this is because each interest tags is different in user status in the heart.It has been generally acknowledged that, the number of times that interest tags occurs in user's microblogging is more, and represent that this interest tags is higher in user status in the heart, namely user is interested in this word, and accordingly, the TF value of this interest tags is also higher.

Use IncTag _{a, U}record the interest tags of the user U that page A comprises, such as: first interest tags containing user U in page A, so vectorial IncTag _{a, U}the first bit element be just 1, and for example in page A not containing second interest tags of user U, so vectorial IncTag _{a, U}second element be just 0.After such expression, the interest tags matching degree that formula (1) calculates the page and user can be obtained: TagMatch (A, U)=IncTag _{a, U}tagWeight _u(1)

Wherein, TagWeight _ube a N unit column vector, each interest tags of recording user U (1) weight.Each element of this vector is a number between 0 to 1; TagMatch (A, U) is a numerical value, the matching degree of the interest tags of representation page A and user U.This numerical value is higher, shows that page A more meets the interest of user U.

204: in conjunction with marking mechanism, the page and the user interest matching degree value of Lucene, final realization is based on the personalized ordering result of user interest.

T-Rank＝α·S(q,d)+β·TagMatch(A,U)(2)

Wherein, α and β is the coefficient between 0 to 1, and alpha+beta=1; Rank is the sequence score obtained according to user interest label; The mark that S (q, d) draws for Lucene marking mechanism originally; T-Rank is the result that sequence exports.

Wherein, the sequence system of Lucene contains inquiry class (Query), weight class (Weight), score class (Score), the several different class of Similarity Measure class (Similarity).Four classes together form the framework that Lucene gives tacit consent to score-system.

1, Query class is the encapsulation of user search information, is an abstract class, is also the mster-control centre that Lucene result for retrieval is finally marked.Other mark relevant class and objects are all managed by Query class and generated.Query class realizes in Query.java file, provides the method for other classification of access;

2, Weight class interface is used to define of Query weight calculation and realizes interface, can be re-used.Weight class can be used for generating Score class, also can resolve the details of scoring.Also define the method obtaining Query weights in addition.Concrete is defined in weight.java file;

3, Scorer class is the core classes of Lucene scoring.The definition of class is abstract class, and some the abstract basic Score function method provided realize for all scoring classes, also define the detailed analytic method of scoring simultaneously.There is a similarity object Scorer class inside, is used for indicating computing formula.Scorer class realizes in Scorer.java;

4, Similarity class is the core abstract class of Lucene scoring Similarity Measure.Similarity class mainly processes score calculation, and system default uses acquiescence Similarity Measure class (DefaultSimilarity) to mark to result.The Similarity class object of internalist methodology setting scoring can be called.Be defined in Similarity.java file.

In sum, the embodiment of the present invention farthest realizes the personalization of Search Results by above-mentioned steps 201-step 204, and make the page weight more meeting user interest larger, rank is more forward, to improve user satisfaction.

Embodiment 3

Feasibility checking is carried out below in conjunction with the scheme in concrete computing formula, example, accompanying drawing 3 pairs of embodiments 1 and 2, described below:

In experimentation, the input data of text vector process comprise the corpus (element namely in vector space utilizing Chinese data collection to produce, could use after need carrying out preliminary dimensionality reduction, avoid dimension too high), and through specific pretreated target microblogging text.

Vector space model is the classical model carrying out text mining, and directly can quote SDK bag of increasing income and directly realize, code is as shown in table 1.

Table 1VSM code example

This experiment random selecting five microblog users are studied, and choose the highest front 10 interest tags as them of TF value in their keyword, and result is as shown in table 2.

The interest tags of table 2 user and weight

Adopt TF value to carry out feature extraction calculating as weight after carrying out vectorization, thus select can the keyword of representative of consumer interest.This experiment random selecting five microblog users are studied, and choose the highest front 10 interest tags as them of TF value in their keyword, and result is as shown in table 2.

In order to scientifically pass judgment on the performance of this method and the original marking algorithm of Lucene, introduce inverse (MRR) judgment criteria that on average sorts.MRR is the mechanism evaluated searching algorithm general in the world, i.e. first result coupling, and mark is 1, second coupling mark be the 0.5, n-th coupling mark is 1/n, if the sentence mark not have coupling is 0.Final mark is the average of all scores.

M R R = Σ_{i = 1}^{n} 1 / (r_{i} \cdot n) - - - (3)

Wherein, r _irepresent i-th position of relevant documentation in Search Results inquired about; N is total number of queries.MRR value is higher, shows that the performance of searching algorithm is better.

Under the original marking mechanism of Lucene, the position of relevant documentation in Search Results of first inquiry (" news ") is 9, thereafter, in four inquiries, the position of relevant documentation is respectively: 3,8,7,11, and the MRR value of so these five times inquiries just equals 0.16; Under this method, the relevant documentation position that first time inquires about is 1, and in four inquiries thereafter, the position of relevant documentation is respectively: 2,3,3,5, and the MRR value of so these five times inquiries just equals 0.47.Similarly, the low volume data of a user can not embody problem.In final experiment, carried out 20 inquiries under the different coefficients of 5 users in his-and-hers watches 2 (i.e. α=1) and this method under Lucene original marking mechanism respectively, the final mean value calculating MRR, the result obtained as shown in Figure 4.

The average MRR value of the original marking mechanism of Lucene is 0.200.And for this method, when factor alpha=0, MRR value is only 0.01.Along with the increase gradually of factor alpha, the MRR value of this method also increases gradually, reaches maximal value 0.263 when α=0.6, means that personalised effects is now best; And increase along with the continuation of factor alpha, the value of MRR falls after rise gradually, this is because the influence power of tag match score in this method is more and more less.When getting factor alpha=0.6, this method can realize good personalized ordering; And increase along with the continuation of factor alpha, the value of MRR falls after rise gradually, this is because the influence power of tag match score in this method is more and more less.

In sum, the method that the embodiment of the present invention provides, for personalized search in social networks provides a kind of thinking of improvement, propose and define the computing method of the page and user tag interest matching degree, traditional Lucene scoring algorithm is rewritten, and then improves the degree of accuracy that the page mates with user tag interest.

Embodiment 4

A personalized search device in social network-i i-platform, see Fig. 4, this personalized search device comprises:

Processing module 1, for carrying out Chinese word segmentation process to user's microblogging text;

Extraction module 2, for extracting Feature Words from word segmentation result, as the interest tags of user;

Acquisition module 3, for by vector space model by described interest tags quantum chemical method, obtain the page and user interest matching degree value;

Order module 4, for the marking mechanism in conjunction with Lucene, the described page and user interest matching degree value, the final personalized ordering result realized based on user interest.

Wherein, see Fig. 5, extraction module 2 comprises:

Extract submodule 21, for extracting Feature Words by microblog users keyword from word segmentation result, as the interest tags of user.

Wherein, see Fig. 6, acquisition module 3 comprises:

Quantize submodule 31, for working as first interest tags containing user in the page, so the first bit element of vector is just 1; Second interest tags not containing user in page A, so the second element of vector is just 0;

Obtain submodule 32, for the Weight Acquisition page and the user interest matching degree value of the interest tags after by quantifying, each interest tags.

The executive agent of the embodiment of the present invention to above-mentioned module, submodule does not limit, and can be the device that single-chip microcomputer, PC etc. have computing function, as long as can complete the device of above-mentioned functions.

In sum, the embodiment of the present invention farthest realizes the personalization of Search Results by above-mentioned module, submodule, and make the page weight more meeting user interest larger, rank is more forward, to improve user satisfaction.

The embodiment of the present invention is to the model of each device except doing specified otherwise, and the model of other devices does not limit, as long as can complete the device of above-mentioned functions.

It will be appreciated by those skilled in the art that accompanying drawing is the schematic diagram of a preferred embodiment, the invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. the individuation search method in social network-i i-platform, is characterized in that, described individuation search method comprises the following steps:

Chinese word segmentation process is carried out to user's microblogging text;

2. the individuation search method in a kind of social network-i i-platform according to claim 1, is characterized in that, describedly from word segmentation result, extracts Feature Words, and the step as the interest tags of user is specially:

3. the individuation search method in a kind of social network-i i-platform according to claim 1, is characterized in that, described by vector space model by described interest tags quantum chemical method, the step obtaining the page and user interest matching degree value is specially:

4. the personalized search device in social network-i i-platform, is characterized in that, described personalized search device comprises:

5. the personalized search device in a kind of social network-i i-platform according to claim 4, it is characterized in that, described extraction module comprises:

6. the personalized search device in a kind of social network-i i-platform according to claim 4, it is characterized in that, described acquisition module comprises: