CN101719145A

CN101719145A - Individuation searching method based on book domain ontology

Info

Publication number: CN101719145A
Application number: CN200910238155A
Authority: CN
Inventors: 张铭; 孙韬
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2009-11-17
Filing date: 2009-11-17
Publication date: 2010-06-02
Anticipated expiration: 2029-11-17
Also published as: CN101719145B

Abstract

The invention provides an individuation searching method based on book domain ontology, belonging to an individuation network searching service. The individuation searching method comprises the following steps of: establishing the domain ontology, introducing a collaborative filtering idea, and adding semantic relationships which present influences among users; analyzing and processing logs and establishing a user model based on user interests and preferences; computing individuation scoring through spreading Activation (SA) based on the user model and the domain ontology; in addition, rearranging searched results, rearranging the results returned by a primary search engine according to an individuation scoring sequence from high to low, and returning to the users. Through introducing the collaborative filtering idea into the domain ontology, establishing the user model which presents user interest changes timely, and analyzing user requirements accurately by the SA, the invention effectively eradicates ambiguity of key words and greatly improves satisfaction degree of the users on the searched results.

Description

Individuation search method based on book domain ontology

Technical field

The present invention relates to the personalized service in library, relate in particular to the method that personalized search is provided for the library, belong to Computer Applied Technology information management technique field.

Background technology

Information age, along with the explosion type increase of quantity of information, " information overload " becomes a very important problem gradually.Universal search engine can return thousands of Search Results, but this has brought the difficulty of information sifting to the user such as Google and Baidu etc.And the user often submits to some that the keyword of ambiguity is arranged, such as, like the user A of thriller to use keyword " Leonardo da Vinci " to search for red Blang's masterpiece " Leonardesque password ", and the user B that makes earnest efforts the Renaissance art select keyword " Leonardo da Vinci " to search for Leonardesque paintings too.Obviously they have different information requirements, but Google, Baidu can return to their same Search Results.Digital library is faced with this severe problem too, because the continuous growth of digital document quantity and the indeterminate property of keyword, the user has to spend the more and more longer time and selects really required from return results.Personalized search can be by the analysis user historical record, set up user model, returns Search Results more accurately at user's real demand, thus " information overload " problem of solution.

At present, in the personalized search technical field, Chinese scholars has been launched a large amount of and deep research work, main individuation search method has: based on personalized PageRank algorithm (T.H.Haveliwala., Topic-sensitive PageRank.Proceedings of the 11th internationalconference on World Wide Web.New York, USA, 2002.), based on the individuation search method of this algorithm according to user's browse history analysis user interest, the document of deflection particular category when random walk (Random Walk) then; Based on clustering algorithm (Ferragina Gulli, A Personalized SearchEngine Based On Web Snippet Hierarchical Clustering, Software Practiceand Experi ence, Volume 38,2008.), this method is carried out cluster to document, highlighted then user's interest particular category or the like.Though above method can not effectively be eliminated the ambiguity of keyword according to user interest realization personalized search to a certain degree, does not also consider the semantic knowledge in searched field, has caused the disappearance of the semantic relation between the document.

People such as Yolanda Blanco-Fern á ndez have proposed method (the Yolanda Blanco-Fern á ndez that utilizes semantic reasoning to realize personalized service, Jos é J.Pazos Arias, etc., SemanticReasoning:A Path to New Possibilities of Personalization, the 5th AnnualEuropean Semantic Web Conference, Tenerife, Spain, 2008.).This method is at first set up domain body according to domain knowledge, uses ρ-path, the ρ-join of semantic reasoning technology, ρ-cp method to obtain potential semantic relation between the example (Instance) then.Utilize inference technology to expand after the body, calculate similarity between each example and the user preference based on domain body again.Though this method can utilize domain body effectively to eliminate the keyword ambiguity, has following shortcoming:

1. do not consider influencing each other between the user, only calculate similarity from the angle of example;

2. do not consider the variability of user interest, can't follow the tracks of the youngest demand of user.

Summary of the invention

In order to overcome the deficiencies in the prior art, the invention provides a kind of individuation search method and system based on book domain ontology.This method is at first set up book domain ontology, considers the influence between the user, adds new semantic relation; Set up user model (User Profile) then, interest has been carried out classification and weighting according to time sequencing; Use again by figure mining algorithm SA and calculate personalized score, and reset the result that search engine returns in view of the above, realize personalized search.The inventive method is effectively eliminated the keyword ambiguity by utilizing domain body, embody user interest immediately and change, thereby accurate analysis user demand significantly improves the satisfaction of user to Search Results.The technical solution adopted for the present invention to solve the technical problems is:

Based on the individuation search method of book domain ontology, it comprises the foundation of off-line certain customers model and the foundation of domain body, and the calculating of the personalized score of online part and the rearrangement of Search Results, and concrete steps are as follows:

Step 1, set up domain body:,, and add the thought of collaborative filtering with the classed thesaurus of a specific area body as description object according to user's history, set up domain body, thereby the concept definition of this specific area and the semantic relation between the notion are provided.

Described domain body provides abundant semantic information, strengthened the semantic relation between the entity, thereby overcome the phenomenons of using in the original searching results such as polysemant, synonym and word dependence, played the effect of disambiguation, and can reflect interacting of interest between the user.This domain body will be as the communication network of searching algorithm of the present invention.

Step 2 is set up user model, according to user's history analyzing and processing is carried out in daily record, the analysis user historical record.As time goes on reader's interest can constantly change, and therefore according to time sequencing user interest classified and weighting.Be divided into instant interest, in the recent period interest and long-term interest three classes according to borrowing the time period, and give this three classes interest weights from high to low, thereby set up user model based on user interest preference.

Described user model can in time embody the renewal and the migration of user interest.

The personalized score of step 3 is calculated, and according to domain body of having set up and user model, calculates this personalization score by figure mining algorithm SA, and concrete calculation procedure is as follows:

At first, domain body is seen mapping, utilization SA algorithm on domain body, and with the initial point (Initial Nodes) of the books that have been endowed weights in the user model for the propagation diffusion; Then, cycle index restriction, the travel path restriction of SA algorithm is set and propagates the terminal point restriction, to improve the efficient of algorithm; Pass through more new formula of score at last, constantly iteration is upgraded the activation value of each point, finishes up to whole algorithm.

After each loop ends of described figure mining algorithm SA, collect field feedback, upgrade the initial point activation value of next time propagating, collected field feedback comprises: the books that the books of the new clickthrough of user and user newly borrow, these two parts books are the instant interest of conduct all, and give weights.

Step 4 is reset Search Results, according to the personalized score that described SA obtains, according to order from high to low the result that former search engine returns is reset, and returns to the user then.

Based on the personalized search system of book domain ontology, it comprises:

The domain body module, system sets up domain body, thereby the concept definition of this specific area and the semantic relation between the notion is provided by with the body of a specific area as description object;

User model, it carries out analyzing and processing to daily record, and the analysis user historical record is because As time goes on user interest constantly changes, according to time sequencing described user interest is classified and weighting, thereby set up user model based on user interest preference;

Personalized score computing module, according to domain body of having set up and user model, SA calculates this score by the figure mining algorithm; And,

Reset the Search Results module, the personalized score that it obtains according to described SA is reset the result that former search engine returns according to order from high to low, returns to the user then.

Described figure mining algorithm SA in the described personalized score computing module comprises as lower unit:

Initial value determining unit: domain body is seen mapping, with the initial point of the books that have been endowed weights in the user model as the propagation diffusion;

The circulation of SA algorithm is provided with the unit: cycle index restriction, the travel path restriction of SA algorithm is set and propagates the terminal point restriction, to improve the efficient of algorithm; And,

Iteration unit: by score new formula more, constantly upgrade the activation value of each point iteratively, finish up to whole algorithm.

Beneficial effect of the present invention:

1. the inventive method combines collaborative filtering thought and SA algorithm, when setting up domain body, introduce new semantic relation-borrowIntent, thereby from the similarity between two entities of angle reflection of user interest, greatly enrich and improved the ability to express of domain body, guaranteed the information integrity of SA communication network simultaneously.

2. the more accurate interest of analysis user meticulously is to set up user model.When setting up user model, distinguished the interest of user's different time sections, by giving different weights, objective, comprehensive representation user interest knowledge embodies and follows the tracks of the variation of user interest, and has guaranteed the rationality of SA algorithm initial point weights.

Method of the present invention uses the true daily record data of Peking University Library (http://www.lib.pku.edu.cn) to evaluate and test, experimental data shows, personalized search by the inventive method is reset the result, can be in the Search Results that returns, effectively eliminate the keyword ambiguity, and significantly improve user's interest books rank, thereby save user's browsing time, improve user satisfaction.Simultaneously, method of the present invention has more than and is limited to the field, library, can expand applying to other field, has higher experimental value.

Description of drawings

Fig. 1 is the domain body synoptic diagram of the inventive method;

Fig. 2 is for providing the process synoptic diagram of personalized search service for the user according to the present invention;

Fig. 3 is for adopting the inventive method and other three kinds of method Top N results' Norm DCG mean value comparison diagram;

Fig. 4 is for adopting the comparison diagram of user's number as a result interested among the inventive method and other three kinds of method Top N.

Embodiment

Below in conjunction with the drawings and specific embodiments the present invention is described in further detail:

Embodiment 1: the true daily record data with Peking University Library year June in January, 2008 to 2008 is an example, describes the specific embodiment of the present invention in detail in conjunction with the domain body synoptic diagram of Fig. 1.

What this embodiment was described is to provide the personalized search service method for library users.Target is the retrieval request for same keyword, and different user can access the information that oneself needs of pressing close to most, thereby brings better user experience for the user.In this embodiment, as shown in Figure 2 based on the system architecture of the individuation search method of book domain ontology.

Specifically describe as follows:

First, the off-line part.Finish under the work of off-line part is online, promptly before the user submits keyword search to, finish.Concrete step is as follows:

Step 1 is set up domain body: with the body of a specific area as description object, set up domain body, thereby the concept definition of this specific area and the semantic relation between the notion are provided.Particularly, domain body suggestion process is as follows:

The domain body synoptic diagram as shown in Figure 1.When setting up body, introduce collaborative filtering (Badrul Sarwar, George Karypis, Joseph Konstan, John Riedl., Item-Based CollaborativeFiltering Recommendation Algorithms, Proceedings of the 10thinternational conference on World Wide Web, Hong Kong, 2001.) thought, consider the influence between the user, add new semantic relation.Set up user model then, interest has been carried out classification and weighting according to time sequencing.Use Spreading Activation (SA) Model (AM Collins again, EF Loftus.A spreading-activation theory of semantic processing.Psychological review.V.82 p.407-428,1975.) algorithm, reset the result that search engine returns, realize personalized search.In field this subject that the present invention sets up, notion (concept) comprises books classification (class) and books entity (instance), contact between the notion comprises the rdfs:subClassOf that W3C recommends, rdf:type, dc:creator, the new borrowIntent that proposes of dc:subject and the present invention.

Specifically, for the books field,, set up top layer classification " F economy ", " I literature ", " J art " etc., and subclass " J2 drawing ", " J22 Chinese painting works " etc. based on Chinese Library classification (CLC).Get in touch with rdfs:subClassOf between classification and the subclass.For every books,, be classified to the bottom classification of CLC according to the middle figure classification number of its correspondence.Classification number such as " Da Vinci Code " is I712.45/598, then is classified to " I7 ".Get in touch with rdf:type between books and the classification.Author and subject information according to books, continuation is introduced dc:creator and dc:subject contact at domain body, such as getting in touch with dc:creator between " Da Vinci Code " and its author " red Blang ", get in touch with dc:subject between " painting of Leonardo da Vinci " and its theme " the Renaissance art ".

Afterwards, use for reference the thought of collaborative filtering,, introduce the oriented asymmetric contact borrowIntent of books weighting between any two from reader's reading interest angle.

BorrowIntent specifically is defined as: if n is arranged ₁Individual reader has borrowed books b ₁, n ₂Individual reader has borrowed books b ₂, b ₁→ b ₂Limit weight (link weight) be: borrowIntent (b ₁, b ₂)=| n ₁∩ n ₂|/n ₁, in like manner, b is arranged ₂→ b ₁The limit weight: borrowIntent (b ₂, b ₁)=| n ₁∩ n ₂|/n ₂

After setting up domain body, carry out the daily record arrangement:

Generally, the form of log record is relatively chaotic and can contain a large amount of garbages.Therefore, at first need to put in order daily record, remove illegal or wrong record, such as have " MISSING " or "?? "Then, all log informations are organized into the form of table 1 and deposit relational database in.Wherein, entry_id represents record number, and book_id represents the middle figure classification numbering of these books, user_id is that (Customs Assigned Number is only used for the differentiation user of system for Customs Assigned Number, can not infer Any user information, not relate to privacy of user), timestamp is the date of this record.

Table 1 log information table

??entry_id	??book_id	??user_id	??timestamp
??entry_id	??book_id	??user_id	??timestamp	??1	??B516.47/9.2	??00000001	??2008-01-02
??...	??...	??...	??...	??1	??B516.47/9.2	??00000001	??2008-01-02
??...	??...	??...	??...	??389，138	??C37/2	??00010009	??2008-06-30

Simultaneously, need safeguard the table of another book information, wherein, book_title represents the complete title of these these books.

Table 2 book information table

??book_id	?book_title
??book_id	?book_title	??I712.45/598	Da Vinci Code
??...	?...	??I712.45/598	Da Vinci Code
??...	?...	??K835.4657/6e	Leonardo da Vinci draws biography=Da vinci

Step 2, set up user model: analyzing and processing is carried out in daily record, and the analysis user historical record is because As time goes on reader's interest constantly change, classify and weighting according to the books that time sequencing is borrowed the user, thereby set up user model based on user interest preference.Detailed process is as follows:

By analyzing the log information table, can obtain specific user's the history of borrowing, thereby analysis user interest is set up user interest model.The present invention is divided three classes user interest-instant interest (other books that this is borrowed) according to the time period, in the recent period interest (borrowing within one month) and long-term interest (other).For each these books i in the user interest model, weights A[i] to give formula as follows,

In the following formula, the present invention is α=4 through the final selected parameter of experiment contrast, β=2, γ=1.Say that intuitively the significance level of representing instant interest is the twice of recent interest, the significance level of interest is the twice of long-term interest in the recent period.

Because user's interest is not unalterable, user interest model of the present invention is brought in constant renewal in as time passes.

The online part of second portion.The work of online part is finished on line, promptly finishes after the user submits keyword search to.The concrete job step of online part is as follows:

Step 3, personalized score is calculated: according to domain body of having set up and user model, calculate this personalization score by figure mining algorithm SA.The concrete grammar that calculates personalized score with SA is as follows:

The activation value (Activation Score) that personalized score promptly calculates by SA.The domain body that off-line is partly set up is the communication network of SA, in this network, and node (node) expression books, classification, author and theme; Link (i, j) limit of expression link node i and node j.During SA propagates, remove borrowIntent, nonoriented edge is all regarded on other limits as, has guaranteed that like this activation value can be from book b ₁Propagate into b ₁Corresponding class/author/theme propagates into book b again ₂That is to say that among all limits, having only borrowIntent is oriented weighting limit.Simultaneously, the user interest model that off-line partly obtains becomes the weighting initial point of SA, and other have an initial activation value is 0.In the SA communication process, the activation value A[j of node j] upgrade as follows.

A [j] = A [j] + \underset{i &Element; {i | link (i, k)}}{Σ} A [i] * DecayFactor

In the following formula, DecayFactor is an attenuation coefficient, and expression propagates into activation value after the decay of neighbours' node j by node i.The parameter setting that the present invention uses according to Ming-Hung Hsu (Ming-Hung Hsu, Hsin-Hsi Chen.A methodto predict social annotations.CIKM, Napa Valley, CA, USA, 2008.), the attenuation coefficient default value is made as 0.8.Different with Ming-Hung Hsu is that when the limit was borrowIntent, attenuation coefficient was the limit weight on this limit.

In the SA communication process, in order to raise the efficiency, the present invention is following restriction for SA is provided with:

(1) cycle index restriction.In this embodiment, the cycle index of restriction SA is 3.

(2) travel path restriction.The distance that control is propagated, in this embodiment, limiting farthest, propagation distance is 2.

(3) propagate the terminal point restriction.In this embodiment, propagate terminal point and be limited in, after propagation ran into specified point, propagation stopped.

It is pointed out that the present invention can collect field feedback after each loop ends of SA, upgrade the initial point activation value of next time propagating.The field feedback that can collect comprises: the books of the new clickthrough of user, the books that the user newly borrows.These books all will be as instant interest, and give weights.

Step 4 is reset Search Results: according to the personalized score that described SA obtains, according to order from high to low the result that former search engine returns is reset, return to the user then.Specific embodiments is as follows:

After SA finished, personalized rearrangement can be regarded the final personalized score that obtains according to SA as, the process that original searching results is reset.

Actual evaluation result according to the inventive method is as follows:

Determine evaluation metrics.Evaluation metrics is Discounted Cumulative Gain (DCG).By the Jaime Teevan of Massachusetts Polytechnics at J J Teevan, ST Dumais, E Horvitz.PersonalizingSearch via Automated Analysis of Interests and Activities.Proceedingsof the 28th annual international ACM SIGIR, Salvador, 2005.New York, ACM Press, propose in 2005:178～185, utilize the method for manually mode of Query Result marking being evaluated and tested the personalized retrieval system in conjunction with the DCG formula.The method is given different importance degrees according to the difference that different web pages sorts to it in result for retrieval, the high more result for retrieval importance degree that sorts is big more, and the user is also big more to the influence of system performance to its marking.Therefore utilize the DCG formula that the user is combined with result's sorting position the marking of result for retrieval, the value that calculates is as the evaluation metrics of system performance.

In the test and appraisal of the present invention, in the Search Results that provides, the books G (i)=3 that the user borrows really, other are G (i)=1 as a result, and the computing formula of DCG iteration is as follows

DCG (i) = \{\begin{matrix} G (1) & i = 1 \\ DCG (i - 1) + G (i) / \log (i) & i > 1 \end{matrix}

Because the quantity as a result that each search is returned is all inconsistent, also needs to do normalization.The high more search of correlated results ordering is desirable search more, and the DCG (i) of this moment is as ideal DCG (i), and then final evaluation and test formula is as follows.

normalizedDCG (i) = \frac{DCG (i)}{idealDCG (i)}

Obviously, standard DCC (normalized DCG is called for short Norm DCG) is high more, illustrates that Search Results coincide with user interest more, and the effect of personalized search is good more.

The evaluation result test data is the Peking University Library true daily record data in year June in January, 2008 to 2008.Method of the present invention contrasts in three kinds of additive methods, and the specific descriptions of each method are as follows:

" Lucene/VSM " method: the Lucene Score API of the search engine Lucene that increases income utilizes the original searching results of VectorSpace Model (VSM).Why comparing with the result of Lucene is because Lucene is adopted as index and search engine by how tame digital library in the world wide, such as Florence National Library (http://www.planetware.com/florence/national-library-i-to-fbc.ht m), New York Public Library (http://www.nypl.org/) or the like.The result of Lucene is the Search Results in each library in the simulating reality to greatest extent.

" SA " method: with Ahu Sieg at Ahu Sieg, Bamshad Mobasher, Robin Burke.Websearch personalization with ontological user profiles.CIKM, Lisbon, Portugal, 2007. the method for middle employing is similar: do not have borrowIntent in the domain body, also user interest is not classified.

" SA+B " method: add borrowIntent in the domain body, but user interest is not classified.

Method of the present invention " SA+B+S ": add borrowIntent in the domain body, simultaneously user interest is classified.The performance comparison result of each method is as shown in the table:

Each method performance of table 3 relatively

Based on last table as can be seen, method best performance of the present invention.Getting baseline results through rearrangement SA method with Lucene compares, Norm DCG mean value has improved 12.9%, by introducing borrowIntent and when setting up user interest model user interest being classified and weighting, the Norm DCG mean value of the inventive method has reached 0.848.

In actual search, the user often only browses and is listed in the highest Top N result of preceding two pages rank.Based on this, getting the performance of Top N as a result the time in contrived experiment method more of the present invention and aforementioned three kinds of methods, result such as Fig. 3, shown in Figure 4.From Fig. 3, the curve of Fig. 4 can find out obviously that method effect of the present invention is better than additive method, has improved user's interest result's rank greatly.

The present invention is not exceeded with the foregoing description, and the inventive method is equally applicable to the expansion of user's degrees of association such as electronic product, e-book, mobile phone and sells.In addition, above-mentioned only is preferred embodiment of the present invention, is not used for limiting practical range of the present invention.That is to say that any equal variation and modification of being made according to claim scope of the present invention is all claim scope of the present invention and contains.

Claims

1. based on the individuation search method of book domain ontology, it is characterized in that, be included in the foundation of off-line certain customers model and domain body, and online part individualized feature calculates and the rearrangement Search Results, concrete steps are as follows:

Step 1, set up domain body:,, and add the thought of collaborative filtering with the classed thesaurus of a specific area body as description object according to user's history, set up domain body, thereby the concept definition of this specific area and the semantic relation between the notion are provided;

Step 2, set up user model: analyzing and processing is carried out in daily record, and the analysis user historical record is set up user model according to time sequencing;

Step 3, personalized score is calculated: according to domain body of having set up and user model, calculate this personalization score by figure mining algorithm SA;

Step 4 is reset Search Results: according to the personalized score that described SA obtains, according to order from high to low the result that former search engine returns is reset, return to the user then.

2. the individuation search method based on book domain ontology according to claim 1, it is characterized in that: domain body described in the step 1 is the communication network of this searching algorithm, by abundant semantic information being provided and strengthening semantic relation between the entity, overcome the polysemant, synonym and the word that use in the original searching results and relied on phenomenon, thereby played the effect of eliminating semantic ambiguity; And introduce collaborative filtering thought, can embody interacting between the user.

3. the individuation search method based on book domain ontology according to claim 1, it is characterized in that: in the step 2, because As time goes on reader's interest constantly change, according to time sequencing user interest is classified and weighting, thereby set up the user model that embodies user interest preference.

4. the individuation search method based on book domain ontology according to claim 3, it is characterized in that, described according to time sequencing to user interest classify and weighting be meant, according to borrowing the time period described time sequencing is divided into instant interest, in the recent period interest and long-term interest three classes, and gives this three classes interest weights from high to low.

5. according to claim 1 or 4 described individuation search methods based on book domain ontology, it is characterized in that: in the step 2, described user model is by the analysis user historical record, behavioural habits according to the user are differentiated the behavior preference of user in retrieving, and the renewal and the migration of embodiment user interest, thereby utilize user interest preference to realize personalized service.

6. the individuation search method based on book domain ontology according to claim 1 is characterized in that: in the step 3, the concrete calculation procedure of described figure mining algorithm SA is as follows:

At first, domain body is seen mapping, utilization figure mining algorithm SA on domain body, and with the initial point of the books that have been endowed weights in the user model for the propagation diffusion;

Then, cycle index restriction, the travel path restriction of SA algorithm is set and propagates the terminal point restriction, to improve the efficient of algorithm;

At last, by score new formula more, constantly iteration is upgraded the fractional value of each point, finishes up to whole algorithm.

7. according to claim 1 or 6 described individuation search methods based on book domain ontology, it is characterized in that: after each loop ends of described figure mining algorithm SA, collect field feedback, upgrade the initial point activation value of next time propagating, collected field feedback comprises: the books that the books of the new clickthrough of user and user newly borrow, these two parts books are the instant interest of conduct all, and give weights.

8. based on the personalized search system of book domain ontology, it is characterized in that, comprising:

9. the personalized search system based on book domain ontology according to claim 8 is characterized in that: the described figure mining algorithm SA in the described personalized score computing module comprises as lower unit: