CN102915335B

CN102915335B - Based on the information correlation method of user operation records and resource content

Info

Publication number: CN102915335B
Application number: CN201210345320.1A
Authority: CN
Inventors: 杨智强; 殷钊; 王衡; 汪国平
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2012-09-17
Filing date: 2012-09-17
Publication date: 2016-04-27
Anticipated expiration: 2032-09-17
Also published as: CN102915335A

Abstract

The present invention relates to the information correlation method based on user operation records and resource content, first according to user to the resource content that relates in the operation historical record in personal computer and operation, the task model based on operation note of automatic mining user and the topic model based on resource content, then in conjunction with the incidence relation between task model and topic model scaling information, finally find out other resources maximally related with Current resource when user uses resource for user and recommend user, in whole process, carrying out any operation bidirectional without the need to user.The task model of automatic mining of the present invention based on operation historical record and the topic model based on resource content, can when user uses resource, other resources that recommendation may be relevant to this resource automatically, and period is without the need to any operation bidirectional.The present invention is intended to save user to the query time of file, ensures the consistance of user task as far as possible, effectively alleviates burden during user's task switching.

Description

Based on the information correlation method of user operation records and resource content

Technical field

The present invention relates to the information correlation method based on user operation records and resource content under operating system environment, belong to computer software technical field.

Background technology

The personal information amount of current user is increasing, excessive information can make people produce to lose time, Delayed Decision, cannot be absorbed in the problem such as main task and pressure, can see reference paper Waddington, P. (1997) .DyingforInformation? AreportontheeffectsofinformationoverloadintheUKandworldw ide.Reuters.1997.Knowledge Worker, the people of the occupations such as such as professor, lawyer and slip-stick artist, situation impression for information overload is the most deep, because they need to carry out various different task in routine duties, and carries out needing in the process of task to search and process a large amount of information.This just inevitably creates a problem, is interrupted or when task switch, will pay a large amount of energy for the relevant information or resource obtaining current task in task.

The present situation of the avalanche type growth of information, and current operating system does not provide the Information Administration Mode needed for user, personal user effectively cannot be obtained and become very outstanding with the problem of managing personal information.Personal information management is exactly the research field about how helping people to address this problem.For user provides perfect personal information management can run into challenge on a lot of psychology.These challenges can be summed up as 2 following points: one, and it is very difficult for article (such as file) being carried out being sorted in cognition.Two, the details about article that user can remember usually can not be used for retrieval.Current research proposes many solutions from the angle solving these two challenges.

For user provides better Information Organization and presentation mode to be the important research directions of personal information management.The project folder that the people such as OferBergman realize, can document OferBergman, RuthBeyth-Marom, RafiNachmias, TheProjectFragmentationProbleminPersonalInformationManag ement, CHI2006Proceedings, 2006 is reference, under all same subject information (comprising the page etc. of document, mail, collection) of user being stored in identical file folder, user can store and give same subject information for change under same catalogue.

Except organizing information better and presenting, more powerful information retrieval function is also the important means realizing personal information management.The people such as Dumais achieve StuffI'veSeen (SIS) system, concrete methods of realizing, with reference to Dumais, S.T., Cutrell, E., Cadiz, J.J., Jancke, G., Sarin, R.andRobbins, D.C. (2003) .StuffI'veSeen:Asystemforpersonalinformationretrievaland re-use.InProc.SIGIR2003,72-79..SIS is designed with two crucial aspects.An aspect is for the information of different tissues structure provides unified mark, thus utilizes unified mark to realize unified retrieval.Another aspect be utilize such as browse time, file the user such as author than the contextual information being easier to remember for user provides retrieval.

Information Organization and the mode presented need user in advance resource to be classified, and fail fundamentally user to be freed from heavy mutual burden.The mode of information retrieval reduces the expense that user searches resource in a certain degree, but frequent retrieval can make the task of user produce to be interrupted, and can not concentrate on current work.

Based on the information correlation method of user operation records and resource content, one resource recommendation service in real time and accurately can be provided for computer user, solve above-mentioned Problems existing.

Summary of the invention

The object of the invention is to propose a kind of resource information correlating method.The present invention is mainly used in personal computer, and according to operation note and the resource content of accessing in user's past, the forward direction user searching resource user recommends relevant information, for user saves the time overhead of information of searching.

For reaching above-mentioned purpose, technical scheme of the present invention is: based on the information correlation method of user operation records and resource content, its step comprises:

1) monitoring users multiple Action Events in a computer, Gains resources content and operation note are also stored in Local or Remote database;

2) described operation note is converted into specific format vector, sets up the task model based on operation note;

2-1) time slice sequence cutting and vector conversion are carried out to described operation note;

2-2) according to implicit expression Di Li Cray apportion model with described Action Events for data, simultaneously with described timeslice for unit, set up task model;

3) topic model based on resource content is set up according to described resource content;

3-1) according to the set of letters extracted in described resource content and vocabulary, be word frequency vector representation by the Content Transformation of each resource;

3-2) described word frequency vector is represented by implicit expression Di Li Cray apportion model, set up topic model;

4) calculate the correlation degree of topic model and task model described in Current resource and other resources respectively, complete the process of information association and select the highest resource of the degree of association to return user.

Described Action Events comprises: open resource event, close resource event, by a resource switch to another resource event, described resource content comprises: document and webpage.

The attribute of the Action Events needs collection that described document is relevant comprises time, event type, the title of resource and the path of resource, and the Action Events relevant with described webpage needs the attribute gathered to comprise time, event type, web page title and webpage URL.

Described time slice sequence cutting method is:

I) all resources in statistical operation record, set up the numbering of each resource in vocabulary, described resource are formed a vocabulary;

Ii) vector of samples A is defined _j={ a ₁, a ₂..., a _n..., a _nfor representing jth time sampling time all resources state, wherein a=(0,1), n are the corresponding resource number of Action Events, and N is total number resource, and j is jth time sampling;

Iii) according to cycle c, timeslice is sampled, obtain cutting time slice sequence wherein, for the total number of vector, i is sampling number, and t is the length of timeslice, and c is the sampling period.

The extraction of described resource content comprises: remove punctuation mark, Chinese word segmentation, removal stop words, statistics vocabulary, obtain word frequency vector, is word frequency vector by the Content Transformation of each resource of aforesaid operations.

Preferably, in described task model, obtain the task distribution probability of sheet preset time and the resource distribution probability of Given task and the task distribution probability about the generation of certain resource.

Preferably, in described topic model, obtain the theme distribution probability of given resource and the word distribution probability of given theme.

Preferably, the method for compute associations degree is: calculate Current resource and other resources similarity in the probability distribution of described task model and topic model according to Kullback-Leibler modal distance, be weighted and obtain total distance.

Preferably, parameter estimation is carried out by Gibbs sampling in the study of described task model and topic model.

Preferably, described subscriber computer installs Windows or android system.

Good effect of the present invention is:

The present invention proposes a kind of resource associations and recommend method.According to user to the resource content related in the operation historical record in personal computer and operation, the task model of automatic mining based on operation historical record and the topic model based on resource content, can when user uses resource, automatic recommendation may be relevant to this resource other resources, and period is without the need to any operation bidirectional.The present invention is intended to save user to the query time of file, ensures the consistance of user task as far as possible, effectively alleviates burden during user's task switching.

Accompanying drawing explanation

Fig. 1 is system architecture diagram in the embodiment of the information correlation method that the present invention is based on user operation records and resource content;

Fig. 2 is the process flow diagram of information acquisition module in the embodiment of the present invention;

Fig. 3 is the process flow diagram of information management module in the embodiment of the present invention;

Fig. 4 is the information correlation method that the present invention is based on user operation records and resource content carries out timeslice cutting example to Action Events;

Fig. 5 is that the information correlation method that the present invention is based on user operation records and resource content carries out pretreated process flow diagram to resource content;

Fig. 6 is the corresponding relation figure that task model of the present invention and implicit expression Di Li Cray distribute (LDA) model;

Fig. 7 is recommending module process flow diagram of the present invention;

Fig. 8 is commending system interface of the present invention example.

Embodiment

Inventive principle

The present invention according to user to the resource content that relates in the operation historical record in personal computer and operation, the task model based on operation note of automatic mining user and the topic model based on resource content, then in conjunction with the incidence relation between task model and topic model scaling information, find out other resources maximally related with Current resource when user uses resource for user and present to user, in whole process, carrying out any operation bidirectional without the need to user.

A) under operating system environment, monitoring users to the Action Events of two or more target resource, the content of Gains resources.

B) the operation note historical data of user is done to cutting and the conversion of time slice sequence, then utilize implicit expression Di Li Cray apportion model specific algorithm can see [Blei2002] Blei, D.M., Ng, A.Y., & Jordan, M.I. (2002) .LatentDirichletallocation.InAdvancesinNeuralInformation ProcessingSystems14.MITPress, Cambridge, MA, 2002. set up the task model based on operation note.

C) content extraction, pre-service and word frequency vectorization are done to the resource related in the operation note of user, then utilize the foundation of implicit expression Di Li Cray apportion model based on the topic model of resource content.

D) weigh according to the task model based on operation note and the correlativity of the topic model based on resource content to resource.

Described in step a, user operation case comprises: open resource event, close resource event, by a resource switch to another resource event.

Step a realizes the monitoring to described two or more target resource by following manner: the Action Events of monitoring users; Described (three kinds of above-mentioned events) event is converted to interaction data; Be content-data by the Content Transformation of the resource related in Action Events; The operation two or more target resource done by described interaction data screening user; Described interaction data and content-data entirety can be stored in Local or Remote database backup.

Comprehensive, the inventive method can be implemented in the following manner:

1) utilize information acquisition module monitoring users Action Events in a computer, obtain the content of the resource related in Action Events, and Action Events and resource content are sent to information management module.

2) utilize information management module to receive from the Action Events of information acquisition module and resource content, be recorded to database, and the data inquiry request of response data pretreatment module.

3) utilizing data preprocessing module to inquiring about the service data that obtains and resource content filters and changes, then the service data of specific format and resource content being passed to data-mining module.

4) data-mining module is to service data after the study of pre-defined algorithm, generates task model; To resource content after the study of pre-defined algorithm, generate topic model.Then the task model generated and topic model are passed to recommending module.

5) recommending module is after obtaining two kinds of models, according to the Generalization bounds of specifying, recommends for Current resource filters out maximally related top n resource.

Resource in Action Events includes but not limited to the resource of following classes: document, webpage.

The method of work of information acquisition module is:

1) gather user's Action Events on computers and content, different Action Events needs to gather different attributes.The Action Events relevant with document needs the attribute gathered to comprise time, event type, the title of resource and the path of resource, and the Action Events relevant with webpage needs the attribute gathered to comprise time, event type, web page title and webpage URL.

2) Action Events collected and resource content are sent to information management module.

The method of work of information management module is:

1) receive the Action Events sent of information acquisition module and resource content, and preserved in a database.

2) response is from the data inquiry request of data preprocessing module, and the Action Events of defined time period in request and resource content are issued data preprocessing module.

Pretreatment module can be divided into two parts, and a part is responsible for carrying out pre-service to the historical data of Action Events, and another part carries out pre-service to resource content.

Pretreated method of work carried out to Action Events as follows:

By the timeslice cutting of the historical data of the Action Events of user, the mode Action Events in each timeslice being converted into word frequency goes to represent.First need all resources related in statistical operation data, all resources form a vocabulary, and each resource has unique numbering in vocabulary.Action Events utilizes label corresponding to resource operated by it to represent.Here need definition two parameters, one is the length t of timeslice, and another is sampling period c, and represent that the time often crossing cycle c, just sampling should be carried out, the unit of t and c is all defined as second here.Further, the vectorial A of definition sampling is needed _j={ a ₁, a ₂..., a _n..., a _n, wherein a={0,1}, n are the numbering of the corresponding resource of Action Events, and N is total number of resource, and j represents jth time sampling, A _jreflect the state of all resources when jth time is sampled.Carry out sampling time sheet t with cycle c, we can obtain m vectorial A, wherein we finally define wherein K is time slice sequence, illustrates the running status of each resource in timeslice, and i represents i-th sampling, and m represents the sampling number in timeslice.Aforesaid way take timeslice as unit, with the Action Events in timeslice for data, achieves the conversion of time slice sequence to word frequency vector.

Pretreated method of work carried out to resource content as follows:

1) punctuation mark is removed

Punctuation mark does not have practical significance, has a negative impact to content analysis meeting, and can adopt the method for filtrator that it is removed, the origin-location of punctuation mark replaces with space.

2) Chinese word segmentation

Chinese word segmentation program is utilized to carry out participle to resource content.

3) stop words is removed

It is higher that stop words refers to the frequency occurred in article, but but there is no the word of what practical significance, mainly some adverbial words, function word and modal particle etc., such as " go back ", " ", " " (with reference to from, change Berlin. the stop words treatment technology in Knowledge Extraction. modem long jump skill intelligence technology .2007, the 8th phase, 48-51).These words can produce interference effect in the process of data mining, so must delete.The method of deleting is according to vocabulary of stopping using, and after the word occurred is deleted, is replaced with space by the original position deleting word at inactive vocabulary.

4) add up vocabulary, obtain word frequency vector

Add up all various words of all articles, composition vocabulary, each word has unique numbering in vocabulary.After participle, the content of each resource is the set of a word.To each resource, according to its set of letters and vocabulary, be that word frequency vector represents by the Content Transformation of each resource.

The method of work of data-mining module is:

After the data obtaining vectorization, after the study of implicit expression Di Li Cray apportion model, distribution probability can be obtained.In task model, obtain be preset time sheet task distribution probability and the resource distribution probability of Given task, obtain the distribution probability of task about the generation of certain resource simultaneously.In topic model, what obtain is the theme distribution probability of given resource and the word distribution probability of given theme.

The method of work of recommending module is:

Foundation for Current resource recommendation related resource is the correlation degree between resource.Basic Ways utilizes the correlation degree between resource to sort to resource.Recommending module obtains task model and topic model from data-mining module, and these two kinds of models are weighed from the operation angle different with content two correlation degree between resource respectively.Recommending module is weighted these two kinds of correlation degrees and obtains a total correlation degree, then carries out sorting and recommending according to total correlation degree.

Realize in task model to be time slice sequence-task-resource with the article-theme-word in implicit expression Di Li Cray apportion model corresponding, and weigh the correlation degree between resource, be also just equivalent to weigh the correlation degree between word.Correlation degree between word can to get on measurement from them at the similarity degree of each theme.Be provided with word w ₁and w ₂, their correlation degree can pass through condition theme distribution P (Z|w ₁) and P (Z|w ₂) weigh.

In topic model, weigh the correlation degree between resource, be exactly weigh the correlation degree between article.In implicit expression Di Li Cray apportion model, article is equivalent to drop to the low theme vector of dimensional comparison from the vector of the very high word of dimension.So, calculate the similarity between two articles, can be calculated by their theme probability distribution.

The standard method weighing the difference of two distributions is that the KL distance calculating them can realize with reference to following mode: Steyvers, M., & Griffiths, T. (2007) .ProbabilisticTopicModels.InT.Landauer, DMcNamara, S.Dennis, andW.Kintsch, editors, LatentSemanticAnalysis:ARoadtoMeaning.LaurenceErlbaum, InPress..Kullback-Leibler distance, is also KL difference (Kullback-Leiblerdivergence), and what its was weighed is the difference condition of two probability distribution in similar events space.Suppose that two probability distribution are p and q respectively, their KL distance can calculate by formula below:

D (p, q) = Σ_{i = 1}^{T} p_{i} \log_{2} \frac{p_{i}}{q_{i}}

Wherein, pi and qi represents i-th dimension of ProbabilityDistribution Vector p and q respectively, and T represents total dimension of p and q.

Can know from formula, when p and q every one dimension all correspondent equal time, namely two distribution complete equal time, D (p, q)=0.KL distance is asymmetrical, that is D (p, q) ≠ D (q, p).In the application recommended, use one based on the distance of the symmetry of KL distance, computing formula is as follows:

K L (p, q) = \frac{1}{2} [D (p, q) + D (q, p)]

Task model and topic model can be the KL distance that two resources calculate that is weighed a probability distribution variances respectively, if the distance that task model obtains is KL ₁, the distance that topic model obtains is KL ₂, calculate total distance L by formula below:

L－α*KL ₁+β*KL ₂

Wherein, α and β is the parameter of setting, can be set, also can set according to user preference by empirical value.

The resource that last length L is less, represents more similar to Current resource, so correlation degree is higher, more should recommend user.

Below by embodiment, the invention will be further described by reference to the accompanying drawings.

The inventive method is by the embodiment System Implementation shown in Fig. 1.

As shown in Figure 1, the present embodiment system mainly comprises: information acquisition module, for the Action Events of monitoring users, obtains the resource content that Action Events relates to, and is sent to information management module; Information management module, receives Action Events and resource content and response data inquiry request; Data preprocessing module, for being converted to specific format to Action Events and resource content and passing to data-mining module; Data-mining module, for being learnt by pre-defined algorithm service data and resource content, produces task model and topic model respectively; Recommending module, for according to the Generalization bounds of specifying, provides Current resource maximally related resource recommendation list.

Introduce the internal process of each module below.

Information acquisition module (as Fig. 2) is in background work, be responsible for Real-time Collection user Action Events on computers, comprise open resource, close resource, by a resource switch to another resource event, and obtain the resource content that Action Events relates to.The Action Events relevant with document needs the attribute gathered to comprise time, event type, the title of resource and the path of resource, and the Action Events relevant with webpage needs the attribute gathered to comprise time, event type, web page title and webpage URL.The Action Events collected and resource content are sent to information management module in real time.

Information management module (as Fig. 3) is responsible for real-time reception from the Action Events of information acquisition module and resource content.First, Action Events is converted to service data, service data and resource content are recorded to database, the data in database can be used for other application to use; Then, respond the data inquiry request from data preprocessing module, the service data of defined time period in request and resource content are returned to data preprocessing module.

Data preprocessing module is responsible for the data prediction of two aspects, on the one hand pre-service is carried out to the service data of inquiry gained, use " pretreated method of work is carried out to Action Events " described in summary of the invention part, by the timeslice cutting of the historical data of the Action Events of user, mode Action Events in each timeslice being converted into word frequency goes to represent, is finally expressed as word frequency vector.Fig. 4 shows an example of Action Events being carried out to timeslice cutting.

On the other hand, pre-service is carried out to resource content, use " pretreated method of work is carried out to resource content " (as the Fig. 5) described in summary of the invention part, through removing punctuation mark, Chinese word segmentation, removing stop words, add up vocabulary, obtain word frequency vector four steps, is that word frequency vector represents by the Content Transformation of each resource.

The word frequency vector of the word frequency vector sum resource content of the service data of gained is for data-mining module.

Data-mining module adopts implicit expression Di Li Cray to distribute the data of (LDA) model to vectorization and learns, and obtains distribution probability, for the calculating of recommending module to two resource associations degree.

In task model, define the theme in the corresponding LDA model of task to be excavated, one section of article in the corresponding LDA model of vector description of each timeslice, the word (as Fig. 6) in the corresponding LDA model of each resource.The method of being sampled by Gibbs carries out the study that parameter estimation realizes LDA model, can in the hope of preset time sheet task distribution probability and the resource distribution probability of Given task, meanwhile, also can obtain the distribution probability of task about the generation of certain resource.

In topic model, one section of article in the vector description of each resource content and corresponding LDA model, each resource content is by the word in each word of participle gained and corresponding LDA model.According to the method similar to task model, carry out parameter estimation by Gibbs sampling thus realize the study of LDA model, the theme distribution probability of given resource and the word distribution probability of given theme can be obtained.

Recommending module (as Fig. 7) is responsible for refreshing in real time and is recommended interface, be that Current resource recommends relevant resource, and the foundation of recommending is the correlation degree between resource.Task model and topic model are weighed from the operation angle different with content two correlation degree between resource respectively, therefore Generalization bounds adopts the correlation degree obtained two kinds of models to be weighted to obtain a total correlation degree, then carries out sorting and recommending according to total correlation degree.

In task model, resource corresponds to the word in LDA model, the correlation degree in the correlation degree between resource i.e. LDA model between word.The correlation degree of two words can be weighed by the similarity degree of the condition theme distribution of given word.Be provided with word w ₁and w ₂, their condition theme distribution is respectively P (Z|w ₁) and P (Z|w ₂), can by calculating P (Z|w ₁) and P (Z|w ₂) KL distance specifically weigh difference between them, thus weigh word w ₁and w ₂correlation degree, the namely correlation degree of the resource of their correspondences.

In topic model, resource corresponds to the article in LDA model, and the correlation degree weighed between resource namely calculates the similarity between two articles.Be provided with article d ₁and d ₂, in topic model, obtain their theme probability distribution P (Z|d ₁) and P (Z|d ₂), their KL distance can be calculated equally to weigh the difference between them, thus weigh article d ₁and d ₂correlation degree between corresponding resource.

The computing method of KL distance as discussed in the summary of the invention section.When user is using certain resource, for other resources in user operation history, in task model and topic model, calculate one according to the method described above respectively and weigh the KL distance with the probability distribution variances of Current resource, if the distance obtained in task model is KL ₁, the distance obtained in topic model is KL ₂, the mode eventually through weighting calculates total distance L=α * KL of these two kinds of models comprehensive ₁+ β * KL ₂, α and β is the parameter of setting.

Total distance L is less, represents that a correlation degree between resource and Current resource is higher.Finally, commending system interface is refreshed in real time, and in interface, shows resource recommendation list according to resource associations degree order from high to low (namely total distance L is ascending).Fig. 8 shows the example at commending system interface, and user directly can double-click resource in interface to open it.

Claims

1., based on the information correlation method of user operation records and resource content, its step comprises:

4) calculate the correlation degree of topic model and task model described in Current resource and other resources respectively, complete the process of information association and select the highest resource of the degree of association to return user;

4-1) in described task model, obtain the task distribution probability of sheet preset time and the resource distribution probability of Given task and the task distribution probability about the generation of certain resource;

4-2) in described topic model, obtain the theme distribution probability of given resource and the word distribution probability of given theme;

4-3) method of compute associations degree is: calculate Current resource and other resources similarity in the probability distribution of described topic model and task model according to Kullback-Leibler modal distance, be weighted and obtain total distance, total distance is less, then represent that correlation degree is higher;

4-4) according to the order display resource recommendation list from high to low of resource associations degree.

2. as claimed in claim 1 based on the information correlation method of user operation records and resource content, it is characterized in that, described Action Events comprises: open resource event, close resource event, by a resource switch to another resource event, described resource content comprises: document and webpage.

3. as claimed in claim 2 based on the information correlation method of user operation records and resource content, it is characterized in that, the Action Events relevant with described document needs the attribute gathered to comprise time, event type, the title of resource and the path of resource, and the Action Events relevant with described webpage needs the attribute gathered to comprise time, event type, web page title and webpage URL.

4., as claimed in claim 1 based on the information correlation method of user operation records and resource content, it is characterized in that, described time slice sequence cutting method is:

Ii) vector of samples A is defined _j={ a ₁, a ₂..., a _n..., a _nfor representing jth time sampling time all resources state, wherein a={0,1}, n are the corresponding resource number of Action Events, and N is total number resource, and j is jth time sampling;

5. as claimed in claim 1 based on the information correlation method of user operation records and resource content, it is characterized in that, the extraction of described resource content comprises: remove punctuation mark, Chinese word segmentation, removal stop words, statistics vocabulary, obtaining word frequency vector, is word frequency vector by the Content Transformation of each resource of aforesaid operations.

6., as claimed in claim 1 based on the information correlation method of user operation records and resource content, it is characterized in that, in the study of described task model and topic model, carry out parameter estimation by Gibbs sampling.

7., as claimed in claim 1 based on the information correlation method of user operation records and resource content, it is characterized in that, described subscriber computer installs Windows or android system.