CN102915335A

CN102915335A - Information associating method based on user operation record and resource content

Info

Publication number: CN102915335A
Application number: CN2012103453201A
Authority: CN
Inventors: 杨智强; 殷钊; 王衡; 汪国平
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2012-09-17
Filing date: 2012-09-17
Publication date: 2013-02-06
Anticipated expiration: 2032-09-17
Also published as: CN102915335B

Abstract

The invention relates to an information associating method based on a user operation record and resource content. The method comprises the following steps: firstly, automatically excavating a task model (of a user) based on the operation record and a subject model based on the resource content according to the operation history record and the relevant resource content (of the user) in the operation in a personal computer; subsequently, combining the association relationship between the measuring information of the task model and the subject model, and finally finding out other resources which are most relevant to the current resource for the user when the user uses the resource, and recommending the other resources to the user, wherein the user does not need any extra operation in the whole operation process. The task model based on the operation history record and the subject model based on the resource content are automatically excavated, other resources relevant to the resource are automatically recommended as much as possible without any extra operation when the user uses the resources; and the invention aims at saving the time that the user spends in checking the file, so as to guarantee the consistency of user tasks as much as possible, and effectively alleviate the burden of the user to switch over the tasks.

Description

Information association method based on user operation record and resource content

Technical Field

The invention relates to an information association method based on user operation records and resource contents in an operating system environment, and belongs to the technical field of computer software.

Background

Today, users have increasingly large amounts of personal information, and the excessive amount of information causes problems of wasting time, delaying decisions, failing to concentrate on the main task and stress, which can be seen in reference Waddington, p. (1997). Knowledge workers, such as professors, lawyers, and engineers, are most deeply experienced in situations of information overload because they are required to perform a variety of different tasks during their day-to-day activities, and to search and process a large amount of information during the performance of the tasks. This inevitably creates a problem in that a great deal of effort is expended to obtain information or resources related to the current task when the task is interrupted or switched.

The current situation of avalanche growth of information and the fact that the current operating systems do not provide information management modes for users, make the problem that individual users cannot effectively acquire and manage individual information become prominent. Personal information management is an area of research on how to help people solve this problem. Providing users with perfect personal information management can encounter many psychological challenges. These challenges can be attributed to two points: first, it is cognitively very difficult to classify items (e.g., documents). Second, details about the item that the user can remember are often not available for retrieval. Current research has proposed many solutions from the perspective of addressing these two challenges.

Providing better information organization and presentation for users is an important research direction for personal information management. The project folder implemented by office Bergman et al can store all the same subject Information (including documents, mails, collected pages, etc.) of the user in the same folder by taking document office Bergman, Ruth byte-Marom, Rafi Nachmias, the project Fragmentation protocol in Personal Information Management, CHI 2006 as reference, and the user can store and retrieve the same subject Information in the same directory.

In addition to better organizing and presenting information, more powerful information retrieval functions are also important means to enable personal information management. Dumais et al realized a StuffI 'e Sen (SIS) system, embodied in a manner referenced to Dumais, S.T., Cutrell, E.S., Cadiz, J.J., Jancke, G.S., Sarin, R.and Robbins, D.C. (2003). StuffI' e Sen: A system for personal information retrieval and re-use. in Proc.SIGIR2003,72-79. The design of SIS has two key aspects. One aspect is to provide uniform tagging of information of different organizational structures, thereby enabling uniform retrieval using uniform tagging. Another aspect is to provide the user with a search using context information that the user can remember more easily, such as the time of browsing, the author of the document, etc.

The way of information organization and presentation requires the user to classify the resources in advance, and fails to fundamentally relieve the user from the heavy interaction burden. The information retrieval mode reduces the cost of searching resources for the user to a certain extent, but frequent retrieval causes the task of the user to be interrupted and the user cannot concentrate on the current work.

The information association method based on the user operation record and the resource content can provide a real-time and accurate resource recommendation service for computer users, and solve the existing problems.

Disclosure of Invention

The invention aims to provide a resource information correlation method. The method is mainly applied to the personal computer, and relevant information is recommended to the user before the user searches for the resources according to the past operation records of the user and the accessed resource content, so that the time overhead of searching for the information is saved for the user.

In order to achieve the purpose, the technical scheme of the invention is as follows: the information association method based on the user operation record and the resource content comprises the following steps:

1) monitoring a plurality of operation events of a user in a computer, acquiring resource content and operation records and storing the resource content and the operation records in a local or remote database;

2) converting the operation records into specific format vectors, and establishing a task model based on the operation records;

2-1) carrying out time slice sequence segmentation and vector conversion on the operation records;

2-2) establishing a task model by taking the operation event as data and the time slice as a unit according to an implicit Dirichlet allocation model;

3) establishing a theme model based on the resource content according to the resource content;

3-1) converting the content of each resource into word frequency vector representation according to the word set and the vocabulary table extracted from the resource content;

3-2) expressing the word frequency vector by an implicit Dirichlet allocation model, and establishing a topic model;

4) and respectively calculating the association degree of the current resource with other resources, namely the topic model and the task model, finishing the processing of information association and selecting the resource with the highest association degree to return to the user.

The operational events include: opening a resource event, closing the resource event, and switching from one resource to another resource event, wherein the resource content comprises: documents and web pages.

The attributes which need to be collected by the operation event related to the document comprise time, event type, title of the resource and path of the resource, and the attributes which need to be collected by the operation event related to the webpage comprise time, event type, title of the webpage and webpage URL.

The time slice sequence segmentation method comprises the following steps:

i) counting all resources in the operation record, establishing the number of each resource in a vocabulary table, and forming the resources into a vocabulary table;

ii) defining a sampling vector A_j={a₁,a₂,…,a_n，...，a_NThe resource state information is used for representing the states of all the resources at the jth sampling, wherein a is (0, 1), N is the resource number corresponding to the operation event, N is the total number of the resources, and j is the jth sampling;

iii) sampling the time slices according to the period c to obtain a sequence of sliced time slices

Wherein,

is the total number of vectors, i is the number of sampling times, t is the length of the time slice, and c is the sampling period.

The extraction of the resource content comprises: removing punctuation marks, Chinese word segmentation, stop words and counting a vocabulary table to obtain word frequency vectors, and converting the content of each resource into the word frequency vectors through the operation.

Preferably, in the task model, a task distribution probability for a given time slice and a resource distribution probability for a given task and a distribution probability for the occurrence of the task with respect to a certain resource are obtained.

Preferably, in the topic model, a topic distribution probability of a given resource and a word distribution probability of a given topic are obtained.

Preferably, the method of calculating the degree of association is: and calculating the similarity of the probability distribution of the current resource and other resources in the task model and the topic model according to the Kullback-Leibler model distance, and weighting to obtain the total distance.

Preferably, parameter estimation is carried out through Gibbs sampling in the learning of the task model and the subject model.

Preferably, the user computer is installed with a Windows or Android system.

The invention has the positive effects that:

the invention provides a resource association and recommendation method. According to the operation history records of the user in the personal computer and the resource content involved in the operation, the task model based on the operation history records and the theme model based on the resource content are automatically mined, so that other resources possibly related to the resources can be automatically recommended when the user uses the resources without any additional operation. The invention aims to save the query time of a user for files, ensure the consistency of tasks of the user as much as possible and effectively reduce the burden of the user when switching the tasks.

Drawings

FIG. 1 is a block diagram of a system architecture in an embodiment of an information association method based on user operation records and resource contents according to the present invention;

FIG. 2 is a flow diagram of an information collection module in an embodiment of the invention;

FIG. 3 is a flow diagram of an information management module in an embodiment of the invention;

FIG. 4 is an example of time slicing operation events based on the information association method of user operation records and resource contents according to the present invention;

FIG. 5 is a flow chart of the present invention for pre-processing resource content based on a user operation record and a resource content information association method;

FIG. 6 is a diagram of the correspondence between the mission model and the implicit Dirichlet allocation (LDA) model of the present invention;

FIG. 7 is a recommendation module flow diagram of the present invention;

FIG. 8 is an example recommendation system interface of the present invention.

Detailed Description

Principle of the invention

According to the operation history records of the user in the personal computer and the resource content related to the operation, the task model based on the operation records and the theme model based on the resource content of the user are automatically mined, then the incidence relation between the task model and the theme model measurement information is combined, other resources most relevant to the current resources are found for the user and presented to the user when the user uses the resources, and the user does not need to perform any additional operation in the whole process.

a) In the operating system environment, the operating events of a user on two or more target resources are monitored, and the content of the resources is obtained.

b) The method comprises the steps of carrying out segmentation and conversion on a time slice sequence on operation record historical data of a user, and then utilizing a specific algorithm of an implicit Dirichlet allocation model, which can be seen in [ Blei 2002] Blei, D.M., Ng, A.Y., & Jordan, M.I. (2002).

c) And performing content extraction, preprocessing and word frequency vectorization on resources related in the operation records of the user, and then establishing a theme model based on the resource content by using an implicit Dirichlet allocation model.

d) And measuring the relevance of the resources according to the task model based on the operation records and the topic model based on the resource contents.

Step a, the user operation event comprises: open resource events, close resource events, switch from one resource to another resource event.

Step a realizes the monitoring of the two or more target resources by the following ways: monitoring an operation event of a user; converting the (three events above) events into interaction data; converting the content of the resource involved in the operation event into content data; screening the operation of the user on two or more target resources by the interactive data; the interaction data and the content data may be stored integrally in a local or remote database for later use.

Overall, the process of the invention can be carried out in the following manner:

1) the information acquisition module is used for monitoring the operation event of a user in a computer, acquiring the content of the resource related to the operation event, and sending the operation event and the resource content to the information management module.

2) And the information management module is used for receiving the operation event and the resource content from the information acquisition module, recording the operation event and the resource content to the database, and responding to the data query request of the data preprocessing module.

3) And the operation data and the resource content obtained by query are filtered and converted by using the data preprocessing module, and then the operation data and the resource content in a specific format are transmitted to the data mining module.

4) The data mining module generates a task model after learning the operation data through a preset algorithm; after the resource content is learned through a preset algorithm, a theme model is generated. The generated task model and topic model are then passed to a recommendation module.

5) And after obtaining the two models, the recommending module screens out the top N most relevant resources for the current resources according to the specified recommending strategy to recommend.

Resources in an operational event include, but are not limited to, the following types of resources: documents, web pages.

The working method of the information acquisition module comprises the following steps:

1) the method comprises the steps of collecting operation events and contents of a user on a computer, wherein different attributes need to be collected for different operation events. The attributes required to be collected by the operation event related to the document comprise time, event type, title of the resource and path of the resource, and the attributes required to be collected by the operation event related to the webpage comprise time, event type, webpage title and webpage URL.

2) And sending the collected operation events and resource contents to the information management module.

The working method of the information management module comprises the following steps:

1) and receiving the operation event and the resource content sent by the information acquisition module, and storing the operation event and the resource content in the database.

2) And responding to a data query request from the data preprocessing module, and sending the operation events and the resource contents of the time period specified in the request to the data preprocessing module.

The preprocessing module can be divided into two parts, one part is responsible for preprocessing historical data of the operation events, and the other part is responsible for preprocessing resource contents.

The working method for preprocessing the operation event comprises the following steps:

and segmenting historical data of the operation events of the user by using time slices, and converting the operation events in each time slice into a word frequency mode for representation. Firstly, all resources involved in the operation data need to be counted, all the resources form a vocabulary table, and each resource has a unique number in the vocabulary table. The operation event is represented by a reference number corresponding to the resource operated by the operation event. Two parameters need to be defined, one is the length t of the time slice and the other is the sampling period c, which means that the sampling is performed once every time of the period c, where the unit of t and c is defined as seconds. Also, a vector a of samples needs to be defined_j＝{a₁，a₂，...，a_n，...，a_NWhere a is (0.1), N is the number of the resource corresponding to the operation event, N is the total number of the resources, j represents the j-th sampling, a_jReflecting the state of all resources at the j-th sample. Sampling a time slice t with a period c, we can get m vectors A, where

We finally define

Where K is a time slice sequence representing the operating state of each resource within a time slice, i represents the ith sample, and m represents the number of samples within a time slice. In the mode, the time slices are taken as units, and the operation events in the time slices are taken as data, so that the conversion from the time slice sequence to the word frequency vector is realized.

The working method for preprocessing the resource content comprises the following steps:

1) removing punctuation marks

Punctuation marks have no practical significance and can have negative effects on content analysis, and can be removed by adopting a filter method, and the original positions of the punctuation marks are replaced by blank spaces.

2) Chinese word segmentation

And utilizing a Chinese word segmentation program to segment the resource content.

3) Removing stop words

Stop words refer to words which appear frequently in the article but have no practical meaning, mainly including adverbs, fictional words, and tone words, such as "still", "of" and "o" (refer to self, Huablin. stop word processing technology in knowledge extraction, modern book information technology, 2007, 8 th, 48-51). These words can have interfering effects during data mining and must be deleted. The deleting method is that the original position of the deleted word is replaced by a blank space after the word appearing in the stop word list is deleted according to the stop word list.

4) Counting the vocabulary to obtain word frequency vector

All different words of all articles are counted to form a vocabulary, and each word has a unique number in the vocabulary. After word segmentation, the content of each resource is a set of words. For each resource, the content of each resource is converted into a word frequency vector to be expressed according to the word set and the vocabulary of each resource.

The working method of the data mining module comprises the following steps:

after vectorized data are obtained, the distribution probability can be obtained after learning of an implicit Dirichlet distribution model. In the task model, the task distribution probability of a given time slice and the resource distribution probability of a given task are obtained, while the distribution probability of the task with respect to the occurrence of a certain resource is obtained. In the topic model, the topic distribution probability for a given resource and the word distribution probability for a given topic are obtained.

The working method of the recommendation module comprises the following steps:

the basis for recommending the related resources for the current resources is the degree of association between the resources. The basic approach is to order the resources by the degree of association between the resources. The recommendation module obtains a task model and a topic model from the data mining module, and the two models respectively measure the association degree between the resources from two different aspects of operation and content. The recommending module weights the two association degrees to obtain a total association degree, and then sorts and recommends according to the total association degree.

The task model realizes the correspondence between time slice sequences, tasks and resources and articles, topics and words in the implicit Dirichlet allocation model, and measures the association degree between the resources, namely measures the association degree between the words. The degree of association between words can be measured from their similarity in individual topics. Is provided with a word w₁And w₂Their degree of association can be distributed by conditional topic P (Z | w)₁) And P (Z | w)₂) To measure.

The relevance degree between the resources is measured in the topic model, namely the relevance degree between the articles is measured. In the implicit dirichlet allocation model, an article is equivalent to dropping from a vector of words with very high dimensionality to a topic vector with a lower dimensionality. Therefore, the similarity between two articles is calculated by their topic probability distribution.

The standard way to measure the difference between the two distributions is to calculate their KL distance by reference to the following: stevers, m., & Griffiths, T. (2007), basic Topic models, In t.landauer, DMcNamara, s.dennis, and w.kintsch, editors, content Semantic Analysis, a Road to means. The Kullback-Leibler distance, also called KL-difference (Kullback-Leibler divergence), measures the difference between two probability distributions in the same event space. Assuming that the two probability distributions are p and q, respectively, their KL distances can be calculated by the following equation:

D (p, q) = Σ_{i = 1}^{T} p_{i} \log_{2} \frac{p_{i}}{q_{i}}

wherein p is_iAnd q is_iDenotes the ith dimension of the probability distribution vectors p and q, respectively, and T denotes the overall dimension of p and q.

As can be seen from the equation, D (p, q) =0 when each dimension of p and q is equal, i.e., the two distributions are completely equal. The KL distance is asymmetric, that is D (p, q) ≠ D (q, p). In the proposed application, a symmetric distance based on the KL distance is used, and the calculation formula is as follows:

KL (p, q) = \frac{1}{2} [D (p, q) + D (q, p)]

the task model and the topic model can respectively calculate a KL distance for measuring the difference of probability distribution for two resources, and the distance obtained by setting the task model is KL₁The distance obtained by the topic model is KL₂The total distance L is calculated by the following equation:

L＝α*KL₁-β*KL₂

where α and β are set parameters, and may be set by empirical values or according to user preferences.

The smaller the final distance L, the more similar the resource is to the current resource, so the higher the association degree is, the more the resource should be recommended to the user.

The invention is further described below by way of example with reference to the accompanying drawings.

The process of the present invention is carried out by the embodiment system shown in fig. 1.

As shown in fig. 1, the system of the present embodiment mainly includes: the information acquisition module is used for monitoring the operation event of the user, acquiring the resource content related to the operation event and sending the resource content to the information management module; the information management module receives the operation event and the resource content and responds to the data query request; the data preprocessing module is used for converting the operation events and the resource contents into a specific format and transmitting the specific format to the data mining module; the data mining module is used for learning the operation data and the resource content through a preset algorithm and respectively generating a task model and a theme model; and the recommending module is used for providing a resource recommending list most relevant to the current resource according to the specified recommending strategy.

The internal flow of each module is described below.

The information acquisition module (as shown in fig. 2) works in the background and is responsible for acquiring operation events of a user on a computer in real time, wherein the operation events comprise the steps of opening a resource, closing the resource, switching from one resource to another resource event, and acquiring the resource content related to the operation events. The attributes required to be collected by the operation event related to the document comprise time, event type, title of the resource and path of the resource, and the attributes required to be collected by the operation event related to the webpage comprise time, event type, webpage title and webpage URL. And the collected operation events and resource contents are sent to the information management module in real time.

The information management module (as shown in fig. 3) is responsible for receiving the operation events and resource contents from the information acquisition module in real time. Firstly, converting an operation event into operation data, recording the operation data and resource content into a database, wherein the data in the database can be used by other applications; then, responding to the data query request from the data preprocessing module, and returning the operation data and the resource content of the time period specified in the request to the data preprocessing module.

The data preprocessing module is responsible for data preprocessing in two aspects, on one hand, preprocessing operation data obtained by query, and using the 'working method for preprocessing operation events' in the invention content part, the historical data of the operation events of the user is segmented by time slices, and the operation events in each time slice are converted into word frequency to be represented, and finally represented as word frequency vectors. FIG. 4 illustrates one example of time slicing operation events.

On the other hand, the resource content is preprocessed, and the content of each resource is converted into a word frequency vector for representation through four steps of removing punctuation marks, dividing Chinese words, removing stop words, counting a vocabulary table and obtaining the word frequency vector by using the working method for preprocessing the resource content (shown in figure 5) described in the section of the invention content.

And the obtained word frequency vector of the operation data and the word frequency vector of the resource content are used by the data mining module.

The data mining module learns the subtended quantized data by adopting an implicit Dirichlet allocation (LDA) model to obtain a distribution probability which is used for calculating the association degree of the two resources by the recommending module.

In the task model, the task to be mined is defined to correspond to the topic in the LDA model, the vector of each time slice describes an article in the corresponding LDA model, and each resource corresponds to a word in the LDA model (as shown in fig. 6). Parameter estimation is carried out through a Gibbs sampling method to realize learning of the LDA model, the task distribution probability of a given time slice and the resource distribution probability of a given task can be obtained, and meanwhile, the distribution probability of the task about the occurrence of a certain resource can also be obtained.

In the topic model, a vector description of each resource content corresponds to an article in the LDA model, and each word obtained by segmenting each resource content corresponds to a word in the LDA model. According to a method similar to the task model, parameter estimation is carried out through Gibbs sampling so as to realize learning of the LDA model, and the theme distribution probability of given resources and the word distribution probability of given themes can be obtained.

The recommending module (as shown in fig. 7) is responsible for refreshing the recommending interface in real time, and recommending related resources for the current resources according to the association degree between the resources. The task model and the topic model respectively measure the association degree between the resources from two different angles of operation and content, so that the recommendation strategy obtains a total association degree by weighting the association degrees obtained by the two models, and then carries out sequencing and recommendation according to the total association degree.

In the task model, the resources correspond to words in the LDA model, and the association degree between the resources is the association degree between the words in the LDA model. The degree of association of two words can be measured by the degree of similarity of the conditional topic distributions of a given word. Is provided with a word w₁And w₂Their conditional topic distributions are respectively P (Z | w)₁) And P (Z | w)₂) Can be calculated by calculating P (Z | w)₁) And P (Z | w)₂) In particular, measure the difference between them, and thus the word w₁And w₂I.e., the degree of association of their corresponding resources.

In the topic model, the resources correspond to the articles in the LDA model, and the degree of association between the resources is measured, that is, the similarity between two articles is calculated. Is provided with an article d₁And d₂Their topic probability distribution P (Z | d) is obtained in the topic model₁) And P (Z | d)₂) Likewise, their KL distance may be calculated to measure the difference between them, and thus article d₁And d₂Between corresponding resourcesThe degree of association.

The KL distance is calculated as described in the summary of the invention. When a user uses a certain resource, for other resources in the user operation history, calculating a KL distance for measuring the difference of probability distribution of the current resource in the task model and the topic model according to the method, and setting the distance obtained in the task model as KL₁The distance obtained in the topic model is KL₂Finally, the total distance L-alpha KL of the two models is calculated in a weighting mode₁+β*KL₂And α and β are set parameters.

The smaller the total distance L, the higher the degree of association between a resource and the current resource. And finally, refreshing the interface of the recommendation system in real time, and displaying the resource recommendation list in the interface according to the sequence of the resource association degree from high to low (namely the total distance L is from small to large). FIG. 8 shows an example of a recommendation system interface that a user can directly double-click on a resource in the interface to open.

Claims

1. The information association method based on the user operation record and the resource content comprises the following steps:

2. The information association method based on user operation record and resource content as claimed in claim 1, wherein the operation event includes: opening a resource event, closing the resource event, and switching from one resource to another resource event, wherein the resource content comprises: documents and web pages.

3. The information correlation method based on user operation record and resource content according to claim 1, wherein the attributes required to be collected of the operation event related to the document comprise time, event type, title of resource and path of resource, and the attributes required to be collected of the operation event related to the web page comprise time, event type, web page title and web page URL.

4. The information association method based on user operation record and resource content as claimed in claim 1, wherein the time slice sequence slicing method is:

ii) defining a sampling vector A_j={a₁,a₂,…,a_n,…,a_NIs used to indicate the j-th samplingSampling states of all resources, wherein a is (0, 1), N is a resource number corresponding to an operation event, N is the total number of the resources, and j is the jth sampling;

Wherein,is the total number of vectors, i is the number of sampling times, t is the length of the time slice, and c is the sampling period.

5. The information association method based on user operation record and resource content as claimed in claim 1, wherein the extracting of the resource content comprises: removing punctuation marks, Chinese word segmentation, stop words and counting a vocabulary table to obtain word frequency vectors, and converting the content of each resource into the word frequency vectors through the operation.

6. The method according to claim 1, wherein in the task model, a task distribution probability of a given time slice and a resource distribution probability of a given task and a distribution probability of an occurrence of a task with respect to a certain resource are obtained.

7. The method of claim 1, wherein in the topic model, a topic distribution probability of a given resource and a word distribution probability of a given topic are obtained.

8. The information association method based on user operation record and resource content as claimed in claim 1, wherein the method of calculating the association degree is: and calculating the similarity of the probability distribution of the current resource and other resources in the topic model and the task model according to the Kullback-Leibler model distance, and weighting to obtain the total distance.

9. The information correlation method based on user operation record and resource content as claimed in claim 6, wherein parameter estimation is performed by Gibbs sampling in the learning of the task model and the subject model.

10. The information correlation method based on user operation record and resource content according to claim 1, wherein the user computer installs Windows or Android system.