CN1645395A

CN1645395A - Method for discovering user interest in e-mail flow and transmitting document effectively

Info

Publication number: CN1645395A
Application number: CN 200510009506
Authority: CN
Inventors: 诸葛海; 丁连红
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2005-02-22
Filing date: 2005-02-22
Publication date: 2005-07-27

Abstract

A method for effectively recommending file based on client interests known from Emails stream includes making members of scientific research team share scientific files from each other, picking up member interests from Email stream between the members, recommending always correct files to the member according to his interest since interest is refreshed in following with the received and sent Emails when concerning question of a member is changed, uploading member file to team file databank to let program finish the recommendation.

Description

Method for discovering user interest in e-mail stream and effectively pushing document according to user interest

Technical Field

The invention relates to the technical field of computers, in particular to a method for discovering user interest in an email stream and effectively pushing a document according to the user interest in the email stream, which comprises semantic understanding, text classification, document sharing and the email stream.

Technical Field

The research fields of different members of a research team are usually crossed, on one hand, the research fields are often repeated by the research teams for obtaining the same document, so that the manpower and the financial resources are wasted; on the other hand, they often exchange information through e-mails and sometimes send valuable documents to other members as attachments, which can achieve document sharing among members to some extent, but still have the following problems:

first, there is no guarantee that each member will be willing to send the documents that the other member needs, and therefore it is not possible to fundamentally avoid the repetitive operations that team members do to obtain the same documents.

Secondly, even if each member is willing to send documents that the other member needs, the following situation still occurs: the interests of a member often change over time, and other members may continue to send him documents that are no longer needed now without noticing this change, and not sending him documents that are needed newly; one member has difficulty in accurately grasping the interests of all other members, and therefore, the document cannot be pushed to all members who need the document, and sufficient sharing of the document cannot be achieved.

In order to realize full sharing of scientific and technical documents in the team, the method firstly extracts the interest of each team member in the aspect of scientific research work, and then regularly pushes related documents for the team members according to the interest of the members. Accurate extraction of the interests of the team members is the basis for fully realizing the technical document sharing among the team members. An email stream is formed among team members in the process of sending and receiving emails, and problems concerned by each member are reflected by emails sent and received by each member, so that the interests of the team members can be extracted from the email stream. The invention extracts user interest from the email stream among team members based on the existing email function, and ensures the premise that documents are fully shared among the team members. The basic idea is that the place where the e-mails sent and received by the member are concentrated is the place where the member studies work is concentrated: firstly, storing the e-mails among the members in a database, wherein the interference of junk mails is eliminated in the process; then, obtaining an effective e-mail which can provide useful information for describing the interest of the user by utilizing a natural language learning method; then, dividing the research field related to the team into smaller sub-fields, and classifying the effective e-mails on the basis; and finally, according to the distribution of the effective e-mails in each sub-field, expressing the user interest by using the set of the sub-fields concerned by the members. Considering that the user interest may change after a long time, a time factor is introduced into the interest extraction process, the user interest can be updated in time along with the generation of new mails and the time, and documents are pushed according to the user interest to ensure that the documents can be always pushed to all team members needing the documents, so that the documents cannot be mistakenly sent or missed.

The invention takes the interest point set describing the sub-field semantics as the template, divides the document into the sub-fields similar to the semantics, and pushes the document to the user concerning the sub-fields based on the interest point set describing the sub-field semantics, thereby ensuring that the pushed document is semantically required by the user, accurate and effective.

If a team member wants to share a certain document with other members, the document is uploaded to a document database of the team only, so that the document can be understood and pushed, most team members can accept simple uploading operation, document sharing among the team members is realized to a great extent, and complicated repeated operation of the team members is avoided.

Disclosure of Invention

The invention aims to provide a method for finding user interest in an email stream and effectively pushing documents according to the user interest, thereby effectively utilizing team resources and fully realizing scientific and technical document sharing among team members. The method comprises the following steps: firstly, storing the e-mails among the team members into a database; then, user interests are extracted from the email stream among the team members, when the problem concerned by the members changes, the interests of the members are updated in time along with the emails sent and received by the members, and correct documents can be pushed to the members according to the interests of the members; performing semantic analysis on the documents in the team document database; and finally, on the basis of document semantic analysis, pushing the documents consistent with the user interests to team members.

The method mainly comprises the following steps: the method comprises the steps of forwarding the e-mails among the team members to a certain fixed account through the function provided by an e-mail server program, executing a mail collection program regularly, decoding the e-mails in the fixed account by the program, storing the decoding result in an e-mail database, and completing automatic storage of the e-mails, wherein most of junk mails come from strange e-mail addresses; only considering the interest of the team members in the aspect of scientific research work, dividing the e-mails among the members into effective e-mails and ineffective e-mails by using a natural language learning method to obtain the effective e-mails capable of providing useful information for describing the user interest, extracting the user interest on the basis of the effective e-mails and ensuring the accuracy of the user interest; each research field related to a team is subdivided into sub-fields, and background knowledge and semantics of the sub-fields are represented by a priori knowledge set and an interest point set of the sub-fields; the classification of the effective e-mails is realized through the similarity calculation of the effective e-mails and the prior knowledge set, and the sub-fields in the effective e-mail set of the user are the sub-fields in the research working set of the user, so that the user interest is extracted according to the condition that the effective e-mails of the user are distributed in each sub-field and is expressed as the set of the concerned sub-fields; the user interest may change along with the lapse of time, the description capacity of the e-mail to the user interest also reduces along with the increase of the existing time of the e-mail, the time is introduced into the extraction process of the user interest, and when the work focus of the user shifts, the interest of the user is also adjusted in time, so that the document can be always pushed to all team members needing the document, the document cannot be sent mistakenly or not sent out, and the premise that the scientific and technical document is fully shared among the team members is ensured; the method comprises the steps of taking an interest point set for describing sub-field semantics as a template, dividing a document into different sub-fields according to semantic similarity between the document and each sub-field, pushing the document to a user of the sub-field to which the concerned sub-field set contains the document on the basis of the semantic similarity, and ensuring that the pushed document is required by the user in a semantic aspect, and is accurate and effective. The team members can understand and push the document only by uploading the document to the document database of the team, and most team members can accept simple uploading operation, so that the document sharing among the team members is simple and feasible.

Technical scheme

The invention discloses a method for discovering user interest in an email stream and effectively pushing a document according to the user interest. The method comprises the steps of firstly, subdividing each research field related to a team into sub-fields, and constructing a priori knowledge set representing background knowledge of the sub-fields and an interest point set describing semantics of the sub-fields; an e-mail collection program is run periodically to store e-mails among team members in an e-mail database and extract effective e-mails from the e-mail database, wherein the effective e-mails can provide useful information to describe the interests of users, and the team members can upload valuable scientific documents to a document database. Then, the effective e-mails are divided into the sub-fields with the prior knowledge sets and the highest similarity, the user interests are extracted according to the distribution conditions of the effective e-mails in each sub-field, and the documents in the document database are subjected to semantic analysis and classification by taking the interest point sets of the sub-fields as templates. And finally, pushing the documents consistent with the user interests to team members by the document pushing program according to the user interests and the document classification results.

The scheme mainly comprises the following technical indexes:

1. automatic e-mail repository between team members

Firstly, an e-mail database is constructed, each record of the database stores an e-mail, and the e-mails among team members are automatically forwarded to a certain fixed account through an e-mail server program; and then, regularly running a mail collection program, decoding the e-mails in the fixed account, and storing the decoding result into an e-mail database to realize automatic storage of the e-mails. Spam is usually derived from strange email addresses because only emails between members are saved and the automatic repository process of emails itself implements filtering of spam.

2. Extracting valid emails

The invention only concerns the interest of the user in the scientific research work, so only the E-mail related to the content of the scientific research work is effective, and the effective E-mail which can provide useful information for describing the interest of the user is extracted from the E-mail database by a natural language learning method.

3. Refining scientific research field division, establishing prior knowledge set and interest point set of sub-fields

Subdividing the research field of the team to obtain a sub-field set related to the team. And establishing a priori knowledge set and an interest point set for each sub-field, and respectively representing the background knowledge and the semantics of the sub-fields. The elements of the prior knowledge set are composed of keywords representing the main content of the sub-fields and influence factors (description capacity) of the keywords on the sub-fields. The interest point set is composed of semantic chain networks corresponding to the interest points contained in the sub-fields, and one semantic chain network describes semantic information of one interest point.

Establishing a priori knowledge set of the sub-fields to express the background knowledge of the sub-fields, classifying the effective e-mails through similarity calculation of the effective e-mails and the priori knowledge sets of the sub-fields, and expressing the user interest by using a sub-field set concerned by members according to the distribution condition of the effective e-mails in the sub-fields.

The method comprises the steps of constructing an interest point set for describing the semantics of the sub-fields, dividing the document into the sub-fields similar to the semantics of the document by taking the interest point set as a template, pushing the document to members concerning the sub-fields to which the document belongs by a document pushing program, ensuring that the document pushed to a user is exactly required by the user semantically, and completing the pushing of the document by the program only by uploading the document to a document database of a team by team members, and is simple and easy to implement.

4. Obtaining user interest from classification results of valid emails

Determining the sub-field to which each email belongs through matching calculation of the effective email and the sub-field prior knowledge set, and realizing classification of the effective email; and determining a set of sub-fields currently concerned by the member according to the distribution condition of the effective e-mails related to the member on the basis of the classification result of the effective e-mails, and expressing the user interest through the set. The basic idea is that a sub-domain in the user's email set is also a sub-domain in his research work set.

5. Timely updating user interests

The method introduces a time factor into the extraction process of the user interest, the interest of the user is adjusted when the problem concerned by the user changes, and the document is pushed according to the user interest to ensure that the document can be always pushed to all team members needing the document without error sending or missing sending.

Considering that the user interest may change after a long time, a time factor is introduced into the interest extraction process, the user interest can be updated in time along with the generation of new mails and the time, and documents are pushed for the user according to the user interest to ensure that the documents can be always pushed to all team members needing the documents.

6. Judging the sub-domain of the document according to semantic analysis

And performing semantic analysis on the documents in the document database by taking the interest point set of the sub-fields as a template, and dividing the documents into the sub-fields similar to the semantics thereof, thereby semantically ensuring the accuracy of document classification. And performing semantic analysis and division on the documents newly added into the document database at regular intervals.

7. Pushing documents according to user interests and document classification results

And (3) periodically operating a document pushing program, and pushing documents which are consistent with the user interest in the document database to corresponding team members through emails according to the current interest of the user by the program. The method comprises the steps of pushing a document according to the user interest, and ensuring that a correct document can be always pushed to team members; the result of the document semantic analysis is pushed to the user instead of the simple keyword matching result, so that the pushed document is semantically required by the user, and is accurate and effective.

Drawings

FIG. 1 is a flow chart of a method for discovering user interests in an email stream and effectively pushing documents according to the method.

FIG. 2 is a representation of a semantic chain network and its adjacency matrix of the present invention.

FIG. 3 is a flow chart of document understanding of the present invention.

Detailed Description

The invention discloses a method for discovering user interest in an email stream and effectively pushing a document according to the user interest. The method subdivides each research field related to a team into smaller sub-fields, establishes a priori knowledge set and an interest point set for each sub-field to respectively represent background knowledge and semantics of the sub-fields, and the user interest is the set of the concerned sub-fields. Firstly, storing the e-mails among the team members in an e-mail database, and extracting effective e-mails of which the contents relate to scientific research information from the e-mails. Then, dividing the effective e-mails into a sub-field with the highest similarity between the prior knowledge set and the prior knowledge set, and realizing the classification of the effective e-mails; and calculating the distribution proportion of the effective e-mails sent and received by each member in each sub-field according to the classification result, and adding the sub-fields with the distribution proportion larger than a threshold value into the sub-field set concerned by the user to obtain the user interest. Meanwhile, the documents in the team document database are divided into sub-fields similar to the semantics of the documents by taking the interest point sets of the sub-fields as templates and performing semantic analysis on the documents. And finally, the document pushing program pushes the relevant documents for the user according to the user interest, and the specific implementation method is that the documents in the document database are pushed to the user who focuses on the sub-field set and contains the sub-field to which the documents belong in the form of an e-mail attachment.

Fig. 1 is a flow chart of the implementation of the present invention, which mainly includes the following four parts:

automatic storage of E-mails to extract effective E-mails

1. Building an email database

Team members use a unified email server and server program (e.g., WebEasyMail) to build a database file (e.g., mail. mdb, hereinafter referred to as an email database) under a certain directory of the email server (e.g., F: \ database, hereinafter referred to as a database directory) to save email information between team members. Each email is stored as a record in the email database, containing six fields, the name and meaning of each field being as follows:

a sender: e-mail address of sender

The receiver: e-mail address of receiver

Copying: copied email address

Sending time: time of sending the e-mail

Subject matter: subject matter of electronic mail

The text is as follows: the text content of the e-mail is stored in the form of object connection and embedding for the length of more than 255 characters

2. Automatic e-mail storage

First, all e-mails between team members are automatically forwarded to a fixed account (e.g., an account with a user name of group) through the service provided by WebEasyMail. The mail for the account is stored in some fixed directory on the mail server (e.g., C: \ WebEasyMail \ mail \ group, hereinafter, undecoded mail directory). The traditional junk mails generally come from email addresses unfamiliar to users, and only the emails among team members are collected in the process, so that the interference of the junk mails on the user interest extraction process is eliminated.

The written mail collection program (e.g., MailGather) is then run periodically (e.g., once a day) to enable automated archiving of e-mail. The program reads each e-mail in the undecoded mail directory in turn, analyzes the mail header, decodes the mail body, and stores the decoded e-mail information into the corresponding field of the e-mail database file; the processed e-mail is moved to another directory of the e-mail server (e.g., C:. WebEasyMail \ mail \ group _ deleted, hereinafter referred to as the decoded mail directory) and is not processed the next time the MailGather is run. The MailGatherer was run periodically.

3. Extracting valid emails

Although spam in the traditional sense has been filtered out in the previous step, not all email stored in the email database can provide effective information describing user interests. We refer to an email that can reflect the user's interest as a valid email and an email that cannot reflect the user's interest as an invalid email. The e-mail associated with the team research content is a valid e-mail; while the laughter or the activity notice and the like frequently sent among the team members belong to the invalid e-mails, only the interest of the members in the scientific research work is considered. In order to obtain accurate user interest, it is necessary to extract valid e-mails from an e-mail database, which is realized by a method of natural language learning.

First, a certain number of valid e-mails and invalid e-mails are selected as the training set C of valid e-mails respectively_lAnd invalid e-mail training set C₂And obtaining the standard vectors of valid e-mails and invalid e-mails by the following formulasAnd

represents:

<math> <mrow> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>=</mo> <mn>16</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>Σ</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mn>4</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>Σ</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </math>

<math> <mrow> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>=</mo> <mn>16</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>Σ</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mn>4</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>Σ</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein,

<math> <mrow> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>=</mo> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>e</mi> <mn>2</mn> </msub> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mo>,</mo> <msub> <mi>e</mi> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msub> <mo>)</mo> </mrow> </mrow> </math>

is a vector representation of an e-mail e, e_iIs the keyword w_iNumber of occurrences in the subject and body of e-mail;

is thatThe vector length of (d); i C₁I and I C₂Each is C₁And C₂I.e., the number of electronic mail pieces contained. Then, a vector representation of e-mail e in the e-mail database is calculatedAnd a standard vector

Andthe calculation method of the similarity is as follows:

<math> <mrow> <mi>cos</mi> <mrow> <mo>(</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msubsup> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msubsup> <msub> <mi>e</mi> <mi>i</mi> </msub> <msub> <mi>c</mi> <mi>i</mi> </msub> </mrow> <mrow> <msqrt> <msubsup> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msubsup> <msubsup> <mi>e</mi> <mi>i</mi> <mn>2</mn> </msubsup> </msqrt> <msqrt> <msubsup> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msubsup> <msubsup> <mi>c</mi> <mi>i</mi> <mn>2</mn> </msubsup> </msqrt> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein n-1 or n-2,

if it is not

<math> <mrow> <mi>cos</mi> <mrow> <mo>(</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>></mo> <mi>cos</mi> <mrow> <mo>(</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mrow> </math>

E is a valid email, otherwise e is an invalid email. By this we have an effective email for extracting the user's interests.

Second, efficient email classification and user interest extraction

Dividing each research area related to the team into smaller sub-areas and passing through the sub-areas nd_iPrior knowledge set K_iIndicating its background knowledge. K_iIs (n)_k，a_k) Set of (2), n_kIs able to reflect nd together_iOne of a set of keywords of the primary content, a_kIs n_kWeight of (2) represents n_kTo nd_iDescription capability of a_kThe higher n_kThe stronger the description capability.

Classifying the effective e-mails through similarity calculation of the effective e-mails and the prior knowledge sets of each sub-field, and expressing user interest by using the sub-field set concerned by members according to the distribution condition of the effective e-mails in each sub-field;

first of all, what is described for computing each valid email e relates to the sub-domain nd_iProbability of (c):

wherein n is_kBelonging to K contained in the subject and body of e-mail_iThe keyword of (1); d_lIs n_kOf (a), obviously, (n)_k，a_k)∈K_iAnd n is_k∈D_l；S_klIs the keyword n_kNumber of occurrences in the above-mentioned part of e-mail and f (S)_kl)＝tanh(S_kl3); r and N are each D_lAnd K_iThe number of elements (c).

And then, dividing e into the sub-fields with the highest probability to realize the classification of effective e-mails.

Generally speaking, most of the effective emails sent or received by the users are concentrated in a few sub-fields, and the research work done by the users should be concentrated in the sub-fields. That is, the sub-domains in the active e-mail set are the sub-domains of interest to his research and development efforts, and the member interests are represented by the set of the sub-domains of interest. Thus, the percentage of the user's research effort related to each sub-domain can be calculated based on the classification results of the active e-mails.

Then, the research work to calculate user i involves the percentage per of the sub-domain j_ij

<math> <mrow> <msub> <mi>per</mi> <mi>ij</mi> </msub> <mo>=</mo> <mfrac> <mrow> <mi>α</mi> <msub> <mi>Σ</mi> <mrow> <mrow> <mo>(</mo> <msub> <mrow> <mi>e</mi> <mo>&Element;</mo> <mi>nd</mi> </mrow> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>∩</mo> <mrow> <mo>(</mo> <msub> <mrow> <mi>e</mi> <mo>&Element;</mo> <mi>from</mi> </mrow> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>β</mi> <msub> <mi>Σ</mi> <mrow> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>nd</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>∩</mo> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>to</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>α</mi> <msub> <mi>Σ</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>from</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>β</mi> <msub> <mi>Σ</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>to</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>×</mo> <mn>100</mn> <mo>%</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>

per_ijIs the percentage of the research work of user i related to sub-domain j, where from_iIs the set of valid e-mails, to, sent by user i_iIs a collection of valid emails received by user i; alpha is 1, beta is 0.8, which respectively represents the description ability of the user's interest in the sent effective e-mail and the received effective e-mail,

making the description capacity of the e-mail reduce with the increase of the existing time, wherein, age (e) is the difference between the current date and the sending date of the e-mail, and hl is 30, which indicates that the e-mail before 30 days has the description capacity of half of the current e-mail; from_iIs the set of valid e-mails, to, sent by user i_iIs the set of valid emails received by user i.

The ability of a user to receive descriptions of their interests in valid emails from other members depends on how well the member sending the email knows about their scientific work; the effective e-mails sent by the user can normally reflect the research interests of the user correctly, so that the user is endowed with stronger description capability of the effective e-mails sent by the user. The research focus of the user is changed after a long time, so that the user can be emailedThe description capability of a piece should also decrease as its lifetime increases, by adding to itIntroducing a formula for realization;

finally, if per_ijGreater than threshold value, and sub-domain nd_jJoin the set of sub-domains of interest to user i, where the threshold is 10%.

Third, document understanding and classification

A basic concept, point of view or method is called a point of interest, and a semantic chain network (SG) is used for representing semantic information of the point of interest. Sub-field nd_iInterest point set SG-set_iDescription nd_iAll the semantics of the implication, its elements are nd_iAnd semantic chain networks corresponding to the contained interest points. Dividing the document into sub-fields with similar semantics by taking the interest point set of the sub-fields as a template;

SG ═ N, R, where N is the set of nodes, including one point of interest N₁And a group of points of interest N represented together₁Semantic keywords { N₂，N₃，...，N_m}; r is a set of directed arcs, representing causal relationships between nodes.

FIG. 2(a) is a semantic chain network, starting with N_iTerminate in N_jIs directed arc of N_iTo N_jCause and effect relationship of (1), weight w thereof_ijIndicating cause node N_iFor result node N_jDegree of influence of, w_ij∈[-1，+1]。

Fig. 2(b) is a adjacency matrix representation of the semantic chain network, which is an n × n matrix, where n is the number of nodes included in the semantic chain network. If N is present_iTo N_jCause and effect relationships exist, then the element of the ith row and jth column of the adjacency matrix is w_ijOtherwise, it is 0.

FIG. 3 is a flowchart of document understanding and partitioning, including the following steps:

s3-1, selecting a document d from the team document database;

s3-2, selecting a sub-field nd_iObtaining the corresponding interest point set SG-set_i；

S3-3, calculating the document d and the sub-field nd_iSemantic similarity md (d, nd)_i)：

S3-3.1 divides document d into several small parts: p is a radical of₁，p₂，...，p_mThe data can be divided according to the number of bytes or paragraphs. The method is divided into sections, and comprises the following steps of further dividing the sub-sections;

s3-3.2 for any small part p_jLet md be_Part-ji＝0；

S3-3.3 subfield nd_iInterest point set SG-set_iAny element SG_r

(1) Calculate SG_rAny keyword N contained_kAt p_jState value V in_k′:V_k′＝tanh(S_k/3)，S_kIs N_kAt p_jThe number of occurrences in (a);

(2)V₁′，V₂′，...，V_m′)＝(0，V₂′，...，V_m′)×E_r，E_ris SG_rA adjacency matrix representation of (a);

(3) if md is_Part-ji＜V₁Then md_Part-ji＝V₁′

Wherein the document d is divided into m small parts

S3-4. if md (d, nd)_i) > 0.65, partition document d into sub-domains nd_iGo to S3-2.

The method calculates the semantic similarity between each document and each sub-field from the interest point level, so that the documents are classified into the sub-fields with higher semantic similarity, and one document may belong to a plurality of sub-fields at the same time. The team's document database includes a large number of existing technical documents and also receives documents uploaded by team members to continuously increase the capacity of the document database. Therefore, it is periodically checked whether there are newly added documents in the document database, and if so, they are divided into corresponding sub-fields as described above.

Fourthly, effectively pushing documents according to user interests

The result of the document understanding is to divide the documents in the team document database into various sub-domains related to the team. A document push program (e.g., filedelivery) is written that selects appropriate documents from the team's document database to push to team members based on the document classification results based on user interest. Because team members tend to read documents related to the sub-domain they are interested in, the program pushes the documents as email attachments to users whose set of sub-domains of interest includes the sub-domain to which the document belongs.

Each document in the document database has two lists, "sent people" and "upload people". The "sent people" list records to which team members the document has been pushed, and the FileDeliver runtime only pushes the document to team members that do not appear in the document "sent people" list. When the member uploads the document to the team document database, if the document does not exist in the document database, the uploading is successful, and otherwise, the uploading is prompted to be repeated. Regardless of whether the upload was successful, the member will record in the "upload people" list of the document. Nor will FileDeliver push documents to members who have appeared in the document "upload people" list because the documents that the member is attempting to upload must be the documents he already has. The team members only need to execute simple uploading operation to share the document among all members needing the document, and the method is simple and effective.

Claims

1. The method for finding user interest in the E-mail stream and effectively pushing the document according to the user interest comprises the following steps of firstly, storing the E-mails among team members into an E-mail database and extracting effective E-mails from the E-mails; then, extracting user interests according to the distribution rule of the effective e-mails, and realizing classification of the documents in the team document database through semantic analysis; and finally, according to the user interest and the document classification result, pushing the document consistent with the member interest to the team members through the E-mails.

2. The method for discovering user interests in an email stream and efficiently pushing documents in accordance therewith as recited in claim 1, wherein the emails among team members are decoded by the email collecting program and the decoded contents are stored in the email database, and automatic warehousing of the emails is realized by periodically running the email collecting program, wherein spam is mostly from strange email addresses, and the process only considers the emails among the members, thereby eliminating the interference of the spam when extracting the user interests.

3. The method for discovering user interests in an email stream and for efficiently pushing documents as claimed in claim 1, wherein the method of natural language learning is used to obtain efficient emails that provide useful information for describing user interests, taking into account only the interests of team members in research and development work, thereby ensuring the accuracy of the extracted user interests.

4. The method of claim 1 for discovering user interests in an email stream and for efficiently pushing documents based thereon, further comprising subdividing a research domain associated with a team into sub-domains, establishing a priori knowledge sets of the sub-domains to represent background knowledge thereof, classifying the valid emails by similarity calculations between the valid emails and the prior knowledge sets of the sub-domains, and representing user interests by a set of sub-domains of interest to the members based on the distribution of the valid emails in the sub-domains.

5. The method of claim 1 for discovering user interests in an email stream and efficiently pushing documents in response thereto, wherein a time factor is introduced into the interest extraction process in view of possible changes in user interests over a longer period of time, the user interests are updated in time as new emails are generated and time passes, and pushing documents to users in response to user interests ensures that documents are always pushed to all team members who need the documents without either misposting or missed posting.

6. The method for discovering user interest in e-mail stream and effectively pushing documents according to the same as the claim 1, wherein an interest point set describing the sub-domain semantics is constructed, the documents are divided into sub-domains similar to the sub-domains of the sub-domains by taking the interest point set as a template, the documents are pushed to the members concerning the sub-domains to which the documents belong by a document pushing program, the documents pushed to the users are guaranteed to be needed by the users semantically, and the members of the team can complete the pushing of the documents by the program only by uploading the documents to a document database of the team, so that the method is simple and easy to implement.

7. A method for discovering user interest in an email stream and effectively pushing documents according to the user interest is characterized by mainly comprising the following four parts:

firstly, the e-mail is automatically stored, and the effective e-mail is extracted, wherein,

1. building an email database

The team members use the uniform E-mail server and server program to establish a database file under a certain directory of the E-mail server to store the E-mail information among the team members;

2. automatic e-mail storage

Firstly, all the e-mails among the team members are automatically forwarded to a fixed account by a mail server program, and the mails of the account are stored in a certain fixed directory of a mail server; then, regularly running the compiled mail collection program to realize the automatic storage of the e-mails, and decoding the e-mails by the program and storing the decoding result in the corresponding field of the e-mail database;

3. extracting valid emails

The invention only considers the interest of the user in the aspect of scientific research work, and extracts the effective e-mail which can provide useful information for describing the interest of the user through a natural language learning method;

second, efficient email classification and user interest extraction

Dividing the research fields related to the team intoSmaller sub-fields and through sub-fields nd_iPrior knowledge set K_iRepresenting its background knowledge, K_iIs (n)_k，a_k) Set of (2), n_kIs able to reflect nd together_iOne of a set of keywords of the primary content, a_kIs n_kWeight of (2) represents n_kTo nd_iDescription capability of a_kThe higher n_kThe stronger the description capability;

third, document understanding and classification

A basic concept, point of view or method is called a point of interest, and a semantic link network (SG) represents semantic information of a point of interest, where SG is (N, R), where N is a set of nodes, including a point of interest N₁And a group of points of interest N represented together₁Semantic keywords { N₂，N₃，...，N_m}; r is a set of directed arcs representing causal relationships between nodes, the sub-domain nd_iInterest point set SG-set_iDescription nd_iAll the implied semantic information, its elements are nd_iThe semantic chain network corresponding to the contained interest points divides the document into sub-fields with similar semantics by taking the interest point set of the sub-fields as a template;

fourthly, effectively pushing documents according to user interests

Writing a document pushing program, wherein the document pushing program pushes a document to a user of a concerned sub-field set including a sub-field to which the document belongs in the form of an e-mail attachment, each document has two lists of 'sent person' and 'uploading person', the document pushing program only pushes the document to team members which do not appear in the two lists, repeated sending is avoided, and the members can share the document among all members needing the document only by uploading the document to a team document database, so that the method is simple and effective.

8. The method for discovering user interests in an email stream and thereby efficiently pushing documents according to claim 7, wherein the first, email is automatically archived, the available emails are extracted, wherein,

3. extracting valid emails

First, a certain number of valid e-mails and invalid e-mails are selected as the training set C of valid e-mails respectively₁And invalid e-mail training set C₂And obtaining the standard vectors of valid e-mails and invalid e-mails by the following formulas

And

represents:

wherein,

is thatThe vector length of (d); i C₁I and I C₂Each is C₁And C₂The number of training samples, i.e., the number of electronic mail pieces contained,

then, the e-mail e is calculatedVector representation

And a standard vector

And

the calculation method of the similarity is as follows:

wherein n-1 or n-2,

if it is not

E is a valid email, otherwise e is an invalid email.

9. The method for discovering user interest in an email stream and thereby efficiently pushing documents according to claim 7, wherein two, efficient email classification and user interest extraction, wherein,

first of all, what is described as a computationally efficient e-mail e relates to the sub-domain nd_iProbability of (c):

wherein n is_kBelonging to K contained in the subject and body of e-mail_iThe keyword of (1); d_lIs n_kA set of (a); s_klIs the keyword n_kNumber of occurrences in the above-mentioned part of e-mail and f (S)_kl)＝tanh(S_kl3); r and N are each D_lAnd K_iThe number of elements (c). E is divided into the sub-fields with the highest probability to realize the classification of effective e-mails;

<math> <mrow> <msub> <mi>per</mi> <mi>ij</mi> </msub> <mo>=</mo> <mfrac> <mrow> <mi>α</mi> <msub> <mi>Σ</mi> <mrow> <mrow> <mo>(</mo> <msub> <mrow> <mi>e</mi> <mo>&Element;</mo> <mi>nd</mi> </mrow> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>∩</mo> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>from</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>β</mi> <msub> <mi>Σ</mi> <mrow> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>nd</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>∩</mo> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>to</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>α</mi> <msub> <mi>Σ</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>from</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>β</mi> <msub> <mi>Σ</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>to</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>×</mo> <mn>100</mn> <mo>%</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>

Wherein α ═ 1, β ═ 0.8, represent the description ability of interest to the valid email sent by the user and the valid email received, respectively;making the description capacity of the e-mail reduce with the increase of the existing time, wherein, age (e) is the difference between the current date and the sending date of the e-mail, and hl is 30, which indicates that the e-mail before 30 days has the description capacity of half of the current e-mail; from_iIs the set of valid e-mails, to, sent by user i_iIs the set of valid emails received by user i, if per_ijGreater than threshold value, and sub-domain nd_jJoin the set of sub-domains of interest to user i, where the threshold is 10%.

10. The method for discovering user interests in an email stream and pushing documents efficiently according to the same as claimed in claim 7, wherein the third step of document understanding and dividing is as follows:

s3-1, selecting a document d from the team document database;

s3-2, selecting a sub-field nd_iObtaining the point of interest set SG-set_i；

S3-3, calculating the document d and the sub-field nd_iSemantic matching degree md (d, nd)_i)：

S3-3.1 divides document d into several small parts: p is a radical of₁，p₂，…，p_m；

S3-3.2 for any small part p_jLet md be_Part-ji＝0；

S3-3.3 subfield nd_iInterest point set SG-set_iAny element SG_r

(2)(V₁′，V₂′，...，V_m′)＝(0，V₂′，...，V_m′)×E_r，E_ris SG_rA adjacency matrix representation of (a);

(3) if md is_Part-ji＜V₁Then md_Part-ji＝V₁′

S3-3.4

Wherein the document d is divided into m small parts