CN1645395A - Method for discovering user interest in e-mail flow and transmitting document effectively - Google Patents

Method for discovering user interest in e-mail flow and transmitting document effectively Download PDF

Info

Publication number
CN1645395A
CN1645395A CN 200510009506 CN200510009506A CN1645395A CN 1645395 A CN1645395 A CN 1645395A CN 200510009506 CN200510009506 CN 200510009506 CN 200510009506 A CN200510009506 A CN 200510009506A CN 1645395 A CN1645395 A CN 1645395A
Authority
CN
China
Prior art keywords
mrow
msub
sub
mover
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200510009506
Other languages
Chinese (zh)
Inventor
诸葛海
丁连红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN 200510009506 priority Critical patent/CN1645395A/en
Publication of CN1645395A publication Critical patent/CN1645395A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

A method for effectively recommending file based on client interests known from Emails stream includes making members of scientific research team share scientific files from each other, picking up member interests from Email stream between the members, recommending always correct files to the member according to his interest since interest is refreshed in following with the received and sent Emails when concerning question of a member is changed, uploading member file to team file databank to let program finish the recommendation.

Description

Method for discovering user interest in e-mail stream and effectively pushing document according to user interest
Technical Field
The invention relates to the technical field of computers, in particular to a method for discovering user interest in an email stream and effectively pushing a document according to the user interest in the email stream, which comprises semantic understanding, text classification, document sharing and the email stream.
Technical Field
The research fields of different members of a research team are usually crossed, on one hand, the research fields are often repeated by the research teams for obtaining the same document, so that the manpower and the financial resources are wasted; on the other hand, they often exchange information through e-mails and sometimes send valuable documents to other members as attachments, which can achieve document sharing among members to some extent, but still have the following problems:
first, there is no guarantee that each member will be willing to send the documents that the other member needs, and therefore it is not possible to fundamentally avoid the repetitive operations that team members do to obtain the same documents.
Secondly, even if each member is willing to send documents that the other member needs, the following situation still occurs: the interests of a member often change over time, and other members may continue to send him documents that are no longer needed now without noticing this change, and not sending him documents that are needed newly; one member has difficulty in accurately grasping the interests of all other members, and therefore, the document cannot be pushed to all members who need the document, and sufficient sharing of the document cannot be achieved.
In order to realize full sharing of scientific and technical documents in the team, the method firstly extracts the interest of each team member in the aspect of scientific research work, and then regularly pushes related documents for the team members according to the interest of the members. Accurate extraction of the interests of the team members is the basis for fully realizing the technical document sharing among the team members. An email stream is formed among team members in the process of sending and receiving emails, and problems concerned by each member are reflected by emails sent and received by each member, so that the interests of the team members can be extracted from the email stream. The invention extracts user interest from the email stream among team members based on the existing email function, and ensures the premise that documents are fully shared among the team members. The basic idea is that the place where the e-mails sent and received by the member are concentrated is the place where the member studies work is concentrated: firstly, storing the e-mails among the members in a database, wherein the interference of junk mails is eliminated in the process; then, obtaining an effective e-mail which can provide useful information for describing the interest of the user by utilizing a natural language learning method; then, dividing the research field related to the team into smaller sub-fields, and classifying the effective e-mails on the basis; and finally, according to the distribution of the effective e-mails in each sub-field, expressing the user interest by using the set of the sub-fields concerned by the members. Considering that the user interest may change after a long time, a time factor is introduced into the interest extraction process, the user interest can be updated in time along with the generation of new mails and the time, and documents are pushed according to the user interest to ensure that the documents can be always pushed to all team members needing the documents, so that the documents cannot be mistakenly sent or missed.
The invention takes the interest point set describing the sub-field semantics as the template, divides the document into the sub-fields similar to the semantics, and pushes the document to the user concerning the sub-fields based on the interest point set describing the sub-field semantics, thereby ensuring that the pushed document is semantically required by the user, accurate and effective.
If a team member wants to share a certain document with other members, the document is uploaded to a document database of the team only, so that the document can be understood and pushed, most team members can accept simple uploading operation, document sharing among the team members is realized to a great extent, and complicated repeated operation of the team members is avoided.
Disclosure of Invention
The invention aims to provide a method for finding user interest in an email stream and effectively pushing documents according to the user interest, thereby effectively utilizing team resources and fully realizing scientific and technical document sharing among team members. The method comprises the following steps: firstly, storing the e-mails among the team members into a database; then, user interests are extracted from the email stream among the team members, when the problem concerned by the members changes, the interests of the members are updated in time along with the emails sent and received by the members, and correct documents can be pushed to the members according to the interests of the members; performing semantic analysis on the documents in the team document database; and finally, on the basis of document semantic analysis, pushing the documents consistent with the user interests to team members.
The method mainly comprises the following steps: the method comprises the steps of forwarding the e-mails among the team members to a certain fixed account through the function provided by an e-mail server program, executing a mail collection program regularly, decoding the e-mails in the fixed account by the program, storing the decoding result in an e-mail database, and completing automatic storage of the e-mails, wherein most of junk mails come from strange e-mail addresses; only considering the interest of the team members in the aspect of scientific research work, dividing the e-mails among the members into effective e-mails and ineffective e-mails by using a natural language learning method to obtain the effective e-mails capable of providing useful information for describing the user interest, extracting the user interest on the basis of the effective e-mails and ensuring the accuracy of the user interest; each research field related to a team is subdivided into sub-fields, and background knowledge and semantics of the sub-fields are represented by a priori knowledge set and an interest point set of the sub-fields; the classification of the effective e-mails is realized through the similarity calculation of the effective e-mails and the prior knowledge set, and the sub-fields in the effective e-mail set of the user are the sub-fields in the research working set of the user, so that the user interest is extracted according to the condition that the effective e-mails of the user are distributed in each sub-field and is expressed as the set of the concerned sub-fields; the user interest may change along with the lapse of time, the description capacity of the e-mail to the user interest also reduces along with the increase of the existing time of the e-mail, the time is introduced into the extraction process of the user interest, and when the work focus of the user shifts, the interest of the user is also adjusted in time, so that the document can be always pushed to all team members needing the document, the document cannot be sent mistakenly or not sent out, and the premise that the scientific and technical document is fully shared among the team members is ensured; the method comprises the steps of taking an interest point set for describing sub-field semantics as a template, dividing a document into different sub-fields according to semantic similarity between the document and each sub-field, pushing the document to a user of the sub-field to which the concerned sub-field set contains the document on the basis of the semantic similarity, and ensuring that the pushed document is required by the user in a semantic aspect, and is accurate and effective. The team members can understand and push the document only by uploading the document to the document database of the team, and most team members can accept simple uploading operation, so that the document sharing among the team members is simple and feasible.
Technical scheme
The invention discloses a method for discovering user interest in an email stream and effectively pushing a document according to the user interest. The method comprises the steps of firstly, subdividing each research field related to a team into sub-fields, and constructing a priori knowledge set representing background knowledge of the sub-fields and an interest point set describing semantics of the sub-fields; an e-mail collection program is run periodically to store e-mails among team members in an e-mail database and extract effective e-mails from the e-mail database, wherein the effective e-mails can provide useful information to describe the interests of users, and the team members can upload valuable scientific documents to a document database. Then, the effective e-mails are divided into the sub-fields with the prior knowledge sets and the highest similarity, the user interests are extracted according to the distribution conditions of the effective e-mails in each sub-field, and the documents in the document database are subjected to semantic analysis and classification by taking the interest point sets of the sub-fields as templates. And finally, pushing the documents consistent with the user interests to team members by the document pushing program according to the user interests and the document classification results.
The scheme mainly comprises the following technical indexes:
1. automatic e-mail repository between team members
Firstly, an e-mail database is constructed, each record of the database stores an e-mail, and the e-mails among team members are automatically forwarded to a certain fixed account through an e-mail server program; and then, regularly running a mail collection program, decoding the e-mails in the fixed account, and storing the decoding result into an e-mail database to realize automatic storage of the e-mails. Spam is usually derived from strange email addresses because only emails between members are saved and the automatic repository process of emails itself implements filtering of spam.
2. Extracting valid emails
The invention only concerns the interest of the user in the scientific research work, so only the E-mail related to the content of the scientific research work is effective, and the effective E-mail which can provide useful information for describing the interest of the user is extracted from the E-mail database by a natural language learning method.
3. Refining scientific research field division, establishing prior knowledge set and interest point set of sub-fields
Subdividing the research field of the team to obtain a sub-field set related to the team. And establishing a priori knowledge set and an interest point set for each sub-field, and respectively representing the background knowledge and the semantics of the sub-fields. The elements of the prior knowledge set are composed of keywords representing the main content of the sub-fields and influence factors (description capacity) of the keywords on the sub-fields. The interest point set is composed of semantic chain networks corresponding to the interest points contained in the sub-fields, and one semantic chain network describes semantic information of one interest point.
Establishing a priori knowledge set of the sub-fields to express the background knowledge of the sub-fields, classifying the effective e-mails through similarity calculation of the effective e-mails and the priori knowledge sets of the sub-fields, and expressing the user interest by using a sub-field set concerned by members according to the distribution condition of the effective e-mails in the sub-fields.
The method comprises the steps of constructing an interest point set for describing the semantics of the sub-fields, dividing the document into the sub-fields similar to the semantics of the document by taking the interest point set as a template, pushing the document to members concerning the sub-fields to which the document belongs by a document pushing program, ensuring that the document pushed to a user is exactly required by the user semantically, and completing the pushing of the document by the program only by uploading the document to a document database of a team by team members, and is simple and easy to implement.
4. Obtaining user interest from classification results of valid emails
Determining the sub-field to which each email belongs through matching calculation of the effective email and the sub-field prior knowledge set, and realizing classification of the effective email; and determining a set of sub-fields currently concerned by the member according to the distribution condition of the effective e-mails related to the member on the basis of the classification result of the effective e-mails, and expressing the user interest through the set. The basic idea is that a sub-domain in the user's email set is also a sub-domain in his research work set.
5. Timely updating user interests
The method introduces a time factor into the extraction process of the user interest, the interest of the user is adjusted when the problem concerned by the user changes, and the document is pushed according to the user interest to ensure that the document can be always pushed to all team members needing the document without error sending or missing sending.
Considering that the user interest may change after a long time, a time factor is introduced into the interest extraction process, the user interest can be updated in time along with the generation of new mails and the time, and documents are pushed for the user according to the user interest to ensure that the documents can be always pushed to all team members needing the documents.
6. Judging the sub-domain of the document according to semantic analysis
And performing semantic analysis on the documents in the document database by taking the interest point set of the sub-fields as a template, and dividing the documents into the sub-fields similar to the semantics thereof, thereby semantically ensuring the accuracy of document classification. And performing semantic analysis and division on the documents newly added into the document database at regular intervals.
7. Pushing documents according to user interests and document classification results
And (3) periodically operating a document pushing program, and pushing documents which are consistent with the user interest in the document database to corresponding team members through emails according to the current interest of the user by the program. The method comprises the steps of pushing a document according to the user interest, and ensuring that a correct document can be always pushed to team members; the result of the document semantic analysis is pushed to the user instead of the simple keyword matching result, so that the pushed document is semantically required by the user, and is accurate and effective.
Drawings
FIG. 1 is a flow chart of a method for discovering user interests in an email stream and effectively pushing documents according to the method.
FIG. 2 is a representation of a semantic chain network and its adjacency matrix of the present invention.
FIG. 3 is a flow chart of document understanding of the present invention.
Detailed Description
The invention discloses a method for discovering user interest in an email stream and effectively pushing a document according to the user interest. The method subdivides each research field related to a team into smaller sub-fields, establishes a priori knowledge set and an interest point set for each sub-field to respectively represent background knowledge and semantics of the sub-fields, and the user interest is the set of the concerned sub-fields. Firstly, storing the e-mails among the team members in an e-mail database, and extracting effective e-mails of which the contents relate to scientific research information from the e-mails. Then, dividing the effective e-mails into a sub-field with the highest similarity between the prior knowledge set and the prior knowledge set, and realizing the classification of the effective e-mails; and calculating the distribution proportion of the effective e-mails sent and received by each member in each sub-field according to the classification result, and adding the sub-fields with the distribution proportion larger than a threshold value into the sub-field set concerned by the user to obtain the user interest. Meanwhile, the documents in the team document database are divided into sub-fields similar to the semantics of the documents by taking the interest point sets of the sub-fields as templates and performing semantic analysis on the documents. And finally, the document pushing program pushes the relevant documents for the user according to the user interest, and the specific implementation method is that the documents in the document database are pushed to the user who focuses on the sub-field set and contains the sub-field to which the documents belong in the form of an e-mail attachment.
Fig. 1 is a flow chart of the implementation of the present invention, which mainly includes the following four parts:
automatic storage of E-mails to extract effective E-mails
1. Building an email database
Team members use a unified email server and server program (e.g., WebEasyMail) to build a database file (e.g., mail. mdb, hereinafter referred to as an email database) under a certain directory of the email server (e.g., F: \ database, hereinafter referred to as a database directory) to save email information between team members. Each email is stored as a record in the email database, containing six fields, the name and meaning of each field being as follows:
a sender: e-mail address of sender
The receiver: e-mail address of receiver
Copying: copied email address
Sending time: time of sending the e-mail
Subject matter: subject matter of electronic mail
The text is as follows: the text content of the e-mail is stored in the form of object connection and embedding for the length of more than 255 characters
2. Automatic e-mail storage
First, all e-mails between team members are automatically forwarded to a fixed account (e.g., an account with a user name of group) through the service provided by WebEasyMail. The mail for the account is stored in some fixed directory on the mail server (e.g., C: \ WebEasyMail \ mail \ group, hereinafter, undecoded mail directory). The traditional junk mails generally come from email addresses unfamiliar to users, and only the emails among team members are collected in the process, so that the interference of the junk mails on the user interest extraction process is eliminated.
The written mail collection program (e.g., MailGather) is then run periodically (e.g., once a day) to enable automated archiving of e-mail. The program reads each e-mail in the undecoded mail directory in turn, analyzes the mail header, decodes the mail body, and stores the decoded e-mail information into the corresponding field of the e-mail database file; the processed e-mail is moved to another directory of the e-mail server (e.g., C:. WebEasyMail \ mail \ group _ deleted, hereinafter referred to as the decoded mail directory) and is not processed the next time the MailGather is run. The MailGatherer was run periodically.
3. Extracting valid emails
Although spam in the traditional sense has been filtered out in the previous step, not all email stored in the email database can provide effective information describing user interests. We refer to an email that can reflect the user's interest as a valid email and an email that cannot reflect the user's interest as an invalid email. The e-mail associated with the team research content is a valid e-mail; while the laughter or the activity notice and the like frequently sent among the team members belong to the invalid e-mails, only the interest of the members in the scientific research work is considered. In order to obtain accurate user interest, it is necessary to extract valid e-mails from an e-mail database, which is realized by a method of natural language learning.
First, a certain number of valid e-mails and invalid e-mails are selected as the training set C of valid e-mails respectivelylAnd invalid e-mail training set C2And obtaining the standard vectors of valid e-mails and invalid e-mails by the following formulasAnd
Figure A20051000950600172
represents:
<math> <mrow> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>=</mo> <mn>16</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mn>4</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>=</mo> <mn>16</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mn>4</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein, <math> <mrow> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>=</mo> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>e</mi> <mn>2</mn> </msub> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mo>,</mo> <msub> <mi>e</mi> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msub> <mo>)</mo> </mrow> </mrow> </math> is a vector representation of an e-mail e, eiIs the keyword wiNumber of occurrences in the subject and body of e-mail;
Figure A20051000950600176
is thatThe vector length of (d); i C1I and I C2Each is C1And C2I.e., the number of electronic mail pieces contained. Then, a vector representation of e-mail e in the e-mail database is calculatedAnd a standard vector
Figure A20051000950600179
Andthe calculation method of the similarity is as follows:
<math> <mrow> <mi>cos</mi> <mrow> <mo>(</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msubsup> <msub> <mi>e</mi> <mi>i</mi> </msub> <msub> <mi>c</mi> <mi>i</mi> </msub> </mrow> <mrow> <msqrt> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msubsup> <msubsup> <mi>e</mi> <mi>i</mi> <mn>2</mn> </msubsup> </msqrt> <msqrt> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msubsup> <msubsup> <mi>c</mi> <mi>i</mi> <mn>2</mn> </msubsup> </msqrt> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein n-1 or n-2,
if it is not <math> <mrow> <mi>cos</mi> <mrow> <mo>(</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>></mo> <mi>cos</mi> <mrow> <mo>(</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mrow> </math> E is a valid email, otherwise e is an invalid email. By this we have an effective email for extracting the user's interests.
Second, efficient email classification and user interest extraction
Dividing each research area related to the team into smaller sub-areas and passing through the sub-areas ndiPrior knowledge set KiIndicating its background knowledge. KiIs (n)k,ak) Set of (2), nkIs able to reflect nd togetheriOne of a set of keywords of the primary content, akIs nkWeight of (2) represents nkTo ndiDescription capability of akThe higher nkThe stronger the description capability.
Classifying the effective e-mails through similarity calculation of the effective e-mails and the prior knowledge sets of each sub-field, and expressing user interest by using the sub-field set concerned by members according to the distribution condition of the effective e-mails in each sub-field;
first of all, what is described for computing each valid email e relates to the sub-domain ndiProbability of (c):
<math> <mrow> <mi>Sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>R</mi> </munderover> <msub> <mi>&alpha;</mi> <mi>k</mi> </msub> <mi>f</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>kl</mi> </msub> <mo>)</mo> </mrow> <mo>/</mo> <mi>N</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein n iskBelonging to K contained in the subject and body of e-mailiThe keyword of (1); dlIs nkOf (a), obviously, (n)k,ak)∈KiAnd n isk∈Dl;SklIs the keyword nkNumber of occurrences in the above-mentioned part of e-mail and f (S)kl)=tanh(Skl3); r and N are each DlAnd KiThe number of elements (c).
And then, dividing e into the sub-fields with the highest probability to realize the classification of effective e-mails.
Generally speaking, most of the effective emails sent or received by the users are concentrated in a few sub-fields, and the research work done by the users should be concentrated in the sub-fields. That is, the sub-domains in the active e-mail set are the sub-domains of interest to his research and development efforts, and the member interests are represented by the set of the sub-domains of interest. Thus, the percentage of the user's research effort related to each sub-domain can be calculated based on the classification results of the active e-mails.
Then, the research work to calculate user i involves the percentage per of the sub-domain jij
<math> <mrow> <msub> <mi>per</mi> <mi>ij</mi> </msub> <mo>=</mo> <mfrac> <mrow> <mi>&alpha;</mi> <msub> <mi>&Sigma;</mi> <mrow> <mrow> <mo>(</mo> <msub> <mrow> <mi>e</mi> <mo>&Element;</mo> <mi>nd</mi> </mrow> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>&cap;</mo> <mrow> <mo>(</mo> <msub> <mrow> <mi>e</mi> <mo>&Element;</mo> <mi>from</mi> </mrow> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>&beta;</mi> <msub> <mi>&Sigma;</mi> <mrow> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>nd</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>&cap;</mo> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>to</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>&alpha;</mi> <msub> <mi>&Sigma;</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>from</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>&beta;</mi> <msub> <mi>&Sigma;</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>to</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>&times;</mo> <mn>100</mn> <mo>%</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>
perijIs the percentage of the research work of user i related to sub-domain j, where fromiIs the set of valid e-mails, to, sent by user iiIs a collection of valid emails received by user i; alpha is 1, beta is 0.8, which respectively represents the description ability of the user's interest in the sent effective e-mail and the received effective e-mail,
Figure A20051000950600192
making the description capacity of the e-mail reduce with the increase of the existing time, wherein, age (e) is the difference between the current date and the sending date of the e-mail, and hl is 30, which indicates that the e-mail before 30 days has the description capacity of half of the current e-mail; fromiIs the set of valid e-mails, to, sent by user iiIs the set of valid emails received by user i.
The ability of a user to receive descriptions of their interests in valid emails from other members depends on how well the member sending the email knows about their scientific work; the effective e-mails sent by the user can normally reflect the research interests of the user correctly, so that the user is endowed with stronger description capability of the effective e-mails sent by the user. The research focus of the user is changed after a long time, so that the user can be emailedThe description capability of a piece should also decrease as its lifetime increases, by adding to itIntroducing a formula for realization;
finally, if perijGreater than threshold value, and sub-domain ndjJoin the set of sub-domains of interest to user i, where the threshold is 10%.
Third, document understanding and classification
A basic concept, point of view or method is called a point of interest, and a semantic chain network (SG) is used for representing semantic information of the point of interest. Sub-field ndiInterest point set SG-setiDescription ndiAll the semantics of the implication, its elements are ndiAnd semantic chain networks corresponding to the contained interest points. Dividing the document into sub-fields with similar semantics by taking the interest point set of the sub-fields as a template;
SG ═ N, R, where N is the set of nodes, including one point of interest N1And a group of points of interest N represented together1Semantic keywords { N2,N3,...,Nm}; r is a set of directed arcs, representing causal relationships between nodes.
FIG. 2(a) is a semantic chain network, starting with NiTerminate in NjIs directed arc of NiTo NjCause and effect relationship of (1), weight w thereofijIndicating cause node NiFor result node NjDegree of influence of, wij∈[-1,+1]。
Fig. 2(b) is a adjacency matrix representation of the semantic chain network, which is an n × n matrix, where n is the number of nodes included in the semantic chain network. If N is presentiTo NjCause and effect relationships exist, then the element of the ith row and jth column of the adjacency matrix is wijOtherwise, it is 0.
FIG. 3 is a flowchart of document understanding and partitioning, including the following steps:
s3-1, selecting a document d from the team document database;
s3-2, selecting a sub-field ndiObtaining the corresponding interest point set SG-seti
S3-3, calculating the document d and the sub-field ndiSemantic similarity md (d, nd)i):
S3-3.1 divides document d into several small parts: p is a radical of1,p2,...,pmThe data can be divided according to the number of bytes or paragraphs. The method is divided into sections, and comprises the following steps of further dividing the sub-sections;
s3-3.2 for any small part pjLet md bePart-ji=0;
S3-3.3 subfield ndiInterest point set SG-setiAny element SGr
(1) Calculate SGrAny keyword N containedkAt pjState value V ink′:Vk′=tanh(Sk/3),SkIs NkAt pjThe number of occurrences in (a);
(2)V1′,V2′,...,Vm′)=(0,V2′,...,Vm′)×Er,Eris SGrA adjacency matrix representation of (a);
(3) if md isPart-ji<V1Then mdPart-ji=V1
<math> <mrow> <mi>S</mi> <mn>3</mn> <mo>-</mo> <mn>3.4</mn> <mi>md</mi> <mrow> <mo>(</mo> <mi>d</mi> <mo>,</mo> <mi>n</mi> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <mi></mi> <mi>m</mi> <msub> <mi>d</mi> <mrow> <mi>Part</mi> <mo>-</mo> <mi>ji</mi> </mrow> </msub> <mo>/</mo> <mi>m</mi> <mo>,</mo> </mrow> </math> Wherein the document d is divided into m small parts
S3-4. if md (d, nd)i) > 0.65, partition document d into sub-domains ndiGo to S3-2.
The method calculates the semantic similarity between each document and each sub-field from the interest point level, so that the documents are classified into the sub-fields with higher semantic similarity, and one document may belong to a plurality of sub-fields at the same time. The team's document database includes a large number of existing technical documents and also receives documents uploaded by team members to continuously increase the capacity of the document database. Therefore, it is periodically checked whether there are newly added documents in the document database, and if so, they are divided into corresponding sub-fields as described above.
Fourthly, effectively pushing documents according to user interests
The result of the document understanding is to divide the documents in the team document database into various sub-domains related to the team. A document push program (e.g., filedelivery) is written that selects appropriate documents from the team's document database to push to team members based on the document classification results based on user interest. Because team members tend to read documents related to the sub-domain they are interested in, the program pushes the documents as email attachments to users whose set of sub-domains of interest includes the sub-domain to which the document belongs.
Each document in the document database has two lists, "sent people" and "upload people". The "sent people" list records to which team members the document has been pushed, and the FileDeliver runtime only pushes the document to team members that do not appear in the document "sent people" list. When the member uploads the document to the team document database, if the document does not exist in the document database, the uploading is successful, and otherwise, the uploading is prompted to be repeated. Regardless of whether the upload was successful, the member will record in the "upload people" list of the document. Nor will FileDeliver push documents to members who have appeared in the document "upload people" list because the documents that the member is attempting to upload must be the documents he already has. The team members only need to execute simple uploading operation to share the document among all members needing the document, and the method is simple and effective.

Claims (10)

1. The method for finding user interest in the E-mail stream and effectively pushing the document according to the user interest comprises the following steps of firstly, storing the E-mails among team members into an E-mail database and extracting effective E-mails from the E-mails; then, extracting user interests according to the distribution rule of the effective e-mails, and realizing classification of the documents in the team document database through semantic analysis; and finally, according to the user interest and the document classification result, pushing the document consistent with the member interest to the team members through the E-mails.
2. The method for discovering user interests in an email stream and efficiently pushing documents in accordance therewith as recited in claim 1, wherein the emails among team members are decoded by the email collecting program and the decoded contents are stored in the email database, and automatic warehousing of the emails is realized by periodically running the email collecting program, wherein spam is mostly from strange email addresses, and the process only considers the emails among the members, thereby eliminating the interference of the spam when extracting the user interests.
3. The method for discovering user interests in an email stream and for efficiently pushing documents as claimed in claim 1, wherein the method of natural language learning is used to obtain efficient emails that provide useful information for describing user interests, taking into account only the interests of team members in research and development work, thereby ensuring the accuracy of the extracted user interests.
4. The method of claim 1 for discovering user interests in an email stream and for efficiently pushing documents based thereon, further comprising subdividing a research domain associated with a team into sub-domains, establishing a priori knowledge sets of the sub-domains to represent background knowledge thereof, classifying the valid emails by similarity calculations between the valid emails and the prior knowledge sets of the sub-domains, and representing user interests by a set of sub-domains of interest to the members based on the distribution of the valid emails in the sub-domains.
5. The method of claim 1 for discovering user interests in an email stream and efficiently pushing documents in response thereto, wherein a time factor is introduced into the interest extraction process in view of possible changes in user interests over a longer period of time, the user interests are updated in time as new emails are generated and time passes, and pushing documents to users in response to user interests ensures that documents are always pushed to all team members who need the documents without either misposting or missed posting.
6. The method for discovering user interest in e-mail stream and effectively pushing documents according to the same as the claim 1, wherein an interest point set describing the sub-domain semantics is constructed, the documents are divided into sub-domains similar to the sub-domains of the sub-domains by taking the interest point set as a template, the documents are pushed to the members concerning the sub-domains to which the documents belong by a document pushing program, the documents pushed to the users are guaranteed to be needed by the users semantically, and the members of the team can complete the pushing of the documents by the program only by uploading the documents to a document database of the team, so that the method is simple and easy to implement.
7. A method for discovering user interest in an email stream and effectively pushing documents according to the user interest is characterized by mainly comprising the following four parts:
firstly, the e-mail is automatically stored, and the effective e-mail is extracted, wherein,
1. building an email database
The team members use the uniform E-mail server and server program to establish a database file under a certain directory of the E-mail server to store the E-mail information among the team members;
2. automatic e-mail storage
Firstly, all the e-mails among the team members are automatically forwarded to a fixed account by a mail server program, and the mails of the account are stored in a certain fixed directory of a mail server; then, regularly running the compiled mail collection program to realize the automatic storage of the e-mails, and decoding the e-mails by the program and storing the decoding result in the corresponding field of the e-mail database;
3. extracting valid emails
The invention only considers the interest of the user in the aspect of scientific research work, and extracts the effective e-mail which can provide useful information for describing the interest of the user through a natural language learning method;
second, efficient email classification and user interest extraction
Dividing the research fields related to the team intoSmaller sub-fields and through sub-fields ndiPrior knowledge set KiRepresenting its background knowledge, KiIs (n)k,ak) Set of (2), nkIs able to reflect nd togetheriOne of a set of keywords of the primary content, akIs nkWeight of (2) represents nkTo ndiDescription capability of akThe higher nkThe stronger the description capability;
classifying the effective e-mails through similarity calculation of the effective e-mails and the prior knowledge sets of each sub-field, and expressing user interest by using the sub-field set concerned by members according to the distribution condition of the effective e-mails in each sub-field;
third, document understanding and classification
A basic concept, point of view or method is called a point of interest, and a semantic link network (SG) represents semantic information of a point of interest, where SG is (N, R), where N is a set of nodes, including a point of interest N1And a group of points of interest N represented together1Semantic keywords { N2,N3,...,Nm}; r is a set of directed arcs representing causal relationships between nodes, the sub-domain ndiInterest point set SG-setiDescription ndiAll the implied semantic information, its elements are ndiThe semantic chain network corresponding to the contained interest points divides the document into sub-fields with similar semantics by taking the interest point set of the sub-fields as a template;
fourthly, effectively pushing documents according to user interests
Writing a document pushing program, wherein the document pushing program pushes a document to a user of a concerned sub-field set including a sub-field to which the document belongs in the form of an e-mail attachment, each document has two lists of 'sent person' and 'uploading person', the document pushing program only pushes the document to team members which do not appear in the two lists, repeated sending is avoided, and the members can share the document among all members needing the document only by uploading the document to a team document database, so that the method is simple and effective.
8. The method for discovering user interests in an email stream and thereby efficiently pushing documents according to claim 7, wherein the first, email is automatically archived, the available emails are extracted, wherein,
3. extracting valid emails
First, a certain number of valid e-mails and invalid e-mails are selected as the training set C of valid e-mails respectively1And invalid e-mail training set C2And obtaining the standard vectors of valid e-mails and invalid e-mails by the following formulas
Figure A2005100095060005C1
And
Figure A2005100095060005C2
represents:
<math> <mrow> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>=</mo> <mn>16</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mn>4</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>=</mo> <mn>16</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>2</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mn>4</mn> <mfrac> <mn>1</mn> <mrow> <mo>|</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> <mo>|</mo> </mrow> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> </mrow> </munder> <mfrac> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>|</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>|</mo> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein, <math> <mrow> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>=</mo> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>e</mi> <mn>2</mn> </msub> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mo>,</mo> <msub> <mi>e</mi> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msub> <mo>)</mo> </mrow> </mrow> </math> is a vector representation of an e-mail e, eiIs the keyword wiNumber of occurrences in the subject and body of e-mail;
Figure A2005100095060005C6
is thatThe vector length of (d); i C1I and I C2Each is C1And C2The number of training samples, i.e., the number of electronic mail pieces contained,
then, the e-mail e is calculatedVector representation
Figure A2005100095060005C8
And a standard vector
Figure A2005100095060005C9
And
Figure A2005100095060005C10
the calculation method of the similarity is as follows:
<math> <mrow> <mi>cos</mi> <mrow> <mo>(</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msubsup> <msub> <mi>e</mi> <mi>i</mi> </msub> <msub> <mi>c</mi> <mi>i</mi> </msub> </mrow> <mrow> <msqrt> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msubsup> <msubsup> <mi>e</mi> <mi>i</mi> <mn>2</mn> </msubsup> </msqrt> <msqrt> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>F</mi> <mo>|</mo> </mrow> </msubsup> <msubsup> <mi>c</mi> <mi>i</mi> <mn>2</mn> </msubsup> </msqrt> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein n-1 or n-2,
if it is not <math> <mrow> <mi>cos</mi> <mrow> <mo>(</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>></mo> <mi>cos</mi> <mrow> <mo>(</mo> <mover> <mi>e</mi> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <msub> <mover> <mi>c</mi> <mo>&RightArrow;</mo> </mover> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mrow> </math> E is a valid email, otherwise e is an invalid email.
9. The method for discovering user interest in an email stream and thereby efficiently pushing documents according to claim 7, wherein two, efficient email classification and user interest extraction, wherein,
first of all, what is described as a computationally efficient e-mail e relates to the sub-domain ndiProbability of (c):
<math> <mrow> <mi>Sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>R</mi> </munderover> <msub> <mi>&alpha;</mi> <mi>k</mi> </msub> <mi>f</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>kl</mi> </msub> <mo>)</mo> </mrow> <mo>/</mo> <mi>N</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein n iskBelonging to K contained in the subject and body of e-mailiThe keyword of (1); dlIs nkA set of (a); sklIs the keyword nkNumber of occurrences in the above-mentioned part of e-mail and f (S)kl)=tanh(Skl3); r and N are each DlAnd KiThe number of elements (c). E is divided into the sub-fields with the highest probability to realize the classification of effective e-mails;
then, the research work to calculate user i involves the percentage per of the sub-domain jij
<math> <mrow> <msub> <mi>per</mi> <mi>ij</mi> </msub> <mo>=</mo> <mfrac> <mrow> <mi>&alpha;</mi> <msub> <mi>&Sigma;</mi> <mrow> <mrow> <mo>(</mo> <msub> <mrow> <mi>e</mi> <mo>&Element;</mo> <mi>nd</mi> </mrow> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>&cap;</mo> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>from</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>&beta;</mi> <msub> <mi>&Sigma;</mi> <mrow> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>nd</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>&cap;</mo> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>to</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>&alpha;</mi> <msub> <mi>&Sigma;</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>from</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>&beta;</mi> <msub> <mi>&Sigma;</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>&Element;</mo> <msub> <mi>to</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </msub> <msup> <mn>2</mn> <mrow> <mo>-</mo> <mfrac> <mrow> <mi>age</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> </mrow> <mi>hl</mi> </mfrac> </mrow> </msup> <mi>sim</mi> <mrow> <mo>(</mo> <mi>e</mi> <mo>,</mo> <msub> <mi>K</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>&times;</mo> <mn>100</mn> <mo>%</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>
Wherein α ═ 1, β ═ 0.8, represent the description ability of interest to the valid email sent by the user and the valid email received, respectively;making the description capacity of the e-mail reduce with the increase of the existing time, wherein, age (e) is the difference between the current date and the sending date of the e-mail, and hl is 30, which indicates that the e-mail before 30 days has the description capacity of half of the current e-mail; fromiIs the set of valid e-mails, to, sent by user iiIs the set of valid emails received by user i, if perijGreater than threshold value, and sub-domain ndjJoin the set of sub-domains of interest to user i, where the threshold is 10%.
10. The method for discovering user interests in an email stream and pushing documents efficiently according to the same as claimed in claim 7, wherein the third step of document understanding and dividing is as follows:
s3-1, selecting a document d from the team document database;
s3-2, selecting a sub-field ndiObtaining the point of interest set SG-seti
S3-3, calculating the document d and the sub-field ndiSemantic matching degree md (d, nd)i):
S3-3.1 divides document d into several small parts: p is a radical of1,p2,…,pm
S3-3.2 for any small part pjLet md bePart-ji=0;
S3-3.3 subfield ndiInterest point set SG-setiAny element SGr
(1) Calculate SGrAny keyword N containedkAt pjState value V ink′:Vk′=tanh(Sk/3),SkIs NkAt pjThe number of occurrences in (a);
(2)(V1′,V2′,...,Vm′)=(0,V2′,...,Vm′)×Er,Eris SGrA adjacency matrix representation of (a);
(3) if md isPart-ji<V1Then mdPart-ji=V1
S3-3.4 <math> <mrow> <mi>md</mi> <mrow> <mo>(</mo> <mi>d</mi> <mo>,</mo> <msub> <mi>nd</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msub> <mi>md</mi> <mrow> <mi>Part</mi> <mo>-</mo> <mi>ji</mi> </mrow> </msub> <mo>/</mo> <mi>m</mi> <mo>,</mo> </mrow> </math> Wherein the document d is divided into m small parts
S3-4. if md (d, nd)i) > 0.65, partition document d into sub-domains ndiGo to S3-2.
CN 200510009506 2005-02-22 2005-02-22 Method for discovering user interest in e-mail flow and transmitting document effectively Pending CN1645395A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200510009506 CN1645395A (en) 2005-02-22 2005-02-22 Method for discovering user interest in e-mail flow and transmitting document effectively

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200510009506 CN1645395A (en) 2005-02-22 2005-02-22 Method for discovering user interest in e-mail flow and transmitting document effectively

Publications (1)

Publication Number Publication Date
CN1645395A true CN1645395A (en) 2005-07-27

Family

ID=34875381

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510009506 Pending CN1645395A (en) 2005-02-22 2005-02-22 Method for discovering user interest in e-mail flow and transmitting document effectively

Country Status (1)

Country Link
CN (1) CN1645395A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102394900A (en) * 2010-06-23 2012-03-28 佳能株式会社 Document generation apparatus, document generation system, document upload method, and program
CN103294745A (en) * 2012-01-09 2013-09-11 国际商业机器公司 System and method for organizing information relevant to a collaboration
CN103379020A (en) * 2012-04-24 2013-10-30 苏州引角信息科技有限公司 Method and system for massively sending emails
CN103593195A (en) * 2013-11-22 2014-02-19 安一恒通(北京)科技有限公司 Method and device for customizing personalized software
CN104219136A (en) * 2013-06-05 2014-12-17 北京国信冠群技术有限公司 System and method for updating attachment in real time during circulation of E-mail
CN105049334A (en) * 2015-08-04 2015-11-11 新浪网技术(中国)有限公司 E-mail filtering method and device
CN105468933A (en) * 2014-08-28 2016-04-06 深圳先进技术研究院 Biological data analysis method and system
CN110089088A (en) * 2016-10-21 2019-08-02 好事达保险公司 Security and account discovery

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102394900A (en) * 2010-06-23 2012-03-28 佳能株式会社 Document generation apparatus, document generation system, document upload method, and program
US8769041B2 (en) 2010-06-23 2014-07-01 Canon Kabushiki Kaisha Document generation apparatus, document generation system, document upload method, and storage medium
CN103294745A (en) * 2012-01-09 2013-09-11 国际商业机器公司 System and method for organizing information relevant to a collaboration
CN103379020A (en) * 2012-04-24 2013-10-30 苏州引角信息科技有限公司 Method and system for massively sending emails
CN104219136A (en) * 2013-06-05 2014-12-17 北京国信冠群技术有限公司 System and method for updating attachment in real time during circulation of E-mail
CN104219136B (en) * 2013-06-05 2017-12-26 北京国信冠群技术有限公司 A kind of system and method for Email annex real-time update during circulation
CN103593195A (en) * 2013-11-22 2014-02-19 安一恒通(北京)科技有限公司 Method and device for customizing personalized software
CN105468933A (en) * 2014-08-28 2016-04-06 深圳先进技术研究院 Biological data analysis method and system
CN105468933B (en) * 2014-08-28 2018-06-15 深圳先进技术研究院 biological data analysis method and system
CN105049334A (en) * 2015-08-04 2015-11-11 新浪网技术(中国)有限公司 E-mail filtering method and device
CN110089088A (en) * 2016-10-21 2019-08-02 好事达保险公司 Security and account discovery
CN110089088B (en) * 2016-10-21 2021-11-09 好事达保险公司 Digital security and account discovery

Similar Documents

Publication Publication Date Title
Ayling et al. New approaches for metagenome assembly with short reads
Bickhart et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities
US8010614B1 (en) Systems and methods for generating signatures for electronic communication classification
CN1645395A (en) Method for discovering user interest in e-mail flow and transmitting document effectively
Clum et al. DOE JGI metagenome workflow
Keegan et al. MG-RAST, a metagenomics service for analysis of microbial community structure and function
Glover et al. Using web structure for classifying and describing web pages
von Reumont et al. Pancrustacean phylogeny in the light of new phylogenomic data: support for Remipedia as the possible sister group of Hexapoda
Grabowski et al. Disk-based compression of data from genome sequencing
CN101877837B (en) Method and device for short message filtration
CN1977261A (en) Method and system for word sequence processing
US20060248055A1 (en) Analysis and comparison of portfolios by classification
US11347810B2 (en) Methods of automatically and self-consistently correcting genome databases
Stoltzfus et al. Phylotastic! Making tree-of-life knowledge accessible, reusable and convenient
Sato et al. Molecular phylogenetic analysis of nuclear genes suggests a Cenozoic over-water dispersal origin for the Cuban solenodon
CN1904886A (en) Method and apparatus for establishing link structure between multiple documents
Barton et al. The earliest farmers of northwest China exploited grain-fed pheasants not chickens
CN1627294A (en) Method and apparatus for document filtering capable of efficiently extracting document matching to searcher&#39;s intention using learning data
Kaur et al. Improved email spam classification method using integrated particle swarm optimization and decision tree
Hallström et al. Gnathostome phylogenomics utilizing lungfish EST sequences
CN111625726A (en) User portrait processing method and device
Nelson et al. Deflating trees: improving Bayesian branch-length estimates using informed priors
Marghny et al. Web mining based on genetic algorithm
Du et al. HiFine: integrating Hi-C-based and shotgun-based methods to refine binning of metagenomic contigs
Lamb et al. De novo chromosome-length assembly of the mule deer (Odocoileus hemionus) genome

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication