CN104182549A

CN104182549A - E-mail digest generation method and device

Info

Publication number: CN104182549A
Application number: CN201410469526.4A
Authority: CN
Inventors: 张基恒; 魏进武; 李丹; 汤雅妃; 张呈宇
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2014-09-15
Filing date: 2014-09-15
Publication date: 2014-12-03

Abstract

The invention provides an E-mail digest generation method and device. The method comprises the steps of converting the main body of a mail into a sentence sequence, and carrying out word segmentation; extracting words representing names, time and places in words obtained by word segmentation, and storing the words representing the names, the time and the places in a keyword set; calculating the weight value of each of the other words according to the occurrence frequency of the word in the main body of the mail and the proportion of sentences containing the word in all sentences; storing the words with the weight values higher than a preset weight threshold value in the keyword set; the digest forming probability of each keyword, except keywords representing the names, the time and the places, in the keyword set is calculated according to the Naive Bayesian Classification Model, and outputting the keywords with the probabilities higher than a probability threshold value, the keywords representing the names, the time and the places and the theme of the mail as a digest according to the appearance sequence of the keywords with the probabilities higher than the probability threshold value, the keywords representing the names, the time and the places and the theme of the mail in the main body of the mail. The E-mail digest generation method and device are suitable for E-mail digest generation.

Description

A kind of Email abstraction generating method and device

Technical field

The present invention relates to network field, especially a kind of Email abstraction generating method and device.

Background technology

Email, is a kind of communication mode that message exchange is provided with electronically, is the service that internet, applications is the widest.By the e-mail system of network, user can be with cheap price, mode fast, contacts, and along with the developing rapidly of various mobile terminals, more make user's receiving and dispatching mail anywhere or anytime with the network user in any one corner in the world.

Email can be the various ways such as word, image, sound.Meanwhile, user can obtain a large amount of free news, thematic mails, and realizes information search easily.The existence of Email greatly facilitates interpersonal Communication, has promoted social development.The development of cloud, the functions such as cloud storage, the multiple terminals annex of more progressively having realized mail shared, cloud management.

When user uses mobile terminal transmit-receive mobile phone mail, picture, big accessory and longer message body all can bring very big inconvenience to user.Taking ios system as example; under desirable 3G pattern; collect a mail with 2MB annex and need 3-5 minute; if and user is in process by bus or by the subway; the impact of signal can make receiving course suspend or thoroughly interrupt; both waste flow, also affected user's normal life and work.

If in the time that user uses mobile terminal, the summary that only sends mail is probably understood Mail Contents for user, user can filter out important mail according to summary and receive, and can wait while using wireless network or computer and receive, thereby effectively reduce the use of flow for other mail.But, the file design that existing summarization generation algorithm is normally grown for content, the summary generating all forms based on sentence, consider that the most length of Mail Contents is limited, its form is also comparatively similar with traditional letter, and the sentence of text is less, if adopt existing digest algorithm to extract sentence as summary, may only can extract a certain sentence of message body, thereby some more important information are left in the basket.Use existing digest algorithm cannot extract effective summary; In cloud system, need in addition bulk mail to carry out abstract extraction, existing summarization generation algorithm more complicated, and inapplicable.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of summarization generation scheme that is applicable to Email.

In order to address the above problem, the invention provides a kind of mail property generation method, comprising:

S101, message body is converted to sentence sequence, and carries out participle;

S102, the vocabulary that obtains from participle, extract and represent that the vocabulary in name, time, place preserves into keyword set;

S103, for remaining each vocabulary, the frequency occurring in message body according to this vocabulary respectively, and the ratio of the sentence that comprises this vocabulary in all sentences calculated the weighted value of this vocabulary; The vocabulary that weighted value is exceeded to predefined weight threshold value is preserved into described keyword set;

S104, calculate respectively by Naive Bayes Classification Model the each keyword representing in described keyword set beyond the keyword in name, time, place and become the probability of summary, finally using probability higher than the keyword of predetermined probability threshold value, represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body.

Alternatively, described in, carry out also comprising before the step of participle:

Whether judge in message body exists partial character to have special format, and character shared ratio in described message body with special format is less than predetermined ratio threshold value, if it is has the character of special format as the ingredient of described summary using described;

Described in removing from described message body, have after the character of special format, described in residue document is carried out, carry out the step of participle.

Alternatively, described weighted value is:

W_{f} (w_{i}) = F (w_{i}) \times \log (\frac{S}{S_{f} (w_{i})})

Wherein, W _f(w _i) represent the weighted value of the vocabulary i that calculates, F (w _i) representing the frequency that vocabulary i occurs in described message body, S represents the number of all sentences that comprise in described sentence sequence, S _f(w _i) for there is the number of the sentence of vocabulary i in described sentence sequence.

Alternatively, described step 104 comprises:

Described keyword set is rejected and represented that the set W described Feature Words set corresponding with all kinds of documents obtaining after the keyword in name, time, place mate, if being less than the keyword of proportion threshold value in W appears in all kinds of document characteristic of correspondence set of words, directly by the keyword in W, and represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body;

When the keyword that is no less than described proportion threshold value in W appears at C _iin class document characteristic of correspondence set of words, mail is stamped and represented that type is C _imark; Calculate not at C _ithe keyword and the anticipation dictionary F that in class document characteristic of correspondence set of words, occur ₁, F ₂... F _kthe probability of coupling:

P (w &Element; W | F_{1}, . . . F_{k}) = \frac{Π_{i = 1}^{k} P (F_{i} | w &Element; W) * P (w &Element; W)}{Π_{i = 1}^{k} P (F_{i})}

Wherein, П is quadrature computing; W is not at C in W _ithe arbitrary keyword occurring in class document characteristic of correspondence set of words; P (F _i) be in the described anticipation dictionary forming at k Feature Words, F _ithe probability occurring; P (F _i| w ∈ W) be F in the time of w ∈ W _ithe probability occurring; P (w ∈ W) is constant; Choose the P (w that is greater than this probability threshold values _j\o∈ W|F ₁... F _k) corresponding keyword, also have in described keyword set and C _ithe keyword that class document characteristic of correspondence set of words is successfully mated, and represent the keyword in name, time, place, with together with the theme of described mail as summary, according to the appearance Sequential output in described message body.

Alternatively, described method also comprises:

In the time that described message body contains picture, obtain the resolution information of the corresponding terminal of this mail posting address, judge whether the long l × wide d pixel of described picture exceedes resolution m × n ratio of terminal;

If exceed, picture compression is moved to the resolution ratio of demonstrate,proving terminal to this, between the length and width pixel l' after compression and d', meets:

\frac{l^{'}}{d^{'}} = \frac{m}{n};

If or only in described summary, increase the information of text containing picture, if the picture after compression is carried in the summary that described step 104 generates; Described

a = \frac{l}{l^{'}}, b = \frac{d}{d^{'}} .

The present invention also provides a kind of mail property generating apparatus, comprising:

Modular converter, for message body being converted to sentence sequence, and carries out participle;

Preliminary election module, for the vocabulary obtaining from participle, extracts and represents that the vocabulary in name, time, place preserves into keyword set;

Weight screening module, for for remaining each vocabulary, the frequency occurring in message body according to this vocabulary respectively, and the ratio of the sentence that comprises this vocabulary in all sentences calculated the weighted value of this vocabulary; The vocabulary that weighted value is exceeded to predefined weight threshold value is preserved into described keyword set;

Probability screen modeling piece, represent that for calculate respectively described keyword set by Naive Bayes Classification Model the each keyword beyond the keyword in name, time, place becomes the probability of summary, finally using probability higher than the keyword of predetermined probability threshold value, represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body.

Alternatively, described probability screen modeling piece is also for before described modular converter carries out participle, whether judge in message body exists partial character to have special format, and character shared ratio in described message body with special format is less than predetermined ratio threshold value, if it is has the character of special format as the ingredient of described summary using described;

The residue document that described modular converter has described in described message body is removed after the character of special format carries out participle.

Alternatively, the weighted value that described weight screening module calculates is:

W_{f} (w_{i}) = F (w_{i}) \times \log (\frac{S}{S_{f} (w_{i})})

Alternatively, described probability screen modeling piece comprises:

Matching unit, judging unit, output unit and computing unit;

Described matching unit mates for described keyword set rejecting being represented to the set W described Feature Words set corresponding with all kinds of documents obtaining after the keyword in name, time, place;

Described judging unit is in the time being less than the keyword of proportion threshold value and appearing in all kinds of document characteristic of correspondence set of words in W, indicate described output unit directly by the keyword in W, and represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body; When the keyword that is no less than described proportion threshold value in W appears at C _iin class document characteristic of correspondence set of words time, mail is stamped and represented that type is C _imark; Indicate described computing unit to calculate;

Described computing unit is used for calculating not at C _ithe keyword and the anticipation dictionary F that in class document characteristic of correspondence set of words, occur ₁, F ₂... F _kthe probability of coupling:

P (w &Element; W | F_{1}, . . . F_{k}) = \frac{Π_{i = 1}^{k} P (F_{i} | w &Element; W) * P (w &Element; W)}{Π_{i = 1}^{k} P (F_{i})}

Wherein, П is quadrature computing; W is not at C in W _ithe arbitrary keyword occurring in class document characteristic of correspondence set of words; P (F _i) be in the described anticipation dictionary forming at k Feature Words, F _ithe probability occurring; P (F _i| w ∈ W) be F in the time of w ∈ W _ithe probability occurring; P (w ∈ W) is constant;

Described output unit, for after calculating at described computing unit, is chosen the P (w that is greater than this probability threshold values _j\o∈ W|F ₁... F _k) corresponding keyword, also have in described keyword set and C _ithe keyword that class document characteristic of correspondence set of words is successfully mated, and represent the keyword in name, time, place, with together with the theme of described mail as summary, according to the appearance Sequential output in described message body.

Alternatively, described device also comprises:

Terminal acquisition module, in the time that described message body contains picture, obtains the resolution information of the corresponding terminal of this mail posting address;

Picture compression module, for judging whether the long l × wide d pixel of picture of described message body exceedes resolution m × n ratio of terminal; If exceed, picture compression is moved to the resolution ratio of demonstrate,proving terminal to this, between the length and width pixel l' after compression and d', meets: if or only in described summary, increase the information of text containing picture, if the picture after compression is carried in the summary that described summarization generation module generates; Described

Technical scheme of the present invention can be in batches, easily Email is carried out to abstract extraction, makes user can be rapidly in poor signal or the situation such as mail is larger mail be had and concise and to the point be understood; Wherein, the summarization generation of message body is by extracting keywords and processes, thereby forms summary.In prioritization scheme of the present invention, the picture possessing for message body, carries out picture compression according to the resolution of the terminal identifying, thereby reduces its size better reflexless terminal.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the mail property generation method of embodiment mono-;

Fig. 2 is the transfer process schematic diagram of Email;

Fig. 3 is the configuration diagram that in the example of embodiment mono-, mail property generates engine;

Fig. 4 is the schematic flow sheet that in the example of embodiment mono-, mail property generates;

Fig. 5 is the process flow diagram of text summarization generation in the example of embodiment mono-.

Embodiment

Below in conjunction with drawings and Examples, technical scheme of the present invention is described in detail.

It should be noted that, if do not conflicted, each feature in the embodiment of the present invention and embodiment can mutually combine, all within protection scope of the present invention.In addition, although there is shown logical order in flow process, in some cases, can carry out shown or described step with the order being different from herein.

Embodiment mono-, a kind of mail property generation method, as shown in Figure 1, comprising:

The present embodiment considers that the most length of message body content is limited, and considers and may need bulk mail to carry out abstract extraction, must choose quick, succinct method; Therefore the present embodiment is taked the method based on feature, message body text is considered as to the linear order of sentence, sentence is considered as to the linear order of word, by extracting weighted value and the frequency of keyword, calculating vocabulary, obtains summary info.

In the present embodiment, if a keyword in summary in described message body, repeatedly occur, when there is Sequential output, the position occurring in described message body for the first time with this keyword is as the criterion.

In the present embodiment, above-mentioned steps S101～104 can be carried out on mail server, also can implement based on cloud environment, such as the generation of mail property, transmitting-receiving engine are implemented in to server front end, its exchange architecture goes up beyond the clouds, utilizes the ability of cloud computing to generate mail property, sends and carry out batch processing.

In the present embodiment, can utilize existing Text Mining Technology realize by message body be converted to sentence sequence, to document participle, extract the step of the vocabulary that represents name, time, place; Such as utilizing punctuation mark by the subsequence of forming a complete sentence of the text-converted in mail; Can utilize given Chinese vocabulary, carry out document participle and the step of extracting the vocabulary that represents name, time, place.

In an embodiment of the present embodiment, described in carry out can also comprising before the step of participle:

In present embodiment, described special format can be, but not limited to comprise: overstriking, italic, with underscore, with strikethrough etc.

In an embodiment of the present embodiment, described weighted value is specifically as follows:

W_{f} (w_{i}) = F (w_{i}) \times \log (\frac{S}{S_{f} (w_{i})}) - - - (1)

In present embodiment, calculate by above formula after the weighted value of vocabulary, according to predefined weight threshold value, pick out the vocabulary that weights are higher and put into keyword set; Described predefined weight threshold value can based on experience value or be tested and be obtained.

Because extracting keywords in the present embodiment is the mode based on word frequency and weight, therefore may have some and not possess the keyword that generates summary meaning, for example " " " " etc. be drawn into, so need above formula again keyword to be screened, thus pick out significant word; Meanwhile, the matching judgment of Feature Words and keyword, also can carry out a tell-tale judgement to Mail Contents.

Suppose that C is the characteristic set of all mails in database, can be divided into multiple character subset C such as " computer ", " administrative class ", " daily class ", " commercial paper " so in C according to the difference of document content in mail _i(and can in these subsets, further separate subset), each character subset is corresponding to a Doctype; Can set up in advance a training document sets (classification of each document is by manually specifying), by calculating the wherein degree of correlation of vocabulary and document classification, the high vocabulary of selection degree of correlation is as the Feature Words set of such document.

Consider that the present embodiment finally will apply in actual production system, the more mail of concurrent processing and Feature Words set simultaneously will be carried out the renewal iteration at short period interval, therefore in the algorithm of Feature Selection, need the low and fireballing algorithm of computation complexity, in this case, carrying out Feature Words by the mode of TF-IDF (term frequency – inverse document frequency, word frequency-oppositely document frequency) chooses.

A vocabulary a is at C _ithe computing method of the TF-IDF value in class document are:

TFIDF(a,C _i)＝tf(a,C _i)*log(N/X _a) (2)

Wherein, x _afor the mail sum that comprises a, N is the sum of all mails in database;

tf(a,C _i)＝Count(a|C _i)/Count(a’|C _i)；

Count (a|C _i) be illustrated in C _ithe number of times that in class document, a occurs; Count (a ' | C _i) expression C _itotal vocabulary number in class document.

Calculate C _iin class document, after the TF-IDF value of each vocabulary, set a threshold values, select the vocabulary that is greater than this threshold values as C _ithe Feature Words of class document.After all kinds of Feature Words set is merged, form the feature word set of whole database, the setting of threshold values need to be by a large amount of training, (for example computer in proper range to ensure the quantity of Feature Words, not all computerese is all Feature Words, but too harsh threshold values can cause vocabulary rareness, do not reach Feature Words and choose object).

In the time that system need to be processed a certain envelope mail, calculate weight and screen the set W (being the set of name in described keyword set, time, place keyword formation in addition) obtaining according to formula (1) and mate with described feature word set, the keyword that is for example less than Y% in W appears in all kinds of document characteristic of correspondence set of words, directly by the keyword in W, and the order that special key word (name, time, place etc.) and mail matter topics occur according to mail original text is exported as summary.

When the keyword that is no less than proportion threshold value Y% in W appears at C _iin class document characteristic of correspondence set of words, to keyword set and C _iclass document characteristic of correspondence set of words is carried out characteristic matching, and mail is stamped and represented that type is for " C _i" mark; Such as total j keyword in the W forming, wherein o appears at C _iin (o/j>=Y%), choose k and appear at C _ithe maximally related Feature Words F of this o Feature Words in storehouse ₁, F ₂... F _k(the desirable k=o^2 of initial situation, formula can continue to adjust, and these Feature Words may also may repeat across class, its correlativity can carry out when feature part of speech is divided obtaining at first) as anticipation dictionary.Then utilize following formula to calculate other j-o vocabulary and anticipation dictionary F ₁, F ₂... F _kthe probability of coupling, that is:

P (w_{j \ o} &Element; W | F_{1}, . . . F_{k}) = \frac{P (F_{1}, . . . F_{k} | w_{j \ o} &Element; W) * P (w_{j \ o} &Element; W)}{P (F_{1}, . . . F_{k})}

Due to F ₁... F _kthe condition that is chosen for independently choose, above formula=

\frac{Π_{i = 1}^{k} P (F_{i} | w_{j \ o} &Element; W) * P (w_{j \ o} &Element; W)}{Π_{i = 1}^{k} P (F_{i})} - - - (3)

Wherein, П is quadrature computing.W is that keyword set is removed the set that represents name, place name, time keyword in addition, the set of namely calculating the keyword filtering out after weight according to formula (1); F ₁... F _kit is the anticipation dictionary being formed by Feature Words; Formula (3) is mated with feature word set W, to improve the accuracy that generates summary the Doctype that provides mail.

The above formula left side meaning is, when Feature Words is F ₁... F _ktime, in set W, do not have and C _ij-o the vocabulary w that class document characteristic of correspondence set of words matches _j\obecome the probability of summary; Right side, for trying to achieve Bayes's expansion of this probability, is theorem.P (F _i) be in the anticipation dictionary forming at this k Feature Words, F _ithe probability occurring is (as F ₁having occurred 1 time, is 1/k; F and F are same words, are 2/k); P (F _i| w _j\o∈ W) for working as w _j\of when ∈ W _ithe probability occurring; P (w _j\o∈ W) be constant, can in the time of training, obtain, for example, while training, in j-o non-characteristic matching keyword, there is g to become final digest word, P (w _j\o∈ W)=g/ (j-o).

Set a probability threshold values, choose the P (w that is greater than this probability threshold values _j\o∈ W|F ₁... F _k) corresponding keyword, also have in described keyword set and C _ithe keyword that class document characteristic of correspondence set of words is successfully mated and the keyword that represents name, time, place, together with mail matter topics, according to occurring sequencing output.

In an embodiment of the present embodiment, described method can also comprise:

\frac{l^{'}}{d^{'}} = \frac{m}{n};

a = \frac{l}{l^{'}}, b = \frac{d}{d^{'}} .

In present embodiment, the target of abstract extraction is that algorithm is simple, quick, permission distortion, therefore can be, but not limited to adopt huffman coding to compress.Current terminal recognition technology reaches its maturity, and mail server can easily obtain end message.

In an embodiment of the present embodiment, described method can also comprise:

In the time that described mail contains annex, the name of described annex is referred to as to the ingredient of described summary.

It is very common in the time sending mail, adding annex, and for utilizing 2G/3G network to receive mail consumption flow and the slow-paced problem containing big accessory, present embodiment forms the ingredient of summary by annex name, to facilitate user to judge.

In an embodiment of the present embodiment, after described step S104, can also comprise:

Obtain the acquiescence text formatting of the corresponding terminal of this mail posting address; The text formatting of described summary is set to described acquiescence text formatting.

Present embodiment has been equivalent to peel off the text formatting that message body possesses.

Below with example explanation the present embodiment, an envelope Email is sent to flow process that addressee receives as shown in Figure 2 by sender, and the mail server A that sender sends an email to mailbox used by SMTP session obtains connection; Mail server A sends the mail server B of described Email to addressee's mailbox from internet by SMTP session; Described Email is sent to addressee by mail server B.

In this example, as shown in Figure 3, engine is by a mail server database and four queues: task queue to be loaded, summarization generation queue, upgrades queue, cleaning queue composition for the Organization Chart that mail property generates engine.

Mail server database, for storing the database of e-mail messages.

Task queue to be loaded, what deposit is that described mail server database is carried out to the rear task to be loaded of batch tasks scanning, the limited amount system of this queue to task, in the time that it is full, just cannot add new task in this queue.

Summarization generation queue, that deposits loads, need to carry out the task of summarization generation from task queue to be loaded, and the limited amount system of this queue to task, in the time that it is full, just cannot add new task in this queue.

Upgrade queue, what deposit is the task of having completed summarization generation in summarization generation queue, puts into cleaning queue to upgrading after each task in queue is upgraded mail server database successively.

Cleaning queue, clears up the temporary file of summarization generation process generation and the information of ancient deed successively to each task of wherein depositing.

In this example, a mail property product process as shown in Figure 4.

According to the e-mail messages in mail server database, first carry out terminal recognition, obtain after the corresponding information of terminal, carry out respectively text summarization generation and annex summarization generation, generate respectively the summary of text and annex according to the respective rule in rule base; Then mail is carried out sending to described mail server database after state renewal, and clear up temporary file.

In above-mentioned flow process, the step of described terminal recognition is mainly divided into two parts, and Part I is the identification to mobile terminal style.The technological means of at present existing multiple more ripe identification mobile terminal.By front 8 models that identify this terminal of IMEI (International Mobile Equipment Identity international mobile equipment identification number), obtain the corresponding attribute of mobile terminal, such as resolution etc. by setting up customer mobile terminal device attribute and the mapping table between customer mobile terminal equipment model again.

Part II is the identification to Move Mailbox terminal.On mobile terminal, take the mail push business of pushmail as interconnection network personal mailbox and enterprise's mailbox system at present, that is: the mail that newly arrives mail server is pushed to the form of service on customer mobile terminal on schedule.The existing two kinds of technical schemes of pushmail, 1) note PUSH, triggers with note form the software client work being arranged on mobile terminal, thereby coordinates server to complete whole mail reception process.2) IP PUSH, carries out mail reception with the form that keeps mobile terminal cell phone mailbox client software IP long-chain to connect.Enterprise installs a mail forwarding server and is connected with internal mail server, and is connected with the mail push gateway of local side by mail forwarding server, and mail push gateway is connected with user terminal by GSM/GPRS/3G network.In the time that enterprise's mailbox has new mail to arrive, mail forwarding server circular mail pushes gateway, if note PUSH scheme, mail push gateway sends backstage notifying messages to the mobile phone users that terminal software has been installed, after the notified note of terminal software, activate GPRS network, mobile terminal software is fetched mail from mail forwarding server, and notifies user, completes mail push process.If IP PUSH scheme, the software client that mail push gateway connects notice mobile terminal by IP data is fetched mail from mail forwarding server, and notifies cellphone subscriber, completes mail push process.No matter be note PUSH or IPPUSH, receive after PUSH order at mail forwarding server, can obtain the information such as resolution ratio, acquiescence text formatting as the mobile terminal of PUSH object, and the mail of institute's request receiving is carried out to summarization generation.

The workflow of described text summarization generation specifically as shown in Figure 5, in full scanning sees in message body, whether to contain picture, if had, compress (if the ratio of length and width compression factor is not separately in preset range after compression, only adding prompting has the information of picture); After compression or while thering is no picture, word summarization generation is carried out in rule-based storehouse, and the picture (or information) after word summary and compression is made a summary as text jointly.In described rule base, can preserve Chinese vocabulary etc.

The step of described annex summarization generation is to form annex summary by annex name.

The summary that when state upgrades, described text summary is formed together with annex summary to Email, upgrades summarization generation queue, upgrades the state of task in mail server database simultaneously.After renewal, notice cleaning queue.

Embodiment bis-, a kind of mail property generating apparatus, comprising:

In an embodiment of the present embodiment, described probability screen modeling piece can also be used for before described modular converter carries out participle, whether judge in message body exists partial character to have special format, and character shared ratio in described message body with special format is less than predetermined ratio threshold value, if it is has the character of special format as the ingredient of described summary using described;

In an embodiment of the present embodiment, the weighted value that described weight screening module calculates can be:

W_{f} (w_{i}) = F (w_{i}) \times \log (\frac{S}{S_{f} (w_{i})})

In an embodiment of the present embodiment, described probability screen modeling piece specifically can comprise:

Matching unit, judging unit, output unit and computing unit;

P (w &Element; W | F_{1}, . . . F_{k}) = \frac{Π_{i = 1}^{k} P (F_{i} | w &Element; W) * P (w &Element; W)}{Π_{i = 1}^{k} P (F_{i})}

In an embodiment of the present embodiment, described device can also comprise:

One of ordinary skill in the art will appreciate that all or part of step in said method can carry out instruction related hardware by program and complete, described program can be stored in computer-readable recording medium, as ROM (read-only memory), disk or CD etc.Alternatively, all or part of step of above-described embodiment also can realize with one or more integrated circuit.Correspondingly, the each module/unit in above-described embodiment can adopt the form of hardware to realize, and also can adopt the form of software function module to realize.The present invention is not restricted to the combination of the hardware and software of any particular form.

Certainly; the present invention also can have other various embodiments; in the situation that not deviating from spirit of the present invention and essence thereof; those of ordinary skill in the art are when making according to the present invention various corresponding changes and distortion, but these corresponding changes and distortion all should belong to the protection domain of claim of the present invention.

Claims

1. a mail property generation method, comprising:

2. the method for claim 1, is characterized in that, described in carry out also comprising before the step of participle:

3. the method for claim 1, is characterized in that, described weighted value is:

W_{f} (w_{i}) = F (w_{i}) \times \log (\frac{S}{S_{f} (w_{i})})

4. the method for claim 1, is characterized in that, described step 104 comprises:

P (w &Element; W | F_{1}, . . . F_{k}) = \frac{Π_{i = 1}^{k} P (F_{i} | w &Element; W) * P (w &Element; W)}{Π_{i = 1}^{k} P (F_{i})}

5. the method for claim 1, is characterized in that, also comprises:

\frac{l^{'}}{d^{'}} = \frac{m}{n};

a = \frac{l}{l^{'}}, b = \frac{d}{d^{'}} .

6. a mail property generating apparatus, is characterized in that, comprising:

7. device as claimed in claim 6, is characterized in that:

Described probability screen modeling piece is also for before described modular converter carries out participle, whether judge in message body exists partial character to have special format, and character shared ratio in described message body with special format is less than predetermined ratio threshold value, if it is has the character of special format as the ingredient of described summary using described;

8. device as claimed in claim 6, is characterized in that, the weighted value that described weight screening module calculates is:

W_{f} (w_{i}) = F (w_{i}) \times \log (\frac{S}{S_{f} (w_{i})})

9. device as claimed in claim 6, is characterized in that, described probability screen modeling piece comprises:

Matching unit, judging unit, output unit and computing unit;

P (w &Element; W | F_{1}, . . . F_{k}) = \frac{Π_{i = 1}^{k} P (F_{i} | w &Element; W) * P (w &Element; W)}{Π_{i = 1}^{k} P (F_{i})}

10. device as claimed in claim 6, is characterized in that, also comprises: