CN104182549A - E-mail digest generation method and device - Google Patents

E-mail digest generation method and device Download PDF

Info

Publication number
CN104182549A
CN104182549A CN201410469526.4A CN201410469526A CN104182549A CN 104182549 A CN104182549 A CN 104182549A CN 201410469526 A CN201410469526 A CN 201410469526A CN 104182549 A CN104182549 A CN 104182549A
Authority
CN
China
Prior art keywords
keyword
mail
vocabulary
message body
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410469526.4A
Other languages
Chinese (zh)
Inventor
张基恒
魏进武
李丹
汤雅妃
张呈宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN201410469526.4A priority Critical patent/CN104182549A/en
Publication of CN104182549A publication Critical patent/CN104182549A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Operations Research (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides an E-mail digest generation method and device. The method comprises the steps of converting the main body of a mail into a sentence sequence, and carrying out word segmentation; extracting words representing names, time and places in words obtained by word segmentation, and storing the words representing the names, the time and the places in a keyword set; calculating the weight value of each of the other words according to the occurrence frequency of the word in the main body of the mail and the proportion of sentences containing the word in all sentences; storing the words with the weight values higher than a preset weight threshold value in the keyword set; the digest forming probability of each keyword, except keywords representing the names, the time and the places, in the keyword set is calculated according to the Naive Bayesian Classification Model, and outputting the keywords with the probabilities higher than a probability threshold value, the keywords representing the names, the time and the places and the theme of the mail as a digest according to the appearance sequence of the keywords with the probabilities higher than the probability threshold value, the keywords representing the names, the time and the places and the theme of the mail in the main body of the mail. The E-mail digest generation method and device are suitable for E-mail digest generation.

Description

A kind of Email abstraction generating method and device
Technical field
The present invention relates to network field, especially a kind of Email abstraction generating method and device.
Background technology
Email, is a kind of communication mode that message exchange is provided with electronically, is the service that internet, applications is the widest.By the e-mail system of network, user can be with cheap price, mode fast, contacts, and along with the developing rapidly of various mobile terminals, more make user's receiving and dispatching mail anywhere or anytime with the network user in any one corner in the world.
Email can be the various ways such as word, image, sound.Meanwhile, user can obtain a large amount of free news, thematic mails, and realizes information search easily.The existence of Email greatly facilitates interpersonal Communication, has promoted social development.The development of cloud, the functions such as cloud storage, the multiple terminals annex of more progressively having realized mail shared, cloud management.
When user uses mobile terminal transmit-receive mobile phone mail, picture, big accessory and longer message body all can bring very big inconvenience to user.Taking ios system as example; under desirable 3G pattern; collect a mail with 2MB annex and need 3-5 minute; if and user is in process by bus or by the subway; the impact of signal can make receiving course suspend or thoroughly interrupt; both waste flow, also affected user's normal life and work.
If in the time that user uses mobile terminal, the summary that only sends mail is probably understood Mail Contents for user, user can filter out important mail according to summary and receive, and can wait while using wireless network or computer and receive, thereby effectively reduce the use of flow for other mail.But, the file design that existing summarization generation algorithm is normally grown for content, the summary generating all forms based on sentence, consider that the most length of Mail Contents is limited, its form is also comparatively similar with traditional letter, and the sentence of text is less, if adopt existing digest algorithm to extract sentence as summary, may only can extract a certain sentence of message body, thereby some more important information are left in the basket.Use existing digest algorithm cannot extract effective summary; In cloud system, need in addition bulk mail to carry out abstract extraction, existing summarization generation algorithm more complicated, and inapplicable.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of summarization generation scheme that is applicable to Email.
In order to address the above problem, the invention provides a kind of mail property generation method, comprising:
S101, message body is converted to sentence sequence, and carries out participle;
S102, the vocabulary that obtains from participle, extract and represent that the vocabulary in name, time, place preserves into keyword set;
S103, for remaining each vocabulary, the frequency occurring in message body according to this vocabulary respectively, and the ratio of the sentence that comprises this vocabulary in all sentences calculated the weighted value of this vocabulary; The vocabulary that weighted value is exceeded to predefined weight threshold value is preserved into described keyword set;
S104, calculate respectively by Naive Bayes Classification Model the each keyword representing in described keyword set beyond the keyword in name, time, place and become the probability of summary, finally using probability higher than the keyword of predetermined probability threshold value, represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body.
Alternatively, described in, carry out also comprising before the step of participle:
Whether judge in message body exists partial character to have special format, and character shared ratio in described message body with special format is less than predetermined ratio threshold value, if it is has the character of special format as the ingredient of described summary using described;
Described in removing from described message body, have after the character of special format, described in residue document is carried out, carry out the step of participle.
Alternatively, described weighted value is:
W f ( w i ) = F ( w i ) × log ( S S f ( w i ) )
Wherein, W f(w i) represent the weighted value of the vocabulary i that calculates, F (w i) representing the frequency that vocabulary i occurs in described message body, S represents the number of all sentences that comprise in described sentence sequence, S f(w i) for there is the number of the sentence of vocabulary i in described sentence sequence.
Alternatively, described step 104 comprises:
Described keyword set is rejected and represented that the set W described Feature Words set corresponding with all kinds of documents obtaining after the keyword in name, time, place mate, if being less than the keyword of proportion threshold value in W appears in all kinds of document characteristic of correspondence set of words, directly by the keyword in W, and represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body;
When the keyword that is no less than described proportion threshold value in W appears at C iin class document characteristic of correspondence set of words, mail is stamped and represented that type is C imark; Calculate not at C ithe keyword and the anticipation dictionary F that in class document characteristic of correspondence set of words, occur 1, F 2... F kthe probability of coupling:
P ( w ∈ W | F 1 , . . . F k ) = Π i = 1 k P ( F i | w ∈ W ) * P ( w ∈ W ) Π i = 1 k P ( F i )
Wherein, П is quadrature computing; W is not at C in W ithe arbitrary keyword occurring in class document characteristic of correspondence set of words; P (F i) be in the described anticipation dictionary forming at k Feature Words, F ithe probability occurring; P (F i| w ∈ W) be F in the time of w ∈ W ithe probability occurring; P (w ∈ W) is constant; Choose the P (w that is greater than this probability threshold values j\o∈ W|F 1... F k) corresponding keyword, also have in described keyword set and C ithe keyword that class document characteristic of correspondence set of words is successfully mated, and represent the keyword in name, time, place, with together with the theme of described mail as summary, according to the appearance Sequential output in described message body.
Alternatively, described method also comprises:
In the time that described message body contains picture, obtain the resolution information of the corresponding terminal of this mail posting address, judge whether the long l × wide d pixel of described picture exceedes resolution m × n ratio of terminal;
If exceed, picture compression is moved to the resolution ratio of demonstrate,proving terminal to this, between the length and width pixel l' after compression and d', meets:
l ′ d ′ = m n ;
If or only in described summary, increase the information of text containing picture, if the picture after compression is carried in the summary that described step 104 generates; Described a = l l ′ , b = d d ′ .
The present invention also provides a kind of mail property generating apparatus, comprising:
Modular converter, for message body being converted to sentence sequence, and carries out participle;
Preliminary election module, for the vocabulary obtaining from participle, extracts and represents that the vocabulary in name, time, place preserves into keyword set;
Weight screening module, for for remaining each vocabulary, the frequency occurring in message body according to this vocabulary respectively, and the ratio of the sentence that comprises this vocabulary in all sentences calculated the weighted value of this vocabulary; The vocabulary that weighted value is exceeded to predefined weight threshold value is preserved into described keyword set;
Probability screen modeling piece, represent that for calculate respectively described keyword set by Naive Bayes Classification Model the each keyword beyond the keyword in name, time, place becomes the probability of summary, finally using probability higher than the keyword of predetermined probability threshold value, represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body.
Alternatively, described probability screen modeling piece is also for before described modular converter carries out participle, whether judge in message body exists partial character to have special format, and character shared ratio in described message body with special format is less than predetermined ratio threshold value, if it is has the character of special format as the ingredient of described summary using described;
The residue document that described modular converter has described in described message body is removed after the character of special format carries out participle.
Alternatively, the weighted value that described weight screening module calculates is:
W f ( w i ) = F ( w i ) × log ( S S f ( w i ) )
Wherein, W f(w i) represent the weighted value of the vocabulary i that calculates, F (w i) representing the frequency that vocabulary i occurs in described message body, S represents the number of all sentences that comprise in described sentence sequence, S f(w i) for there is the number of the sentence of vocabulary i in described sentence sequence.
Alternatively, described probability screen modeling piece comprises:
Matching unit, judging unit, output unit and computing unit;
Described matching unit mates for described keyword set rejecting being represented to the set W described Feature Words set corresponding with all kinds of documents obtaining after the keyword in name, time, place;
Described judging unit is in the time being less than the keyword of proportion threshold value and appearing in all kinds of document characteristic of correspondence set of words in W, indicate described output unit directly by the keyword in W, and represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body; When the keyword that is no less than described proportion threshold value in W appears at C iin class document characteristic of correspondence set of words time, mail is stamped and represented that type is C imark; Indicate described computing unit to calculate;
Described computing unit is used for calculating not at C ithe keyword and the anticipation dictionary F that in class document characteristic of correspondence set of words, occur 1, F 2... F kthe probability of coupling:
P ( w ∈ W | F 1 , . . . F k ) = Π i = 1 k P ( F i | w ∈ W ) * P ( w ∈ W ) Π i = 1 k P ( F i )
Wherein, П is quadrature computing; W is not at C in W ithe arbitrary keyword occurring in class document characteristic of correspondence set of words; P (F i) be in the described anticipation dictionary forming at k Feature Words, F ithe probability occurring; P (F i| w ∈ W) be F in the time of w ∈ W ithe probability occurring; P (w ∈ W) is constant;
Described output unit, for after calculating at described computing unit, is chosen the P (w that is greater than this probability threshold values j\o∈ W|F 1... F k) corresponding keyword, also have in described keyword set and C ithe keyword that class document characteristic of correspondence set of words is successfully mated, and represent the keyword in name, time, place, with together with the theme of described mail as summary, according to the appearance Sequential output in described message body.
Alternatively, described device also comprises:
Terminal acquisition module, in the time that described message body contains picture, obtains the resolution information of the corresponding terminal of this mail posting address;
Picture compression module, for judging whether the long l × wide d pixel of picture of described message body exceedes resolution m × n ratio of terminal; If exceed, picture compression is moved to the resolution ratio of demonstrate,proving terminal to this, between the length and width pixel l' after compression and d', meets: if or only in described summary, increase the information of text containing picture, if the picture after compression is carried in the summary that described summarization generation module generates; Described
Technical scheme of the present invention can be in batches, easily Email is carried out to abstract extraction, makes user can be rapidly in poor signal or the situation such as mail is larger mail be had and concise and to the point be understood; Wherein, the summarization generation of message body is by extracting keywords and processes, thereby forms summary.In prioritization scheme of the present invention, the picture possessing for message body, carries out picture compression according to the resolution of the terminal identifying, thereby reduces its size better reflexless terminal.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the mail property generation method of embodiment mono-;
Fig. 2 is the transfer process schematic diagram of Email;
Fig. 3 is the configuration diagram that in the example of embodiment mono-, mail property generates engine;
Fig. 4 is the schematic flow sheet that in the example of embodiment mono-, mail property generates;
Fig. 5 is the process flow diagram of text summarization generation in the example of embodiment mono-.
Embodiment
Below in conjunction with drawings and Examples, technical scheme of the present invention is described in detail.
It should be noted that, if do not conflicted, each feature in the embodiment of the present invention and embodiment can mutually combine, all within protection scope of the present invention.In addition, although there is shown logical order in flow process, in some cases, can carry out shown or described step with the order being different from herein.
Embodiment mono-, a kind of mail property generation method, as shown in Figure 1, comprising:
S101, message body is converted to sentence sequence, and carries out participle;
S102, the vocabulary that obtains from participle, extract and represent that the vocabulary in name, time, place preserves into keyword set;
S103, for remaining each vocabulary, the frequency occurring in message body according to this vocabulary respectively, and the ratio of the sentence that comprises this vocabulary in all sentences calculated the weighted value of this vocabulary; The vocabulary that weighted value is exceeded to predefined weight threshold value is preserved into described keyword set;
S104, calculate respectively by Naive Bayes Classification Model the each keyword representing in described keyword set beyond the keyword in name, time, place and become the probability of summary, finally using probability higher than the keyword of predetermined probability threshold value, represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body.
The present embodiment considers that the most length of message body content is limited, and considers and may need bulk mail to carry out abstract extraction, must choose quick, succinct method; Therefore the present embodiment is taked the method based on feature, message body text is considered as to the linear order of sentence, sentence is considered as to the linear order of word, by extracting weighted value and the frequency of keyword, calculating vocabulary, obtains summary info.
In the present embodiment, if a keyword in summary in described message body, repeatedly occur, when there is Sequential output, the position occurring in described message body for the first time with this keyword is as the criterion.
In the present embodiment, above-mentioned steps S101~104 can be carried out on mail server, also can implement based on cloud environment, such as the generation of mail property, transmitting-receiving engine are implemented in to server front end, its exchange architecture goes up beyond the clouds, utilizes the ability of cloud computing to generate mail property, sends and carry out batch processing.
In the present embodiment, can utilize existing Text Mining Technology realize by message body be converted to sentence sequence, to document participle, extract the step of the vocabulary that represents name, time, place; Such as utilizing punctuation mark by the subsequence of forming a complete sentence of the text-converted in mail; Can utilize given Chinese vocabulary, carry out document participle and the step of extracting the vocabulary that represents name, time, place.
In an embodiment of the present embodiment, described in carry out can also comprising before the step of participle:
Whether judge in message body exists partial character to have special format, and character shared ratio in described message body with special format is less than predetermined ratio threshold value, if it is has the character of special format as the ingredient of described summary using described;
Described in removing from described message body, have after the character of special format, described in residue document is carried out, carry out the step of participle.
In present embodiment, described special format can be, but not limited to comprise: overstriking, italic, with underscore, with strikethrough etc.
In an embodiment of the present embodiment, described weighted value is specifically as follows:
W f ( w i ) = F ( w i ) × log ( S S f ( w i ) ) - - - ( 1 )
Wherein, W f(w i) represent the weighted value of the vocabulary i that calculates, F (w i) representing the frequency that vocabulary i occurs in described message body, S represents the number of all sentences that comprise in described sentence sequence, S f(w i) for there is the number of the sentence of vocabulary i in described sentence sequence.
In present embodiment, calculate by above formula after the weighted value of vocabulary, according to predefined weight threshold value, pick out the vocabulary that weights are higher and put into keyword set; Described predefined weight threshold value can based on experience value or be tested and be obtained.
Because extracting keywords in the present embodiment is the mode based on word frequency and weight, therefore may have some and not possess the keyword that generates summary meaning, for example " " " " etc. be drawn into, so need above formula again keyword to be screened, thus pick out significant word; Meanwhile, the matching judgment of Feature Words and keyword, also can carry out a tell-tale judgement to Mail Contents.
Suppose that C is the characteristic set of all mails in database, can be divided into multiple character subset C such as " computer ", " administrative class ", " daily class ", " commercial paper " so in C according to the difference of document content in mail i(and can in these subsets, further separate subset), each character subset is corresponding to a Doctype; Can set up in advance a training document sets (classification of each document is by manually specifying), by calculating the wherein degree of correlation of vocabulary and document classification, the high vocabulary of selection degree of correlation is as the Feature Words set of such document.
Consider that the present embodiment finally will apply in actual production system, the more mail of concurrent processing and Feature Words set simultaneously will be carried out the renewal iteration at short period interval, therefore in the algorithm of Feature Selection, need the low and fireballing algorithm of computation complexity, in this case, carrying out Feature Words by the mode of TF-IDF (term frequency – inverse document frequency, word frequency-oppositely document frequency) chooses.
A vocabulary a is at C ithe computing method of the TF-IDF value in class document are:
TFIDF(a,C i)=tf(a,C i)*log(N/X a) (2)
Wherein, x afor the mail sum that comprises a, N is the sum of all mails in database;
tf(a,C i)=Count(a|C i)/Count(a’|C i);
Count (a|C i) be illustrated in C ithe number of times that in class document, a occurs; Count (a ' | C i) expression C itotal vocabulary number in class document.
Calculate C iin class document, after the TF-IDF value of each vocabulary, set a threshold values, select the vocabulary that is greater than this threshold values as C ithe Feature Words of class document.After all kinds of Feature Words set is merged, form the feature word set of whole database, the setting of threshold values need to be by a large amount of training, (for example computer in proper range to ensure the quantity of Feature Words, not all computerese is all Feature Words, but too harsh threshold values can cause vocabulary rareness, do not reach Feature Words and choose object).
In the time that system need to be processed a certain envelope mail, calculate weight and screen the set W (being the set of name in described keyword set, time, place keyword formation in addition) obtaining according to formula (1) and mate with described feature word set, the keyword that is for example less than Y% in W appears in all kinds of document characteristic of correspondence set of words, directly by the keyword in W, and the order that special key word (name, time, place etc.) and mail matter topics occur according to mail original text is exported as summary.
When the keyword that is no less than proportion threshold value Y% in W appears at C iin class document characteristic of correspondence set of words, to keyword set and C iclass document characteristic of correspondence set of words is carried out characteristic matching, and mail is stamped and represented that type is for " C i" mark; Such as total j keyword in the W forming, wherein o appears at C iin (o/j>=Y%), choose k and appear at C ithe maximally related Feature Words F of this o Feature Words in storehouse 1, F 2... F k(the desirable k=o^2 of initial situation, formula can continue to adjust, and these Feature Words may also may repeat across class, its correlativity can carry out when feature part of speech is divided obtaining at first) as anticipation dictionary.Then utilize following formula to calculate other j-o vocabulary and anticipation dictionary F 1, F 2... F kthe probability of coupling, that is:
P ( w j \ o ∈ W | F 1 , . . . F k ) = P ( F 1 , . . . F k | w j \ o ∈ W ) * P ( w j \ o ∈ W ) P ( F 1 , . . . F k )
Due to F 1... F kthe condition that is chosen for independently choose, above formula=
Π i = 1 k P ( F i | w j \ o ∈ W ) * P ( w j \ o ∈ W ) Π i = 1 k P ( F i ) - - - ( 3 )
Wherein, П is quadrature computing.W is that keyword set is removed the set that represents name, place name, time keyword in addition, the set of namely calculating the keyword filtering out after weight according to formula (1); F 1... F kit is the anticipation dictionary being formed by Feature Words; Formula (3) is mated with feature word set W, to improve the accuracy that generates summary the Doctype that provides mail.
The above formula left side meaning is, when Feature Words is F 1... F ktime, in set W, do not have and C ij-o the vocabulary w that class document characteristic of correspondence set of words matches j\obecome the probability of summary; Right side, for trying to achieve Bayes's expansion of this probability, is theorem.P (F i) be in the anticipation dictionary forming at this k Feature Words, F ithe probability occurring is (as F 1having occurred 1 time, is 1/k; F and F are same words, are 2/k); P (F i| w j\o∈ W) for working as w j\of when ∈ W ithe probability occurring; P (w j\o∈ W) be constant, can in the time of training, obtain, for example, while training, in j-o non-characteristic matching keyword, there is g to become final digest word, P (w j\o∈ W)=g/ (j-o).
Set a probability threshold values, choose the P (w that is greater than this probability threshold values j\o∈ W|F 1... F k) corresponding keyword, also have in described keyword set and C ithe keyword that class document characteristic of correspondence set of words is successfully mated and the keyword that represents name, time, place, together with mail matter topics, according to occurring sequencing output.
In an embodiment of the present embodiment, described method can also comprise:
In the time that described message body contains picture, obtain the resolution information of the corresponding terminal of this mail posting address, judge whether the long l × wide d pixel of described picture exceedes resolution m × n ratio of terminal;
If exceed, picture compression is moved to the resolution ratio of demonstrate,proving terminal to this, between the length and width pixel l' after compression and d', meets:
l ′ d ′ = m n ;
If or only in described summary, increase the information of text containing picture, if the picture after compression is carried in the summary that described step 104 generates; Described a = l l ′ , b = d d ′ .
In present embodiment, the target of abstract extraction is that algorithm is simple, quick, permission distortion, therefore can be, but not limited to adopt huffman coding to compress.Current terminal recognition technology reaches its maturity, and mail server can easily obtain end message.
In an embodiment of the present embodiment, described method can also comprise:
In the time that described mail contains annex, the name of described annex is referred to as to the ingredient of described summary.
It is very common in the time sending mail, adding annex, and for utilizing 2G/3G network to receive mail consumption flow and the slow-paced problem containing big accessory, present embodiment forms the ingredient of summary by annex name, to facilitate user to judge.
In an embodiment of the present embodiment, after described step S104, can also comprise:
Obtain the acquiescence text formatting of the corresponding terminal of this mail posting address; The text formatting of described summary is set to described acquiescence text formatting.
Present embodiment has been equivalent to peel off the text formatting that message body possesses.
Below with example explanation the present embodiment, an envelope Email is sent to flow process that addressee receives as shown in Figure 2 by sender, and the mail server A that sender sends an email to mailbox used by SMTP session obtains connection; Mail server A sends the mail server B of described Email to addressee's mailbox from internet by SMTP session; Described Email is sent to addressee by mail server B.
In this example, as shown in Figure 3, engine is by a mail server database and four queues: task queue to be loaded, summarization generation queue, upgrades queue, cleaning queue composition for the Organization Chart that mail property generates engine.
Mail server database, for storing the database of e-mail messages.
Task queue to be loaded, what deposit is that described mail server database is carried out to the rear task to be loaded of batch tasks scanning, the limited amount system of this queue to task, in the time that it is full, just cannot add new task in this queue.
Summarization generation queue, that deposits loads, need to carry out the task of summarization generation from task queue to be loaded, and the limited amount system of this queue to task, in the time that it is full, just cannot add new task in this queue.
Upgrade queue, what deposit is the task of having completed summarization generation in summarization generation queue, puts into cleaning queue to upgrading after each task in queue is upgraded mail server database successively.
Cleaning queue, clears up the temporary file of summarization generation process generation and the information of ancient deed successively to each task of wherein depositing.
In this example, a mail property product process as shown in Figure 4.
According to the e-mail messages in mail server database, first carry out terminal recognition, obtain after the corresponding information of terminal, carry out respectively text summarization generation and annex summarization generation, generate respectively the summary of text and annex according to the respective rule in rule base; Then mail is carried out sending to described mail server database after state renewal, and clear up temporary file.
In above-mentioned flow process, the step of described terminal recognition is mainly divided into two parts, and Part I is the identification to mobile terminal style.The technological means of at present existing multiple more ripe identification mobile terminal.By front 8 models that identify this terminal of IMEI (International Mobile Equipment Identity international mobile equipment identification number), obtain the corresponding attribute of mobile terminal, such as resolution etc. by setting up customer mobile terminal device attribute and the mapping table between customer mobile terminal equipment model again.
Part II is the identification to Move Mailbox terminal.On mobile terminal, take the mail push business of pushmail as interconnection network personal mailbox and enterprise's mailbox system at present, that is: the mail that newly arrives mail server is pushed to the form of service on customer mobile terminal on schedule.The existing two kinds of technical schemes of pushmail, 1) note PUSH, triggers with note form the software client work being arranged on mobile terminal, thereby coordinates server to complete whole mail reception process.2) IP PUSH, carries out mail reception with the form that keeps mobile terminal cell phone mailbox client software IP long-chain to connect.Enterprise installs a mail forwarding server and is connected with internal mail server, and is connected with the mail push gateway of local side by mail forwarding server, and mail push gateway is connected with user terminal by GSM/GPRS/3G network.In the time that enterprise's mailbox has new mail to arrive, mail forwarding server circular mail pushes gateway, if note PUSH scheme, mail push gateway sends backstage notifying messages to the mobile phone users that terminal software has been installed, after the notified note of terminal software, activate GPRS network, mobile terminal software is fetched mail from mail forwarding server, and notifies user, completes mail push process.If IP PUSH scheme, the software client that mail push gateway connects notice mobile terminal by IP data is fetched mail from mail forwarding server, and notifies cellphone subscriber, completes mail push process.No matter be note PUSH or IPPUSH, receive after PUSH order at mail forwarding server, can obtain the information such as resolution ratio, acquiescence text formatting as the mobile terminal of PUSH object, and the mail of institute's request receiving is carried out to summarization generation.
The workflow of described text summarization generation specifically as shown in Figure 5, in full scanning sees in message body, whether to contain picture, if had, compress (if the ratio of length and width compression factor is not separately in preset range after compression, only adding prompting has the information of picture); After compression or while thering is no picture, word summarization generation is carried out in rule-based storehouse, and the picture (or information) after word summary and compression is made a summary as text jointly.In described rule base, can preserve Chinese vocabulary etc.
The step of described annex summarization generation is to form annex summary by annex name.
The summary that when state upgrades, described text summary is formed together with annex summary to Email, upgrades summarization generation queue, upgrades the state of task in mail server database simultaneously.After renewal, notice cleaning queue.
Embodiment bis-, a kind of mail property generating apparatus, comprising:
Modular converter, for message body being converted to sentence sequence, and carries out participle;
Preliminary election module, for the vocabulary obtaining from participle, extracts and represents that the vocabulary in name, time, place preserves into keyword set;
Weight screening module, for for remaining each vocabulary, the frequency occurring in message body according to this vocabulary respectively, and the ratio of the sentence that comprises this vocabulary in all sentences calculated the weighted value of this vocabulary; The vocabulary that weighted value is exceeded to predefined weight threshold value is preserved into described keyword set;
Probability screen modeling piece, represent that for calculate respectively described keyword set by Naive Bayes Classification Model the each keyword beyond the keyword in name, time, place becomes the probability of summary, finally using probability higher than the keyword of predetermined probability threshold value, represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body.
In an embodiment of the present embodiment, described probability screen modeling piece can also be used for before described modular converter carries out participle, whether judge in message body exists partial character to have special format, and character shared ratio in described message body with special format is less than predetermined ratio threshold value, if it is has the character of special format as the ingredient of described summary using described;
The residue document that described modular converter has described in described message body is removed after the character of special format carries out participle.
In an embodiment of the present embodiment, the weighted value that described weight screening module calculates can be:
W f ( w i ) = F ( w i ) × log ( S S f ( w i ) )
Wherein, W f(w i) represent the weighted value of the vocabulary i that calculates, F (w i) representing the frequency that vocabulary i occurs in described message body, S represents the number of all sentences that comprise in described sentence sequence, S f(w i) for there is the number of the sentence of vocabulary i in described sentence sequence.
In an embodiment of the present embodiment, described probability screen modeling piece specifically can comprise:
Matching unit, judging unit, output unit and computing unit;
Described matching unit mates for described keyword set rejecting being represented to the set W described Feature Words set corresponding with all kinds of documents obtaining after the keyword in name, time, place;
Described judging unit is in the time being less than the keyword of proportion threshold value and appearing in all kinds of document characteristic of correspondence set of words in W, indicate described output unit directly by the keyword in W, and represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body; When the keyword that is no less than described proportion threshold value in W appears at C iin class document characteristic of correspondence set of words time, mail is stamped and represented that type is C imark; Indicate described computing unit to calculate;
Described computing unit is used for calculating not at C ithe keyword and the anticipation dictionary F that in class document characteristic of correspondence set of words, occur 1, F 2... F kthe probability of coupling:
P ( w ∈ W | F 1 , . . . F k ) = Π i = 1 k P ( F i | w ∈ W ) * P ( w ∈ W ) Π i = 1 k P ( F i )
Wherein, П is quadrature computing; W is not at C in W ithe arbitrary keyword occurring in class document characteristic of correspondence set of words; P (F i) be in the described anticipation dictionary forming at k Feature Words, F ithe probability occurring; P (F i| w ∈ W) be F in the time of w ∈ W ithe probability occurring; P (w ∈ W) is constant;
Described output unit, for after calculating at described computing unit, is chosen the P (w that is greater than this probability threshold values j\o∈ W|F 1... F k) corresponding keyword, also have in described keyword set and C ithe keyword that class document characteristic of correspondence set of words is successfully mated, and represent the keyword in name, time, place, with together with the theme of described mail as summary, according to the appearance Sequential output in described message body.
In an embodiment of the present embodiment, described device can also comprise:
Terminal acquisition module, in the time that described message body contains picture, obtains the resolution information of the corresponding terminal of this mail posting address;
Picture compression module, for judging whether the long l × wide d pixel of picture of described message body exceedes resolution m × n ratio of terminal; If exceed, picture compression is moved to the resolution ratio of demonstrate,proving terminal to this, between the length and width pixel l' after compression and d', meets: if or only in described summary, increase the information of text containing picture, if the picture after compression is carried in the summary that described summarization generation module generates; Described
One of ordinary skill in the art will appreciate that all or part of step in said method can carry out instruction related hardware by program and complete, described program can be stored in computer-readable recording medium, as ROM (read-only memory), disk or CD etc.Alternatively, all or part of step of above-described embodiment also can realize with one or more integrated circuit.Correspondingly, the each module/unit in above-described embodiment can adopt the form of hardware to realize, and also can adopt the form of software function module to realize.The present invention is not restricted to the combination of the hardware and software of any particular form.
Certainly; the present invention also can have other various embodiments; in the situation that not deviating from spirit of the present invention and essence thereof; those of ordinary skill in the art are when making according to the present invention various corresponding changes and distortion, but these corresponding changes and distortion all should belong to the protection domain of claim of the present invention.

Claims (10)

1. a mail property generation method, comprising:
S101, message body is converted to sentence sequence, and carries out participle;
S102, the vocabulary that obtains from participle, extract and represent that the vocabulary in name, time, place preserves into keyword set;
S103, for remaining each vocabulary, the frequency occurring in message body according to this vocabulary respectively, and the ratio of the sentence that comprises this vocabulary in all sentences calculated the weighted value of this vocabulary; The vocabulary that weighted value is exceeded to predefined weight threshold value is preserved into described keyword set;
S104, calculate respectively by Naive Bayes Classification Model the each keyword representing in described keyword set beyond the keyword in name, time, place and become the probability of summary, finally using probability higher than the keyword of predetermined probability threshold value, represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body.
2. the method for claim 1, is characterized in that, described in carry out also comprising before the step of participle:
Whether judge in message body exists partial character to have special format, and character shared ratio in described message body with special format is less than predetermined ratio threshold value, if it is has the character of special format as the ingredient of described summary using described;
Described in removing from described message body, have after the character of special format, described in residue document is carried out, carry out the step of participle.
3. the method for claim 1, is characterized in that, described weighted value is:
W f ( w i ) = F ( w i ) × log ( S S f ( w i ) )
Wherein, W f(w i) represent the weighted value of the vocabulary i that calculates, F (w i) representing the frequency that vocabulary i occurs in described message body, S represents the number of all sentences that comprise in described sentence sequence, S f(w i) for there is the number of the sentence of vocabulary i in described sentence sequence.
4. the method for claim 1, is characterized in that, described step 104 comprises:
Described keyword set is rejected and represented that the set W described Feature Words set corresponding with all kinds of documents obtaining after the keyword in name, time, place mate, if being less than the keyword of proportion threshold value in W appears in all kinds of document characteristic of correspondence set of words, directly by the keyword in W, and represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body;
When the keyword that is no less than described proportion threshold value in W appears at C iin class document characteristic of correspondence set of words, mail is stamped and represented that type is C imark; Calculate not at C ithe keyword and the anticipation dictionary F that in class document characteristic of correspondence set of words, occur 1, F 2... F kthe probability of coupling:
P ( w ∈ W | F 1 , . . . F k ) = Π i = 1 k P ( F i | w ∈ W ) * P ( w ∈ W ) Π i = 1 k P ( F i )
Wherein, П is quadrature computing; W is not at C in W ithe arbitrary keyword occurring in class document characteristic of correspondence set of words; P (F i) be in the described anticipation dictionary forming at k Feature Words, F ithe probability occurring; P (F i| w ∈ W) be F in the time of w ∈ W ithe probability occurring; P (w ∈ W) is constant; Choose the P (w that is greater than this probability threshold values j\o∈ W|F 1... F k) corresponding keyword, also have in described keyword set and C ithe keyword that class document characteristic of correspondence set of words is successfully mated, and represent the keyword in name, time, place, with together with the theme of described mail as summary, according to the appearance Sequential output in described message body.
5. the method for claim 1, is characterized in that, also comprises:
In the time that described message body contains picture, obtain the resolution information of the corresponding terminal of this mail posting address, judge whether the long l × wide d pixel of described picture exceedes resolution m × n ratio of terminal;
If exceed, picture compression is moved to the resolution ratio of demonstrate,proving terminal to this, between the length and width pixel l' after compression and d', meets:
l ′ d ′ = m n ;
If or only in described summary, increase the information of text containing picture, if the picture after compression is carried in the summary that described step 104 generates; Described a = l l ′ , b = d d ′ .
6. a mail property generating apparatus, is characterized in that, comprising:
Modular converter, for message body being converted to sentence sequence, and carries out participle;
Preliminary election module, for the vocabulary obtaining from participle, extracts and represents that the vocabulary in name, time, place preserves into keyword set;
Weight screening module, for for remaining each vocabulary, the frequency occurring in message body according to this vocabulary respectively, and the ratio of the sentence that comprises this vocabulary in all sentences calculated the weighted value of this vocabulary; The vocabulary that weighted value is exceeded to predefined weight threshold value is preserved into described keyword set;
Probability screen modeling piece, represent that for calculate respectively described keyword set by Naive Bayes Classification Model the each keyword beyond the keyword in name, time, place becomes the probability of summary, finally using probability higher than the keyword of predetermined probability threshold value, represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body.
7. device as claimed in claim 6, is characterized in that:
Described probability screen modeling piece is also for before described modular converter carries out participle, whether judge in message body exists partial character to have special format, and character shared ratio in described message body with special format is less than predetermined ratio threshold value, if it is has the character of special format as the ingredient of described summary using described;
The residue document that described modular converter has described in described message body is removed after the character of special format carries out participle.
8. device as claimed in claim 6, is characterized in that, the weighted value that described weight screening module calculates is:
W f ( w i ) = F ( w i ) × log ( S S f ( w i ) )
Wherein, W f(w i) represent the weighted value of the vocabulary i that calculates, F (w i) representing the frequency that vocabulary i occurs in described message body, S represents the number of all sentences that comprise in described sentence sequence, S f(w i) for there is the number of the sentence of vocabulary i in described sentence sequence.
9. device as claimed in claim 6, is characterized in that, described probability screen modeling piece comprises:
Matching unit, judging unit, output unit and computing unit;
Described matching unit mates for described keyword set rejecting being represented to the set W described Feature Words set corresponding with all kinds of documents obtaining after the keyword in name, time, place;
Described judging unit is in the time being less than the keyword of proportion threshold value and appearing in all kinds of document characteristic of correspondence set of words in W, indicate described output unit directly by the keyword in W, and represent together with the keyword in name, time, place and the theme of described mail as summary, according to the appearance Sequential output in described message body; When the keyword that is no less than described proportion threshold value in W appears at C iin class document characteristic of correspondence set of words time, mail is stamped and represented that type is C imark; Indicate described computing unit to calculate;
Described computing unit is used for calculating not at C ithe keyword and the anticipation dictionary F that in class document characteristic of correspondence set of words, occur 1, F 2... F kthe probability of coupling:
P ( w ∈ W | F 1 , . . . F k ) = Π i = 1 k P ( F i | w ∈ W ) * P ( w ∈ W ) Π i = 1 k P ( F i )
Wherein, П is quadrature computing; W is not at C in W ithe arbitrary keyword occurring in class document characteristic of correspondence set of words; P (F i) be in the described anticipation dictionary forming at k Feature Words, F ithe probability occurring; P (F i| w ∈ W) be F in the time of w ∈ W ithe probability occurring; P (w ∈ W) is constant;
Described output unit, for after calculating at described computing unit, is chosen the P (w that is greater than this probability threshold values j\o∈ W|F 1... F k) corresponding keyword, also have in described keyword set and C ithe keyword that class document characteristic of correspondence set of words is successfully mated, and represent the keyword in name, time, place, with together with the theme of described mail as summary, according to the appearance Sequential output in described message body.
10. device as claimed in claim 6, is characterized in that, also comprises:
Terminal acquisition module, in the time that described message body contains picture, obtains the resolution information of the corresponding terminal of this mail posting address;
Picture compression module, for judging whether the long l × wide d pixel of picture of described message body exceedes resolution m × n ratio of terminal; If exceed, picture compression is moved to the resolution ratio of demonstrate,proving terminal to this, between the length and width pixel l' after compression and d', meets: if or only in described summary, increase the information of text containing picture, if the picture after compression is carried in the summary that described summarization generation module generates; Described
CN201410469526.4A 2014-09-15 2014-09-15 E-mail digest generation method and device Pending CN104182549A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410469526.4A CN104182549A (en) 2014-09-15 2014-09-15 E-mail digest generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410469526.4A CN104182549A (en) 2014-09-15 2014-09-15 E-mail digest generation method and device

Publications (1)

Publication Number Publication Date
CN104182549A true CN104182549A (en) 2014-12-03

Family

ID=51963588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410469526.4A Pending CN104182549A (en) 2014-09-15 2014-09-15 E-mail digest generation method and device

Country Status (1)

Country Link
CN (1) CN104182549A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573054A (en) * 2015-01-21 2015-04-29 杭州朗和科技有限公司 Information pushing method and equipment
CN104639427A (en) * 2015-02-04 2015-05-20 九玉(北京)科技有限公司 Method and device for outputting E-mail information
CN104636431A (en) * 2014-12-31 2015-05-20 南京新模式软件集成有限公司 Automatic extraction and optimizing method for document abstracts of different fields
CN105786790A (en) * 2014-12-18 2016-07-20 镇江高科科技信息咨询有限公司 Device and method for generation of paper text
CN106209605A (en) * 2016-08-30 2016-12-07 程传旭 The processing method of adnexa and equipment in a kind of network information
CN107517152A (en) * 2016-06-15 2017-12-26 李卓桓 Mail treatment service system and method
CN107608946A (en) * 2017-09-30 2018-01-19 努比亚技术有限公司 Word key content extracting method and corresponding mobile terminal
CN108038189A (en) * 2017-12-11 2018-05-15 南京茂毓通软件科技有限公司 A kind of information extracting system of Email
CN112364155A (en) * 2020-11-20 2021-02-12 北京五八信息技术有限公司 Information processing method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN101441744A (en) * 2008-11-27 2009-05-27 北京立通无限科技有限公司 E-mail management method and system
US20130086035A1 (en) * 2011-09-30 2013-04-04 International Business Machines Corporation Method and apparatus for generating extended page snippet of search result

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196904A (en) * 2007-11-09 2008-06-11 清华大学 News keyword abstraction method based on word frequency and multi-component grammar
CN101441744A (en) * 2008-11-27 2009-05-27 北京立通无限科技有限公司 E-mail management method and system
US20130086035A1 (en) * 2011-09-30 2013-04-04 International Business Machines Corporation Method and apparatus for generating extended page snippet of search result

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786790A (en) * 2014-12-18 2016-07-20 镇江高科科技信息咨询有限公司 Device and method for generation of paper text
CN104636431B (en) * 2014-12-31 2017-12-12 南京新模式软件集成有限公司 A kind of different field documentation summary extracts automatically and the method for Automatic Optimal
CN104636431A (en) * 2014-12-31 2015-05-20 南京新模式软件集成有限公司 Automatic extraction and optimizing method for document abstracts of different fields
CN104573054A (en) * 2015-01-21 2015-04-29 杭州朗和科技有限公司 Information pushing method and equipment
CN104573054B (en) * 2015-01-21 2018-06-01 杭州朗和科技有限公司 A kind of information-pushing method and equipment
CN104639427A (en) * 2015-02-04 2015-05-20 九玉(北京)科技有限公司 Method and device for outputting E-mail information
CN104639427B (en) * 2015-02-04 2017-11-14 九玉(北京)科技有限公司 A kind of method and device for exporting e-mail messages
CN107517152A (en) * 2016-06-15 2017-12-26 李卓桓 Mail treatment service system and method
CN106209605A (en) * 2016-08-30 2016-12-07 程传旭 The processing method of adnexa and equipment in a kind of network information
CN107608946A (en) * 2017-09-30 2018-01-19 努比亚技术有限公司 Word key content extracting method and corresponding mobile terminal
CN108038189A (en) * 2017-12-11 2018-05-15 南京茂毓通软件科技有限公司 A kind of information extracting system of Email
CN112364155A (en) * 2020-11-20 2021-02-12 北京五八信息技术有限公司 Information processing method and device
CN112364155B (en) * 2020-11-20 2024-05-31 北京五八信息技术有限公司 Information processing method and device

Similar Documents

Publication Publication Date Title
CN104182549A (en) E-mail digest generation method and device
JP5775348B2 (en) Entity similarity calculation method
US7930351B2 (en) Identifying undesired email messages having attachments
CN101079827B (en) Email management method and system
KR101708508B1 (en) Method for calculating semantic similarities between messages and conversations based on enhanced entity extraction
CN102622592A (en) Name card recognition method based on cloud technology
CN107533557A (en) Communicated using template identification network fraud
WO2004105332A9 (en) Method and apparatus for filtering email spam based on similarity measures
CN102572108A (en) Method for optimizing mobile phone message service and system thereof
CN101877837A (en) Method and device for short message filtration
CN104850550A (en) Method and apparatus for ordering prompt messages
CN110213152B (en) Method, device, server and storage medium for identifying junk mails
CN103473218A (en) Email classification method and email classification device
CN103580919B (en) A kind of method and system that mail user mark is carried out using mail server daily record
US8775534B2 (en) Method and system for e-mail enhancement
CN103533152A (en) Short message processing method and system of mobile terminal
CN108475274A (en) It generates and application spreads out of communications module
CN105589845A (en) Junk text recognizing method, device and system
CN111221970A (en) Mail classification method and device based on behavior structure and semantic content joint analysis
CN109242410A (en) A kind of generation method and device of backlog
CN109947944A (en) Short message display method, device and storage medium
CN111010336A (en) Massive mail analysis method and device
CN104765784A (en) Key words list maintenance method and system
JP2001331422A (en) Mail evaluation device
CN101094197B (en) Method and mail server of resisting garbage mail

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20141203

WD01 Invention patent application deemed withdrawn after publication