CN108038189A - A kind of information extracting system of Email - Google Patents
A kind of information extracting system of Email Download PDFInfo
- Publication number
- CN108038189A CN108038189A CN201711307359.3A CN201711307359A CN108038189A CN 108038189 A CN108038189 A CN 108038189A CN 201711307359 A CN201711307359 A CN 201711307359A CN 108038189 A CN108038189 A CN 108038189A
- Authority
- CN
- China
- Prior art keywords
- annex
- information
- extraction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of information extracting system of Email, engine is extracted including e-mail messages, e-mail messages extraction engine is trained disaggregated model by training set, then Mail Contents are segmented using participle instrument, finally classified with trained disaggregated model to the Mail Contents after participle, obtain the classification of Mail Contents;After user stamps class label manually to mail, mail is put back into the training set of corresponding classification automatically.The present invention can classify Mail Contents, and user can decide whether to reading mail content by checking classification, effectively save the time of user.In addition, user participates in beating mail manually class label, then mail can be put back into the training set of corresponding classification automatically, and by the continuous iteration optimization training set of this process, last classification results also can increasingly be met the needs of users.
Description
Technical field
The present invention relates to Email Information extraction, more particularly to a kind of information extracting system of Email.
Background technology
With the development of Internet technology, the information on network is vast as the open sea.How information is rapidly and accurately identified such as
What propagation on the internet, becomes focus of concern.Email, be it is a kind of by electronic communication system into row information
The communication mode of exchange, often links together with internet now, becomes one of most popular the Internet, applications service.
Effect of the Email evidence in relating to forming table part and investigating and prosecuting is more and more important.In face of the Email of magnanimity, how quickly, effectively
Ground analysis mail evidence becomes a focus of big data epoch mass e-mails analysis.Current existing e-mail analysis system
It can not classify to Mail Contents, make personnel in charge of the case need to take more time artificial analysis Mail Contents.
The content of the invention
Goal of the invention:, can be to Mail Contents the object of the present invention is to provide a kind of information extracting system of Email
Rational Classification is carried out, so as to save the time and efforts of personnel in charge of the case.
Technical solution:The information extracting system of Email of the present invention, including e-mail messages extraction engine, mail
Information extraction engine is trained disaggregated model by training set, and then Mail Contents are segmented using participle instrument,
Finally classified with trained disaggregated model to the Mail Contents after participle, obtain the classification of Mail Contents;When user gives
After mail stamps class label manually, mail is put back into the training set of corresponding classification automatically.
Further, the e-mail messages extraction engine further includes annex parsing module and decompression module, can be to encryption
Annex is identified:When annex is not compressed file, annex is parsed by annex parsing module, if annex can be normal
Parsing, then judge that annex is not encrypted, if annex cannot be normally resolved, judges that annex has been encrypted;When annex is compression
During file, annex is decompressed by decompression module, if annex can be decompressed normally, judges that annex does not add
It is close, if annex cannot be decompressed normally, judge that annex has been encrypted.So can effectively it be sentenced to whether annex is encrypted
It is disconnected.
Further, the e-mail messages extraction engine further includes mail parsing module, and mail parsing module can be to mail
Parsed, message body, mail matter topics are analytically directly extracted in result, addressee, sender, people is made a copy for and close makes a gift to someone
Account and the pet name, post time, addressee whether check and accept, speech encoding, annex name and annex number.It can so carry
Take out the essential information of mail.
Further, the e-mail messages extraction engine can also extract mail property:A weight is first drafted to comment
Minute mark is accurate, then gives a mark to each sentence, finally provides several sentences in the top as extraction result.Do not have to so logical
The main information that mailing lists reaches can be understood by crossing reading mail content, time saving convenient.
Further, the e-mail messages extraction engine also carries out languages identification by ngram to message body content.So
Facilitate user's orientation to check the mail of a certain languages, without carrying out examination to each mail, save the plenty of time.
Further, the e-mail messages extraction engine is also named by being laminated the Chinese of Markov model and character labeling
Entity recognition method extracts the entity information in Mail Contents.
Further, described information extraction system further includes Email attachment information extraction engine, and Email attachment information extraction is drawn
The content of annex can be extracted including annex parsing module, the annex parsing module by holding up.
Further, described information extraction system further includes recessive information extraction engine, and recessive information extraction engine includes postal
Part parsing module, the recessive information extraction engine can extract forwarding relation:Mail is solved using mail parsing module
Analysis, then extracts sender, recipients fields content and time field contents.So user can be extracted by recessive information
Engine analysis obtains the forwarding relation in the communication process and mail of mail sequential, it is possible to increase user has found useful letter in mail
The efficiency of breath.
Further, the recessive information extraction engine can extract entity relationship:Determine each entity, i.e. mail lander
Account, the mail pet name, Email attachment and mail essential information, are then associated analysis to each entity.So user can be with
Engine analysis is extracted by recessive information and obtains the entity relationship in the communication process and mail of mail sequential, it is possible to increase user
It was found that in mail useful information efficiency.
Further, the association analysis includes following relational model:Account and pet name relation, mail and accessory relationship, postal
Part and outbox relationship, mail and receiver relation.
Beneficial effect:The invention discloses a kind of information extracting system of Email, Mail Contents can be divided
Class, user can decide whether to reading mail content by checking classification, effectively save the time of user.In addition, user
Class label is beaten mail in participation manually, and then mail can be put back into the training set of corresponding classification automatically, by this process not
Disconnected iteration optimization training set, last classification results also can increasingly be met the needs of users.
Brief description of the drawings
Fig. 1 is the block diagram of system in the specific embodiment of the invention;
Fig. 2 is that mail propagates figure in embodiment 1;
Fig. 3 is e-mail messages and graph of a relation in embodiment 1.
Embodiment
With reference to the accompanying drawings and detailed description, technical scheme is described further.
Present embodiment discloses a kind of information extracting system of Email, including e-mail messages extraction engine,
Email attachment information extraction engine and recessive information extraction engine, as shown in Figure 1.
E-mail messages extraction engine is used to extract the dominant information in mail, has following functions:
1) mail essential information is extracted
The mail parsing module in engine is extracted by e-mail messages to parse eml files, in result analytically
The text of mail is directly extracted, mail matter topics, addressee, sender, make a copy for people and the close account and the pet name made a gift to someone, mail hair
The time is sent, whether addressee checks and accepts, speech encoding, annex name, annex number.The content so extracted can intuitively reflect
The essential information of mail.
2) abstract extraction
Mail property is extracted in the following manner:A weight standards of grading are first drafted, then to each sentence
Marking, finally provides several sentences in the top as extraction result.Not having to so can by reading mail content
Understand the main information that mailing lists reaches, it is time saving convenient.
3) languages identify
Languages identification is carried out to message body content by ngram.User's orientation is so facilitated to check the postal of a certain languages
Part, without carrying out examination to each mail, saves the plenty of time.
4) Mail Contents are classified
Disaggregated model is trained by preprepared training set, then using the instrument that segments to Mail Contents into
Row participle, finally classifies the Mail Contents after participle with trained disaggregated model, obtains the classification of Mail Contents;With
When mail is checked in family at interface, it can simultaneously participate in and class label is stamped manually to mail, which is put back into correspondence automatically
In the training set of classification.By the continuous iteration optimization training set of this process, last classification results also can increasingly meet to use
Family demand.
5) cryptographic attachment identifies
E-mail messages extraction engine further includes annex parsing module and decompression module, and cryptographic attachment can be known
Not:When annex is not compressed file, annex is parsed by annex parsing module, if annex can be normally resolved, is sentenced
Determine annex not encrypt, if annex cannot be normally resolved, judge that annex has been encrypted;When annex is compressed file, lead to
Cross decompression module to decompress annex, if annex can be decompressed normally, judge that annex is not encrypted, if annex
It cannot normally be decompressed, then judge that annex has been encrypted.So can effectively it be judged whether annex is encrypted.
6) entity information extracts
Important information may also be included in the content of mail, the entity information in mail is the name in Mail Contents
And place name, by being laminated the Chinese name entity recognition method of Markov model and character labeling to the entity in Mail Contents
Information is extracted.
Email attachment information extraction engine is that the dominant information in the non-cryptographic attachments to mail extracts, extraction it is interior
Appearance includes:Annex name, annex size, annex number, type of attachment, attachment content, annex summary, annex coding, annex language,
Attachment content is classified, and entity information extracts (including name, place name).Email attachment information extraction engine includes annex parsing mould
Block, annex parsing module can extract the content of annex.The extracting method of the relevant information of annex is drawn with e-mail messages extraction
It is identical to hold up the middle extracting method used.
A variety of relational models are included in recessive information extraction engine, dominant information is excavated by these relational models and is divided
Analysis, makes the various information in mail have incidence relation on different dimensions, and user can extract engine point by recessive information
The various incidence relations in the communication process for obtaining mail sequential and mail are analysed, raising user has found the useful information in mail
Efficiency.Extraction comprising forwarding relation and entity relationship in recessive information extraction engine:
A) relation extraction is forwarded
Recessive information extraction engine includes mail parsing module, mail is parsed using mail parsing module, then
Extract sender, recipients fields content and time field contents.So user can extract engine point by recessive information
Analysis obtains the forwarding relation in the communication process and mail of mail sequential, it is possible to increase user has found the effect of useful information in mail
Rate.
B) entity relationship is extracted
Determine each entity, i.e. mail lander account, the mail pet name, Email attachment and mail essential information, it is then right
Each entity is associated analysis.So user can extract engine analysis by recessive information and obtain being propagated through for mail sequential
Entity relationship in journey and mail, it is possible to increase user has found the efficiency of useful information in mail.Association analysis is included with ShiShimonoseki
It is model:Account and pet name relation, mail and accessory relationship, mail and outbox relationship, mail and receiver relation.
One embodiment is described below:
Embodiment 1:
If there is such demand, certain unit needs analyzing the information of an envelope mail, analyzes classification, the encryption feelings of mail
Condition, Spreading source, spread scope and related personnel, in order to complete the demand, developer is needed from unit mail database
Associated mail and record are exported, method described in the present invention can be utilized afterwards, analyze the dominant and recessive letter of mail
Breath, so as to find out the information such as the classification of mail, encryption situation, Spreading source, spread scope and related personnel.
(1) e-mail messages extraction engine
Related dominant information is extracted according to the form of standard mail to mail using e-mail messages extraction engine, such as:
(2) Email attachment information extraction engine
Analysis extraction is carried out to accessory information using Email attachment information extraction engine.
(3) recessive information extraction engine
A. relation extraction is forwarded
The extraction of email relaying relation can intuitively check that mail relies on the communication process of sequential.
B. entity relationship is extracted
In result first from ticket information and e-mail messages extraction engine with the extraction of Email attachment information extraction engine really
Then these entities are associated analysis, are the entity relationship extracted below by fixed each entity.
Can obtain this envelope mail by the analysis of e-mail messages extraction engine, Email attachment information extraction engine is to make a copy for
Chinese email that people and Mi make a gift to someone, non-encrypted, agriculture, and the annex with agriculture, recessive information extraction engine
Extraction wherein entity relationship and the forwarding relation shown by sequential, build the propagation path of mail and the relevant people of propagation
Member, as Fig. 2 mails are propagated shown in figure.After this processing, to mail 20160608202706-43855.eml, can intuitively it look into
A variety of relations between the every terms of information and entity of mail are seen, as shown in Fig. 3 e-mail messages and graph of a relation.
Claims (10)
- A kind of 1. information extracting system of Email, it is characterised in that:Engine, e-mail messages extraction are extracted including e-mail messages Engine is trained disaggregated model by training set, and then Mail Contents are segmented using participle instrument, finally with instruction The disaggregated model perfected classifies the Mail Contents after participle, obtains the classification of Mail Contents;When user is manual to mail After stamping class label, mail is put back into the training set of corresponding classification automatically.
- 2. the information extracting system of Email according to claim 1, it is characterised in that:The e-mail messages extraction is drawn Hold up and further include annex parsing module and decompression module, cryptographic attachment can be identified:When annex is not compressed file, Annex is parsed by annex parsing module, if annex can be normally resolved, judges that annex is not encrypted, if annex is not It can be normally resolved, then judge that annex has been encrypted;When annex is compressed file, annex is solved by decompression module Pressure, if annex can be decompressed normally, judges that annex is not encrypted, if annex cannot be decompressed normally, judges attached Part has been encrypted.
- 3. the information extracting system of Email according to claim 1, it is characterised in that:The e-mail messages extraction is drawn Hold up and further include mail parsing module, mail parsing module can parse mail, analytically directly extract postal in result Part text, mail matter topics, addressee, sender, make a copy for people and the close account made a gift to someone and the pet name, post time, addressee Whether check and accept, speech encoding, annex name and annex number.
- 4. the information extracting system of Email according to claim 1, it is characterised in that:The e-mail messages extraction is drawn Mail property can also be extracted by holding up:A weight standards of grading are first drafted, then gives a mark to each sentence, finally provides Several sentences in the top are as extraction result.
- 5. the information extracting system of Email according to claim 1, it is characterised in that:The e-mail messages extraction is drawn Hold up and languages identification is also carried out to message body content by ngram.
- 6. the information extracting system of Email according to claim 1, it is characterised in that:The e-mail messages extraction is drawn Hold up and also the entity in Mail Contents is believed by being laminated the Chinese name entity recognition method of Markov model and character labeling Breath is extracted.
- 7. the information extracting system of Email according to claim 1, it is characterised in that:Described information extraction system is also Including Email attachment information extraction engine, Email attachment information extraction engine includes annex parsing module, and the annex parses mould Block can extract the content of annex.
- 8. the information extracting system of Email according to claim 1, it is characterised in that:Described information extraction system is also Engine is extracted including recessive information, recessive information extraction engine includes mail parsing module, and the recessive information extracts engine energy Enough extraction forwarding relations:Mail is parsed using mail parsing module, then extracts sender, recipients fields content With time field contents.
- 9. the information extracting system of Email according to claim 8, it is characterised in that:The recessive information extraction is drawn Entity relationship can be extracted by holding up:Determine each entity, i.e. mail lander account, the mail pet name, Email attachment and mail is basic Information, is then associated analysis to each entity.
- 10. the information extracting system of Email according to claim 9, it is characterised in that:The association analysis includes Following relational model:Account and pet name relation, mail and accessory relationship, mail and outbox relationship, mail and receiver relation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711307359.3A CN108038189A (en) | 2017-12-11 | 2017-12-11 | A kind of information extracting system of Email |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711307359.3A CN108038189A (en) | 2017-12-11 | 2017-12-11 | A kind of information extracting system of Email |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108038189A true CN108038189A (en) | 2018-05-15 |
Family
ID=62101680
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711307359.3A Pending CN108038189A (en) | 2017-12-11 | 2017-12-11 | A kind of information extracting system of Email |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108038189A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109033155A (en) * | 2018-06-13 | 2018-12-18 | 中国电子科技集团公司电子科学研究院 | Search mail content and method, device, terminal and storage medium |
CN110400123A (en) * | 2019-07-05 | 2019-11-01 | 中国平安财产保险股份有限公司 | Friend-making information popularization method, apparatus, equipment and computer readable storage medium |
CN111047455A (en) * | 2019-12-31 | 2020-04-21 | 武汉市烽视威科技有限公司 | Personal statue method and system for mail |
CN116032509A (en) * | 2021-10-27 | 2023-04-28 | 中移系统集成有限公司 | Mail encryption and decryption method and device |
CN116308237A (en) * | 2023-05-25 | 2023-06-23 | 湖南九立供应链有限公司 | ERP mail processing method and related equipment thereof |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101398814A (en) * | 2007-09-26 | 2009-04-01 | 北京大学 | Method and system for simultaneously abstracting document summarization and key words |
CN102842078A (en) * | 2012-07-18 | 2012-12-26 | 南京邮电大学 | Email forensic analyzing method based on community characteristics analysis |
CN104182549A (en) * | 2014-09-15 | 2014-12-03 | 中国联合网络通信集团有限公司 | E-mail digest generation method and device |
CN105871887A (en) * | 2016-05-12 | 2016-08-17 | 北京大学 | Client-side based personalized E-mail filtering system and method |
CN106598937A (en) * | 2015-10-16 | 2017-04-26 | 阿里巴巴集团控股有限公司 | Language recognition method and device for text and electronic equipment |
-
2017
- 2017-12-11 CN CN201711307359.3A patent/CN108038189A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101398814A (en) * | 2007-09-26 | 2009-04-01 | 北京大学 | Method and system for simultaneously abstracting document summarization and key words |
CN102842078A (en) * | 2012-07-18 | 2012-12-26 | 南京邮电大学 | Email forensic analyzing method based on community characteristics analysis |
CN104182549A (en) * | 2014-09-15 | 2014-12-03 | 中国联合网络通信集团有限公司 | E-mail digest generation method and device |
CN106598937A (en) * | 2015-10-16 | 2017-04-26 | 阿里巴巴集团控股有限公司 | Language recognition method and device for text and electronic equipment |
CN105871887A (en) * | 2016-05-12 | 2016-08-17 | 北京大学 | Client-side based personalized E-mail filtering system and method |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109033155A (en) * | 2018-06-13 | 2018-12-18 | 中国电子科技集团公司电子科学研究院 | Search mail content and method, device, terminal and storage medium |
CN110400123A (en) * | 2019-07-05 | 2019-11-01 | 中国平安财产保险股份有限公司 | Friend-making information popularization method, apparatus, equipment and computer readable storage medium |
CN110400123B (en) * | 2019-07-05 | 2023-06-20 | 中国平安财产保险股份有限公司 | Friend-making information popularization method, friend-making information popularization device, friend-making information popularization equipment and friend-making information popularization computer readable storage medium |
CN111047455A (en) * | 2019-12-31 | 2020-04-21 | 武汉市烽视威科技有限公司 | Personal statue method and system for mail |
CN116032509A (en) * | 2021-10-27 | 2023-04-28 | 中移系统集成有限公司 | Mail encryption and decryption method and device |
CN116308237A (en) * | 2023-05-25 | 2023-06-23 | 湖南九立供应链有限公司 | ERP mail processing method and related equipment thereof |
CN116308237B (en) * | 2023-05-25 | 2023-08-25 | 湖南九立供应链有限公司 | ERP mail processing method and related equipment thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108038189A (en) | A kind of information extracting system of Email | |
CN103150367B (en) | A kind of Sentiment orientation analytical approach of Chinese microblogging | |
CN104463552B (en) | Calendar reminding generation method and device | |
KR101716905B1 (en) | Method for calculating entity similarities | |
CN109582861B (en) | Data privacy information detection system | |
CN103729474B (en) | Method and system for recognizing forum user vest account | |
CN109753909A (en) | A kind of resume analytic method based on content piecemeal and BiLSTM model | |
CN107632968A (en) | A kind of construction method of chain of evidence relational model towards judgement document | |
CN108491388B (en) | Data set acquisition method, classification method, device, equipment and storage medium | |
CN109446404A (en) | A kind of the feeling polarities analysis method and device of network public-opinion | |
CN106776555B (en) | A kind of comment text entity recognition method and device based on word model | |
Zhang et al. | Filtering junk mail with a maximum entropy model | |
JP2010056682A (en) | E-mail receiver and method of receiving e-mail, e-mail transmitter and e-mail transmission method, mail transmission server | |
CN108009297A (en) | Text emotion analysis method and system based on natural language processing | |
CN109446299B (en) | Method and system for searching e-mail content based on event recognition | |
CN109062895A (en) | A kind of intelligent semantic processing method | |
CN111985896A (en) | Mail filtering method and device | |
Algur et al. | Sentiment analysis by identifying the speaker's polarity in Twitter data | |
CN109039874A (en) | A kind of the mail auditing method and device of Behavior-based control analysis | |
JP2004310691A (en) | Text information processor | |
CN111199208A (en) | Head portrait gender identification method and system based on deep learning framework | |
CN110110079B (en) | Social network spam user detection method | |
CN111144929A (en) | Comment object and word combined extraction method for automobile industry user generated content | |
McKeown et al. | Automatically learning cognitive status for multi-document summarization of newswire | |
CN106817297B (en) | A method of spam is identified by html tag |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180515 |
|
RJ01 | Rejection of invention patent application after publication |