CN108038189A - A kind of information extracting system of Email - Google Patents

A kind of information extracting system of Email Download PDF

Info

Publication number
CN108038189A
CN108038189A CN201711307359.3A CN201711307359A CN108038189A CN 108038189 A CN108038189 A CN 108038189A CN 201711307359 A CN201711307359 A CN 201711307359A CN 108038189 A CN108038189 A CN 108038189A
Authority
CN
China
Prior art keywords
mail
annex
information
extraction
email
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711307359.3A
Other languages
Chinese (zh)
Inventor
龙炳林
陆丰勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Mao Yu Tong Software Technology Co Ltd
Original Assignee
Nanjing Mao Yu Tong Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Mao Yu Tong Software Technology Co Ltd filed Critical Nanjing Mao Yu Tong Software Technology Co Ltd
Priority to CN201711307359.3A priority Critical patent/CN108038189A/en
Publication of CN108038189A publication Critical patent/CN108038189A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of information extracting system of Email, engine is extracted including e-mail messages, e-mail messages extraction engine is trained disaggregated model by training set, then Mail Contents are segmented using participle instrument, finally classified with trained disaggregated model to the Mail Contents after participle, obtain the classification of Mail Contents;After user stamps class label manually to mail, mail is put back into the training set of corresponding classification automatically.The present invention can classify Mail Contents, and user can decide whether to reading mail content by checking classification, effectively save the time of user.In addition, user participates in beating mail manually class label, then mail can be put back into the training set of corresponding classification automatically, and by the continuous iteration optimization training set of this process, last classification results also can increasingly be met the needs of users.

Description

A kind of information extracting system of Email
Technical field
The present invention relates to Email Information extraction, more particularly to a kind of information extracting system of Email.
Background technology
With the development of Internet technology, the information on network is vast as the open sea.How information is rapidly and accurately identified such as What propagation on the internet, becomes focus of concern.Email, be it is a kind of by electronic communication system into row information The communication mode of exchange, often links together with internet now, becomes one of most popular the Internet, applications service. Effect of the Email evidence in relating to forming table part and investigating and prosecuting is more and more important.In face of the Email of magnanimity, how quickly, effectively Ground analysis mail evidence becomes a focus of big data epoch mass e-mails analysis.Current existing e-mail analysis system It can not classify to Mail Contents, make personnel in charge of the case need to take more time artificial analysis Mail Contents.
The content of the invention
Goal of the invention:, can be to Mail Contents the object of the present invention is to provide a kind of information extracting system of Email Rational Classification is carried out, so as to save the time and efforts of personnel in charge of the case.
Technical solution:The information extracting system of Email of the present invention, including e-mail messages extraction engine, mail Information extraction engine is trained disaggregated model by training set, and then Mail Contents are segmented using participle instrument, Finally classified with trained disaggregated model to the Mail Contents after participle, obtain the classification of Mail Contents;When user gives After mail stamps class label manually, mail is put back into the training set of corresponding classification automatically.
Further, the e-mail messages extraction engine further includes annex parsing module and decompression module, can be to encryption Annex is identified:When annex is not compressed file, annex is parsed by annex parsing module, if annex can be normal Parsing, then judge that annex is not encrypted, if annex cannot be normally resolved, judges that annex has been encrypted;When annex is compression During file, annex is decompressed by decompression module, if annex can be decompressed normally, judges that annex does not add It is close, if annex cannot be decompressed normally, judge that annex has been encrypted.So can effectively it be sentenced to whether annex is encrypted It is disconnected.
Further, the e-mail messages extraction engine further includes mail parsing module, and mail parsing module can be to mail Parsed, message body, mail matter topics are analytically directly extracted in result, addressee, sender, people is made a copy for and close makes a gift to someone Account and the pet name, post time, addressee whether check and accept, speech encoding, annex name and annex number.It can so carry Take out the essential information of mail.
Further, the e-mail messages extraction engine can also extract mail property:A weight is first drafted to comment Minute mark is accurate, then gives a mark to each sentence, finally provides several sentences in the top as extraction result.Do not have to so logical The main information that mailing lists reaches can be understood by crossing reading mail content, time saving convenient.
Further, the e-mail messages extraction engine also carries out languages identification by ngram to message body content.So Facilitate user's orientation to check the mail of a certain languages, without carrying out examination to each mail, save the plenty of time.
Further, the e-mail messages extraction engine is also named by being laminated the Chinese of Markov model and character labeling Entity recognition method extracts the entity information in Mail Contents.
Further, described information extraction system further includes Email attachment information extraction engine, and Email attachment information extraction is drawn The content of annex can be extracted including annex parsing module, the annex parsing module by holding up.
Further, described information extraction system further includes recessive information extraction engine, and recessive information extraction engine includes postal Part parsing module, the recessive information extraction engine can extract forwarding relation:Mail is solved using mail parsing module Analysis, then extracts sender, recipients fields content and time field contents.So user can be extracted by recessive information Engine analysis obtains the forwarding relation in the communication process and mail of mail sequential, it is possible to increase user has found useful letter in mail The efficiency of breath.
Further, the recessive information extraction engine can extract entity relationship:Determine each entity, i.e. mail lander Account, the mail pet name, Email attachment and mail essential information, are then associated analysis to each entity.So user can be with Engine analysis is extracted by recessive information and obtains the entity relationship in the communication process and mail of mail sequential, it is possible to increase user It was found that in mail useful information efficiency.
Further, the association analysis includes following relational model:Account and pet name relation, mail and accessory relationship, postal Part and outbox relationship, mail and receiver relation.
Beneficial effect:The invention discloses a kind of information extracting system of Email, Mail Contents can be divided Class, user can decide whether to reading mail content by checking classification, effectively save the time of user.In addition, user Class label is beaten mail in participation manually, and then mail can be put back into the training set of corresponding classification automatically, by this process not Disconnected iteration optimization training set, last classification results also can increasingly be met the needs of users.
Brief description of the drawings
Fig. 1 is the block diagram of system in the specific embodiment of the invention;
Fig. 2 is that mail propagates figure in embodiment 1;
Fig. 3 is e-mail messages and graph of a relation in embodiment 1.
Embodiment
With reference to the accompanying drawings and detailed description, technical scheme is described further.
Present embodiment discloses a kind of information extracting system of Email, including e-mail messages extraction engine, Email attachment information extraction engine and recessive information extraction engine, as shown in Figure 1.
E-mail messages extraction engine is used to extract the dominant information in mail, has following functions:
1) mail essential information is extracted
The mail parsing module in engine is extracted by e-mail messages to parse eml files, in result analytically The text of mail is directly extracted, mail matter topics, addressee, sender, make a copy for people and the close account and the pet name made a gift to someone, mail hair The time is sent, whether addressee checks and accepts, speech encoding, annex name, annex number.The content so extracted can intuitively reflect The essential information of mail.
2) abstract extraction
Mail property is extracted in the following manner:A weight standards of grading are first drafted, then to each sentence Marking, finally provides several sentences in the top as extraction result.Not having to so can by reading mail content Understand the main information that mailing lists reaches, it is time saving convenient.
3) languages identify
Languages identification is carried out to message body content by ngram.User's orientation is so facilitated to check the postal of a certain languages Part, without carrying out examination to each mail, saves the plenty of time.
4) Mail Contents are classified
Disaggregated model is trained by preprepared training set, then using the instrument that segments to Mail Contents into Row participle, finally classifies the Mail Contents after participle with trained disaggregated model, obtains the classification of Mail Contents;With When mail is checked in family at interface, it can simultaneously participate in and class label is stamped manually to mail, which is put back into correspondence automatically In the training set of classification.By the continuous iteration optimization training set of this process, last classification results also can increasingly meet to use Family demand.
5) cryptographic attachment identifies
E-mail messages extraction engine further includes annex parsing module and decompression module, and cryptographic attachment can be known Not:When annex is not compressed file, annex is parsed by annex parsing module, if annex can be normally resolved, is sentenced Determine annex not encrypt, if annex cannot be normally resolved, judge that annex has been encrypted;When annex is compressed file, lead to Cross decompression module to decompress annex, if annex can be decompressed normally, judge that annex is not encrypted, if annex It cannot normally be decompressed, then judge that annex has been encrypted.So can effectively it be judged whether annex is encrypted.
6) entity information extracts
Important information may also be included in the content of mail, the entity information in mail is the name in Mail Contents And place name, by being laminated the Chinese name entity recognition method of Markov model and character labeling to the entity in Mail Contents Information is extracted.
Email attachment information extraction engine is that the dominant information in the non-cryptographic attachments to mail extracts, extraction it is interior Appearance includes:Annex name, annex size, annex number, type of attachment, attachment content, annex summary, annex coding, annex language, Attachment content is classified, and entity information extracts (including name, place name).Email attachment information extraction engine includes annex parsing mould Block, annex parsing module can extract the content of annex.The extracting method of the relevant information of annex is drawn with e-mail messages extraction It is identical to hold up the middle extracting method used.
A variety of relational models are included in recessive information extraction engine, dominant information is excavated by these relational models and is divided Analysis, makes the various information in mail have incidence relation on different dimensions, and user can extract engine point by recessive information The various incidence relations in the communication process for obtaining mail sequential and mail are analysed, raising user has found the useful information in mail Efficiency.Extraction comprising forwarding relation and entity relationship in recessive information extraction engine:
A) relation extraction is forwarded
Recessive information extraction engine includes mail parsing module, mail is parsed using mail parsing module, then Extract sender, recipients fields content and time field contents.So user can extract engine point by recessive information Analysis obtains the forwarding relation in the communication process and mail of mail sequential, it is possible to increase user has found the effect of useful information in mail Rate.
B) entity relationship is extracted
Determine each entity, i.e. mail lander account, the mail pet name, Email attachment and mail essential information, it is then right Each entity is associated analysis.So user can extract engine analysis by recessive information and obtain being propagated through for mail sequential Entity relationship in journey and mail, it is possible to increase user has found the efficiency of useful information in mail.Association analysis is included with ShiShimonoseki It is model:Account and pet name relation, mail and accessory relationship, mail and outbox relationship, mail and receiver relation.
One embodiment is described below:
Embodiment 1:
If there is such demand, certain unit needs analyzing the information of an envelope mail, analyzes classification, the encryption feelings of mail Condition, Spreading source, spread scope and related personnel, in order to complete the demand, developer is needed from unit mail database Associated mail and record are exported, method described in the present invention can be utilized afterwards, analyze the dominant and recessive letter of mail Breath, so as to find out the information such as the classification of mail, encryption situation, Spreading source, spread scope and related personnel.
(1) e-mail messages extraction engine
Related dominant information is extracted according to the form of standard mail to mail using e-mail messages extraction engine, such as:
(2) Email attachment information extraction engine
Analysis extraction is carried out to accessory information using Email attachment information extraction engine.
(3) recessive information extraction engine
A. relation extraction is forwarded
The extraction of email relaying relation can intuitively check that mail relies on the communication process of sequential.
B. entity relationship is extracted
In result first from ticket information and e-mail messages extraction engine with the extraction of Email attachment information extraction engine really Then these entities are associated analysis, are the entity relationship extracted below by fixed each entity.
Can obtain this envelope mail by the analysis of e-mail messages extraction engine, Email attachment information extraction engine is to make a copy for Chinese email that people and Mi make a gift to someone, non-encrypted, agriculture, and the annex with agriculture, recessive information extraction engine Extraction wherein entity relationship and the forwarding relation shown by sequential, build the propagation path of mail and the relevant people of propagation Member, as Fig. 2 mails are propagated shown in figure.After this processing, to mail 20160608202706-43855.eml, can intuitively it look into A variety of relations between the every terms of information and entity of mail are seen, as shown in Fig. 3 e-mail messages and graph of a relation.

Claims (10)

  1. A kind of 1. information extracting system of Email, it is characterised in that:Engine, e-mail messages extraction are extracted including e-mail messages Engine is trained disaggregated model by training set, and then Mail Contents are segmented using participle instrument, finally with instruction The disaggregated model perfected classifies the Mail Contents after participle, obtains the classification of Mail Contents;When user is manual to mail After stamping class label, mail is put back into the training set of corresponding classification automatically.
  2. 2. the information extracting system of Email according to claim 1, it is characterised in that:The e-mail messages extraction is drawn Hold up and further include annex parsing module and decompression module, cryptographic attachment can be identified:When annex is not compressed file, Annex is parsed by annex parsing module, if annex can be normally resolved, judges that annex is not encrypted, if annex is not It can be normally resolved, then judge that annex has been encrypted;When annex is compressed file, annex is solved by decompression module Pressure, if annex can be decompressed normally, judges that annex is not encrypted, if annex cannot be decompressed normally, judges attached Part has been encrypted.
  3. 3. the information extracting system of Email according to claim 1, it is characterised in that:The e-mail messages extraction is drawn Hold up and further include mail parsing module, mail parsing module can parse mail, analytically directly extract postal in result Part text, mail matter topics, addressee, sender, make a copy for people and the close account made a gift to someone and the pet name, post time, addressee Whether check and accept, speech encoding, annex name and annex number.
  4. 4. the information extracting system of Email according to claim 1, it is characterised in that:The e-mail messages extraction is drawn Mail property can also be extracted by holding up:A weight standards of grading are first drafted, then gives a mark to each sentence, finally provides Several sentences in the top are as extraction result.
  5. 5. the information extracting system of Email according to claim 1, it is characterised in that:The e-mail messages extraction is drawn Hold up and languages identification is also carried out to message body content by ngram.
  6. 6. the information extracting system of Email according to claim 1, it is characterised in that:The e-mail messages extraction is drawn Hold up and also the entity in Mail Contents is believed by being laminated the Chinese name entity recognition method of Markov model and character labeling Breath is extracted.
  7. 7. the information extracting system of Email according to claim 1, it is characterised in that:Described information extraction system is also Including Email attachment information extraction engine, Email attachment information extraction engine includes annex parsing module, and the annex parses mould Block can extract the content of annex.
  8. 8. the information extracting system of Email according to claim 1, it is characterised in that:Described information extraction system is also Engine is extracted including recessive information, recessive information extraction engine includes mail parsing module, and the recessive information extracts engine energy Enough extraction forwarding relations:Mail is parsed using mail parsing module, then extracts sender, recipients fields content With time field contents.
  9. 9. the information extracting system of Email according to claim 8, it is characterised in that:The recessive information extraction is drawn Entity relationship can be extracted by holding up:Determine each entity, i.e. mail lander account, the mail pet name, Email attachment and mail is basic Information, is then associated analysis to each entity.
  10. 10. the information extracting system of Email according to claim 9, it is characterised in that:The association analysis includes Following relational model:Account and pet name relation, mail and accessory relationship, mail and outbox relationship, mail and receiver relation.
CN201711307359.3A 2017-12-11 2017-12-11 A kind of information extracting system of Email Pending CN108038189A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711307359.3A CN108038189A (en) 2017-12-11 2017-12-11 A kind of information extracting system of Email

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711307359.3A CN108038189A (en) 2017-12-11 2017-12-11 A kind of information extracting system of Email

Publications (1)

Publication Number Publication Date
CN108038189A true CN108038189A (en) 2018-05-15

Family

ID=62101680

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711307359.3A Pending CN108038189A (en) 2017-12-11 2017-12-11 A kind of information extracting system of Email

Country Status (1)

Country Link
CN (1) CN108038189A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033155A (en) * 2018-06-13 2018-12-18 中国电子科技集团公司电子科学研究院 Search mail content and method, device, terminal and storage medium
CN110400123A (en) * 2019-07-05 2019-11-01 中国平安财产保险股份有限公司 Friend-making information popularization method, apparatus, equipment and computer readable storage medium
CN111047455A (en) * 2019-12-31 2020-04-21 武汉市烽视威科技有限公司 Personal statue method and system for mail
CN116032509A (en) * 2021-10-27 2023-04-28 中移系统集成有限公司 Mail encryption and decryption method and device
CN116308237A (en) * 2023-05-25 2023-06-23 湖南九立供应链有限公司 ERP mail processing method and related equipment thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
CN102842078A (en) * 2012-07-18 2012-12-26 南京邮电大学 Email forensic analyzing method based on community characteristics analysis
CN104182549A (en) * 2014-09-15 2014-12-03 中国联合网络通信集团有限公司 E-mail digest generation method and device
CN105871887A (en) * 2016-05-12 2016-08-17 北京大学 Client-side based personalized E-mail filtering system and method
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
CN102842078A (en) * 2012-07-18 2012-12-26 南京邮电大学 Email forensic analyzing method based on community characteristics analysis
CN104182549A (en) * 2014-09-15 2014-12-03 中国联合网络通信集团有限公司 E-mail digest generation method and device
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment
CN105871887A (en) * 2016-05-12 2016-08-17 北京大学 Client-side based personalized E-mail filtering system and method

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033155A (en) * 2018-06-13 2018-12-18 中国电子科技集团公司电子科学研究院 Search mail content and method, device, terminal and storage medium
CN110400123A (en) * 2019-07-05 2019-11-01 中国平安财产保险股份有限公司 Friend-making information popularization method, apparatus, equipment and computer readable storage medium
CN110400123B (en) * 2019-07-05 2023-06-20 中国平安财产保险股份有限公司 Friend-making information popularization method, friend-making information popularization device, friend-making information popularization equipment and friend-making information popularization computer readable storage medium
CN111047455A (en) * 2019-12-31 2020-04-21 武汉市烽视威科技有限公司 Personal statue method and system for mail
CN116032509A (en) * 2021-10-27 2023-04-28 中移系统集成有限公司 Mail encryption and decryption method and device
CN116308237A (en) * 2023-05-25 2023-06-23 湖南九立供应链有限公司 ERP mail processing method and related equipment thereof
CN116308237B (en) * 2023-05-25 2023-08-25 湖南九立供应链有限公司 ERP mail processing method and related equipment thereof

Similar Documents

Publication Publication Date Title
CN108038189A (en) A kind of information extracting system of Email
CN103150367B (en) A kind of Sentiment orientation analytical approach of Chinese microblogging
CN104463552B (en) Calendar reminding generation method and device
KR101716905B1 (en) Method for calculating entity similarities
CN109582861B (en) Data privacy information detection system
CN103729474B (en) Method and system for recognizing forum user vest account
CN109753909A (en) A kind of resume analytic method based on content piecemeal and BiLSTM model
CN107632968A (en) A kind of construction method of chain of evidence relational model towards judgement document
CN108491388B (en) Data set acquisition method, classification method, device, equipment and storage medium
CN109446404A (en) A kind of the feeling polarities analysis method and device of network public-opinion
CN106776555B (en) A kind of comment text entity recognition method and device based on word model
Zhang et al. Filtering junk mail with a maximum entropy model
JP2010056682A (en) E-mail receiver and method of receiving e-mail, e-mail transmitter and e-mail transmission method, mail transmission server
CN108009297A (en) Text emotion analysis method and system based on natural language processing
CN109446299B (en) Method and system for searching e-mail content based on event recognition
CN109062895A (en) A kind of intelligent semantic processing method
CN111985896A (en) Mail filtering method and device
Algur et al. Sentiment analysis by identifying the speaker's polarity in Twitter data
CN109039874A (en) A kind of the mail auditing method and device of Behavior-based control analysis
JP2004310691A (en) Text information processor
CN111199208A (en) Head portrait gender identification method and system based on deep learning framework
CN110110079B (en) Social network spam user detection method
CN111144929A (en) Comment object and word combined extraction method for automobile industry user generated content
McKeown et al. Automatically learning cognitive status for multi-document summarization of newswire
CN106817297B (en) A method of spam is identified by html tag

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180515

RJ01 Rejection of invention patent application after publication