CN110020366B - Mailbox information extraction method and device - Google Patents

Mailbox information extraction method and device Download PDF

Info

Publication number
CN110020366B
CN110020366B CN201711285206.3A CN201711285206A CN110020366B CN 110020366 B CN110020366 B CN 110020366B CN 201711285206 A CN201711285206 A CN 201711285206A CN 110020366 B CN110020366 B CN 110020366B
Authority
CN
China
Prior art keywords
mailbox
html file
mailbox address
address
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201711285206.3A
Other languages
Chinese (zh)
Other versions
CN110020366A (en
Inventor
谢海华
罗学文
陈雪飞
佟津乐
高良才
黄肖俊
汤帜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Original Assignee
Pku Founder Information Industry Group Co ltd
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pku Founder Information Industry Group Co ltd, Peking University Founder Group Co Ltd filed Critical Pku Founder Information Industry Group Co ltd
Priority to CN201711285206.3A priority Critical patent/CN110020366B/en
Publication of CN110020366A publication Critical patent/CN110020366A/en
Application granted granted Critical
Publication of CN110020366B publication Critical patent/CN110020366B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method and a device for extracting mailbox information, wherein the method comprises the following steps: preprocessing an HTML file of a user homepage of a mailbox to be extracted, wherein the preprocessing comprises the steps of removing meaningless characters and messy code characters in the HTML file and uniformly converting the format of the HTML file into a preset target format; detecting whether a first mailbox address containing a character string exists in the mailbox address or not aiming at the mailbox address in the preprocessed HTML file; if yes, detecting whether the first mailbox address meets a preset mailbox format, and if yes, deleting irregular characters in the first mailbox address; and detecting whether the character length of the first mailbox address is in a preset length range or not, and if so, taking the first mailbox address as a standard mailbox address. According to the invention, the accuracy of acquiring the mailbox of the target user from the webpage can be improved.

Description

Mailbox information extraction method and device
Technical Field
The invention relates to the field of information retrieval and text information processing, in particular to a method and a device for extracting mailbox information.
Background
In the field of academic big data analysis, academic pictures of scholars are helpful for distinguishing scholars with the same name and carrying out more accurate analysis on the aspects of research interests, relationship networks, influence assessment and the like of the scholars. But the current academic data shows an exponential growth trend, the global academic papers have more than 3 hundred million and the academic workers have reached 1 hundred million. Thereby also bringing difficulty to the acquisition of academic images of scholars.
The conventional mailbox information acquisition mode generally acquires a mailbox according to a preset traditional format of the mailbox after acquiring a student homepage.
However, because the presentation forms of the mailboxes on the web page are various, a part of the mailboxes are replaced by ' AT ' instead of ' @ ' and ' DOT ' instead of '. the part of the mailboxes are displayed in a picture form, a part of the mailboxes are represented in html source codes in decimal or hexadecimal ASCII codes, and the like. Therefore, mailboxes acquired according to only one format are often inaccurate.
Disclosure of Invention
The invention provides a method and a device for extracting mailbox information, which are used for improving the accuracy of acquiring a mailbox of a target user from a webpage.
The first aspect of the invention provides a mailbox information extraction method, which comprises the following steps: preprocessing an HTML file of a user homepage of a mailbox to be extracted, wherein the preprocessing comprises the steps of removing meaningless characters and messy code characters in the HTML file and uniformly converting the format of the HTML file into a preset target format; detecting whether a first mailbox address containing a character string exists in the mailbox address or not aiming at the mailbox address in the preprocessed HTML file; if yes, detecting whether the first mailbox address meets a preset mailbox format, and if yes, deleting irregular characters in the first mailbox address; and detecting whether the character length of the first mailbox address is in a preset length range or not, and if so, taking the first mailbox address as a standard mailbox address.
Another aspect of the present invention provides a mailbox information extracting apparatus, including: the system comprises a preprocessing module, a storage module and a processing module, wherein the preprocessing module is used for preprocessing an HTML file of a user homepage of a mailbox to be extracted, and the preprocessing comprises the steps of removing meaningless characters and messy code characters in the HTML file and uniformly converting the format of the HTML file into a preset target format; the character string detection module is used for detecting whether a first mailbox address containing a character string exists in the mailbox address or not aiming at the mailbox address in the preprocessed HTML file; the mailbox format detection module is used for detecting whether the first mailbox address meets a preset mailbox format or not if the first mailbox address exists, and deleting the irregular characters in the first mailbox address if the first mailbox address meets the preset mailbox format; and the first standard mailbox address judgment module is used for detecting whether the character length of the current first mailbox address is within a preset length range, and if so, taking the first mailbox address as a standard mailbox address.
The method and the device for extracting the mailbox information provided by the invention have the advantages that the HTML file of the user homepage of the mailbox to be extracted is preprocessed, all mailbox addresses in the preprocessed HTML file are obtained, whether the preset conditions are met or not is detected, wherein the preset conditions comprise all forms of the current mailbox, if yes, the irregular characters in the mailbox addresses are deleted, whether the mailbox meets the preset length range or not is detected, and the mailbox address meeting the length range is used as the standard mailbox address. The source codes in the HTML file can be converted into the same format by preprocessing the HTML file, so that the mailbox addresses of all character strings in the HTML file can be acquired, all the mailbox addresses in the HTML file can be further acquired by comparing the acquired mailbox addresses with all the current mailbox formats, and the mailbox addresses of the target users in the webpage can be accurately acquired.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
Fig. 1 is a flowchart of a mailbox information extraction method according to an embodiment of the present invention;
fig. 2 is a flowchart of a mailbox information extraction method according to a second embodiment of the present invention;
fig. 3 is a flowchart of a mailbox information extraction method according to a third embodiment of the present invention;
fig. 4 is a flowchart of a mailbox information extraction method according to a fourth embodiment of the present invention;
fig. 5 is a flowchart of a mailbox information extraction method according to a fifth embodiment of the present invention;
fig. 6 is a structural diagram of a mailbox information extraction apparatus according to a sixth embodiment of the present invention;
fig. 7 is a structural diagram of a mailbox information extraction apparatus according to a seventh embodiment of the present invention;
fig. 8 is a structural diagram of an mailbox information extraction apparatus according to an eighth embodiment of the present invention;
fig. 9 is a structural diagram of a mailbox information extraction apparatus according to a ninth embodiment of the present invention;
fig. 10 is a structural diagram of a mailbox information extraction apparatus according to a tenth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other examples obtained based on the examples in the present invention are within the scope of the present invention.
Fig. 1 is a flowchart of a mailbox information extraction method according to an embodiment of the present invention, and as shown in fig. 1, the method includes:
101. preprocessing an HTML file of a user homepage of a mailbox to be extracted, wherein the preprocessing comprises removing meaningless characters and messy code characters in the HTML file and uniformly converting the format of the HTML file into a preset target format.
In this embodiment, in order to accurately obtain the mailbox address in the user homepage of the mailbox to be extracted, the HTML file of the user homepage of the mailbox to be extracted needs to be preprocessed. Specifically, in order to improve the efficiency of mailbox extraction, the meaningless characters and the garbled characters in the HTML file can be deleted, for example, the meaningless characters include < span >, < meta > and the like. In addition, because the HTML source code in the HTML file may include multiple ASCII codes in decimal, hexadecimal, and so on, if the code is not processed, the mailbox address in the HTML file is directly obtained, and the obtained mailbox address is incomplete. Therefore, in order to improve the accuracy of mailbox acquisition, all ASCII codes in different systems in an HTML file need to be converted into english letters, numbers or symbols.
102. And detecting whether a first mailbox address containing a character string exists in the mailbox address or not aiming at the mailbox address in the preprocessed HTML file.
In the embodiment, after the HTML file is preprocessed, whether the first mailbox address in the form of a character string exists in the HTML file is detected. Specifically, the HTML file may be traversed by a regular expression to detect whether the first mailbox address in the form of a character string is contained therein. In order to improve the accuracy of acquisition, whether the mailbox address is the mailbox address or not can be judged according to the position of the mailbox address appearing in the HTML file, for example, a character string after "Email" or a character string after "Mail to" can be acquired; in addition, in practical applications, mailbox formats are diversified, for example, a part of mailboxes is replaced by "AT" instead of "@" and "DOT" instead of ". once", so that in order to accurately acquire all mailbox addresses in the HTML file, whether the mailbox addresses are mailbox addresses can be judged according to the content of the character strings, for example, whether any item of "@" or "AT" is included in the character strings or not is detected, and any item of "@" or "DOT" is detected, and if yes, the mailbox addresses are judged.
103. If yes, detecting whether the first mailbox address meets a preset mailbox format, and if yes, deleting the irregular characters in the first mailbox address.
In this embodiment, if it is detected that the HTML file includes a mailbox address in a string form, it is detected whether the mailbox address satisfies a preset mailbox format. For example, if the mailbox address satisfies xx [ at ] yy.zz format, wherein "at" and ". are replaceable in any form, it is determined that the mailbox address satisfies the preset mailbox format; in addition, whether the mailbox address contains the name information of the target user is detected, wherein the name information can be a complete spelling of the name of the target user, an abbreviation of the name of the target user, a last name and a first name of the target user, for example, if the name of the target user is: ming Li, then the string before @ in Email contains Ming, Li, lm, or ml, then it is considered to contain name information of the target user. If the arbitrary requirements are met, the irregular characters in the mailbox addresses can be deleted. In practical application, the correct mailbox address does not include irregular characters, for example, characters such as "/" and spaces in the correct mailbox address all belong to irregular characters, and therefore, in order to ensure the accuracy of the finally obtained mailbox address, the irregular characters in the mailbox need to be deleted. In practical application, the irregular character library can be pre-established through collection and statistics, and then the irregular characters contained in the first mailbox address are deleted from the first mailbox address when the mailbox information extraction scheme is executed.
104. And detecting whether the character length of the first mailbox address is in a preset length range or not, and if so, taking the first mailbox address as a standard mailbox address.
In this embodiment, after deleting the irregular characters in the mailbox, in order to ensure that the obtained mailbox address is a valid mailbox address, the length of the mailbox needs to be detected, and if it is detected that the length of the mailbox address is within a preset length range, the mailbox address can be used as a standard mailbox address. Specifically, the length range may be set by a user, or may be a default length range, where the default length range is obtained according to analysis of a plurality of current mailbox addresses, for example, the default length range may be that the character length of the email is 5-80, and if the character length of the mailbox address is between 5-80, the mailbox is determined to be the standard mailbox address.
It should be noted that the method for obtaining the mailbox can be used for obtaining the mailbox address of the student in the homepage of the student, and can also be used for obtaining the mailbox address in the homepage of any target user, and the invention does not need to be limited at all again.
The method for extracting mailbox information provided by this embodiment includes preprocessing an HTML file of a user homepage of a mailbox to be extracted, acquiring all mailbox addresses in the preprocessed HTML file, and detecting whether the preset conditions include all forms of a current mailbox, if so, deleting irregular characters in the mailbox addresses, detecting whether a preset length of the mailbox satisfies a preset length range, and taking the mailbox addresses satisfying the length range as standard mailbox addresses. The source codes in the HTML file can be converted into the same format by preprocessing the HTML file, so that the mailbox addresses of all character strings in the HTML file can be acquired, all the mailbox addresses in the HTML file can be further acquired by comparing the acquired mailbox addresses with all the current mailbox formats, and the mailbox addresses of the target users in the webpage can be accurately acquired.
Fig. 2 is a flowchart of a mailbox information extraction method provided in the second embodiment of the present invention, and on the basis of the above embodiment, as shown in fig. 2, after step 102, the method further includes:
201. if the first mailbox address does not exist, detecting whether the mailbox address in the HTML file meets a preset first condition, wherein the first condition comprises that: the mailbox address has the associated information of the target user within a preset distance around the position where the HTML file appears, and/or a character string in the mailbox address before a spacer between the mailbox user name and the domain name contains the associated information of the target user.
In this embodiment, if the HTML file does not include the first mailbox address, it is detected whether the mailbox address in the HTML file meets a preset first condition. Wherein, the first condition of presetting specifically includes: the short distance of the position where the mailbox address appears comprises the associated information of the target user, wherein the short distance is defined as follows: two strings are considered to be closer together when they are no more than three tags apart. For example, str1< bq1> str2< bq2> str3< bq3> str4< bq4> str5, str1 and str2, str3 and str4 are closer together and farther apart from str 5. When the labels are calculated to be spaced apart, if there are no valid characters (referring to empty characters such as non-space, line feed, or tab) between the two labels, they are calculated to be a label. For example: str1< bq1> < bq2> str2, str1 and str2 are considered to be one tag apart; and/or a character string in the mailbox address before the spacer between the mailbox user name and the domain name contains the associated information of the target user.
202. And if the mailbox address meets the first condition, storing the mailbox address to a suspected mailbox list.
In this embodiment, if it is detected that the mailbox address satisfies any of the first conditions, the mailbox address may be added to the suspected mailbox list. It should be noted that, if the HTML file does not include the first mailbox address, the mailbox address may be acquired from the suspected mailbox list as the standard mailbox address, and since the HTML file of the mailbox to be extracted may include mailbox addresses of a plurality of users, all mailbox addresses in the suspected mailbox list satisfy the first condition, that is, the string corresponding to the mailbox address in the suspected mailbox list includes the associated information of the target user or the associated information of the target user is included near the string, the probability that the mailbox address is the standard mailbox address is high, and therefore, the accuracy of mailbox acquisition can be increased.
It should be noted that, on the basis of any of the above embodiments, the association information includes the name of the target user, or an abbreviation of the name of the target user, or a suffix of the URL of the HTML file.
Because the expression form of the user name in the mailbox address of the user is various, for example, the full spelling of the name or the combination of the initial names, and the like, in order to improve the accuracy of mailbox acquisition, the associated information includes the name and the abbreviation of the target user, and if the mailbox address includes the name and the abbreviation of the target user, it can be determined that the mailbox address is the mailbox address of the target user. In addition, if the mailbox address of the target user does not include the name information of the user, it cannot be determined whether the mailbox address is the mailbox address of the target user, and therefore, in order to solve the above problem, the association information further includes a suffix of the HTML file URL, and if it is detected that the suffix of the HTML file URL is included in a character string of the mailbox address located before the spacer between the mailbox user name and the domain name, it may be determined that the mailbox address is the mailbox address of the target user.
According to the mailbox information extraction method provided by the embodiment, if it is detected that the HTML file does not contain the first mailbox address, the mailbox addresses meeting the preset first condition are stored in the suspected mailbox list, so that the mailbox addresses are acquired from the suspected mailbox list as the standard mailbox addresses, and because the mailbox addresses in the suspected mailbox list meet the first condition, the probability that the mailbox addresses are the standard mailbox addresses is high, and therefore the mailbox acquisition accuracy can be improved.
Fig. 3 is a flowchart of a mailbox information extraction method provided by a third embodiment of the present invention, on the basis of any one of the above embodiments, as shown in fig. 3, after step 102, the method includes:
301. and if the first mailbox address does not exist, detecting whether the preprocessed HTML file contains a mailbox address in a picture form.
In practical application, since the mailbox address of the target user in part of the homepage is represented in a picture form, if the HTML file is detected not to include the first mailbox address, whether the preprocessed HTML file includes the mailbox address in the picture form or not can be detected
302. If yes, performing character recognition on the mailbox address in the picture form, and detecting whether a recognition result meets a preset second condition, if yes, taking the mailbox address in the picture form as the standard mailbox address, wherein the second condition comprises that: the identification result comprises a spacer between the mailbox user name and the domain name, and the picture name comprises a preset word; and/or the identification result comprises a spacer between the mailbox user name and the domain name, and the spacer comprises the association information of the target user before.
In this embodiment, if it is detected that the HTML file includes the mailbox address in the form of a picture, all pictures in the HTML file may be downloaded, and since the mailbox address memory in the form of a picture in the HTML file is less than 10480, pictures smaller than 10480 may be screened out for analysis. Specifically, the screened pictures can be recognized by an OCR character recognition tool. And detecting whether the picture meets a preset second condition, and if so, taking the mailbox address in the picture form as a standard mailbox address. Wherein the second condition comprises: the identification result comprises a spacer between the mailbox user name and the domain name, and the picture name comprises a preset word; and/or the identification result comprises a spacer between the mailbox user name and the domain name, and the spacer comprises the association information of the target user.
According to the mailbox information extraction method provided by the embodiment, whether the picture-form mailbox address is contained in the HTML file or not is detected, if yes, whether the mailbox address meets the preset second condition or not is judged, and if yes, the mailbox address is used as the standard mailbox address, so that the accuracy of mailbox acquisition can be improved.
It should be noted that the above three embodiments can be implemented individually or in combination, and the present invention is not limited again.
Fig. 4 is a flowchart of a mailbox information extraction method provided by a fourth embodiment of the present invention, on the basis of any one of the foregoing embodiments, as shown in fig. 4, before step 101, the method further includes:
401. and detecting whether the HTML file contains a jump instruction, and if so, taking the HTML file of the target homepage corresponding to the jump instruction as the HTML file of the user homepage of the mailbox to be extracted.
In this embodiment, after the homepage of the target user is acquired, in order to improve the accuracy of mailbox acquisition, the authenticity of the homepage needs to be determined. In practical application, in some cases, the homepage of the target user may jump, and therefore, whether the HTML file includes a jump instruction is detected, and if the jump instruction includes the jump instruction, the HTML file of the target homepage corresponding to the jump instruction is used as the HTML file of the user homepage of the mailbox to be extracted. For example, if included in the HTML: < meta http-equiv ═ Refresh "; … URL … HTML "> form, then download the HTML corresponding to the website after URL as the learner's homepage HTML. If the HTML contains: location place ("… … html"); and the HTML corresponding to the address in the brackets is downloaded as the HTML of the learner homepage in the form of the statement. And if the HTML file is detected not to contain the jump instruction, taking the current HTML file as the homepage of the target user.
According to the mailbox information extraction method provided by the embodiment, whether the homepage of the target user jumps or not is detected, and different measures are taken according to the detection result, so that the homepage of the target user can be accurately acquired, and a basis is provided for accurately acquiring the mailbox address of the target user.
Fig. 5 is a flowchart of a mailbox information extraction method provided in the fifth embodiment of the present invention, and on the basis of any of the above embodiments, as shown in fig. 5, before step 101, the method further includes:
501. and detecting whether the HTML file contains a plurality of frame tags or not, and if so, sequentially taking the HTML file of the main page corresponding to each frame tag as the HTML file of the user main page of the mailbox to be extracted.
In this embodiment, after the homepage of the target user is acquired, in order to improve the accuracy of mailbox acquisition, the authenticity of the homepage needs to be determined. In practical application, the HTML file may include a plurality of frame tags, and each frame tag may correspond to a separate HTML file, so that in order to improve mailbox acquisition accuracy, the HTML file of the main page corresponding to each frame tag needs to be sequentially used as the HTML file of the user main page of the mailbox to be extracted, and mailbox address acquisition operation is performed on each HTML file.
In the method for extracting mailbox information provided by this embodiment, the HTML files corresponding to the multiple frame tags in the current HTML file are acquired, and the HTML files corresponding to the multiple frame tags are sequentially used as the HTML files of the mailbox to be extracted to perform the operation of extracting the mailbox address, so that the mailbox address of the target user can be accurately acquired.
Fig. 6 is a structural diagram of a mailbox information extraction apparatus according to a sixth embodiment of the present invention, and as shown in fig. 6, the illustrated apparatus includes:
the preprocessing module 61 is configured to preprocess an HTML file of a user homepage of the mailbox to be extracted, where the preprocessing includes removing meaningless characters and messy code characters in the HTML file, and uniformly converting a format of the HTML file into a preset target format.
And the character string detection module 62 is configured to detect whether a first mailbox address containing a character string exists in the mailbox addresses for the mailbox addresses in the preprocessed HTML file.
And the mailbox format detection module 63 is configured to detect whether the first mailbox address meets a preset mailbox format if the first mailbox address exists, and delete the irregular character in the first mailbox address if the first mailbox address meets the preset mailbox format.
A first standard mailbox address determining module 64, configured to detect whether a character length of the current first mailbox address is within a preset length range, and if so, take the first mailbox address as a standard mailbox address.
In this embodiment, in order to accurately obtain the mailbox address in the user homepage of the mailbox to be extracted, the preprocessing module 61 needs to preprocess the HTML file of the user homepage of the mailbox to be extracted. Specifically, in order to improve the efficiency of mailbox extraction, the meaningless characters and the garbled characters in the HTML file can be deleted, for example, the meaningless characters include < span >, < meta > and the like. In addition, because the HTML source code in the HTML file may include multiple ASCII codes in decimal, hexadecimal, and so on, if the code is not processed, the mailbox address in the HTML file is directly obtained, and the obtained mailbox address is incomplete. Therefore, in order to improve the accuracy of mailbox acquisition, all ASCII codes in different systems in an HTML file need to be converted into english letters, numbers or symbols.
After preprocessing the HTML file, the string detection module 62 detects whether a first mailbox address in the form of a string exists in the HTML file. Specifically, the HTML file may be traversed by a regular expression to detect whether the first mailbox address in the form of a character string is contained therein. In order to improve the accuracy of acquisition, whether the mailbox address is the mailbox address or not can be judged according to the position of the mailbox address appearing in the HTML file, for example, a character string after "Email" or a character string after "Mail to" can be acquired; in addition, in practical applications, mailbox formats are diversified, for example, a part of mailboxes is replaced by "AT" instead of "@" and "DOT" instead of ". once", so that in order to accurately acquire all mailbox addresses in the HTML file, whether the mailbox addresses are mailbox addresses can be judged according to the content of the character strings, for example, whether any item of "@" or "AT" is included in the character strings or not is detected, and any item of "@" or "DOT" is detected, and if yes, the mailbox addresses are judged.
If the HTML file is detected to contain a mailbox address in a character string form, the mailbox format detection module 63 detects whether the mailbox address meets a preset mailbox format. For example, if the mailbox address satisfies xx [ at ] yy.zz format, wherein "at" and ". are replaceable in any form, it is determined that the mailbox address satisfies the preset mailbox format; in addition, whether the mailbox address contains the name information of the target user is detected, wherein the name information can be a complete spelling of the name of the target user, an abbreviation of the name of the target user, a last name and a first name of the target user, for example, if the name of the target user is: ming Li, then the string before @ in Email contains Ming, Li, lm, or ml, then it is considered to contain name information of the target user. If the arbitrary requirements are met, the irregular characters in the mailbox addresses can be deleted. In practical application, the correct mailbox address does not include irregular characters, for example, characters such as "/" and spaces in the correct mailbox address all belong to irregular characters, and therefore, in order to ensure the accuracy of the finally obtained mailbox address, the irregular characters in the mailbox need to be deleted.
After the irregular characters in the mailbox are deleted, in order to ensure that the acquired mailbox address is a valid mailbox address, the first standard mailbox address determination module 64 needs to detect the length of the mailbox, and if the length of the mailbox address is detected to be within a preset length range, the mailbox address can be used as a standard mailbox address. Specifically, the length range may be set by a user, or may be a default length range, where the default length range is obtained according to analysis of a plurality of current mailbox addresses, for example, the default length range may be that the character length of the email is 5-80, and if the character length of the mailbox address is between 5-80, the mailbox is determined to be the standard mailbox address.
It should be noted that the mailbox acquiring apparatus may be used for acquiring the mailbox address of the student in the homepage of the student, and may also be used for acquiring the mailbox address in the homepage of any target user, which is not limited in the present invention.
The mailbox information extraction device provided in this embodiment preprocesses an HTML file of a user homepage of a mailbox to be extracted, acquires all mailbox addresses in the preprocessed HTML file, and detects whether the all mailbox addresses meet preset conditions, where the preset conditions include all forms of a current mailbox, and if so, deletes irregular characters in the mailbox addresses, detects whether the mailbox meets a preset length range, and takes the mailbox address meeting the length range as a standard mailbox address. The source codes in the HTML file can be converted into the same format by preprocessing the HTML file, so that the mailbox addresses of all character strings in the HTML file can be acquired, all the mailbox addresses in the HTML file can be further acquired by comparing the acquired mailbox addresses with all the current mailbox formats, and the mailbox addresses of the target users in the webpage can be accurately acquired.
Fig. 7 is a structural diagram of a mailbox information extraction apparatus according to a seventh embodiment of the present invention, and based on the foregoing embodiment, as shown in fig. 7, the apparatus further includes:
a first condition detecting module 71, configured to detect whether a mailbox address in the HTML file meets a preset first condition if the first mailbox address does not exist, where the first condition includes: the mailbox address has the associated information of the target user within a preset distance around the position where the HTML file appears, and/or a character string in the mailbox address before a spacer between the mailbox user name and the domain name contains the associated information of the target user.
A suspected mailbox list creating module 72, configured to store the mailbox address to a suspected mailbox list if the mailbox address satisfies the first condition.
In this embodiment, if the HTML file does not include the first mailbox address, the first condition detection module 71 detects whether the mailbox address in the HTML file meets a preset first condition. Wherein, the first condition of presetting specifically includes: the short distance of the position where the mailbox address appears comprises the associated information of the target user, wherein the short distance is defined as follows: two strings are considered to be closer together when they are no more than three tags apart. For example, str1< bq1> str2< bq2> str3< bq3> str4< bq4> str5, str1 and str2, str3 and str4 are closer together and farther apart from str 5. When the labels are calculated to be spaced apart, if there are no valid characters (referring to empty characters such as non-space, line feed, or tab) between the two labels, they are calculated to be a label. For example: str1< bq1> < bq2> str2, str1 and str2 are considered to be one tag apart; and/or a character string in the mailbox address before the spacer between the mailbox user name and the domain name contains the associated information of the target user.
If it is detected that the mailbox address satisfies any of the above first conditions, the suspected mailbox list creating module 72 may add the mailbox address to the suspected mailbox list. It should be noted that, if the HTML file does not include the first mailbox address, the mailbox address may be acquired from the suspected mailbox list as the standard mailbox address, and since the HTML file of the mailbox to be extracted may include mailbox addresses of a plurality of users, all mailbox addresses in the suspected mailbox list satisfy the first condition, that is, the string corresponding to the mailbox address in the suspected mailbox list includes the associated information of the target user or the associated information of the target user is included near the string, the probability that the mailbox address is the standard mailbox address is high, and therefore, the accuracy of mailbox acquisition can be increased.
It should be noted that, on the basis of any of the above embodiments, the association information includes the name of the target user, or an abbreviation of the name of the target user, or a suffix of the URL of the HTML file.
Because the expression form of the user name in the mailbox address of the user is various, for example, the full spelling of the name or the combination of the initial names, and the like, in order to improve the accuracy of mailbox acquisition, the associated information includes the name and the abbreviation of the target user, and if the mailbox address includes the name and the abbreviation of the target user, it can be determined that the mailbox address is the mailbox address of the target user. In addition, if the mailbox address of the target user does not include the name information of the user, it cannot be determined whether the mailbox address is the mailbox address of the target user, and therefore, in order to solve the above problem, the association information further includes a suffix of the HTML file URL, and if it is detected that the suffix of the HTML file URL is included in a character string of the mailbox address located before the spacer between the mailbox user name and the domain name, it may be determined that the mailbox address is the mailbox address of the target user.
The mailbox information extraction device provided in this embodiment stores, if it is detected that the HTML file does not include the first mailbox address, the mailbox address meeting the preset first condition into the suspected mailbox list, so as to subsequently acquire the mailbox address from the suspected mailbox list as the standard mailbox address.
Fig. 8 is a structural diagram of an mailbox information extraction apparatus according to an eighth embodiment of the present invention, and on the basis of any of the above embodiments, as shown in fig. 8, the apparatus includes:
and the picture-form mailbox address detection module 81 is configured to detect whether the preprocessed HTML file includes a picture-form mailbox address if the first mailbox address does not exist.
A second standard mailbox address determination module 82, configured to perform text recognition on the picture-format mailbox address if the picture-format mailbox address is a standard mailbox address, and detect whether a preset second condition is met in a recognition result, and if the preset second condition is met, use the picture-format mailbox address as the standard mailbox address, where the second condition includes: the identification result comprises a spacer between the mailbox user name and the domain name, and the picture name comprises a preset word; and/or the identification result comprises a spacer between the mailbox user name and the domain name, and the spacer comprises the association information of the target user before.
In practical applications, since the mailbox address of the target user in part of the homepage is represented in the form of a picture, if it is detected that the HTML file does not include the first mailbox address, the picture-form mailbox address detection module 81 may detect whether the preprocessed HTML file includes a picture-form mailbox address
In this embodiment, if it is detected that the HTML file includes the mailbox address in the form of a picture, the first standard mailbox address determination module 82 may download all pictures in the HTML file, and since all mailbox address memories in the form of pictures in the HTML file are smaller than 10480, the pictures smaller than 10480 may be screened out for analysis. Specifically, the screened pictures can be recognized by an OCR character recognition tool. And detecting whether the picture meets a preset second condition, and if so, taking the mailbox address in the picture form as a standard mailbox address. Wherein the second condition comprises: the identification result comprises a spacer between the mailbox user name and the domain name, and the picture name comprises a preset word; and/or the identification result comprises a spacer between the mailbox user name and the domain name, and the spacer comprises the association information of the target user.
The mailbox information extraction device provided by this embodiment detects whether the HTML file includes the mailbox address in the form of a picture, and if so, determines whether the mailbox address meets a preset second condition, and if so, takes the mailbox address as a standard mailbox address, thereby improving the accuracy of mailbox acquisition.
It should be noted that the above three embodiments can be implemented individually or in combination, and the present invention is not limited again.
Fig. 9 is a structural diagram of a mailbox information extraction apparatus according to a ninth embodiment of the present invention, where on the basis of any of the foregoing embodiments, as shown in fig. 9, the apparatus further includes:
the first user homepage detection module 91 is configured to detect whether the HTML file includes a jump instruction, and if the HTML file includes the jump instruction, the HTML file of the target homepage corresponding to the jump instruction is used as the HTML file of the user homepage of the mailbox to be extracted.
In this embodiment, after the homepage of the target user is acquired, in order to improve the accuracy of mailbox acquisition, the authenticity of the homepage needs to be determined. In practical applications, in some cases, the homepage of the target user may jump, and therefore, the first user homepage detecting module 91 detects whether the HTML file includes a jump instruction, and if so, takes the HTML file of the target homepage corresponding to the jump instruction as the HTML file of the user homepage of the mailbox to be extracted. For example, if included in the HTML: < meta http-equiv ═ Refresh "; … URL … HTML "> form, then download the HTML corresponding to the website after URL as the learner's homepage HTML. If the HTML contains: location place ("… … html"); and the HTML corresponding to the address in the brackets is downloaded as the HTML of the learner homepage in the form of the statement. And if the HTML file is detected not to contain the jump instruction, taking the current HTML file as the homepage of the target user.
The mailbox information extraction device provided by this embodiment detects whether the homepage of the target user jumps or not, and takes different measures according to the detection result, so that the homepage of the target user can be accurately acquired, and a basis is provided for accurately acquiring the mailbox address of the target user.
Fig. 10 is a structural diagram of a mailbox information extraction apparatus provided in a tenth embodiment of the present invention, and on the basis of any of the above embodiments, as shown in fig. 10, the apparatus further includes:
the second user homepage detection module 111 is configured to detect whether the HTML file includes multiple frame tags, and if the HTML file includes multiple frame tags, sequentially use the HTML file of the homepage corresponding to each frame tag as the HTML file of the user homepage of the mailbox to be extracted.
In this embodiment, after acquiring the homepage of the target user, in order to improve the accuracy of mailbox acquisition, the second user homepage detection module 111 further needs to determine the authenticity of the homepage. In practical application, the HTML file may include a plurality of frame tags, and each frame tag may correspond to a separate HTML file, so that in order to improve mailbox acquisition accuracy, the HTML file of the main page corresponding to each frame tag needs to be sequentially used as the HTML file of the user main page of the mailbox to be extracted, and mailbox address acquisition operation is performed on each HTML file.
The mailbox information extraction device provided by this embodiment can accurately acquire the mailbox address of the target user by acquiring the HTML files corresponding to the plurality of frame tags in the current HTML file and sequentially using the HTML files corresponding to the plurality of frame tags as the HTML files of the mailbox to be extracted to perform the operation of mailbox address extraction.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A mailbox information extraction method is characterized by comprising the following steps:
preprocessing an HTML file of a user homepage of a mailbox to be extracted, wherein the preprocessing comprises the steps of removing meaningless characters and messy code characters in the HTML file and uniformly converting the format of the HTML file into a preset target format;
detecting whether a first mailbox address containing a character string exists in the mailbox address or not aiming at the mailbox address in the preprocessed HTML file;
if yes, detecting whether the first mailbox address meets a preset mailbox format, and if yes, deleting irregular characters in the first mailbox address;
detecting whether the character length of the first mailbox address is in a preset length range or not, and if so, taking the first mailbox address as a standard mailbox address;
after detecting whether a first mailbox address containing a character string exists in the mailbox addresses, the method further comprises the following steps:
if the first mailbox address does not exist, detecting whether the mailbox address in the HTML file meets a preset first condition, wherein the first condition comprises that: the mailbox address has associated information of a target user within a preset distance around the position where the HTML file appears, and/or a character string in the mailbox address before a spacer between a mailbox user name and a domain name contains the associated information of the target user;
and if the mailbox address meets the first condition, storing the mailbox address to a suspected mailbox list.
2. The method according to claim 1, wherein after detecting whether the processed HTML file contains a mailbox address in the form of a character string, the method further comprises:
if the first mailbox address does not exist, detecting whether the preprocessed HTML file contains a mailbox address in a picture form;
if yes, performing character recognition on the mailbox address in the picture form, and detecting whether a recognition result meets a preset second condition, if yes, taking the mailbox address in the picture form as the standard mailbox address, wherein the second condition comprises that: the identification result comprises a spacer between the mailbox user name and the domain name, and the picture name comprises a preset word; and/or the identification result comprises a spacer between the mailbox user name and the domain name, and the spacer comprises the association information of the target user before.
3. The method according to claim 1, wherein before preprocessing the HTML file of the user homepage of the mailbox to be extracted, the method further comprises:
and detecting whether the HTML file contains a jump instruction, and if so, taking the HTML file of the target homepage corresponding to the jump instruction as the HTML file of the user homepage of the mailbox to be extracted.
4. The method according to claim 1, wherein before preprocessing the HTML file of the user homepage of the mailbox to be extracted, the method further comprises:
and detecting whether the HTML file contains a plurality of frame tags or not, and if so, sequentially taking the HTML file of the main page corresponding to each frame tag as the HTML file of the user main page of the mailbox to be extracted.
5. The method of claim 1, wherein the association information comprises a name of the target user or an abbreviation of the name of the target user or a suffix of the URL of the HTML file.
6. A mailbox information extraction apparatus characterized by comprising:
the system comprises a preprocessing module, a storage module and a processing module, wherein the preprocessing module is used for preprocessing an HTML file of a user homepage of a mailbox to be extracted, and the preprocessing comprises the steps of removing meaningless characters and messy code characters in the HTML file and uniformly converting the format of the HTML file into a preset target format;
the character string detection module is used for detecting whether a first mailbox address containing a character string exists in the mailbox address or not aiming at the mailbox address in the preprocessed HTML file;
the mailbox format detection module is used for detecting whether the first mailbox address meets a preset mailbox format or not if the first mailbox address exists, and deleting the irregular characters in the first mailbox address if the first mailbox address meets the preset mailbox format;
the first standard mailbox address judgment module is used for detecting whether the character length of the current first mailbox address is within a preset length range, and if so, taking the first mailbox address as a standard mailbox address;
a first condition detection module, configured to detect whether a mailbox address in the HTML file meets a preset first condition if the first mailbox address does not exist, where the first condition includes: the mailbox address has associated information of a target user within a preset distance around the position where the HTML file appears, and/or a character string in the mailbox address before a spacer between a mailbox user name and a domain name contains the associated information of the target user;
and the suspected mailbox list establishing module is used for storing the mailbox address to a suspected mailbox list if the mailbox address meets the first condition.
7. The apparatus of claim 6, further comprising:
the picture-form mailbox address detection module is used for detecting whether the preprocessed HTML file contains a picture-form mailbox address or not if the first mailbox address does not exist;
a second standard mailbox address determination module, configured to perform character recognition on the picture-form mailbox address if the picture-form mailbox address is a standard mailbox address, and detect whether a preset second condition is met in a recognition result, and if the preset second condition is met, use the picture-form mailbox address as the standard mailbox address, where the second condition includes: the identification result comprises a spacer between the mailbox user name and the domain name, and the picture name comprises a preset word; and/or the identification result comprises a spacer between the mailbox user name and the domain name, and the spacer comprises the association information of the target user before.
8. The apparatus of claim 6, further comprising:
and the first user homepage detection module is used for detecting whether the HTML file contains a jump instruction, and if the HTML file contains the jump instruction, the HTML file of a target homepage corresponding to the jump instruction is used as the HTML file of the user homepage of the mailbox to be extracted.
CN201711285206.3A 2017-12-07 2017-12-07 Mailbox information extraction method and device Expired - Fee Related CN110020366B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711285206.3A CN110020366B (en) 2017-12-07 2017-12-07 Mailbox information extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711285206.3A CN110020366B (en) 2017-12-07 2017-12-07 Mailbox information extraction method and device

Publications (2)

Publication Number Publication Date
CN110020366A CN110020366A (en) 2019-07-16
CN110020366B true CN110020366B (en) 2021-06-15

Family

ID=67186903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711285206.3A Expired - Fee Related CN110020366B (en) 2017-12-07 2017-12-07 Mailbox information extraction method and device

Country Status (1)

Country Link
CN (1) CN110020366B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111147361B (en) * 2019-12-30 2022-06-07 论客科技(广州)有限公司 Method, device and storage medium for adding mailbox account

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001052021A (en) * 1999-08-12 2001-02-23 Bigbang Technology Ltd Information retrieval system used in the internet and method for constructing the system
CN101980156A (en) * 2010-11-22 2011-02-23 上海合合信息科技发展有限公司 Method for automatically extracting email address and creating new email
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics
CN103049845A (en) * 2013-01-22 2013-04-17 广州多益网络科技有限公司 Management method and device for electronic mail box
CN107247790A (en) * 2017-06-16 2017-10-13 北京小米移动软件有限公司 The method and apparatus of newly-built mail

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004179946A (en) * 2002-11-27 2004-06-24 Nec Corp Method and system for issuing mail address and server device
CN101299729B (en) * 2008-06-25 2011-05-11 哈尔滨工程大学 Method for judging rubbish mail based on topological action
CN103490980B (en) * 2013-09-04 2017-07-28 盈世信息科技(北京)有限公司 The extracting method and its device of number in a kind of Email
CN103944810B (en) * 2014-05-06 2017-02-15 厦门大学 Spam e-mail intention recognition system
CN106209724A (en) * 2015-04-29 2016-12-07 福建天晴数码有限公司 A kind of invalid addresses of items of mail filter method and device
CN106021304A (en) * 2016-05-05 2016-10-12 乐视控股(北京)有限公司 Webpage address correcting method and system
CN106027369A (en) * 2016-05-09 2016-10-12 哈尔滨工程大学 Email address characteristic oriented email address matching method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001052021A (en) * 1999-08-12 2001-02-23 Bigbang Technology Ltd Information retrieval system used in the internet and method for constructing the system
CN101980156A (en) * 2010-11-22 2011-02-23 上海合合信息科技发展有限公司 Method for automatically extracting email address and creating new email
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics
CN103049845A (en) * 2013-01-22 2013-04-17 广州多益网络科技有限公司 Management method and device for electronic mail box
CN107247790A (en) * 2017-06-16 2017-10-13 北京小米移动软件有限公司 The method and apparatus of newly-built mail

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于搜索引擎的邮箱地址自动提取系统开发;刘冉;《中国优秀硕士学位论文全文数据库 信息科技辑》;20140315(第03期);I138-1207 第13-44页 *

Also Published As

Publication number Publication date
CN110020366A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
CN110569832B (en) Text real-time positioning and identifying method based on deep learning attention mechanism
JP4829920B2 (en) Form automatic embedding method and apparatus, graphical user interface apparatus
CN109635120B (en) Knowledge graph construction method and device and storage medium
KR101769918B1 (en) Recognition device based deep learning for extracting text from images
CN110866091B (en) Data retrieval method and device
EP2058744A1 (en) Location expression detection device, program, and computer readable medium
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
CN108900554B (en) HTTP asset detection method, system, device and computer medium
CN112445915A (en) Document map extraction method and device based on machine learning and storage medium
CN111078915B (en) Click-to-read content acquisition method in click-to-read mode and electronic equipment
CN110020366B (en) Mailbox information extraction method and device
US20090327210A1 (en) Advanced book page classification engine and index page extraction
CN112015773B (en) Knowledge base retrieval method and device, electronic equipment and storage medium
US10606875B2 (en) Search support apparatus and method
CN111177301B (en) Method and system for identifying and extracting key information
CN114220113A (en) Paper quality detection method, device and equipment
CN114254138A (en) Multimedia resource classification method and device, electronic equipment and storage medium
JPH06124366A (en) Address reader
CN113806368A (en) System and method for identifying document and automatically establishing database
CN112905733A (en) Book storage method, system and device based on OCR recognition technology
CN111666928A (en) Computer file similarity recognition system and method based on image analysis
JP2011159256A (en) Method and program for reading visiting card
CN111241313A (en) Retrieval method and device supporting image input
CN117935292B (en) Website identification recognition method and device, electronic equipment and storage medium
CN114372267B (en) Malicious webpage identification detection method based on static domain, computer and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230609

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Address before: 100871, Beijing, Haidian District, Cheng Fu Road, No. 298, Zhongguancun Fangzheng building, 9 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: PKU FOUNDER INFORMATION INDUSTRY GROUP CO.,LTD.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210615