Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other examples obtained based on the examples in the present invention are within the scope of the present invention.
Fig. 1 is a flowchart of a mailbox information extraction method according to an embodiment of the present invention, and as shown in fig. 1, the method includes:
101. preprocessing an HTML file of a user homepage of a mailbox to be extracted, wherein the preprocessing comprises removing meaningless characters and messy code characters in the HTML file and uniformly converting the format of the HTML file into a preset target format.
In this embodiment, in order to accurately obtain the mailbox address in the user homepage of the mailbox to be extracted, the HTML file of the user homepage of the mailbox to be extracted needs to be preprocessed. Specifically, in order to improve the efficiency of mailbox extraction, the meaningless characters and the garbled characters in the HTML file can be deleted, for example, the meaningless characters include < span >, < meta > and the like. In addition, because the HTML source code in the HTML file may include multiple ASCII codes in decimal, hexadecimal, and so on, if the code is not processed, the mailbox address in the HTML file is directly obtained, and the obtained mailbox address is incomplete. Therefore, in order to improve the accuracy of mailbox acquisition, all ASCII codes in different systems in an HTML file need to be converted into english letters, numbers or symbols.
102. And detecting whether a first mailbox address containing a character string exists in the mailbox address or not aiming at the mailbox address in the preprocessed HTML file.
In the embodiment, after the HTML file is preprocessed, whether the first mailbox address in the form of a character string exists in the HTML file is detected. Specifically, the HTML file may be traversed by a regular expression to detect whether the first mailbox address in the form of a character string is contained therein. In order to improve the accuracy of acquisition, whether the mailbox address is the mailbox address or not can be judged according to the position of the mailbox address appearing in the HTML file, for example, a character string after "Email" or a character string after "Mail to" can be acquired; in addition, in practical applications, mailbox formats are diversified, for example, a part of mailboxes is replaced by "AT" instead of "@" and "DOT" instead of ". once", so that in order to accurately acquire all mailbox addresses in the HTML file, whether the mailbox addresses are mailbox addresses can be judged according to the content of the character strings, for example, whether any item of "@" or "AT" is included in the character strings or not is detected, and any item of "@" or "DOT" is detected, and if yes, the mailbox addresses are judged.
103. If yes, detecting whether the first mailbox address meets a preset mailbox format, and if yes, deleting the irregular characters in the first mailbox address.
In this embodiment, if it is detected that the HTML file includes a mailbox address in a string form, it is detected whether the mailbox address satisfies a preset mailbox format. For example, if the mailbox address satisfies xx [ at ] yy.zz format, wherein "at" and ". are replaceable in any form, it is determined that the mailbox address satisfies the preset mailbox format; in addition, whether the mailbox address contains the name information of the target user is detected, wherein the name information can be a complete spelling of the name of the target user, an abbreviation of the name of the target user, a last name and a first name of the target user, for example, if the name of the target user is: ming Li, then the string before @ in Email contains Ming, Li, lm, or ml, then it is considered to contain name information of the target user. If the arbitrary requirements are met, the irregular characters in the mailbox addresses can be deleted. In practical application, the correct mailbox address does not include irregular characters, for example, characters such as "/" and spaces in the correct mailbox address all belong to irregular characters, and therefore, in order to ensure the accuracy of the finally obtained mailbox address, the irregular characters in the mailbox need to be deleted. In practical application, the irregular character library can be pre-established through collection and statistics, and then the irregular characters contained in the first mailbox address are deleted from the first mailbox address when the mailbox information extraction scheme is executed.
104. And detecting whether the character length of the first mailbox address is in a preset length range or not, and if so, taking the first mailbox address as a standard mailbox address.
In this embodiment, after deleting the irregular characters in the mailbox, in order to ensure that the obtained mailbox address is a valid mailbox address, the length of the mailbox needs to be detected, and if it is detected that the length of the mailbox address is within a preset length range, the mailbox address can be used as a standard mailbox address. Specifically, the length range may be set by a user, or may be a default length range, where the default length range is obtained according to analysis of a plurality of current mailbox addresses, for example, the default length range may be that the character length of the email is 5-80, and if the character length of the mailbox address is between 5-80, the mailbox is determined to be the standard mailbox address.
It should be noted that the method for obtaining the mailbox can be used for obtaining the mailbox address of the student in the homepage of the student, and can also be used for obtaining the mailbox address in the homepage of any target user, and the invention does not need to be limited at all again.
The method for extracting mailbox information provided by this embodiment includes preprocessing an HTML file of a user homepage of a mailbox to be extracted, acquiring all mailbox addresses in the preprocessed HTML file, and detecting whether the preset conditions include all forms of a current mailbox, if so, deleting irregular characters in the mailbox addresses, detecting whether a preset length of the mailbox satisfies a preset length range, and taking the mailbox addresses satisfying the length range as standard mailbox addresses. The source codes in the HTML file can be converted into the same format by preprocessing the HTML file, so that the mailbox addresses of all character strings in the HTML file can be acquired, all the mailbox addresses in the HTML file can be further acquired by comparing the acquired mailbox addresses with all the current mailbox formats, and the mailbox addresses of the target users in the webpage can be accurately acquired.
Fig. 2 is a flowchart of a mailbox information extraction method provided in the second embodiment of the present invention, and on the basis of the above embodiment, as shown in fig. 2, after step 102, the method further includes:
201. if the first mailbox address does not exist, detecting whether the mailbox address in the HTML file meets a preset first condition, wherein the first condition comprises that: the mailbox address has the associated information of the target user within a preset distance around the position where the HTML file appears, and/or a character string in the mailbox address before a spacer between the mailbox user name and the domain name contains the associated information of the target user.
In this embodiment, if the HTML file does not include the first mailbox address, it is detected whether the mailbox address in the HTML file meets a preset first condition. Wherein, the first condition of presetting specifically includes: the short distance of the position where the mailbox address appears comprises the associated information of the target user, wherein the short distance is defined as follows: two strings are considered to be closer together when they are no more than three tags apart. For example, str1< bq1> str2< bq2> str3< bq3> str4< bq4> str5, str1 and str2, str3 and str4 are closer together and farther apart from str 5. When the labels are calculated to be spaced apart, if there are no valid characters (referring to empty characters such as non-space, line feed, or tab) between the two labels, they are calculated to be a label. For example: str1< bq1> < bq2> str2, str1 and str2 are considered to be one tag apart; and/or a character string in the mailbox address before the spacer between the mailbox user name and the domain name contains the associated information of the target user.
202. And if the mailbox address meets the first condition, storing the mailbox address to a suspected mailbox list.
In this embodiment, if it is detected that the mailbox address satisfies any of the first conditions, the mailbox address may be added to the suspected mailbox list. It should be noted that, if the HTML file does not include the first mailbox address, the mailbox address may be acquired from the suspected mailbox list as the standard mailbox address, and since the HTML file of the mailbox to be extracted may include mailbox addresses of a plurality of users, all mailbox addresses in the suspected mailbox list satisfy the first condition, that is, the string corresponding to the mailbox address in the suspected mailbox list includes the associated information of the target user or the associated information of the target user is included near the string, the probability that the mailbox address is the standard mailbox address is high, and therefore, the accuracy of mailbox acquisition can be increased.
It should be noted that, on the basis of any of the above embodiments, the association information includes the name of the target user, or an abbreviation of the name of the target user, or a suffix of the URL of the HTML file.
Because the expression form of the user name in the mailbox address of the user is various, for example, the full spelling of the name or the combination of the initial names, and the like, in order to improve the accuracy of mailbox acquisition, the associated information includes the name and the abbreviation of the target user, and if the mailbox address includes the name and the abbreviation of the target user, it can be determined that the mailbox address is the mailbox address of the target user. In addition, if the mailbox address of the target user does not include the name information of the user, it cannot be determined whether the mailbox address is the mailbox address of the target user, and therefore, in order to solve the above problem, the association information further includes a suffix of the HTML file URL, and if it is detected that the suffix of the HTML file URL is included in a character string of the mailbox address located before the spacer between the mailbox user name and the domain name, it may be determined that the mailbox address is the mailbox address of the target user.
According to the mailbox information extraction method provided by the embodiment, if it is detected that the HTML file does not contain the first mailbox address, the mailbox addresses meeting the preset first condition are stored in the suspected mailbox list, so that the mailbox addresses are acquired from the suspected mailbox list as the standard mailbox addresses, and because the mailbox addresses in the suspected mailbox list meet the first condition, the probability that the mailbox addresses are the standard mailbox addresses is high, and therefore the mailbox acquisition accuracy can be improved.
Fig. 3 is a flowchart of a mailbox information extraction method provided by a third embodiment of the present invention, on the basis of any one of the above embodiments, as shown in fig. 3, after step 102, the method includes:
301. and if the first mailbox address does not exist, detecting whether the preprocessed HTML file contains a mailbox address in a picture form.
In practical application, since the mailbox address of the target user in part of the homepage is represented in a picture form, if the HTML file is detected not to include the first mailbox address, whether the preprocessed HTML file includes the mailbox address in the picture form or not can be detected
302. If yes, performing character recognition on the mailbox address in the picture form, and detecting whether a recognition result meets a preset second condition, if yes, taking the mailbox address in the picture form as the standard mailbox address, wherein the second condition comprises that: the identification result comprises a spacer between the mailbox user name and the domain name, and the picture name comprises a preset word; and/or the identification result comprises a spacer between the mailbox user name and the domain name, and the spacer comprises the association information of the target user before.
In this embodiment, if it is detected that the HTML file includes the mailbox address in the form of a picture, all pictures in the HTML file may be downloaded, and since the mailbox address memory in the form of a picture in the HTML file is less than 10480, pictures smaller than 10480 may be screened out for analysis. Specifically, the screened pictures can be recognized by an OCR character recognition tool. And detecting whether the picture meets a preset second condition, and if so, taking the mailbox address in the picture form as a standard mailbox address. Wherein the second condition comprises: the identification result comprises a spacer between the mailbox user name and the domain name, and the picture name comprises a preset word; and/or the identification result comprises a spacer between the mailbox user name and the domain name, and the spacer comprises the association information of the target user.
According to the mailbox information extraction method provided by the embodiment, whether the picture-form mailbox address is contained in the HTML file or not is detected, if yes, whether the mailbox address meets the preset second condition or not is judged, and if yes, the mailbox address is used as the standard mailbox address, so that the accuracy of mailbox acquisition can be improved.
It should be noted that the above three embodiments can be implemented individually or in combination, and the present invention is not limited again.
Fig. 4 is a flowchart of a mailbox information extraction method provided by a fourth embodiment of the present invention, on the basis of any one of the foregoing embodiments, as shown in fig. 4, before step 101, the method further includes:
401. and detecting whether the HTML file contains a jump instruction, and if so, taking the HTML file of the target homepage corresponding to the jump instruction as the HTML file of the user homepage of the mailbox to be extracted.
In this embodiment, after the homepage of the target user is acquired, in order to improve the accuracy of mailbox acquisition, the authenticity of the homepage needs to be determined. In practical application, in some cases, the homepage of the target user may jump, and therefore, whether the HTML file includes a jump instruction is detected, and if the jump instruction includes the jump instruction, the HTML file of the target homepage corresponding to the jump instruction is used as the HTML file of the user homepage of the mailbox to be extracted. For example, if included in the HTML: < meta http-equiv ═ Refresh "; … URL … HTML "> form, then download the HTML corresponding to the website after URL as the learner's homepage HTML. If the HTML contains: location place ("… … html"); and the HTML corresponding to the address in the brackets is downloaded as the HTML of the learner homepage in the form of the statement. And if the HTML file is detected not to contain the jump instruction, taking the current HTML file as the homepage of the target user.
According to the mailbox information extraction method provided by the embodiment, whether the homepage of the target user jumps or not is detected, and different measures are taken according to the detection result, so that the homepage of the target user can be accurately acquired, and a basis is provided for accurately acquiring the mailbox address of the target user.
Fig. 5 is a flowchart of a mailbox information extraction method provided in the fifth embodiment of the present invention, and on the basis of any of the above embodiments, as shown in fig. 5, before step 101, the method further includes:
501. and detecting whether the HTML file contains a plurality of frame tags or not, and if so, sequentially taking the HTML file of the main page corresponding to each frame tag as the HTML file of the user main page of the mailbox to be extracted.
In this embodiment, after the homepage of the target user is acquired, in order to improve the accuracy of mailbox acquisition, the authenticity of the homepage needs to be determined. In practical application, the HTML file may include a plurality of frame tags, and each frame tag may correspond to a separate HTML file, so that in order to improve mailbox acquisition accuracy, the HTML file of the main page corresponding to each frame tag needs to be sequentially used as the HTML file of the user main page of the mailbox to be extracted, and mailbox address acquisition operation is performed on each HTML file.
In the method for extracting mailbox information provided by this embodiment, the HTML files corresponding to the multiple frame tags in the current HTML file are acquired, and the HTML files corresponding to the multiple frame tags are sequentially used as the HTML files of the mailbox to be extracted to perform the operation of extracting the mailbox address, so that the mailbox address of the target user can be accurately acquired.
Fig. 6 is a structural diagram of a mailbox information extraction apparatus according to a sixth embodiment of the present invention, and as shown in fig. 6, the illustrated apparatus includes:
the preprocessing module 61 is configured to preprocess an HTML file of a user homepage of the mailbox to be extracted, where the preprocessing includes removing meaningless characters and messy code characters in the HTML file, and uniformly converting a format of the HTML file into a preset target format.
And the character string detection module 62 is configured to detect whether a first mailbox address containing a character string exists in the mailbox addresses for the mailbox addresses in the preprocessed HTML file.
And the mailbox format detection module 63 is configured to detect whether the first mailbox address meets a preset mailbox format if the first mailbox address exists, and delete the irregular character in the first mailbox address if the first mailbox address meets the preset mailbox format.
A first standard mailbox address determining module 64, configured to detect whether a character length of the current first mailbox address is within a preset length range, and if so, take the first mailbox address as a standard mailbox address.
In this embodiment, in order to accurately obtain the mailbox address in the user homepage of the mailbox to be extracted, the preprocessing module 61 needs to preprocess the HTML file of the user homepage of the mailbox to be extracted. Specifically, in order to improve the efficiency of mailbox extraction, the meaningless characters and the garbled characters in the HTML file can be deleted, for example, the meaningless characters include < span >, < meta > and the like. In addition, because the HTML source code in the HTML file may include multiple ASCII codes in decimal, hexadecimal, and so on, if the code is not processed, the mailbox address in the HTML file is directly obtained, and the obtained mailbox address is incomplete. Therefore, in order to improve the accuracy of mailbox acquisition, all ASCII codes in different systems in an HTML file need to be converted into english letters, numbers or symbols.
After preprocessing the HTML file, the string detection module 62 detects whether a first mailbox address in the form of a string exists in the HTML file. Specifically, the HTML file may be traversed by a regular expression to detect whether the first mailbox address in the form of a character string is contained therein. In order to improve the accuracy of acquisition, whether the mailbox address is the mailbox address or not can be judged according to the position of the mailbox address appearing in the HTML file, for example, a character string after "Email" or a character string after "Mail to" can be acquired; in addition, in practical applications, mailbox formats are diversified, for example, a part of mailboxes is replaced by "AT" instead of "@" and "DOT" instead of ". once", so that in order to accurately acquire all mailbox addresses in the HTML file, whether the mailbox addresses are mailbox addresses can be judged according to the content of the character strings, for example, whether any item of "@" or "AT" is included in the character strings or not is detected, and any item of "@" or "DOT" is detected, and if yes, the mailbox addresses are judged.
If the HTML file is detected to contain a mailbox address in a character string form, the mailbox format detection module 63 detects whether the mailbox address meets a preset mailbox format. For example, if the mailbox address satisfies xx [ at ] yy.zz format, wherein "at" and ". are replaceable in any form, it is determined that the mailbox address satisfies the preset mailbox format; in addition, whether the mailbox address contains the name information of the target user is detected, wherein the name information can be a complete spelling of the name of the target user, an abbreviation of the name of the target user, a last name and a first name of the target user, for example, if the name of the target user is: ming Li, then the string before @ in Email contains Ming, Li, lm, or ml, then it is considered to contain name information of the target user. If the arbitrary requirements are met, the irregular characters in the mailbox addresses can be deleted. In practical application, the correct mailbox address does not include irregular characters, for example, characters such as "/" and spaces in the correct mailbox address all belong to irregular characters, and therefore, in order to ensure the accuracy of the finally obtained mailbox address, the irregular characters in the mailbox need to be deleted.
After the irregular characters in the mailbox are deleted, in order to ensure that the acquired mailbox address is a valid mailbox address, the first standard mailbox address determination module 64 needs to detect the length of the mailbox, and if the length of the mailbox address is detected to be within a preset length range, the mailbox address can be used as a standard mailbox address. Specifically, the length range may be set by a user, or may be a default length range, where the default length range is obtained according to analysis of a plurality of current mailbox addresses, for example, the default length range may be that the character length of the email is 5-80, and if the character length of the mailbox address is between 5-80, the mailbox is determined to be the standard mailbox address.
It should be noted that the mailbox acquiring apparatus may be used for acquiring the mailbox address of the student in the homepage of the student, and may also be used for acquiring the mailbox address in the homepage of any target user, which is not limited in the present invention.
The mailbox information extraction device provided in this embodiment preprocesses an HTML file of a user homepage of a mailbox to be extracted, acquires all mailbox addresses in the preprocessed HTML file, and detects whether the all mailbox addresses meet preset conditions, where the preset conditions include all forms of a current mailbox, and if so, deletes irregular characters in the mailbox addresses, detects whether the mailbox meets a preset length range, and takes the mailbox address meeting the length range as a standard mailbox address. The source codes in the HTML file can be converted into the same format by preprocessing the HTML file, so that the mailbox addresses of all character strings in the HTML file can be acquired, all the mailbox addresses in the HTML file can be further acquired by comparing the acquired mailbox addresses with all the current mailbox formats, and the mailbox addresses of the target users in the webpage can be accurately acquired.
Fig. 7 is a structural diagram of a mailbox information extraction apparatus according to a seventh embodiment of the present invention, and based on the foregoing embodiment, as shown in fig. 7, the apparatus further includes:
a first condition detecting module 71, configured to detect whether a mailbox address in the HTML file meets a preset first condition if the first mailbox address does not exist, where the first condition includes: the mailbox address has the associated information of the target user within a preset distance around the position where the HTML file appears, and/or a character string in the mailbox address before a spacer between the mailbox user name and the domain name contains the associated information of the target user.
A suspected mailbox list creating module 72, configured to store the mailbox address to a suspected mailbox list if the mailbox address satisfies the first condition.
In this embodiment, if the HTML file does not include the first mailbox address, the first condition detection module 71 detects whether the mailbox address in the HTML file meets a preset first condition. Wherein, the first condition of presetting specifically includes: the short distance of the position where the mailbox address appears comprises the associated information of the target user, wherein the short distance is defined as follows: two strings are considered to be closer together when they are no more than three tags apart. For example, str1< bq1> str2< bq2> str3< bq3> str4< bq4> str5, str1 and str2, str3 and str4 are closer together and farther apart from str 5. When the labels are calculated to be spaced apart, if there are no valid characters (referring to empty characters such as non-space, line feed, or tab) between the two labels, they are calculated to be a label. For example: str1< bq1> < bq2> str2, str1 and str2 are considered to be one tag apart; and/or a character string in the mailbox address before the spacer between the mailbox user name and the domain name contains the associated information of the target user.
If it is detected that the mailbox address satisfies any of the above first conditions, the suspected mailbox list creating module 72 may add the mailbox address to the suspected mailbox list. It should be noted that, if the HTML file does not include the first mailbox address, the mailbox address may be acquired from the suspected mailbox list as the standard mailbox address, and since the HTML file of the mailbox to be extracted may include mailbox addresses of a plurality of users, all mailbox addresses in the suspected mailbox list satisfy the first condition, that is, the string corresponding to the mailbox address in the suspected mailbox list includes the associated information of the target user or the associated information of the target user is included near the string, the probability that the mailbox address is the standard mailbox address is high, and therefore, the accuracy of mailbox acquisition can be increased.
It should be noted that, on the basis of any of the above embodiments, the association information includes the name of the target user, or an abbreviation of the name of the target user, or a suffix of the URL of the HTML file.
Because the expression form of the user name in the mailbox address of the user is various, for example, the full spelling of the name or the combination of the initial names, and the like, in order to improve the accuracy of mailbox acquisition, the associated information includes the name and the abbreviation of the target user, and if the mailbox address includes the name and the abbreviation of the target user, it can be determined that the mailbox address is the mailbox address of the target user. In addition, if the mailbox address of the target user does not include the name information of the user, it cannot be determined whether the mailbox address is the mailbox address of the target user, and therefore, in order to solve the above problem, the association information further includes a suffix of the HTML file URL, and if it is detected that the suffix of the HTML file URL is included in a character string of the mailbox address located before the spacer between the mailbox user name and the domain name, it may be determined that the mailbox address is the mailbox address of the target user.
The mailbox information extraction device provided in this embodiment stores, if it is detected that the HTML file does not include the first mailbox address, the mailbox address meeting the preset first condition into the suspected mailbox list, so as to subsequently acquire the mailbox address from the suspected mailbox list as the standard mailbox address.
Fig. 8 is a structural diagram of an mailbox information extraction apparatus according to an eighth embodiment of the present invention, and on the basis of any of the above embodiments, as shown in fig. 8, the apparatus includes:
and the picture-form mailbox address detection module 81 is configured to detect whether the preprocessed HTML file includes a picture-form mailbox address if the first mailbox address does not exist.
A second standard mailbox address determination module 82, configured to perform text recognition on the picture-format mailbox address if the picture-format mailbox address is a standard mailbox address, and detect whether a preset second condition is met in a recognition result, and if the preset second condition is met, use the picture-format mailbox address as the standard mailbox address, where the second condition includes: the identification result comprises a spacer between the mailbox user name and the domain name, and the picture name comprises a preset word; and/or the identification result comprises a spacer between the mailbox user name and the domain name, and the spacer comprises the association information of the target user before.
In practical applications, since the mailbox address of the target user in part of the homepage is represented in the form of a picture, if it is detected that the HTML file does not include the first mailbox address, the picture-form mailbox address detection module 81 may detect whether the preprocessed HTML file includes a picture-form mailbox address
In this embodiment, if it is detected that the HTML file includes the mailbox address in the form of a picture, the first standard mailbox address determination module 82 may download all pictures in the HTML file, and since all mailbox address memories in the form of pictures in the HTML file are smaller than 10480, the pictures smaller than 10480 may be screened out for analysis. Specifically, the screened pictures can be recognized by an OCR character recognition tool. And detecting whether the picture meets a preset second condition, and if so, taking the mailbox address in the picture form as a standard mailbox address. Wherein the second condition comprises: the identification result comprises a spacer between the mailbox user name and the domain name, and the picture name comprises a preset word; and/or the identification result comprises a spacer between the mailbox user name and the domain name, and the spacer comprises the association information of the target user.
The mailbox information extraction device provided by this embodiment detects whether the HTML file includes the mailbox address in the form of a picture, and if so, determines whether the mailbox address meets a preset second condition, and if so, takes the mailbox address as a standard mailbox address, thereby improving the accuracy of mailbox acquisition.
It should be noted that the above three embodiments can be implemented individually or in combination, and the present invention is not limited again.
Fig. 9 is a structural diagram of a mailbox information extraction apparatus according to a ninth embodiment of the present invention, where on the basis of any of the foregoing embodiments, as shown in fig. 9, the apparatus further includes:
the first user homepage detection module 91 is configured to detect whether the HTML file includes a jump instruction, and if the HTML file includes the jump instruction, the HTML file of the target homepage corresponding to the jump instruction is used as the HTML file of the user homepage of the mailbox to be extracted.
In this embodiment, after the homepage of the target user is acquired, in order to improve the accuracy of mailbox acquisition, the authenticity of the homepage needs to be determined. In practical applications, in some cases, the homepage of the target user may jump, and therefore, the first user homepage detecting module 91 detects whether the HTML file includes a jump instruction, and if so, takes the HTML file of the target homepage corresponding to the jump instruction as the HTML file of the user homepage of the mailbox to be extracted. For example, if included in the HTML: < meta http-equiv ═ Refresh "; … URL … HTML "> form, then download the HTML corresponding to the website after URL as the learner's homepage HTML. If the HTML contains: location place ("… … html"); and the HTML corresponding to the address in the brackets is downloaded as the HTML of the learner homepage in the form of the statement. And if the HTML file is detected not to contain the jump instruction, taking the current HTML file as the homepage of the target user.
The mailbox information extraction device provided by this embodiment detects whether the homepage of the target user jumps or not, and takes different measures according to the detection result, so that the homepage of the target user can be accurately acquired, and a basis is provided for accurately acquiring the mailbox address of the target user.
Fig. 10 is a structural diagram of a mailbox information extraction apparatus provided in a tenth embodiment of the present invention, and on the basis of any of the above embodiments, as shown in fig. 10, the apparatus further includes:
the second user homepage detection module 111 is configured to detect whether the HTML file includes multiple frame tags, and if the HTML file includes multiple frame tags, sequentially use the HTML file of the homepage corresponding to each frame tag as the HTML file of the user homepage of the mailbox to be extracted.
In this embodiment, after acquiring the homepage of the target user, in order to improve the accuracy of mailbox acquisition, the second user homepage detection module 111 further needs to determine the authenticity of the homepage. In practical application, the HTML file may include a plurality of frame tags, and each frame tag may correspond to a separate HTML file, so that in order to improve mailbox acquisition accuracy, the HTML file of the main page corresponding to each frame tag needs to be sequentially used as the HTML file of the user main page of the mailbox to be extracted, and mailbox address acquisition operation is performed on each HTML file.
The mailbox information extraction device provided by this embodiment can accurately acquire the mailbox address of the target user by acquiring the HTML files corresponding to the plurality of frame tags in the current HTML file and sequentially using the HTML files corresponding to the plurality of frame tags as the HTML files of the mailbox to be extracted to perform the operation of mailbox address extraction.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.