CN112287198A

CN112287198A - Spam short message detection method based on crawler technology

Info

Publication number: CN112287198A
Application number: CN202011173377.9A
Authority: CN
Inventors: 汤增丰; 张长根
Original assignee: Shanghai Winnerlook Information Technology Co ltd
Current assignee: Shanghai Winnerlook Information Technology Co ltd
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2021-01-29
Anticipated expiration: 2040-10-28
Also published as: CN112287198B

Abstract

The invention provides a spam short message detection method based on a crawler technology, which comprises the following steps: extracting an effective URL address in the short message to be detected through a preset regular expression; acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address by using a crawler module; extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information; analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address; segmenting word information of an HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in spam information.

Description

Spam short message detection method based on crawler technology

Technical Field

The invention relates to the field of communication and network search, and particularly provides a spam message detection method based on a crawler technology.

Background

With the continuous development of the mobile internet technology, commercial short messages are used more and more widely as propaganda media with the strongest controllability of operators. The commercial short message has the advantages of rapid propagation, high arrival rate, advanced end, small advertisement delay effect, high inquiring rate of advertisement content, capability of subdividing target customers and market propagation, low cost and the like. The advantages are often utilized by bad merchants and illegal molecules, illegal marketing, malicious fraud and yellow gambling poison are carried out by using the short messages, so that very bad social influence is caused, and benign and healthy development of the commercial short messages is seriously damaged.

In the existing technical scheme, the spam detection technology mainly analyzes and processes the sending and receiving code number and the short message content, and does not further analyze and identify the URL address information contained in the short message content, so that the spam identification success rate is low.

Disclosure of Invention

The invention provides a crawler technology-based spam message detection method, which is used for solving the problem that spam messages are disguised as normal messages and the condition that URL addresses contained in the spam messages point to illegal or spam information pages cannot be effectively identified, and adopts the following technical scheme:

the invention provides a spam short message detection method based on a crawler technology, which comprises the following steps:

extracting an effective URL address in the short message to be detected through a preset regular expression;

acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address by using a crawler module;

extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information;

analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address;

segmenting word information of an HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in spam information.

Further, in the process of acquiring the HTML page pointed by the URL, if the HTTP request is jumped for multiple times, the crawler module controls the HTML page not to jump to other HTML pages automatically, wherein the HTML page is the final page visible to the user.

Further, the regular expression is:

http[s]？://(？:[a-zA-Z]|[0-9]|[$-_@.&+]|[！*,]|(？:％[0-9a-fA-F][0-9a-fA-F]))+

wherein, the regular expression is used

http [ s ]? And// (.

Further, the method for acquiring the URL address and the HTML page text information of the final HTML page pointed by the URL address by using the crawler module comprises the following steps:

sending an HTTP request pointing to the URL address to a response header in a crawler mode;

when receiving a first type of response sent back by the response header, the crawler module continuously initiates an HTTP request to a new URL address contained in the response header; wherein, the first type response refers to the occurrence of HTTP redirection message;

when a second type of response sent back by the response header is received, the crawler module simulates the behavior of a browser, executes a Javascript and initiates a new HTTP request to a URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;

and the crawler module continuously processes the first type response and the second type response until a third type response is received, wherein the third type response is a final response, and the received corresponding information comprises HTML page content displayed to the terminal user.

Further, the crawler module comprises:

the request sending module is used for sending an HTTP request pointing to the URL address to the response head in a crawler mode;

the first-class response processing module is used for continuously initiating an HTTP request to a new URL address contained in the response head when receiving the first-class response sent back by the response head; wherein, the first type response refers to the occurrence of HTTP redirection message;

the second type response processing module simulates the behavior of a browser when receiving a second type response sent back by the response header, executes the Javascript and initiates a new HTTP request to the URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;

and the HTML information acquisition module is used for controlling the first-class response processing module and the second-class response processing module to respectively carry out continuous processing on the first-class response and the second-class response until a third-class response is received, wherein the third-class response is a final response, and the received corresponding information contains HTML page content displayed to the terminal user.

Further, after the request sending module sends a first HTTP request to the response header for the short message to be detected, if the time for the response header to return the response information exceeds a preset time threshold, the request sending module sends a continuous HTTP request according to a request sending time interval, where the request sending time interval is determined by the following formula:

where Δ T represents a transmission request time interval; m represents the total number of the short messages to be detected by the crawler module; delta T₀Indicating an initial default value for the transmission request interval; beta is a₁、β₂And beta₃Denotes a time interval adjustment coefficient, beta₁Has a value range of 0.581-0.673, beta₂Value ofIn the range of 0.424-0.537; beta is a₂The value range of (1) is 0.615-0.736; t is_iWhen the crawler module detects the ith short message to be detected, the request sending module sends a first HTTP request to the response head and then obtains the interval duration responded by the response head; t is_maxThe method comprises the steps that a request sending module sends a first HTTP request to a response head to obtain the maximum value of interval duration responded by the response head in the process that the crawler module detects m short messages to be detected; t is_minAnd indicating that the minimum interval duration value responded by the response head is obtained after the request sending module sends the first HTTP request to the response head in the process of detecting the m short messages to be detected by the crawler module.

Further, the determining whether the short message to be detected is a spam short message by using the domain name information includes:

inquiring ICP filing information corresponding to the domain name information, and judging whether the ICP filing information is effective filing information or not;

and when the ICP record information is invalid record information, judging that the short message to be detected is a spam short message.

Further, determining whether the short message to be detected is a spam short message according to the IP address comprises:

inquiring attribution information of the IP address according to the IP address;

judging whether the home location corresponding to the IP address is a continental China area or not according to the home location information;

and when the attribution corresponding to the IP address is not the continental China, judging that the short message to be detected is a spam short message.

Further, the step of determining whether the short message to be detected is a spam short message by using the proportion of each word in the spam message comprises the following steps:

inquiring the proportion of each term in the junk information;

weighting and summing the proportion of all words in the junk information to obtain a weighted sum value K;

judging whether the weighted summation value K is greater than a preset weighted summation value;

and when the weighted sum value K is greater than the preset value of weighted sum, judging that the short message to be detected is a spam short message.

Further, the weighted sum value K is obtained by the following formula:

wherein, W₁、W₂、……、W_nRepresenting the proportion of words 1 to n in the word information of the HTML page in the junk information; c₁、C₂、……、C_nAnd representing the times of the words 1 to n appearing in the character information of the HTML page, wherein the specific gravity of each word in the junk information is a preset specific gravity value.

The invention has the beneficial effects that:

the spam message detection method based on the crawler technology can effectively and quickly identify spam messages under the condition that the spam messages are disguised as normal messages, greatly improves the efficiency and accuracy of spam message identification, and effectively avoids the condition that the spam messages cannot be effectively identified after being disguised. The success rate of identifying and detecting the spam messages is improved. Meanwhile, the short messages to be detected are detected through various information detection ways, so that the efficiency, accuracy and success rate of spam short message detection can be effectively improved, and the number of spam short message detection and identification failures is effectively reduced.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described implementations are only some embodiments of the present invention. Rather than all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It is noted that the terms first, second and the like in the description and in the claims, and in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The spam message detection method based on the crawler technology provided by the embodiment of the invention is shown in figure 1, and comprises the following steps:

s1, extracting an effective URL address in the short message to be detected through a preset regular expression;

s2, acquiring the URL address and HTML page text information of the final HTML page pointed by the URL address by using a crawler module; the crawler module controls the HTML page not to automatically jump to other HTML pages if the HTTP request jumps for multiple times in the process of acquiring the HTML page pointed by the URL, wherein the HTML page is the final page visible to a user.

S3, extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information;

s4, analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address;

s5, segmenting word information of the HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in the spam information.

The working principle of the technical scheme is as follows: firstly, when the short message to be detected contains a valid URL address, extracting the URL address in the short message to be detected through a preset regular expression for matching the URL, and when the short message does not contain the URL address, processing by adopting other spam short message detection methods. Then, acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address by using a crawler module; subsequently, extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information; then, analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address; and finally, segmenting word information of an HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in the spam information. The word segmentation process of the character information of the HTML page comprises the following steps: firstly, the information content in the short message is coded according to a Universal Character Set (UCS), and then word segmentation is performed according to a horizontal arrangement mode of characters from left to right.

The effect of the above technical scheme is as follows: the spam message detection method based on the crawler technology provided by the embodiment can effectively and quickly identify spam messages under the condition that the spam messages are disguised as normal messages, so that the efficiency and the accuracy of spam message identification are improved to a great extent, and the condition that the spam messages cannot be effectively identified after being disguised is effectively avoided. The success rate of identifying and detecting the spam messages is improved.

In an embodiment of the present invention, the regular expression is:

wherein, the regular expression is used

http [ s ]? And// (. For example, a valid URL address in a sms is definitely a domain name or http: information such as + ip port, by the regular expression described above, can pass http [ s ]? // fragment or [ $ - @. & + ] | [! \ \ \ | (.

The effect of the above technical scheme is as follows: through the regular expression, the extraction accuracy and efficiency of the effective URL address in the short message to be detected can be improved, and the problem that the URL address is extracted in a missing mode can be effectively avoided. Therefore, the spam message detection accuracy is improved, and the spam message detection error rate is effectively reduced.

In an embodiment of the present invention, obtaining, by using a crawler module, a URL address and HTML page text information of a final HTML page to which the URL address points includes:

s201, sending an HTTP request pointing to the URL address to a response header in a crawler mode;

s202, when a first type of response sent back by the response header is received, the crawler module continuously initiates an HTTP request to a new URL address contained in the response header; wherein, the first type response refers to the occurrence of HTTP redirection message;

s203, when a second type of response sent back by the response header is received, the crawler module simulates the behavior of a browser, executes a Javascript and initiates a new HTTP request to a URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;

and S204, the crawler module continuously processes the first type of response and the second type of response until a third type of response is received, wherein the third type of response is a final response, and the received corresponding information contains HTML page content displayed to the terminal user.

Wherein the crawler module comprises:

The working principle of the technical scheme is as follows: acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address in the step S102 by using a crawler technology; three types of response information may be received when the crawler sends an HTTP request directed to the URL address, where the first type of response is an HTTP redirect message, and upon receipt of such a response, the crawler will continue to initiate HTTP requests to the new URL address contained in the response header. The second type of response is page information containing Javascript script, at which point the crawler will simulate browser behavior, execute the script, and initiate a new HTTP request to the URL address specified in the Javascript script. The first and second type responses will trigger the crawler to send a new HTTP request until a third type response message is received. The third type of response is a final response message that contains the HTML page content that is presented to the end user.

The effect of the above technical scheme is as follows: the information acquisition efficiency and the information acquisition accuracy of the URL address and the HTML page text information are effectively improved. The problem that the URL address and the HTML page text information are not accurately obtained due to problems of HTTP redirection, page information containing Javascript scripts and the like is solved. Meanwhile, the URL address and the HTML page text information of the final HTML page can be effectively and quickly acquired through the method, and the problem that the acquired URL address and the HTML page text information do not belong to the HTML page content displayed to the terminal user is effectively prevented.

In an embodiment of the present invention, after the request sending module sends a first HTTP request to the response head for the short message to be detected, if the time for the response head to return the response information exceeds a preset time threshold, the request sending module sends a continuous HTTP request according to a sending request time interval, where the sending request time interval is determined by the following formula:

where Δ T represents a transmission request time interval; m represents the total number of the short messages to be detected by the crawler module; delta T₀Indicating an initial default value for the transmission request interval; beta is a₁、β₂And beta₃Denotes a time interval adjustment coefficient, beta₁Has a value range of 0.581-0.673, beta₂The value range of (1) is 0.424-0.537; beta is a₂The value range of (1) is 0.615-0.736; t is_iWhen the crawler module detects the ith short message to be detected, the request sending module sends a first HTTP request to the response head and then obtains the interval duration responded by the response head; t is_maxIndicating that the crawler module obtains the response after sending a first HTTP request to a response head in the process of detecting m short messages to be detected by the crawler moduleMaximum interval duration of head response; t is_minAnd indicating that the minimum interval duration value responded by the response head is obtained after the request sending module sends the first HTTP request to the response head in the process of detecting the m short messages to be detected by the crawler module.

The working principle of the technical scheme is as follows: after the request sending module sends a first HTTP request to the response head aiming at the short message to be detected, if the time for the response head to return the response information exceeds a preset time threshold, the request sending module sends the HTTP request to the response head again and continuously according to the sending request time interval. In this embodiment, the interval duration of the response head is obtained after the request sending module sends the first HTTP request to the response head, and the request sending module obtains parameters such as the maximum value and the minimum value of the interval duration of the response head after sending the first HTTP request to the response head to obtain the request sending time interval during the detection of m short messages to be detected by the crawler module.

The effect of the above technical scheme is as follows: the efficiency of the crawler obtaining the URL address of the final HTML page pointed by the URL address and the character information of the HTML page is effectively improved. The problem that the junk short message flow is stopped or the junk short message detection efficiency is reduced due to the fact that no information response is carried out under the condition that a response head fails to receive a request is solved, meanwhile, the timeliness of HTTP request sending can be effectively improved through the sending request time interval obtained through the formula, the determination of the sending request time interval is carried out by taking historical time data in the short message detection process as the basis, the matching degree of the sending request time interval and the operation of a crawler module can be improved, the operation stability of the crawler module is improved, and the condition that the crawler module is unstable in operation due to the fact that the request sending frequency is too high due to the fact that the sending request time interval is too short is avoided.

In an embodiment of the present invention, the determining whether the short message to be detected is a spam short message by using domain name information includes:

s301, inquiring ICP record information corresponding to the domain name information, and judging whether the ICP record information is effective record information or not;

s302, when the ICP record information is invalid record information, the short message to be detected is judged to be a spam short message.

The working principle of the technical scheme is as follows: extracting the domain name contained in the URL address information of the HTML page through a preset regular expression for matching the domain name, inquiring the extracted domain name for related filing information from an industrial and informatization department government service platform or other service platforms provided by other legally authorized organizations, and judging that the short message to be detected is a spam short message when the related filing information cannot be inquired or the inquired filing information is identified to be invalid or illegal.

The effect of the above technical scheme is as follows: the short messages to be detected are detected in a way of examining and recording information, so that the efficiency, accuracy and success rate of spam short message detection can be effectively improved, and the number of spam short message detection and identification failures can be effectively reduced.

In one embodiment of the present invention, determining whether the short message to be detected is a spam short message by using the IP address includes:

s401, inquiring attribution information of the IP address according to the IP address;

s402, judging whether the attribution corresponding to the IP address is a continental China area or not according to the attribution information;

and S403, when the attribution corresponding to the IP address is not the mainland China, judging that the short message to be detected is a spam short message.

The working principle of the technical scheme is as follows: and acquiring the IP address corresponding to the domain name through a DNS analysis program, inquiring the information of the home location of the IP address, judging whether the home location is the continental China region, and if not, judging that the short message to be detected is a spam short message.

The effect of the above technical scheme is as follows: the short message to be detected is identified through the detection of the IP address attribution, the efficiency, the accuracy and the success rate of spam short message detection can be effectively improved, and the number of spam short message detection and identification failures is effectively reduced.

In an embodiment of the present invention, determining whether the short message to be detected is a spam short message by using the specific gravity of each word in the spam message includes:

s501, inquiring the proportion of each term in the junk information;

s502, carrying out weighted summation on the proportion of all words in the junk information to obtain a weighted summation value K;

s503, judging whether the weighted sum value K is larger than a preset weighted sum value;

s504, when the weighted sum value K is larger than the preset weighted sum value, the short message to be detected is judged to be a spam short message.

Wherein the weighted summation value K is obtained by the following formula:

wherein, W₁、W₂、……、W_nRepresenting the proportion of words 1 to n in the word information of the HTML page in the junk information; c₁、C₂、……、C_nAnd representing the times of the words 1 to n appearing in the character information of the HTML page, wherein the proportion of each word in the junk information is a preset manual value, and the default is zero.

The working principle of the technical scheme is as follows: the method comprises the steps of segmenting the HTML page characters to obtain n terms, inquiring the proportion of each term in junk information, carrying out weighted summation on the proportion of the junk information of all the terms to obtain K, judging whether the K value is larger than a preset value or not, and judging that the short message to be detected is the junk short message if the K value is larger than the preset value, wherein the proportion of each term in the junk information is a preset proportion value.

The effect of the above technical scheme is as follows: the short messages to be detected are detected through the weighted sum value of each word of the HTML page characters in the spam information, so that the efficiency, the accuracy and the success rate of spam short message detection can be effectively improved, and the number of spam short message detection and identification failures is effectively reduced. Meanwhile, the weighted sum value of each word aiming at the HTML page characters, which is obtained through the formula, can effectively improve the determination accuracy of the weight of the word in the junk information, and further improve the accuracy of junk short message detection through the word.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A spam message detection method based on a crawler technology is characterized by comprising the following steps:

2. The method according to claim 1, wherein the crawler module controls the HTML page not to jump to other HTML pages automatically if the HTTP request jumps many times during the process of acquiring the HTML page pointed by the URL, wherein the HTML page is the final page visible to the user.

3. The method of claim 1, wherein the regular expression is:

http[s]？://(？:[a-zA-Z]|[0-9]|[$-_@.&+]|[！*,]|(？:％[0-9a-fA-F][0-9a-fA-F]))+。

4. the method of claim 1, wherein the obtaining, by a crawler module, the URL address and HTML page text information of the final HTML page pointed by the URL address comprises:

5. The method of claim 1, wherein the crawler module comprises:

6. The method according to claim 5, wherein after the request sending module sends a first HTTP request to the response header for the short message to be tested, if the time for the response header to return the response information exceeds a preset time threshold, the request sending module sends a continuous HTTP request according to a sending request time interval, where the sending request time interval is determined by the following formula:

where Δ T represents a transmission request time interval; m represents the total number of the short messages to be detected by the crawler module; delta T₀Indicating an initial default value for the transmission request interval; beta is a₁、β₂And beta₃Denotes a time interval adjustment coefficient, beta₁Has a value range of 0.581-0.673, beta₂The value range of (1) is 0.424-0.537; beta is a₂The value range of (1) is 0.615-0.736; t is_iWhen the crawler module detects the ith short message to be detected, the request sending module sends a first HTTP request to the response head and then obtains the interval duration responded by the response head; t is_maxThe method comprises the steps that a request sending module sends a first HTTP request to a response head to obtain the maximum value of interval duration responded by the response head in the process that the crawler module detects m short messages to be detected; t is_minThe request sending module sends a request to the crawler module to indicate that the crawler module detects m short messages to be detectedAnd obtaining the minimum value of the interval duration responded by the response head after the response head sends the first HTTP request.

7. The method of claim 1, wherein the determining whether the short message to be detected is a spam short message by using the domain name information comprises:

8. The method of claim 1, wherein determining whether the short message to be detected is a spam short message according to the IP address comprises:

9. The method of claim 1, wherein determining whether the short message to be detected is a spam message by using the specific gravity of each word in the spam message comprises:

inquiring the proportion of each term in the junk information;

10. The method of claim 9, wherein the weighted sum K is obtained by the following formula:

wherein, W₁、W₂、……、W_nRepresenting the proportion of words 1 to n in the word information of the HTML page in the junk information; c₁、C₂、……、C_nRepresenting the times of occurrence of the words 1 to n in the text information of the HTML page.