CN112287198A - Spam short message detection method based on crawler technology - Google Patents

Spam short message detection method based on crawler technology Download PDF

Info

Publication number
CN112287198A
CN112287198A CN202011173377.9A CN202011173377A CN112287198A CN 112287198 A CN112287198 A CN 112287198A CN 202011173377 A CN202011173377 A CN 202011173377A CN 112287198 A CN112287198 A CN 112287198A
Authority
CN
China
Prior art keywords
response
information
short message
detected
spam
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011173377.9A
Other languages
Chinese (zh)
Other versions
CN112287198B (en
Inventor
汤增丰
张长根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Winnerlook Information Technology Co ltd
Original Assignee
Shanghai Winnerlook Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Winnerlook Information Technology Co ltd filed Critical Shanghai Winnerlook Information Technology Co ltd
Priority to CN202011173377.9A priority Critical patent/CN112287198B/en
Publication of CN112287198A publication Critical patent/CN112287198A/en
Application granted granted Critical
Publication of CN112287198B publication Critical patent/CN112287198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/04Real-time or near real-time messaging, e.g. instant messaging [IM]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a spam short message detection method based on a crawler technology, which comprises the following steps: extracting an effective URL address in the short message to be detected through a preset regular expression; acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address by using a crawler module; extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information; analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address; segmenting word information of an HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in spam information.

Description

Spam short message detection method based on crawler technology
Technical Field
The invention relates to the field of communication and network search, and particularly provides a spam message detection method based on a crawler technology.
Background
With the continuous development of the mobile internet technology, commercial short messages are used more and more widely as propaganda media with the strongest controllability of operators. The commercial short message has the advantages of rapid propagation, high arrival rate, advanced end, small advertisement delay effect, high inquiring rate of advertisement content, capability of subdividing target customers and market propagation, low cost and the like. The advantages are often utilized by bad merchants and illegal molecules, illegal marketing, malicious fraud and yellow gambling poison are carried out by using the short messages, so that very bad social influence is caused, and benign and healthy development of the commercial short messages is seriously damaged.
In the existing technical scheme, the spam detection technology mainly analyzes and processes the sending and receiving code number and the short message content, and does not further analyze and identify the URL address information contained in the short message content, so that the spam identification success rate is low.
Disclosure of Invention
The invention provides a crawler technology-based spam message detection method, which is used for solving the problem that spam messages are disguised as normal messages and the condition that URL addresses contained in the spam messages point to illegal or spam information pages cannot be effectively identified, and adopts the following technical scheme:
the invention provides a spam short message detection method based on a crawler technology, which comprises the following steps:
extracting an effective URL address in the short message to be detected through a preset regular expression;
acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address by using a crawler module;
extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information;
analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address;
segmenting word information of an HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in spam information.
Further, in the process of acquiring the HTML page pointed by the URL, if the HTTP request is jumped for multiple times, the crawler module controls the HTML page not to jump to other HTML pages automatically, wherein the HTML page is the final page visible to the user.
Further, the regular expression is:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
wherein, the regular expression is used
http [ s ]? And// (.
Further, the method for acquiring the URL address and the HTML page text information of the final HTML page pointed by the URL address by using the crawler module comprises the following steps:
sending an HTTP request pointing to the URL address to a response header in a crawler mode;
when receiving a first type of response sent back by the response header, the crawler module continuously initiates an HTTP request to a new URL address contained in the response header; wherein, the first type response refers to the occurrence of HTTP redirection message;
when a second type of response sent back by the response header is received, the crawler module simulates the behavior of a browser, executes a Javascript and initiates a new HTTP request to a URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;
and the crawler module continuously processes the first type response and the second type response until a third type response is received, wherein the third type response is a final response, and the received corresponding information comprises HTML page content displayed to the terminal user.
Further, the crawler module comprises:
the request sending module is used for sending an HTTP request pointing to the URL address to the response head in a crawler mode;
the first-class response processing module is used for continuously initiating an HTTP request to a new URL address contained in the response head when receiving the first-class response sent back by the response head; wherein, the first type response refers to the occurrence of HTTP redirection message;
the second type response processing module simulates the behavior of a browser when receiving a second type response sent back by the response header, executes the Javascript and initiates a new HTTP request to the URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;
and the HTML information acquisition module is used for controlling the first-class response processing module and the second-class response processing module to respectively carry out continuous processing on the first-class response and the second-class response until a third-class response is received, wherein the third-class response is a final response, and the received corresponding information contains HTML page content displayed to the terminal user.
Further, after the request sending module sends a first HTTP request to the response header for the short message to be detected, if the time for the response header to return the response information exceeds a preset time threshold, the request sending module sends a continuous HTTP request according to a request sending time interval, where the request sending time interval is determined by the following formula:
Figure BDA0002747999950000021
where Δ T represents a transmission request time interval; m represents the total number of the short messages to be detected by the crawler module; delta T0Indicating an initial default value for the transmission request interval; beta is a1、β2And beta3Denotes a time interval adjustment coefficient, beta1Has a value range of 0.581-0.673, beta2Value ofIn the range of 0.424-0.537; beta is a2The value range of (1) is 0.615-0.736; t isiWhen the crawler module detects the ith short message to be detected, the request sending module sends a first HTTP request to the response head and then obtains the interval duration responded by the response head; t ismaxThe method comprises the steps that a request sending module sends a first HTTP request to a response head to obtain the maximum value of interval duration responded by the response head in the process that the crawler module detects m short messages to be detected; t isminAnd indicating that the minimum interval duration value responded by the response head is obtained after the request sending module sends the first HTTP request to the response head in the process of detecting the m short messages to be detected by the crawler module.
Further, the determining whether the short message to be detected is a spam short message by using the domain name information includes:
inquiring ICP filing information corresponding to the domain name information, and judging whether the ICP filing information is effective filing information or not;
and when the ICP record information is invalid record information, judging that the short message to be detected is a spam short message.
Further, determining whether the short message to be detected is a spam short message according to the IP address comprises:
inquiring attribution information of the IP address according to the IP address;
judging whether the home location corresponding to the IP address is a continental China area or not according to the home location information;
and when the attribution corresponding to the IP address is not the continental China, judging that the short message to be detected is a spam short message.
Further, the step of determining whether the short message to be detected is a spam short message by using the proportion of each word in the spam message comprises the following steps:
inquiring the proportion of each term in the junk information;
weighting and summing the proportion of all words in the junk information to obtain a weighted sum value K;
judging whether the weighted summation value K is greater than a preset weighted summation value;
and when the weighted sum value K is greater than the preset value of weighted sum, judging that the short message to be detected is a spam short message.
Further, the weighted sum value K is obtained by the following formula:
Figure BDA0002747999950000031
wherein, W1、W2、……、WnRepresenting the proportion of words 1 to n in the word information of the HTML page in the junk information; c1、C2、……、CnAnd representing the times of the words 1 to n appearing in the character information of the HTML page, wherein the specific gravity of each word in the junk information is a preset specific gravity value.
The invention has the beneficial effects that:
the spam message detection method based on the crawler technology can effectively and quickly identify spam messages under the condition that the spam messages are disguised as normal messages, greatly improves the efficiency and accuracy of spam message identification, and effectively avoids the condition that the spam messages cannot be effectively identified after being disguised. The success rate of identifying and detecting the spam messages is improved. Meanwhile, the short messages to be detected are detected through various information detection ways, so that the efficiency, accuracy and success rate of spam short message detection can be effectively improved, and the number of spam short message detection and identification failures is effectively reduced.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described implementations are only some embodiments of the present invention. Rather than all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It is noted that the terms first, second and the like in the description and in the claims, and in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The spam message detection method based on the crawler technology provided by the embodiment of the invention is shown in figure 1, and comprises the following steps:
s1, extracting an effective URL address in the short message to be detected through a preset regular expression;
s2, acquiring the URL address and HTML page text information of the final HTML page pointed by the URL address by using a crawler module; the crawler module controls the HTML page not to automatically jump to other HTML pages if the HTTP request jumps for multiple times in the process of acquiring the HTML page pointed by the URL, wherein the HTML page is the final page visible to a user.
S3, extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information;
s4, analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address;
s5, segmenting word information of the HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in the spam information.
The working principle of the technical scheme is as follows: firstly, when the short message to be detected contains a valid URL address, extracting the URL address in the short message to be detected through a preset regular expression for matching the URL, and when the short message does not contain the URL address, processing by adopting other spam short message detection methods. Then, acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address by using a crawler module; subsequently, extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information; then, analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address; and finally, segmenting word information of an HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in the spam information. The word segmentation process of the character information of the HTML page comprises the following steps: firstly, the information content in the short message is coded according to a Universal Character Set (UCS), and then word segmentation is performed according to a horizontal arrangement mode of characters from left to right.
The effect of the above technical scheme is as follows: the spam message detection method based on the crawler technology provided by the embodiment can effectively and quickly identify spam messages under the condition that the spam messages are disguised as normal messages, so that the efficiency and the accuracy of spam message identification are improved to a great extent, and the condition that the spam messages cannot be effectively identified after being disguised is effectively avoided. The success rate of identifying and detecting the spam messages is improved.
In an embodiment of the present invention, the regular expression is:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
wherein, the regular expression is used
http [ s ]? And// (. For example, a valid URL address in a sms is definitely a domain name or http: information such as + ip port, by the regular expression described above, can pass http [ s ]? // fragment or [ $ - @. & + ] | [! \ \ \ | (.
The effect of the above technical scheme is as follows: through the regular expression, the extraction accuracy and efficiency of the effective URL address in the short message to be detected can be improved, and the problem that the URL address is extracted in a missing mode can be effectively avoided. Therefore, the spam message detection accuracy is improved, and the spam message detection error rate is effectively reduced.
In an embodiment of the present invention, obtaining, by using a crawler module, a URL address and HTML page text information of a final HTML page to which the URL address points includes:
s201, sending an HTTP request pointing to the URL address to a response header in a crawler mode;
s202, when a first type of response sent back by the response header is received, the crawler module continuously initiates an HTTP request to a new URL address contained in the response header; wherein, the first type response refers to the occurrence of HTTP redirection message;
s203, when a second type of response sent back by the response header is received, the crawler module simulates the behavior of a browser, executes a Javascript and initiates a new HTTP request to a URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;
and S204, the crawler module continuously processes the first type of response and the second type of response until a third type of response is received, wherein the third type of response is a final response, and the received corresponding information contains HTML page content displayed to the terminal user.
Wherein the crawler module comprises:
the request sending module is used for sending an HTTP request pointing to the URL address to the response head in a crawler mode;
the first-class response processing module is used for continuously initiating an HTTP request to a new URL address contained in the response head when receiving the first-class response sent back by the response head; wherein, the first type response refers to the occurrence of HTTP redirection message;
the second type response processing module simulates the behavior of a browser when receiving a second type response sent back by the response header, executes the Javascript and initiates a new HTTP request to the URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;
and the HTML information acquisition module is used for controlling the first-class response processing module and the second-class response processing module to respectively carry out continuous processing on the first-class response and the second-class response until a third-class response is received, wherein the third-class response is a final response, and the received corresponding information contains HTML page content displayed to the terminal user.
The working principle of the technical scheme is as follows: acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address in the step S102 by using a crawler technology; three types of response information may be received when the crawler sends an HTTP request directed to the URL address, where the first type of response is an HTTP redirect message, and upon receipt of such a response, the crawler will continue to initiate HTTP requests to the new URL address contained in the response header. The second type of response is page information containing Javascript script, at which point the crawler will simulate browser behavior, execute the script, and initiate a new HTTP request to the URL address specified in the Javascript script. The first and second type responses will trigger the crawler to send a new HTTP request until a third type response message is received. The third type of response is a final response message that contains the HTML page content that is presented to the end user.
The effect of the above technical scheme is as follows: the information acquisition efficiency and the information acquisition accuracy of the URL address and the HTML page text information are effectively improved. The problem that the URL address and the HTML page text information are not accurately obtained due to problems of HTTP redirection, page information containing Javascript scripts and the like is solved. Meanwhile, the URL address and the HTML page text information of the final HTML page can be effectively and quickly acquired through the method, and the problem that the acquired URL address and the HTML page text information do not belong to the HTML page content displayed to the terminal user is effectively prevented.
In an embodiment of the present invention, after the request sending module sends a first HTTP request to the response head for the short message to be detected, if the time for the response head to return the response information exceeds a preset time threshold, the request sending module sends a continuous HTTP request according to a sending request time interval, where the sending request time interval is determined by the following formula:
Figure BDA0002747999950000071
where Δ T represents a transmission request time interval; m represents the total number of the short messages to be detected by the crawler module; delta T0Indicating an initial default value for the transmission request interval; beta is a1、β2And beta3Denotes a time interval adjustment coefficient, beta1Has a value range of 0.581-0.673, beta2The value range of (1) is 0.424-0.537; beta is a2The value range of (1) is 0.615-0.736; t isiWhen the crawler module detects the ith short message to be detected, the request sending module sends a first HTTP request to the response head and then obtains the interval duration responded by the response head; t ismaxIndicating that the crawler module obtains the response after sending a first HTTP request to a response head in the process of detecting m short messages to be detected by the crawler moduleMaximum interval duration of head response; t isminAnd indicating that the minimum interval duration value responded by the response head is obtained after the request sending module sends the first HTTP request to the response head in the process of detecting the m short messages to be detected by the crawler module.
The working principle of the technical scheme is as follows: after the request sending module sends a first HTTP request to the response head aiming at the short message to be detected, if the time for the response head to return the response information exceeds a preset time threshold, the request sending module sends the HTTP request to the response head again and continuously according to the sending request time interval. In this embodiment, the interval duration of the response head is obtained after the request sending module sends the first HTTP request to the response head, and the request sending module obtains parameters such as the maximum value and the minimum value of the interval duration of the response head after sending the first HTTP request to the response head to obtain the request sending time interval during the detection of m short messages to be detected by the crawler module.
The effect of the above technical scheme is as follows: the efficiency of the crawler obtaining the URL address of the final HTML page pointed by the URL address and the character information of the HTML page is effectively improved. The problem that the junk short message flow is stopped or the junk short message detection efficiency is reduced due to the fact that no information response is carried out under the condition that a response head fails to receive a request is solved, meanwhile, the timeliness of HTTP request sending can be effectively improved through the sending request time interval obtained through the formula, the determination of the sending request time interval is carried out by taking historical time data in the short message detection process as the basis, the matching degree of the sending request time interval and the operation of a crawler module can be improved, the operation stability of the crawler module is improved, and the condition that the crawler module is unstable in operation due to the fact that the request sending frequency is too high due to the fact that the sending request time interval is too short is avoided.
In an embodiment of the present invention, the determining whether the short message to be detected is a spam short message by using domain name information includes:
s301, inquiring ICP record information corresponding to the domain name information, and judging whether the ICP record information is effective record information or not;
s302, when the ICP record information is invalid record information, the short message to be detected is judged to be a spam short message.
The working principle of the technical scheme is as follows: extracting the domain name contained in the URL address information of the HTML page through a preset regular expression for matching the domain name, inquiring the extracted domain name for related filing information from an industrial and informatization department government service platform or other service platforms provided by other legally authorized organizations, and judging that the short message to be detected is a spam short message when the related filing information cannot be inquired or the inquired filing information is identified to be invalid or illegal.
The effect of the above technical scheme is as follows: the short messages to be detected are detected in a way of examining and recording information, so that the efficiency, accuracy and success rate of spam short message detection can be effectively improved, and the number of spam short message detection and identification failures can be effectively reduced.
In one embodiment of the present invention, determining whether the short message to be detected is a spam short message by using the IP address includes:
s401, inquiring attribution information of the IP address according to the IP address;
s402, judging whether the attribution corresponding to the IP address is a continental China area or not according to the attribution information;
and S403, when the attribution corresponding to the IP address is not the mainland China, judging that the short message to be detected is a spam short message.
The working principle of the technical scheme is as follows: and acquiring the IP address corresponding to the domain name through a DNS analysis program, inquiring the information of the home location of the IP address, judging whether the home location is the continental China region, and if not, judging that the short message to be detected is a spam short message.
The effect of the above technical scheme is as follows: the short message to be detected is identified through the detection of the IP address attribution, the efficiency, the accuracy and the success rate of spam short message detection can be effectively improved, and the number of spam short message detection and identification failures is effectively reduced.
In an embodiment of the present invention, determining whether the short message to be detected is a spam short message by using the specific gravity of each word in the spam message includes:
s501, inquiring the proportion of each term in the junk information;
s502, carrying out weighted summation on the proportion of all words in the junk information to obtain a weighted summation value K;
s503, judging whether the weighted sum value K is larger than a preset weighted sum value;
s504, when the weighted sum value K is larger than the preset weighted sum value, the short message to be detected is judged to be a spam short message.
Wherein the weighted summation value K is obtained by the following formula:
Figure BDA0002747999950000081
wherein, W1、W2、……、WnRepresenting the proportion of words 1 to n in the word information of the HTML page in the junk information; c1、C2、……、CnAnd representing the times of the words 1 to n appearing in the character information of the HTML page, wherein the proportion of each word in the junk information is a preset manual value, and the default is zero.
The working principle of the technical scheme is as follows: the method comprises the steps of segmenting the HTML page characters to obtain n terms, inquiring the proportion of each term in junk information, carrying out weighted summation on the proportion of the junk information of all the terms to obtain K, judging whether the K value is larger than a preset value or not, and judging that the short message to be detected is the junk short message if the K value is larger than the preset value, wherein the proportion of each term in the junk information is a preset proportion value.
The effect of the above technical scheme is as follows: the short messages to be detected are detected through the weighted sum value of each word of the HTML page characters in the spam information, so that the efficiency, the accuracy and the success rate of spam short message detection can be effectively improved, and the number of spam short message detection and identification failures is effectively reduced. Meanwhile, the weighted sum value of each word aiming at the HTML page characters, which is obtained through the formula, can effectively improve the determination accuracy of the weight of the word in the junk information, and further improve the accuracy of junk short message detection through the word.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A spam message detection method based on a crawler technology is characterized by comprising the following steps:
extracting an effective URL address in the short message to be detected through a preset regular expression;
acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address by using a crawler module;
extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information;
analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address;
segmenting word information of an HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in spam information.
2. The method according to claim 1, wherein the crawler module controls the HTML page not to jump to other HTML pages automatically if the HTTP request jumps many times during the process of acquiring the HTML page pointed by the URL, wherein the HTML page is the final page visible to the user.
3. The method of claim 1, wherein the regular expression is:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+。
4. the method of claim 1, wherein the obtaining, by a crawler module, the URL address and HTML page text information of the final HTML page pointed by the URL address comprises:
sending an HTTP request pointing to the URL address to a response header in a crawler mode;
when receiving a first type of response sent back by the response header, the crawler module continuously initiates an HTTP request to a new URL address contained in the response header; wherein, the first type response refers to the occurrence of HTTP redirection message;
when a second type of response sent back by the response header is received, the crawler module simulates the behavior of a browser, executes a Javascript and initiates a new HTTP request to a URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;
and the crawler module continuously processes the first type response and the second type response until a third type response is received, wherein the third type response is a final response, and the received corresponding information comprises HTML page content displayed to the terminal user.
5. The method of claim 1, wherein the crawler module comprises:
the request sending module is used for sending an HTTP request pointing to the URL address to the response head in a crawler mode;
the first-class response processing module is used for continuously initiating an HTTP request to a new URL address contained in the response head when receiving the first-class response sent back by the response head; wherein, the first type response refers to the occurrence of HTTP redirection message;
the second type response processing module simulates the behavior of a browser when receiving a second type response sent back by the response header, executes the Javascript and initiates a new HTTP request to the URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;
and the HTML information acquisition module is used for controlling the first-class response processing module and the second-class response processing module to respectively carry out continuous processing on the first-class response and the second-class response until a third-class response is received, wherein the third-class response is a final response, and the received corresponding information contains HTML page content displayed to the terminal user.
6. The method according to claim 5, wherein after the request sending module sends a first HTTP request to the response header for the short message to be tested, if the time for the response header to return the response information exceeds a preset time threshold, the request sending module sends a continuous HTTP request according to a sending request time interval, where the sending request time interval is determined by the following formula:
Figure FDA0002747999940000021
where Δ T represents a transmission request time interval; m represents the total number of the short messages to be detected by the crawler module; delta T0Indicating an initial default value for the transmission request interval; beta is a1、β2And beta3Denotes a time interval adjustment coefficient, beta1Has a value range of 0.581-0.673, beta2The value range of (1) is 0.424-0.537; beta is a2The value range of (1) is 0.615-0.736; t isiWhen the crawler module detects the ith short message to be detected, the request sending module sends a first HTTP request to the response head and then obtains the interval duration responded by the response head; t ismaxThe method comprises the steps that a request sending module sends a first HTTP request to a response head to obtain the maximum value of interval duration responded by the response head in the process that the crawler module detects m short messages to be detected; t isminThe request sending module sends a request to the crawler module to indicate that the crawler module detects m short messages to be detectedAnd obtaining the minimum value of the interval duration responded by the response head after the response head sends the first HTTP request.
7. The method of claim 1, wherein the determining whether the short message to be detected is a spam short message by using the domain name information comprises:
inquiring ICP filing information corresponding to the domain name information, and judging whether the ICP filing information is effective filing information or not;
and when the ICP record information is invalid record information, judging that the short message to be detected is a spam short message.
8. The method of claim 1, wherein determining whether the short message to be detected is a spam short message according to the IP address comprises:
inquiring attribution information of the IP address according to the IP address;
judging whether the home location corresponding to the IP address is a continental China area or not according to the home location information;
and when the attribution corresponding to the IP address is not the continental China, judging that the short message to be detected is a spam short message.
9. The method of claim 1, wherein determining whether the short message to be detected is a spam message by using the specific gravity of each word in the spam message comprises:
inquiring the proportion of each term in the junk information;
weighting and summing the proportion of all words in the junk information to obtain a weighted sum value K;
judging whether the weighted summation value K is greater than a preset weighted summation value;
and when the weighted sum value K is greater than the preset value of weighted sum, judging that the short message to be detected is a spam short message.
10. The method of claim 9, wherein the weighted sum K is obtained by the following formula:
Figure FDA0002747999940000031
wherein, W1、W2、……、WnRepresenting the proportion of words 1 to n in the word information of the HTML page in the junk information; c1、C2、……、CnRepresenting the times of occurrence of the words 1 to n in the text information of the HTML page.
CN202011173377.9A 2020-10-28 2020-10-28 Junk short message detection method based on crawler technology Active CN112287198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011173377.9A CN112287198B (en) 2020-10-28 2020-10-28 Junk short message detection method based on crawler technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011173377.9A CN112287198B (en) 2020-10-28 2020-10-28 Junk short message detection method based on crawler technology

Publications (2)

Publication Number Publication Date
CN112287198A true CN112287198A (en) 2021-01-29
CN112287198B CN112287198B (en) 2023-12-01

Family

ID=74372913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011173377.9A Active CN112287198B (en) 2020-10-28 2020-10-28 Junk short message detection method based on crawler technology

Country Status (1)

Country Link
CN (1) CN112287198B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113821754A (en) * 2021-09-18 2021-12-21 上海观安信息技术股份有限公司 Sensitive data interface crawler identification method and device
CN115623485A (en) * 2022-12-20 2023-01-17 杭州孝道科技有限公司 Short message bombing detection method, system, server and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110159895A1 (en) * 2009-12-30 2011-06-30 Research In Motion Limited Method and system for allowing varied functionality based on multiple transmissions
CN105956175A (en) * 2016-05-24 2016-09-21 考拉征信服务有限公司 Webpage content crawling method and device
CN106453351A (en) * 2016-10-31 2017-02-22 重庆邮电大学 Financial fishing webpage detection method based on Web page characteristics
CN111083705A (en) * 2019-12-10 2020-04-28 平安国际智慧城市科技股份有限公司 Group-sending fraud short message detection method, device, server and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110159895A1 (en) * 2009-12-30 2011-06-30 Research In Motion Limited Method and system for allowing varied functionality based on multiple transmissions
CN105956175A (en) * 2016-05-24 2016-09-21 考拉征信服务有限公司 Webpage content crawling method and device
CN106453351A (en) * 2016-10-31 2017-02-22 重庆邮电大学 Financial fishing webpage detection method based on Web page characteristics
CN111083705A (en) * 2019-12-10 2020-04-28 平安国际智慧城市科技股份有限公司 Group-sending fraud short message detection method, device, server and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113821754A (en) * 2021-09-18 2021-12-21 上海观安信息技术股份有限公司 Sensitive data interface crawler identification method and device
CN113821754B (en) * 2021-09-18 2024-08-16 上海观安信息技术股份有限公司 Method and device for identifying crawler of sensitive data interface
CN115623485A (en) * 2022-12-20 2023-01-17 杭州孝道科技有限公司 Short message bombing detection method, system, server and storage medium

Also Published As

Publication number Publication date
CN112287198B (en) 2023-12-01

Similar Documents

Publication Publication Date Title
CN111401416B (en) Abnormal website identification method and device and abnormal countermeasure identification method
CN109274632B (en) Website identification method and device
US9123027B2 (en) Social engineering protection appliance
CN101504673B (en) Method and system for recognizing doubtful fake website
CN104954372B (en) A kind of evidence obtaining of fishing website and verification method and system
CN109302434B (en) Prompt message pushing method and device, service platform and storage medium
KR102355973B1 (en) Apparatus and method for detecting smishing message
CN105119909B (en) A kind of counterfeit website detection method and system based on page visual similarity
CN102638448A (en) Method for judging phishing websites based on non-content analysis
US20090055928A1 (en) Method and apparatus for providing phishing and pharming alerts
CN103297270A (en) Application type recognition method and network equipment
CN102647408A (en) Method for judging phishing website based on content analysis
CN112287198B (en) Junk short message detection method based on crawler technology
CN111147489B (en) Link camouflage-oriented fishfork attack mail discovery method and device
CN109547426B (en) Service response method and server
CN106446113A (en) Mobile big data analysis method and device
CN108449368A (en) A kind of application layer attack detection method, device and electronic equipment
CN107426136B (en) Network attack identification method and device
CN107979845A (en) The indicating risk method and apparatus of wireless access point
JP4564916B2 (en) Phishing fraud countermeasure method, terminal, server and program
Sampat et al. Detection of phishing website using machine learning
CN108804501A (en) A kind of method and device of detection effective information
US20160285905A1 (en) System and method for detecting mobile cyber incident
CN113709748B (en) Method for identifying virus short message based on sending behavior and website characteristics
CN114301711B (en) Anti-riot brushing method, device, equipment, storage medium and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant