CN112287198A - Spam short message detection method based on crawler technology - Google Patents
Spam short message detection method based on crawler technology Download PDFInfo
- Publication number
- CN112287198A CN112287198A CN202011173377.9A CN202011173377A CN112287198A CN 112287198 A CN112287198 A CN 112287198A CN 202011173377 A CN202011173377 A CN 202011173377A CN 112287198 A CN112287198 A CN 112287198A
- Authority
- CN
- China
- Prior art keywords
- response
- information
- short message
- detected
- spam
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 28
- 238000005516 engineering process Methods 0.000 title claims abstract description 13
- 238000004458 analytical method Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 33
- 238000012545 processing Methods 0.000 claims description 16
- 230000005540 biological transmission Effects 0.000 claims description 6
- 230000005484 gravity Effects 0.000 claims description 4
- 230000000977 initiatory effect Effects 0.000 claims description 3
- 230000000694 effects Effects 0.000 description 8
- 238000013515 script Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 208000001613 Gambling Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 239000002574 poison Substances 0.000 description 1
- 231100000614 poison Toxicity 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/04—Real-time or near real-time messaging, e.g. instant messaging [IM]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a spam short message detection method based on a crawler technology, which comprises the following steps: extracting an effective URL address in the short message to be detected through a preset regular expression; acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address by using a crawler module; extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information; analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address; segmenting word information of an HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in spam information.
Description
Technical Field
The invention relates to the field of communication and network search, and particularly provides a spam message detection method based on a crawler technology.
Background
With the continuous development of the mobile internet technology, commercial short messages are used more and more widely as propaganda media with the strongest controllability of operators. The commercial short message has the advantages of rapid propagation, high arrival rate, advanced end, small advertisement delay effect, high inquiring rate of advertisement content, capability of subdividing target customers and market propagation, low cost and the like. The advantages are often utilized by bad merchants and illegal molecules, illegal marketing, malicious fraud and yellow gambling poison are carried out by using the short messages, so that very bad social influence is caused, and benign and healthy development of the commercial short messages is seriously damaged.
In the existing technical scheme, the spam detection technology mainly analyzes and processes the sending and receiving code number and the short message content, and does not further analyze and identify the URL address information contained in the short message content, so that the spam identification success rate is low.
Disclosure of Invention
The invention provides a crawler technology-based spam message detection method, which is used for solving the problem that spam messages are disguised as normal messages and the condition that URL addresses contained in the spam messages point to illegal or spam information pages cannot be effectively identified, and adopts the following technical scheme:
the invention provides a spam short message detection method based on a crawler technology, which comprises the following steps:
extracting an effective URL address in the short message to be detected through a preset regular expression;
acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address by using a crawler module;
extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information;
analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address;
segmenting word information of an HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in spam information.
Further, in the process of acquiring the HTML page pointed by the URL, if the HTTP request is jumped for multiple times, the crawler module controls the HTML page not to jump to other HTML pages automatically, wherein the HTML page is the final page visible to the user.
Further, the regular expression is:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
wherein, the regular expression is used
http [ s ]? And// (.
Further, the method for acquiring the URL address and the HTML page text information of the final HTML page pointed by the URL address by using the crawler module comprises the following steps:
sending an HTTP request pointing to the URL address to a response header in a crawler mode;
when receiving a first type of response sent back by the response header, the crawler module continuously initiates an HTTP request to a new URL address contained in the response header; wherein, the first type response refers to the occurrence of HTTP redirection message;
when a second type of response sent back by the response header is received, the crawler module simulates the behavior of a browser, executes a Javascript and initiates a new HTTP request to a URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;
and the crawler module continuously processes the first type response and the second type response until a third type response is received, wherein the third type response is a final response, and the received corresponding information comprises HTML page content displayed to the terminal user.
Further, the crawler module comprises:
the request sending module is used for sending an HTTP request pointing to the URL address to the response head in a crawler mode;
the first-class response processing module is used for continuously initiating an HTTP request to a new URL address contained in the response head when receiving the first-class response sent back by the response head; wherein, the first type response refers to the occurrence of HTTP redirection message;
the second type response processing module simulates the behavior of a browser when receiving a second type response sent back by the response header, executes the Javascript and initiates a new HTTP request to the URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;
and the HTML information acquisition module is used for controlling the first-class response processing module and the second-class response processing module to respectively carry out continuous processing on the first-class response and the second-class response until a third-class response is received, wherein the third-class response is a final response, and the received corresponding information contains HTML page content displayed to the terminal user.
Further, after the request sending module sends a first HTTP request to the response header for the short message to be detected, if the time for the response header to return the response information exceeds a preset time threshold, the request sending module sends a continuous HTTP request according to a request sending time interval, where the request sending time interval is determined by the following formula:
where Δ T represents a transmission request time interval; m represents the total number of the short messages to be detected by the crawler module; delta T0Indicating an initial default value for the transmission request interval; beta is a1、β2And beta3Denotes a time interval adjustment coefficient, beta1Has a value range of 0.581-0.673, beta2Value ofIn the range of 0.424-0.537; beta is a2The value range of (1) is 0.615-0.736; t isiWhen the crawler module detects the ith short message to be detected, the request sending module sends a first HTTP request to the response head and then obtains the interval duration responded by the response head; t ismaxThe method comprises the steps that a request sending module sends a first HTTP request to a response head to obtain the maximum value of interval duration responded by the response head in the process that the crawler module detects m short messages to be detected; t isminAnd indicating that the minimum interval duration value responded by the response head is obtained after the request sending module sends the first HTTP request to the response head in the process of detecting the m short messages to be detected by the crawler module.
Further, the determining whether the short message to be detected is a spam short message by using the domain name information includes:
inquiring ICP filing information corresponding to the domain name information, and judging whether the ICP filing information is effective filing information or not;
and when the ICP record information is invalid record information, judging that the short message to be detected is a spam short message.
Further, determining whether the short message to be detected is a spam short message according to the IP address comprises:
inquiring attribution information of the IP address according to the IP address;
judging whether the home location corresponding to the IP address is a continental China area or not according to the home location information;
and when the attribution corresponding to the IP address is not the continental China, judging that the short message to be detected is a spam short message.
Further, the step of determining whether the short message to be detected is a spam short message by using the proportion of each word in the spam message comprises the following steps:
inquiring the proportion of each term in the junk information;
weighting and summing the proportion of all words in the junk information to obtain a weighted sum value K;
judging whether the weighted summation value K is greater than a preset weighted summation value;
and when the weighted sum value K is greater than the preset value of weighted sum, judging that the short message to be detected is a spam short message.
Further, the weighted sum value K is obtained by the following formula:
wherein, W1、W2、……、WnRepresenting the proportion of words 1 to n in the word information of the HTML page in the junk information; c1、C2、……、CnAnd representing the times of the words 1 to n appearing in the character information of the HTML page, wherein the specific gravity of each word in the junk information is a preset specific gravity value.
The invention has the beneficial effects that:
the spam message detection method based on the crawler technology can effectively and quickly identify spam messages under the condition that the spam messages are disguised as normal messages, greatly improves the efficiency and accuracy of spam message identification, and effectively avoids the condition that the spam messages cannot be effectively identified after being disguised. The success rate of identifying and detecting the spam messages is improved. Meanwhile, the short messages to be detected are detected through various information detection ways, so that the efficiency, accuracy and success rate of spam short message detection can be effectively improved, and the number of spam short message detection and identification failures is effectively reduced.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described implementations are only some embodiments of the present invention. Rather than all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It is noted that the terms first, second and the like in the description and in the claims, and in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The spam message detection method based on the crawler technology provided by the embodiment of the invention is shown in figure 1, and comprises the following steps:
s1, extracting an effective URL address in the short message to be detected through a preset regular expression;
s2, acquiring the URL address and HTML page text information of the final HTML page pointed by the URL address by using a crawler module; the crawler module controls the HTML page not to automatically jump to other HTML pages if the HTTP request jumps for multiple times in the process of acquiring the HTML page pointed by the URL, wherein the HTML page is the final page visible to a user.
S3, extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information;
s4, analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address;
s5, segmenting word information of the HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in the spam information.
The working principle of the technical scheme is as follows: firstly, when the short message to be detected contains a valid URL address, extracting the URL address in the short message to be detected through a preset regular expression for matching the URL, and when the short message does not contain the URL address, processing by adopting other spam short message detection methods. Then, acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address by using a crawler module; subsequently, extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information; then, analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address; and finally, segmenting word information of an HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in the spam information. The word segmentation process of the character information of the HTML page comprises the following steps: firstly, the information content in the short message is coded according to a Universal Character Set (UCS), and then word segmentation is performed according to a horizontal arrangement mode of characters from left to right.
The effect of the above technical scheme is as follows: the spam message detection method based on the crawler technology provided by the embodiment can effectively and quickly identify spam messages under the condition that the spam messages are disguised as normal messages, so that the efficiency and the accuracy of spam message identification are improved to a great extent, and the condition that the spam messages cannot be effectively identified after being disguised is effectively avoided. The success rate of identifying and detecting the spam messages is improved.
In an embodiment of the present invention, the regular expression is:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
wherein, the regular expression is used
http [ s ]? And// (. For example, a valid URL address in a sms is definitely a domain name or http: information such as + ip port, by the regular expression described above, can pass http [ s ]? // fragment or [ $ - @. & + ] | [! \ \ \ | (.
The effect of the above technical scheme is as follows: through the regular expression, the extraction accuracy and efficiency of the effective URL address in the short message to be detected can be improved, and the problem that the URL address is extracted in a missing mode can be effectively avoided. Therefore, the spam message detection accuracy is improved, and the spam message detection error rate is effectively reduced.
In an embodiment of the present invention, obtaining, by using a crawler module, a URL address and HTML page text information of a final HTML page to which the URL address points includes:
s201, sending an HTTP request pointing to the URL address to a response header in a crawler mode;
s202, when a first type of response sent back by the response header is received, the crawler module continuously initiates an HTTP request to a new URL address contained in the response header; wherein, the first type response refers to the occurrence of HTTP redirection message;
s203, when a second type of response sent back by the response header is received, the crawler module simulates the behavior of a browser, executes a Javascript and initiates a new HTTP request to a URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;
and S204, the crawler module continuously processes the first type of response and the second type of response until a third type of response is received, wherein the third type of response is a final response, and the received corresponding information contains HTML page content displayed to the terminal user.
Wherein the crawler module comprises:
the request sending module is used for sending an HTTP request pointing to the URL address to the response head in a crawler mode;
the first-class response processing module is used for continuously initiating an HTTP request to a new URL address contained in the response head when receiving the first-class response sent back by the response head; wherein, the first type response refers to the occurrence of HTTP redirection message;
the second type response processing module simulates the behavior of a browser when receiving a second type response sent back by the response header, executes the Javascript and initiates a new HTTP request to the URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;
and the HTML information acquisition module is used for controlling the first-class response processing module and the second-class response processing module to respectively carry out continuous processing on the first-class response and the second-class response until a third-class response is received, wherein the third-class response is a final response, and the received corresponding information contains HTML page content displayed to the terminal user.
The working principle of the technical scheme is as follows: acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address in the step S102 by using a crawler technology; three types of response information may be received when the crawler sends an HTTP request directed to the URL address, where the first type of response is an HTTP redirect message, and upon receipt of such a response, the crawler will continue to initiate HTTP requests to the new URL address contained in the response header. The second type of response is page information containing Javascript script, at which point the crawler will simulate browser behavior, execute the script, and initiate a new HTTP request to the URL address specified in the Javascript script. The first and second type responses will trigger the crawler to send a new HTTP request until a third type response message is received. The third type of response is a final response message that contains the HTML page content that is presented to the end user.
The effect of the above technical scheme is as follows: the information acquisition efficiency and the information acquisition accuracy of the URL address and the HTML page text information are effectively improved. The problem that the URL address and the HTML page text information are not accurately obtained due to problems of HTTP redirection, page information containing Javascript scripts and the like is solved. Meanwhile, the URL address and the HTML page text information of the final HTML page can be effectively and quickly acquired through the method, and the problem that the acquired URL address and the HTML page text information do not belong to the HTML page content displayed to the terminal user is effectively prevented.
In an embodiment of the present invention, after the request sending module sends a first HTTP request to the response head for the short message to be detected, if the time for the response head to return the response information exceeds a preset time threshold, the request sending module sends a continuous HTTP request according to a sending request time interval, where the sending request time interval is determined by the following formula:
where Δ T represents a transmission request time interval; m represents the total number of the short messages to be detected by the crawler module; delta T0Indicating an initial default value for the transmission request interval; beta is a1、β2And beta3Denotes a time interval adjustment coefficient, beta1Has a value range of 0.581-0.673, beta2The value range of (1) is 0.424-0.537; beta is a2The value range of (1) is 0.615-0.736; t isiWhen the crawler module detects the ith short message to be detected, the request sending module sends a first HTTP request to the response head and then obtains the interval duration responded by the response head; t ismaxIndicating that the crawler module obtains the response after sending a first HTTP request to a response head in the process of detecting m short messages to be detected by the crawler moduleMaximum interval duration of head response; t isminAnd indicating that the minimum interval duration value responded by the response head is obtained after the request sending module sends the first HTTP request to the response head in the process of detecting the m short messages to be detected by the crawler module.
The working principle of the technical scheme is as follows: after the request sending module sends a first HTTP request to the response head aiming at the short message to be detected, if the time for the response head to return the response information exceeds a preset time threshold, the request sending module sends the HTTP request to the response head again and continuously according to the sending request time interval. In this embodiment, the interval duration of the response head is obtained after the request sending module sends the first HTTP request to the response head, and the request sending module obtains parameters such as the maximum value and the minimum value of the interval duration of the response head after sending the first HTTP request to the response head to obtain the request sending time interval during the detection of m short messages to be detected by the crawler module.
The effect of the above technical scheme is as follows: the efficiency of the crawler obtaining the URL address of the final HTML page pointed by the URL address and the character information of the HTML page is effectively improved. The problem that the junk short message flow is stopped or the junk short message detection efficiency is reduced due to the fact that no information response is carried out under the condition that a response head fails to receive a request is solved, meanwhile, the timeliness of HTTP request sending can be effectively improved through the sending request time interval obtained through the formula, the determination of the sending request time interval is carried out by taking historical time data in the short message detection process as the basis, the matching degree of the sending request time interval and the operation of a crawler module can be improved, the operation stability of the crawler module is improved, and the condition that the crawler module is unstable in operation due to the fact that the request sending frequency is too high due to the fact that the sending request time interval is too short is avoided.
In an embodiment of the present invention, the determining whether the short message to be detected is a spam short message by using domain name information includes:
s301, inquiring ICP record information corresponding to the domain name information, and judging whether the ICP record information is effective record information or not;
s302, when the ICP record information is invalid record information, the short message to be detected is judged to be a spam short message.
The working principle of the technical scheme is as follows: extracting the domain name contained in the URL address information of the HTML page through a preset regular expression for matching the domain name, inquiring the extracted domain name for related filing information from an industrial and informatization department government service platform or other service platforms provided by other legally authorized organizations, and judging that the short message to be detected is a spam short message when the related filing information cannot be inquired or the inquired filing information is identified to be invalid or illegal.
The effect of the above technical scheme is as follows: the short messages to be detected are detected in a way of examining and recording information, so that the efficiency, accuracy and success rate of spam short message detection can be effectively improved, and the number of spam short message detection and identification failures can be effectively reduced.
In one embodiment of the present invention, determining whether the short message to be detected is a spam short message by using the IP address includes:
s401, inquiring attribution information of the IP address according to the IP address;
s402, judging whether the attribution corresponding to the IP address is a continental China area or not according to the attribution information;
and S403, when the attribution corresponding to the IP address is not the mainland China, judging that the short message to be detected is a spam short message.
The working principle of the technical scheme is as follows: and acquiring the IP address corresponding to the domain name through a DNS analysis program, inquiring the information of the home location of the IP address, judging whether the home location is the continental China region, and if not, judging that the short message to be detected is a spam short message.
The effect of the above technical scheme is as follows: the short message to be detected is identified through the detection of the IP address attribution, the efficiency, the accuracy and the success rate of spam short message detection can be effectively improved, and the number of spam short message detection and identification failures is effectively reduced.
In an embodiment of the present invention, determining whether the short message to be detected is a spam short message by using the specific gravity of each word in the spam message includes:
s501, inquiring the proportion of each term in the junk information;
s502, carrying out weighted summation on the proportion of all words in the junk information to obtain a weighted summation value K;
s503, judging whether the weighted sum value K is larger than a preset weighted sum value;
s504, when the weighted sum value K is larger than the preset weighted sum value, the short message to be detected is judged to be a spam short message.
Wherein the weighted summation value K is obtained by the following formula:
wherein, W1、W2、……、WnRepresenting the proportion of words 1 to n in the word information of the HTML page in the junk information; c1、C2、……、CnAnd representing the times of the words 1 to n appearing in the character information of the HTML page, wherein the proportion of each word in the junk information is a preset manual value, and the default is zero.
The working principle of the technical scheme is as follows: the method comprises the steps of segmenting the HTML page characters to obtain n terms, inquiring the proportion of each term in junk information, carrying out weighted summation on the proportion of the junk information of all the terms to obtain K, judging whether the K value is larger than a preset value or not, and judging that the short message to be detected is the junk short message if the K value is larger than the preset value, wherein the proportion of each term in the junk information is a preset proportion value.
The effect of the above technical scheme is as follows: the short messages to be detected are detected through the weighted sum value of each word of the HTML page characters in the spam information, so that the efficiency, the accuracy and the success rate of spam short message detection can be effectively improved, and the number of spam short message detection and identification failures is effectively reduced. Meanwhile, the weighted sum value of each word aiming at the HTML page characters, which is obtained through the formula, can effectively improve the determination accuracy of the weight of the word in the junk information, and further improve the accuracy of junk short message detection through the word.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (10)
1. A spam message detection method based on a crawler technology is characterized by comprising the following steps:
extracting an effective URL address in the short message to be detected through a preset regular expression;
acquiring a URL address and HTML page text information of a final HTML page pointed by the URL address by using a crawler module;
extracting domain name information in a URL address corresponding to the HTML page through a preset regular expression for acquiring the domain name, and judging whether the short message to be detected is a spam short message or not by utilizing the domain name information;
analyzing the domain name information by a DNS analysis method, acquiring an IP address corresponding to the domain name information, and judging whether the short message to be detected is a spam short message or not according to the IP address;
segmenting word information of an HTML page to obtain each corresponding word in the word information of the HTML page, and judging whether the short message to be detected is a spam short message or not by utilizing the proportion of each word in spam information.
2. The method according to claim 1, wherein the crawler module controls the HTML page not to jump to other HTML pages automatically if the HTTP request jumps many times during the process of acquiring the HTML page pointed by the URL, wherein the HTML page is the final page visible to the user.
3. The method of claim 1, wherein the regular expression is:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+。
4. the method of claim 1, wherein the obtaining, by a crawler module, the URL address and HTML page text information of the final HTML page pointed by the URL address comprises:
sending an HTTP request pointing to the URL address to a response header in a crawler mode;
when receiving a first type of response sent back by the response header, the crawler module continuously initiates an HTTP request to a new URL address contained in the response header; wherein, the first type response refers to the occurrence of HTTP redirection message;
when a second type of response sent back by the response header is received, the crawler module simulates the behavior of a browser, executes a Javascript and initiates a new HTTP request to a URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;
and the crawler module continuously processes the first type response and the second type response until a third type response is received, wherein the third type response is a final response, and the received corresponding information comprises HTML page content displayed to the terminal user.
5. The method of claim 1, wherein the crawler module comprises:
the request sending module is used for sending an HTTP request pointing to the URL address to the response head in a crawler mode;
the first-class response processing module is used for continuously initiating an HTTP request to a new URL address contained in the response head when receiving the first-class response sent back by the response head; wherein, the first type response refers to the occurrence of HTTP redirection message;
the second type response processing module simulates the behavior of a browser when receiving a second type response sent back by the response header, executes the Javascript and initiates a new HTTP request to the URL address specified in the Javascript; the second type of response refers to the appearance of page information containing Javascript;
and the HTML information acquisition module is used for controlling the first-class response processing module and the second-class response processing module to respectively carry out continuous processing on the first-class response and the second-class response until a third-class response is received, wherein the third-class response is a final response, and the received corresponding information contains HTML page content displayed to the terminal user.
6. The method according to claim 5, wherein after the request sending module sends a first HTTP request to the response header for the short message to be tested, if the time for the response header to return the response information exceeds a preset time threshold, the request sending module sends a continuous HTTP request according to a sending request time interval, where the sending request time interval is determined by the following formula:
where Δ T represents a transmission request time interval; m represents the total number of the short messages to be detected by the crawler module; delta T0Indicating an initial default value for the transmission request interval; beta is a1、β2And beta3Denotes a time interval adjustment coefficient, beta1Has a value range of 0.581-0.673, beta2The value range of (1) is 0.424-0.537; beta is a2The value range of (1) is 0.615-0.736; t isiWhen the crawler module detects the ith short message to be detected, the request sending module sends a first HTTP request to the response head and then obtains the interval duration responded by the response head; t ismaxThe method comprises the steps that a request sending module sends a first HTTP request to a response head to obtain the maximum value of interval duration responded by the response head in the process that the crawler module detects m short messages to be detected; t isminThe request sending module sends a request to the crawler module to indicate that the crawler module detects m short messages to be detectedAnd obtaining the minimum value of the interval duration responded by the response head after the response head sends the first HTTP request.
7. The method of claim 1, wherein the determining whether the short message to be detected is a spam short message by using the domain name information comprises:
inquiring ICP filing information corresponding to the domain name information, and judging whether the ICP filing information is effective filing information or not;
and when the ICP record information is invalid record information, judging that the short message to be detected is a spam short message.
8. The method of claim 1, wherein determining whether the short message to be detected is a spam short message according to the IP address comprises:
inquiring attribution information of the IP address according to the IP address;
judging whether the home location corresponding to the IP address is a continental China area or not according to the home location information;
and when the attribution corresponding to the IP address is not the continental China, judging that the short message to be detected is a spam short message.
9. The method of claim 1, wherein determining whether the short message to be detected is a spam message by using the specific gravity of each word in the spam message comprises:
inquiring the proportion of each term in the junk information;
weighting and summing the proportion of all words in the junk information to obtain a weighted sum value K;
judging whether the weighted summation value K is greater than a preset weighted summation value;
and when the weighted sum value K is greater than the preset value of weighted sum, judging that the short message to be detected is a spam short message.
10. The method of claim 9, wherein the weighted sum K is obtained by the following formula:
wherein, W1、W2、……、WnRepresenting the proportion of words 1 to n in the word information of the HTML page in the junk information; c1、C2、……、CnRepresenting the times of occurrence of the words 1 to n in the text information of the HTML page.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011173377.9A CN112287198B (en) | 2020-10-28 | 2020-10-28 | Junk short message detection method based on crawler technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011173377.9A CN112287198B (en) | 2020-10-28 | 2020-10-28 | Junk short message detection method based on crawler technology |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112287198A true CN112287198A (en) | 2021-01-29 |
CN112287198B CN112287198B (en) | 2023-12-01 |
Family
ID=74372913
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011173377.9A Active CN112287198B (en) | 2020-10-28 | 2020-10-28 | Junk short message detection method based on crawler technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112287198B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113821754A (en) * | 2021-09-18 | 2021-12-21 | 上海观安信息技术股份有限公司 | Sensitive data interface crawler identification method and device |
CN115623485A (en) * | 2022-12-20 | 2023-01-17 | 杭州孝道科技有限公司 | Short message bombing detection method, system, server and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110159895A1 (en) * | 2009-12-30 | 2011-06-30 | Research In Motion Limited | Method and system for allowing varied functionality based on multiple transmissions |
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
CN106453351A (en) * | 2016-10-31 | 2017-02-22 | 重庆邮电大学 | Financial fishing webpage detection method based on Web page characteristics |
CN111083705A (en) * | 2019-12-10 | 2020-04-28 | 平安国际智慧城市科技股份有限公司 | Group-sending fraud short message detection method, device, server and storage medium |
-
2020
- 2020-10-28 CN CN202011173377.9A patent/CN112287198B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110159895A1 (en) * | 2009-12-30 | 2011-06-30 | Research In Motion Limited | Method and system for allowing varied functionality based on multiple transmissions |
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
CN106453351A (en) * | 2016-10-31 | 2017-02-22 | 重庆邮电大学 | Financial fishing webpage detection method based on Web page characteristics |
CN111083705A (en) * | 2019-12-10 | 2020-04-28 | 平安国际智慧城市科技股份有限公司 | Group-sending fraud short message detection method, device, server and storage medium |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113821754A (en) * | 2021-09-18 | 2021-12-21 | 上海观安信息技术股份有限公司 | Sensitive data interface crawler identification method and device |
CN113821754B (en) * | 2021-09-18 | 2024-08-16 | 上海观安信息技术股份有限公司 | Method and device for identifying crawler of sensitive data interface |
CN115623485A (en) * | 2022-12-20 | 2023-01-17 | 杭州孝道科技有限公司 | Short message bombing detection method, system, server and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN112287198B (en) | 2023-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111401416B (en) | Abnormal website identification method and device and abnormal countermeasure identification method | |
CN109274632B (en) | Website identification method and device | |
US9123027B2 (en) | Social engineering protection appliance | |
CN101504673B (en) | Method and system for recognizing doubtful fake website | |
CN104954372B (en) | A kind of evidence obtaining of fishing website and verification method and system | |
CN109302434B (en) | Prompt message pushing method and device, service platform and storage medium | |
KR102355973B1 (en) | Apparatus and method for detecting smishing message | |
CN105119909B (en) | A kind of counterfeit website detection method and system based on page visual similarity | |
CN102638448A (en) | Method for judging phishing websites based on non-content analysis | |
US20090055928A1 (en) | Method and apparatus for providing phishing and pharming alerts | |
CN103297270A (en) | Application type recognition method and network equipment | |
CN102647408A (en) | Method for judging phishing website based on content analysis | |
CN112287198B (en) | Junk short message detection method based on crawler technology | |
CN111147489B (en) | Link camouflage-oriented fishfork attack mail discovery method and device | |
CN109547426B (en) | Service response method and server | |
CN106446113A (en) | Mobile big data analysis method and device | |
CN108449368A (en) | A kind of application layer attack detection method, device and electronic equipment | |
CN107426136B (en) | Network attack identification method and device | |
CN107979845A (en) | The indicating risk method and apparatus of wireless access point | |
JP4564916B2 (en) | Phishing fraud countermeasure method, terminal, server and program | |
Sampat et al. | Detection of phishing website using machine learning | |
CN108804501A (en) | A kind of method and device of detection effective information | |
US20160285905A1 (en) | System and method for detecting mobile cyber incident | |
CN113709748B (en) | Method for identifying virus short message based on sending behavior and website characteristics | |
CN114301711B (en) | Anti-riot brushing method, device, equipment, storage medium and computer program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |