CN106383862B - Illegal short message detection method and system - Google Patents

Illegal short message detection method and system Download PDF

Info

Publication number
CN106383862B
CN106383862B CN201610799866.2A CN201610799866A CN106383862B CN 106383862 B CN106383862 B CN 106383862B CN 201610799866 A CN201610799866 A CN 201610799866A CN 106383862 B CN106383862 B CN 106383862B
Authority
CN
China
Prior art keywords
illegal
webpage
link
short message
judging
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610799866.2A
Other languages
Chinese (zh)
Other versions
CN106383862A (en
Inventor
肖耿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Clouds Network Technology Co Ltd
Original Assignee
Hangzhou Clouds Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Clouds Network Technology Co Ltd filed Critical Hangzhou Clouds Network Technology Co Ltd
Priority to CN201610799866.2A priority Critical patent/CN106383862B/en
Publication of CN106383862A publication Critical patent/CN106383862A/en
Application granted granted Critical
Publication of CN106383862B publication Critical patent/CN106383862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method for detecting illegal short messages, which comprises the steps of obtaining links in short message contents and obtaining a webpage pointed by the links; judging whether the link is an illegal link according to the filtering result of the illegal keyword of the text content in the webpage; and if the short message contains the illegal link, judging the short message to be the illegal short message. Meanwhile, the system for detecting the illegal short message comprises a link acquisition module, a link detection module and a message processing module, wherein the link acquisition module is used for acquiring a link in the short message content and acquiring a webpage pointed by the link; the illegal keyword filtering module is used for judging whether the link is an illegal link according to the filtering result of the illegal keyword of the text content in the webpage acquired by the link acquisition module; and the judging module is used for judging that the short message contains the illegal link according to the judging result of the illegal keyword filtering module, and then judging that the short message is the illegal short message. The technical scheme disclosed by the invention realizes the link content detection of the short message and effectively improves the interception success rate of the illegal short message.

Description

Illegal short message detection method and system
Technical Field
The invention relates to the technical field of communication, in particular to a violation short message detection method and a system for realizing the violation short message detection method.
Background
The short message service is an important component in the mobile phone communication service, and although the proportion of personal short message communication using the short message is reduced under the impact of mobile social application, the short message service is still used based on the special advantages of the popularization mode of mass texting. The mass texting as the promotion medium always contains the information to be conveyed by the user, such as the name of the product, or a link, and the short message receiver is expected to view the product through the link, so that benefits are brought to the short message receiver.
The short message sending platform is used as a service side and has the responsibility of checking the content of the mass texting and ensuring that the content of the short message does not contain the related content of illegal laws and regulations such as gambling, pornography and the like. The existing detection and monitoring methods of the violation short messages can be roughly divided into two types: one is detection of short message sending operators, and illegal short messages are screened out and sent in an intercepting way through two ways of manual checking or illegal keyword filtering on the content of the short messages; the detection of the operator side can fundamentally intercept illegal short message transmission, but the short message transmission merchant can add a link in the short message to directly point to the promotion webpage in order to avoid interception, and illegal words do not appear in the text content, so that the interception is easily avoided. And the other method is that the short messages received by the mobile phone are subjected to keyword filtering through application software and an illegal word bank at the mobile phone end, and the short messages containing illegal contents are shielded. Due to the huge difference between the performance and the short message flow of the mobile phone and the service end, the method is difficult to be applied to illegal short message detection of a short message sending platform.
Disclosure of Invention
The invention provides a method and a system for detecting illegal short messages, which can detect the link content of the short messages to be sent and effectively improve the interception success rate of the illegal short messages, in order to overcome the defect that the illegal short message sending cannot be completely shielded because the short message sending platform in the prior art is difficult to detect the illegal content of the short message link content.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention relates to a violation short message detection method, which specifically comprises the following steps: acquiring a link in the short message content, and acquiring a webpage pointed by the link; judging whether the link is an illegal link according to the filtering result of the illegal keyword of the text content in the webpage; and if the short message contains the illegal link, judging the short message to be the illegal short message.
Preferably, the step of obtaining the link in the short message content further includes: and acquiring all contents of the short message, and extracting links in the short message contents by using a regular expression matching method.
Preferably, the step of determining whether the link is an illegal link according to the filtering result of the illegal keyword of the text content in the web page further includes: analyzing webpage elements, extracting text contents, and marking webpage element sources of all the text contents; performing word segmentation processing on the text content to obtain word segmentation phrases, matching the word segmentation phrases with violation keywords in a preset violation keyword library, and identifying the violation phrases in the word segmentation phrases; according to preset weighting coefficients given to the illegal phrases by different webpage element sources, weighting and calculating the weighted word frequency of the illegal phrases in the text content of the webpage; when the weighted word frequency of the illegal phrase exceeds a preset threshold value, judging the webpage to be the illegal webpage; and if the webpage pointed by the connection is the illegal webpage, judging that the link is the illegal link.
Preferably, the webpage elements comprise un-hyperlinked characters and hyperlinked characters, and the weighting coefficient of the illegal phrase originating from the un-hyperlinked characters is smaller than that of the illegal phrase originating from the hyperlinked characters.
Preferably, the webpage elements comprise pictures without hyperlinks and pictures with hyperlinks, and the weighting coefficient of the illegal phrase from which the picture without hyperlinks is taken is smaller than the weighting coefficient of the illegal phrase from which the picture with hyperlinks is taken; the step of analyzing the webpage elements, extracting the text contents and marking the webpage element sources of each part of the text contents further comprises the following steps: acquiring pictures in a webpage, and distinguishing the pictures without hyperlinks from the pictures with hyperlinks; identifying and extracting the text content in the picture without the hyperlink by using an optical character identification technology, and marking the webpage element source of the text content as the picture without the hyperlink; and identifying and extracting the text content in the picture with the hyperlink by using an optical character recognition technology, and marking the source of the webpage element of the text content as the picture with the hyperlink.
The invention also provides a violation short message detection system, which comprises:
the link acquisition module is used for acquiring a link in the short message content and acquiring a webpage pointed by the link;
the illegal keyword filtering module is used for judging whether the link is an illegal link according to the filtering result of the illegal keyword of the text content in the webpage acquired by the link acquisition module;
and the judging module is used for judging that the short message contains the illegal link according to the judging result of the illegal keyword filtering module, and then judging that the short message is the illegal short message.
Preferably, the illegal keyword filtering module specifically includes:
the character analysis unit is used for analyzing the webpage elements and extracting character contents;
the source marking unit is used for marking the webpage element sources of each part of the text content extracted by the text analysis unit;
the word segmentation unit is used for carrying out word segmentation processing on the text content extracted by the text analysis unit to obtain word segmentation phrases;
the illegal phrase identification unit is used for matching the segmented phrases obtained by the segmentation unit with illegal keywords in a preset illegal keyword library and identifying the illegal phrases in the segmented phrases;
the calculation unit is used for giving preset weighting coefficients to the illegal phrases according to different webpage element sources, and weighting and calculating the weighted word frequency of the illegal phrases in the text content of the webpage;
the link judging unit is used for judging the webpage to be an illegal webpage when the weighted word frequency of the illegal phrase exceeds a preset threshold value; and if the webpage pointed by the connection is the illegal webpage, judging that the link is the illegal link.
Preferably, the webpage elements comprise un-hyperlinked characters and hyperlinked characters, and the weighting coefficient of the illegal phrase originating from the un-hyperlinked characters is smaller than that of the illegal phrase originating from the hyperlinked characters.
Preferably, the webpage elements comprise pictures without hyperlinks and pictures with hyperlinks, and the weighting coefficient of the illegal phrase from which the picture without hyperlinks is taken is smaller than the weighting coefficient of the illegal phrase from which the picture with hyperlinks is taken; the character analysis unit comprises an optical character recognition subunit used for recognizing and extracting character contents in the picture without the hyperlink and the picture with the hyperlink in the webpage.
The invention discloses a method for detecting an illegal short message, which comprises the steps of extracting a link in the short message, accessing a webpage pointed by the link, filtering illegal keywords in the text content of the webpage, judging whether the webpage contains illegal contents or not so as to judge whether the link is the illegal link, judging that the short message is the illegal short message if the short message contains the illegal link, and carrying out corresponding interception and other operations. The objects for filtering the illegal keywords in the method comprise pure character contents of the webpage and characters in pictures, different weighting coefficients are given to calculate the word frequency of the illegal phrase according to whether the contents are provided with links, and therefore the legality of the links pointing to the webpage is judged more reasonably according to the habit of a user. Meanwhile, the invention also discloses a system for detecting the illegal short message, which acquires the link in the short message content and the webpage pointed by the link through the link acquisition module, filters the illegal keyword on the webpage content through the illegal keyword filtering module, thereby judging whether the webpage is the illegal webpage or not, and detects and intercepts the short message containing the illegal link. The technical scheme is different from the prior art and can detect the link content in the short message, so that the interception accuracy of the illegal short message is ensured, and a merchant cannot avoid the illegal short message from being intercepted by adding the link, so as to earn illegal benefits.
Drawings
Fig. 1 is a schematic diagram of an illegal short message detection system according to an embodiment of the present invention.
Fig. 2 is a first schematic diagram of a violation keyword filtering module according to an embodiment of the present invention.
Fig. 3 is a second schematic diagram of the violation keyword filtering module according to the embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
The invention discloses an illegal short message detection method and an illegal short message detection system, wherein a link in a short message is extracted, and a webpage pointed by the link is accessed; and filtering the illegal keywords of the text content of the webpage, judging whether the webpage contains illegal content, judging whether the link is an illegal link, judging that the short message is an illegal short message if the short message contains the illegal link, and performing corresponding operations such as interception and the like. The technical scheme is different from the prior art and can detect the link content in the short message, so that the interception accuracy of the illegal short message is ensured, and a merchant cannot avoid the illegal short message from being intercepted by adding the link, so as to earn illegal benefits.
The specific embodiment of the illegal short message detection method comprises the following steps:
example 1: a violation short message detection method specifically comprises the following steps:
s101, obtaining the link in the short message content and obtaining the webpage pointed by the link.
The method specifically comprises the steps of obtaining all contents of the short message, and extracting links in the short message contents by using a regular expression matching method. Regular expressions are a concept of computer science. Regular expressions use a single string to describe, match, a series of strings that conform to a certain syntactic rule. In the step, the blank spaces and the meaningless punctuations in the short message content are deleted to obtain the character content of the short message, and the links in the character content are identified through the preset regular expression.
S102, judging whether the link is an illegal link according to the filtering result of the illegal keyword of the text content in the webpage.
Preferably, the steps further comprise: analyzing webpage elements, extracting text contents, and marking webpage element sources of all the text contents; performing word segmentation processing on the text content to obtain word segmentation phrases, matching the word segmentation phrases with violation keywords in a preset violation keyword library, and identifying the violation phrases in the word segmentation phrases; according to preset weighting coefficients given to the illegal phrases by different webpage element sources, weighting and calculating the weighted word frequency of the illegal phrases in the text content of the webpage; when the weighted word frequency of the illegal phrase exceeds a preset threshold value, judging the webpage to be the illegal webpage; and if the webpage pointed by the connection is the illegal webpage, judging that the link is the illegal link.
S103, if the short message contains the illegal link, the short message is judged to be the illegal short message. And receiving the limitation of the short message content, wherein the link added in the short message by the merchant is generally the main content popularized by the merchant, so that the short message can be judged as the illegal short message only by judging the link as the illegal link.
The invention discloses a method for detecting an illegal short message, which comprises the steps of extracting a link in the short message, accessing a webpage pointed by the link, filtering illegal keywords in the text content of the webpage, judging whether the webpage contains illegal contents or not so as to judge whether the link is the illegal link, judging that the short message is the illegal short message if the short message contains the illegal link, and carrying out corresponding interception and other operations. The objects for filtering the illegal keywords in the method comprise pure character contents of the webpage and characters in pictures, different weighting coefficients are given to calculate the word frequency of the illegal phrase according to whether the contents are provided with links, and therefore the legality of the links pointing to the webpage is judged more reasonably according to the habit of a user.
Example 2: a violation short message detection method specifically comprises the following steps:
s201, acquiring all contents of the short message, and extracting links in the short message contents by using a regular expression matching method.
S202, analyzing webpage elements, extracting text contents, and marking webpage element sources of the text contents of all parts; the webpage elements comprise characters without hyperlinks and characters with hyperlinks.
S203, word segmentation is carried out on the text content to obtain word segmentation phrases, the word segmentation phrases are matched with the violation keywords in the preset violation keyword library, and the violation phrases in the word segmentation phrases are identified.
S204, according to preset weighting coefficients given to illegal phrases by different webpage element sources, weighting and calculating the weighted word frequency of the illegal phrases in the text content of the webpage; preferably, the weighting factor of the violation phrase originating from the un-hyperlinked text is smaller than the weighting factor of the violation phrase originating from the hyperlinked text.
S205, when the weighted word frequency of the illegal phrase exceeds a preset threshold value, judging the webpage to be the illegal webpage;
s206, if the webpage pointed by the connection is the illegal webpage, the connection is judged to be the illegal link.
Example 3: a violation short message detection method specifically comprises the following steps:
s301, all the contents of the short message are obtained, and the links in the short message contents are extracted by using a regular expression matching method.
S302, analyzing the webpage elements, extracting the text content, and marking the webpage element sources of each part of the text content.
S303, acquiring pictures in the webpage, and distinguishing the pictures without hyperlinks from the pictures with hyperlinks.
S304, recognizing and extracting the text content in the picture without the hyperlink by using an optical character recognition technology, and marking the webpage element source of the text content as the picture without the hyperlink; and identifying and extracting the text content in the picture with the hyperlink by using an optical character recognition technology, and marking the source of the webpage element of the text content as the picture with the hyperlink.
S305, performing word segmentation processing on the text content to obtain word segmentation phrases, matching the word segmentation phrases with violation keywords in a preset violation keyword library, and identifying the violation phrases in the word segmentation phrases.
S306, according to different webpage element sources, giving preset weighting coefficients to the illegal phrases, and carrying out weighting calculation on the weighted word frequency of the illegal phrases in the text content of the webpage; preferably, the weighting factor of the violation phrase originating from the un-hyperlinked text is smaller than the weighting factor of the violation phrase originating from the hyperlinked text.
S307, when the weighted word frequency of the illegal phrase exceeds a preset threshold value, the webpage is judged to be the illegal webpage;
s308, if the webpage pointed by the connection is the illegal webpage, the connection is judged to be the illegal link.
Embodiment 2, referring to fig. 1, is a first schematic diagram of an illegal short message detection system according to the present invention, and as shown in the figure, the illegal short message detection system specifically includes: the system comprises a link acquisition module, an illegal keyword filtering module and a judgment module.
The link acquisition module is used for acquiring a link in the short message content and acquiring a webpage pointed by the link;
and the illegal keyword filtering module is used for judging whether the link is an illegal link according to the filtering result of the illegal keyword of the text content in the webpage acquired by the link acquisition module.
Preferably, the illegal keyword filtering module includes: the character analysis unit is used for analyzing the webpage elements and extracting character contents; the source marking unit is used for marking the webpage element sources of each part of the text content extracted by the text analysis unit; the word segmentation unit is used for carrying out word segmentation processing on the text content extracted by the text analysis unit to obtain word segmentation phrases; and the illegal phrase identification unit is used for matching the segmented phrases obtained by the segmentation unit with illegal keywords in a preset illegal keyword library to identify the illegal phrases in the segmented phrases. The calculation unit is used for giving preset weighting coefficients to the illegal phrases according to different webpage element sources, and weighting and calculating the weighted word frequency of the illegal phrases in the text content of the webpage; the link judging unit is used for judging the webpage to be an illegal webpage when the weighted word frequency of the illegal phrase exceeds a preset threshold value; and if the webpage pointed by the connection is the illegal webpage, judging that the link is the illegal link.
And the judging module is used for judging that the short message contains the illegal link according to the judging result of the illegal keyword filtering module, and then judging that the short message is the illegal short message.
The invention also discloses a system for detecting the illegal short message, which acquires the link in the short message content and the webpage pointed by the link through the link acquisition module, filters the illegal keyword on the webpage content through the illegal keyword filtering module, thereby judging whether the webpage is the illegal webpage or not, and detects and intercepts the short message containing the illegal link. The technical scheme is different from the prior art and can detect the link content in the short message, so that the interception accuracy of the illegal short message is ensured, and a merchant cannot avoid the illegal short message from being intercepted by adding the link, so as to earn illegal benefits.
Example 3: as shown in fig. 1, a system for detecting an illegal short message specifically includes: the system comprises a link acquisition module, an illegal keyword filtering module and a judgment module.
The link acquisition module is used for acquiring a link in the short message content and acquiring a webpage pointed by the link;
as shown in fig. 2, the illegal keyword filtering module includes:
the character analysis unit is used for analyzing the webpage elements and extracting character contents;
the source marking unit is used for marking the webpage element sources of each part of the text content extracted by the text analysis unit;
the word segmentation unit is used for carrying out word segmentation processing on the text content extracted by the text analysis unit to obtain word segmentation phrases;
the illegal phrase identification unit is used for matching the segmented phrases obtained by the segmentation unit with illegal keywords in a preset illegal keyword library and identifying the illegal phrases in the segmented phrases;
the calculation unit is used for giving preset weighting coefficients to the illegal phrases according to different webpage element sources, and weighting and calculating the weighted word frequency of the illegal phrases in the text content of the webpage;
the link judging unit is used for judging the webpage to be an illegal webpage when the weighted word frequency of the illegal phrase exceeds a preset threshold value; and if the webpage pointed by the connection is the illegal webpage, judging that the link is the illegal link.
Preferably, the webpage elements comprise non-hyperlinked characters and hyperlinked characters, and the weighting coefficient of the illegal phrase from which the non-hyperlinked characters are derived is smaller than that of the illegal phrase from which the hyperlinked characters are derived.
And the judging module is used for judging that the short message contains the illegal link according to the judging result of the illegal keyword filtering module, and then judging that the short message is the illegal short message.
The rule-breaking keyword filtering module of the rule-breaking short message detection system is further refined, whether the webpage element source of the extracted webpage text content is linked or not is marked through the source marking unit, different weighting coefficients are given to the webpage element source of the rule-breaking phrase through the calculating unit, and the weighting word frequency of the rule-breaking keyword is weighted and calculated to serve as a parameter for judging whether the webpage content is rule-breaking or not. Because the characters with the links have the function of jumping to the page after clicking, the weight of the illegal contents of the characters is higher, and the accuracy of detecting the illegal contents of the characters in the webpage is greatly improved by weighting and calculating the weighted word frequency of the illegal phrases.
Example 6: as shown in fig. 1, a system for detecting an illegal short message specifically includes: the system comprises a link acquisition module, an illegal keyword filtering module and a judgment module.
The link acquisition module is used for acquiring a link in the short message content and acquiring a webpage pointed by the link;
as shown in fig. 3, the illegal keyword filtering module further includes:
the character analysis unit is used for analyzing the webpage elements and extracting character contents; the character analysis unit comprises an optical character recognition subunit used for recognizing and extracting character contents in the picture without the hyperlink and the picture with the hyperlink in the webpage.
And the source marking unit is used for marking the webpage element sources of each part of the text content extracted by the text analysis unit.
And the word segmentation unit is used for carrying out word segmentation processing on the text content extracted by the text analysis unit to obtain word groups.
And the illegal phrase identification unit is used for matching the segmented phrases obtained by the segmentation unit with illegal keywords in a preset illegal keyword library to identify the illegal phrases in the segmented phrases.
And the calculating unit is used for weighting and calculating the weighted word frequency of the illegal phrase in the text content of the webpage according to preset weighting coefficients given to the illegal phrase by different webpage element sources.
Preferably, the webpage elements comprise pictures without hyperlinks and pictures with hyperlinks, and the weighting coefficient of the illegal phrase from the picture without hyperlinks is smaller than that of the illegal phrase from the picture with hyperlinks.
The link judging unit is used for judging the webpage to be an illegal webpage when the weighted word frequency of the illegal phrase exceeds a preset threshold value; and if the webpage pointed by the connection is the illegal webpage, judging that the link is the illegal link.
And the judging module is used for judging that the short message contains the illegal link according to the judging result of the illegal keyword filtering module, and then judging that the short message is the illegal short message.
According to the scheme, the illegal keyword filtering module of the illegal short message detection system is further refined, a source marking unit is used for marking the webpage element source of the extracted webpage text content to be expanded to the picture in the webpage, an optical character recognition subunit is used for recognizing the character in the extracted picture, whether the picture serving as the webpage element source is linked or not is distinguished, a calculating unit is used for endowing different weighting coefficients according to the webpage element source of the illegal phrase, and the weighting word frequency of the illegal keyword is weighted and calculated to serve as a parameter for judging whether the webpage content is illegal or not. The display of the pictures in the webpage is more visual and attractive, so that the probability and the influence of illegal contents are higher, higher weight is distributed, the pictures with the links have the function of jumping to the page after clicking, the weight of the illegal contents in the part of words is highest, and the accuracy of illegal detection of the word contents in the webpage is greatly improved by calculating the weighted word frequency of the illegal word group through weighting.

Claims (7)

1. A violation short message detection method is characterized by comprising the following steps:
acquiring a link in the short message content, and acquiring a webpage pointed by the link;
according to the filtering result of the illegal keywords of the text content in the webpage, judging whether the link is an illegal link, specifically comprising the following steps: analyzing webpage elements, extracting text contents, and marking webpage element sources of all the text contents; performing word segmentation processing on the text content to obtain word segmentation phrases, matching the word segmentation phrases with violation keywords in a preset violation keyword library, and identifying the violation phrases in the word segmentation phrases; according to preset weighting coefficients given to the illegal phrases by different webpage element sources, weighting and calculating the weighted word frequency of the illegal phrases in the text content of the webpage; when the weighted word frequency of the illegal phrase exceeds a preset threshold value, judging the webpage to be the illegal webpage; if the webpage pointed by the connection is an illegal webpage, judging that the link is an illegal link;
and if the short message contains the illegal link, judging the short message to be the illegal short message.
2. The method as claimed in claim 1, wherein the step of obtaining the link in the content of the short message further comprises:
and acquiring all contents of the short message, and extracting links in the short message contents by using a regular expression matching method.
3. The method as claimed in claim 1, wherein the web page elements include un-hyperlinked words and hyperlinked words, and the weighting coefficient of the offending phrase originating from un-hyperlinked words is smaller than that of the offending phrase originating from hyperlinked words.
4. The method according to claim 3, wherein the web page elements include a picture without hyperlink and a picture with hyperlink, and the weighting coefficient of the illegal phrase originating from the picture without hyperlink is smaller than the weighting coefficient of the illegal phrase originating from the picture with hyperlink;
the step of analyzing the webpage elements, extracting the text contents and marking the webpage element sources of each part of the text contents further comprises the following steps:
acquiring pictures in a webpage, and distinguishing the pictures without hyperlinks from the pictures with hyperlinks;
identifying and extracting the text content in the picture without the hyperlink by using an optical character identification technology, and marking the webpage element source of the text content as the picture without the hyperlink;
and identifying and extracting the text content in the picture with the hyperlink by using an optical character recognition technology, and marking the source of the webpage element of the text content as the picture with the hyperlink.
5. A violation short message detection system is characterized by comprising:
the link acquisition module is used for acquiring a link in the short message content and acquiring a webpage pointed by the link;
the illegal keyword filtering module is used for judging whether the link is an illegal link according to the filtering result of the illegal keyword of the text content in the webpage acquired by the link acquisition module;
the illegal keyword filtering module comprises: the character analysis unit is used for analyzing the webpage elements and extracting character contents; the source marking unit is used for marking the webpage element sources of each part of the text content extracted by the text analysis unit; the word segmentation unit is used for carrying out word segmentation processing on the text content extracted by the text analysis unit to obtain word segmentation phrases; the illegal phrase identification unit is used for matching the segmented phrases obtained by the segmentation unit with illegal keywords in a preset illegal keyword library and identifying the illegal phrases in the segmented phrases; the calculation unit is used for giving preset weighting coefficients to the illegal phrases according to different webpage element sources, and weighting and calculating the weighted word frequency of the illegal phrases in the text content of the webpage; the link judging unit is used for judging the webpage to be an illegal webpage when the weighted word frequency of the illegal phrase exceeds a preset threshold value; if the webpage pointed by the connection is an illegal webpage, judging that the link is an illegal link;
and the judging module is used for judging that the short message contains the illegal link according to the judging result of the illegal keyword filtering module, and then judging that the short message is the illegal short message.
6. The system of claim 5, wherein the web page elements include un-hyperlinked text and hyperlinked text, and wherein the weighting factor of the offending phrase originating from un-hyperlinked text is smaller than the weighting factor of the offending phrase originating from hyperlinked text.
7. The system of claim 6, wherein the web page elements include images without hyperlinks and images with hyperlinks, and the weighting factor of the offending phrases originating from the characters without hyperlinks is smaller than the weighting factor of the offending phrases originating from the characters with hyperlinks; the character analysis unit comprises an optical character recognition subunit used for recognizing and extracting character contents in the picture without the hyperlink and the picture with the hyperlink in the webpage.
CN201610799866.2A 2016-08-31 2016-08-31 Illegal short message detection method and system Active CN106383862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610799866.2A CN106383862B (en) 2016-08-31 2016-08-31 Illegal short message detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610799866.2A CN106383862B (en) 2016-08-31 2016-08-31 Illegal short message detection method and system

Publications (2)

Publication Number Publication Date
CN106383862A CN106383862A (en) 2017-02-08
CN106383862B true CN106383862B (en) 2019-12-31

Family

ID=57938012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610799866.2A Active CN106383862B (en) 2016-08-31 2016-08-31 Illegal short message detection method and system

Country Status (1)

Country Link
CN (1) CN106383862B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960952A (en) * 2017-05-24 2018-12-07 阿里巴巴集团控股有限公司 A kind of detection method and device of violated information
CN107992578B (en) * 2017-12-06 2019-11-22 山西睿信智达传媒科技股份有限公司 The database automatic testing method in objectionable video source
CN110110577B (en) * 2019-01-22 2020-11-10 口碑(上海)信息技术有限公司 Method and device for identifying dish name, storage medium and electronic device
CN111597805B (en) * 2020-05-21 2021-01-05 上海创蓝文化传播有限公司 Method and device for auditing short message text links based on deep learning
CN113032658A (en) * 2021-02-25 2021-06-25 未鲲(上海)科技服务有限公司 Illegal word detection method, device and equipment and computer-readable storage medium
CN115408420B (en) * 2022-09-02 2023-08-01 自然资源部地图技术审查中心 Method and apparatus for automatically filtering map notes and points of interest using a computer

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902889A (en) * 2012-12-26 2014-07-02 腾讯科技(深圳)有限公司 Malicious message cloud detection method and server
US20150309981A1 (en) * 2014-04-28 2015-10-29 Elwha Llc Methods, systems, and devices for outcome prediction of text submission to network based on corpora analysis
KR102150624B1 (en) * 2014-07-01 2020-09-01 삼성전자 주식회사 Method and apparatus for notifying smishing
CN105205090A (en) * 2015-05-29 2015-12-30 湖南大学 Web page text classification algorithm research based on web page link analysis and support vector machine
CN105335354A (en) * 2015-12-09 2016-02-17 中国联合网络通信集团有限公司 Cheat information recognition method and device

Also Published As

Publication number Publication date
CN106383862A (en) 2017-02-08

Similar Documents

Publication Publication Date Title
CN106383862B (en) Illegal short message detection method and system
CN101504673B (en) Method and system for recognizing doubtful fake website
CN111107048B (en) Phishing website detection method and device and storage medium
US10872270B2 (en) Exploit kit detection system based on the neural network using image
KR102355973B1 (en) Apparatus and method for detecting smishing message
EP3933636A1 (en) Webpage tampering detection method and related apparatus
CN109922065B (en) Quick identification method for malicious website
CN103336766A (en) Short text garbage identification and modeling method and device
CN104156490A (en) Method and device for detecting suspicious fishing webpage based on character recognition
CN105426759A (en) URL legality determining method and apparatus
CN102880613A (en) Identification method of porno pictures and equipment thereof
CN112541476A (en) Malicious webpage identification method based on semantic feature extraction
CN107273465A (en) SQL injection detection method
CN112328936A (en) Website identification method, device and equipment and computer readable storage medium
CN110866108A (en) Sensitive data detection system and detection method thereof
CN110020161B (en) Data processing method, log processing method and terminal
JP2023544925A (en) Data evaluation methods, training methods and devices, electronic equipment, storage media, computer programs
CN108694325A (en) The condition discriminating apparatus of the discriminating conduct and specified type website of specified type website
CN106357682A (en) Phishing website detecting method
CN108804501B (en) Method and device for detecting effective information
CN114117299A (en) Website intrusion tampering detection method, device, equipment and storage medium
CN111383660B (en) Website bad information monitoring system and monitoring method thereof
CN116546448A (en) Short message pushing system
CN115879110A (en) System for identifying financial risk website based on fingerprint penetration technology
CN111488622A (en) Method and device for detecting webpage tampering behavior and related components

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant