WO2015054993A1 - 垃圾信息处理方法及装置 - Google Patents
垃圾信息处理方法及装置 Download PDFInfo
- Publication number
- WO2015054993A1 WO2015054993A1 PCT/CN2014/074924 CN2014074924W WO2015054993A1 WO 2015054993 A1 WO2015054993 A1 WO 2015054993A1 CN 2014074924 W CN2014074924 W CN 2014074924W WO 2015054993 A1 WO2015054993 A1 WO 2015054993A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- spam
- information
- seed
- determining
- content
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W12/00—Security arrangements; Authentication; Protecting privacy or anonymity
- H04W12/12—Detection or prevention of fraud
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/21—Monitoring or handling of messages
- H04L51/212—Monitoring or handling of messages using filtering or selective blocking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/12—Messaging; Mailboxes; Announcements
- H04W4/14—Short messaging services, e.g. short message services [SMS] or unstructured supplementary service data [USSD]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W88/00—Devices specially adapted for wireless communication networks, e.g. terminals, base stations or access point devices
- H04W88/18—Service support devices; Network management devices
- H04W88/184—Messaging devices, e.g. message centre
Definitions
- a short message service is a short text information transmission and reception method based on a mobile communication network. Message by the SMS Service Center
- SMSC Short Message Service Center
- GSM Global System for Mobile Communication
- CDMA Code Division Multiple Access
- PHS Personal Handyphone System
- WCDMA Wideband Code Division Multiple Access
- CDMA2000 Time Division Synchronous Code Division Multiple Access
- TD Time Division Synchronous Code Division Multiple Access
- 3G networks are developing rapidly and have broader development prospects. They have become one of the most frequently used services for mobile phone users. Many merchants are also increasingly favoring this convenient and low-cost advertising channel. However, this also brings new problems: How to filter spam messages more efficiently.
- the invention name is "a short message service system and a method for implementing short message filtering"
- a filtering spam condition in a short message center to authenticate a message satisfying the condition.
- control the delivery of the short message according to the authentication result.
- Real-time monitoring and real-time filtering of spam messages can be realized.
- the spam short message monitoring strategy mainly uses the traffic threshold rule, the content keyword matching rule, the destination number continuity, the message delivery state, and the like to perform spam message monitoring. Rule-based monitoring is easily identified and resolved by spammers.
- spam messaging tends to be gang-like, single-number low-frequency, and versatile, ie hundreds of thousands of numbers participate in a spam message, each The number only sends a small number of messages, and the content sent varies.
- Traditional traffic-based thresholds, content keyword matching, destination number contact characteristics, etc. are difficult to identify these spam messages.
- the present invention provides a spam processing method and apparatus to at least solve the problem of the inability of the related art to intercept the entire gang garbage.
- a spam processing method including: acquiring a spam seed; using the spam seed as a starting point, using a predetermined bill file set as a crawler processing body, and adopting information content to crawl spam information
- the calling number is iteratively crawled by means of the spam calling number crawling the information content; determining the information calling number of the direct or indirect insect network relationship with the spam seed is the spam number, and/or determining The information that the spam seed has direct or indirect insect network relationship is spam.
- the obtaining the spam seed comprises at least one of: providing the spam seed by the spam monitored by the spam monitoring system; and obtaining the spam seed by acquiring information in the information bill file by the short message center
- the spam seed is provided by the spam complained by the user.
- determining that the information calling number that has a direct or indirect network relationship with the spam seed is the spam number includes: information calling a direct or indirect insect network relationship with the spam seed The number is divided into a spam caller number group; and the information calling number that is directly or indirectly related to the spam seed is determined by the spam calling number group to be the spam number.
- determining, according to the spam calling number group, the information calling number that has direct or indirect insect network relationship with the spam seed is the spam number, including: the spam calling number Sorting the number of the group in the group; obtaining the number of consecutive numbers in the predetermined interval after sorting; determining whether the number of consecutive numbers exceeds a first predetermined threshold; and if the determination result is yes, determining that the calling number of the information is the garbage Information number.
- the determining that the information that has a direct or indirect network relationship with the spam seed is the spam information includes: dividing the information that has a direct or indirect insect network relationship with the spam seed into garbage information.
- the content gang; determining, according to the spam content gang, the information that has a direct or indirect worm network relationship with the spam seed is the spam.
- determining, by the at least one of the following manners, the information that is directly or indirectly related to the spam seed according to the spam content group is the spam information: by determining the spam content gang And obtaining a similarity value between the information and the spam seed by concentrating the ratio of the information to the maximum length of the message between the spam seeds, and the similarity value exceeds a second predetermined threshold.
- the information is the spam information; determining, by the spam content gang, the number of pieces of the information that has a direct or indirect insect network relationship with the spam seed, and the number of transmissions exceeds a third predetermined number In the case of a threshold, determining that the information is the spam information; determining the number of participating calling numbers of the information that the spam content gang has a direct or indirect insect network relationship with the spam seed, When the number of participating calling numbers exceeds a fourth predetermined threshold, the information is determined to be the spam.
- a spam processing apparatus including: an obtaining module, configured to acquire a spam seed; a processing module, configured to use the spam seed as a starting point, and the predetermined bill file set is
- the crawler processing body uses the information content to climb the spam caller number, and iteratively crawls the crawling information content by the spam calling number; the first determining module is configured to determine that the spam seed has direct or indirect pests The information calling number of the network relationship is a spam number; and/or, the second determining module is configured to determine that the information directly or indirectly related to the spam seed is spam.
- the obtaining module includes at least one of the following: a first providing unit configured to provide the spam seed by spam monitored by the spam monitoring system; and a second providing unit configured to obtain information by the short message center The information in the bill file provides the spam seed; and the third providing unit is configured to provide the spam seed by the spam complained by the user.
- the first determining module includes: a first sub-unit, configured to divide the information calling number that has a direct or indirect insect network relationship with the spam seed into a spam calling number group; The determining unit is configured to determine, according to the spam calling number group, that the information calling number having direct or indirect insect network relationship with the spam seed is the spam number.
- the first determining unit comprises: a sorting subunit, configured to sort the numbers of the garbage information calling party group set; and obtain the subunits, and set the number of consecutive numbers in the predetermined interval after the sorting;
- the determining subunit is configured to determine whether the consecutive number of numbers exceeds a first predetermined threshold; and the first determining subunit is configured to determine that the information calling number is the spam number if the determination result is yes.
- the second determining module includes: a second splitting unit, configured to divide the information that has a direct or indirect network relationship with the spam seed into a spam content gang; the second determining unit, The information that is set to determine a direct or indirect insect network relationship with the spam seed according to the spam content group is the spam.
- the second determining unit comprises at least one of the following: a second determining subunit, configured to determine a maximum number of common characters and a message length between the information in the spam content gang and the spam seed by determining a ratio of the information to obtain the similarity value of the spam seed, and if the similarity value exceeds a second predetermined threshold, determining the information as the spam; the third determining subunit, setting In order to determine the number of pieces of the information that the spam content gang has a direct or indirect insect network relationship with the spam seed, if the number of transmissions exceeds a third predetermined threshold, the information is determined to be Spam a fourth determining subunit, configured to determine the number of participating calling numbers of the information that the spam content gang has a direct or indirect insect network relationship with the spam seed, where the number of participating calling numbers exceeds In the case of four predetermined thresholds, the information is determined to be the spam.
- a second determining subunit configured to determine a maximum number of common characters and a message length between the information in the spam
- the garbage information seed is obtained; the spam information seed is used as the starting point, the predetermined bill file set is used as the crawler processing body, the information content is used to climb the spam calling number, and the garbage information calling number is used to climb the information content.
- the information is spam, which solves the problem that the related technology cannot intercept the entire gang garbage, and thus can effectively identify the garbage caller number gang and the garbage content gang, and greatly improve the effect of garbage information management.
- FIG. 1 is a flowchart of a spam processing method according to an embodiment of the present invention
- FIG. 2 is a block diagram showing a structure of a spam processing apparatus according to an embodiment of the present invention
- FIG. 3 is a spam according to an embodiment of the present invention.
- FIG. 4 is a block diagram showing a preferred configuration of the first determination module 26 in the spam processing device according to an embodiment of the present invention
- FIG. 5 is a diagram of a spam processing device according to an embodiment of the present invention. a first determining unit in the determining module 26
- FIG. 6 is a block diagram showing a preferred configuration of the second determining module 28 in the spam processing apparatus according to the embodiment of the present invention
- FIG. 7 is a second determining module 28 in the spam processing apparatus according to the embodiment of the present invention.
- FIG. 8 is a system architecture diagram of spam crawling processing according to a preferred embodiment of the present invention
- FIG. 9 is a schematic diagram of spam crawling processing according to a preferred embodiment of the present invention
- 10 is a logic flow diagram of a crawler iterative process in accordance with a preferred embodiment of the present invention.
- FIG. 1 is a flowchart of a spam processing method according to an embodiment of the present invention. As shown in FIG. 1, the process includes the following steps: Step S102: Obtaining a spam seed Step S104: taking the spam seed as a starting point, using the predetermined bill file set as the crawler processing body, using the information content to climb the spam calling party number, and crawling the information content by the spam calling number to perform the iterative crawling process; S106.
- the iterative crawling process is performed according to the spam content or the garbage calling number.
- the simple traffic threshold rule is imposed on the spam, the content keyword matching rule is processed, and the gangful garbage information operation cannot be performed.
- Effective identification not only solves the problem that the relevant technology cannot intercept the entire gang garbage, but also achieves the effect of being able to effectively identify the garbage caller number gang and the garbage content gang. It should be noted that the foregoing method for obtaining spam seeds may adopt multiple processing manners.
- At least one of the following methods may be used to obtain the spam information seed: the spam information monitored by the spam monitoring system provides spam seeds; The message center obtains the information in the information CDR file to provide the spam seed; the spam message provided by the user provides the spam seed.
- determining that the calling number of the information directly or indirectly related to the spam seed is a spam number, and/or determining that the information directly or indirectly related to the spam seed is spam may also be used as follows: For the simple processing method, the above two processing steps will be described below. Determining the information that the direct or indirect insect network relationship with the spam seed has the calling number as the spam number can be processed as follows: First, the information calling number that has direct or indirect insect network relationship with the spam seed is divided into spam information.
- the calling number group is set; after that, based on the spam calling number group, the information calling number that has direct or indirect insect network relationship with the spam seed is the spam number. Among them, based on spam The caller number group determines that the information with the spam seed has direct or indirect insect network relationship.
- the calling number is the spam number, which may include: sorting the numbers of the spam caller number group; obtaining the predetermined interval within the sorting The number of consecutive numbers; determining whether the number of consecutive numbers exceeds a first predetermined threshold; if the result of the determination is yes, determining that the calling number of the information is a spam number. Determining the information that has direct or indirect insect network relationship with the spam seed is spam.
- the information that has direct or indirect insect network relationship with the spam seed is divided into spam content gang; according to the spam content gang
- the information that determines whether there is a direct or indirect insect network relationship with the spam seed is spam.
- the information that is determined by the spam content group to have direct or indirect insect network relationship with the spam seed may also be in a variety of ways. For example, it may be implemented by at least one of the following methods: By judging the spam content gang The value of the similarity between the number of common characters and the maximum length of the message between the centralized information and the spam seed is obtained by obtaining the similarity value between the information and the spam seed.
- the information is determined to be spam; Determining the number of pieces of information in which the spam content gang concentrates on the direct or indirect insect network relationship with the spam seed.
- the information is determined to be spam; determining the spam content gang concentration and The spam seed has the number of participating calling numbers of the information of the direct or indirect insect network relationship.
- the number of participating calling numbers exceeds the fourth predetermined threshold, the information is determined to be spam.
- a spam processing device is also provided, which is used to implement the above embodiments and preferred embodiments, and has not been described again.
- the term “module” may implement a combination of software and/or hardware of a predetermined function.
- FIG. 2 is a structural block diagram of a spam processing apparatus according to an embodiment of the present invention.
- the apparatus includes an obtaining module 22, a processing module 24, a first determining module 26, and/or a second determining module 28, The device will be described.
- the obtaining module 22 is configured to obtain the spam seed;
- the processing module 24 is connected to the obtaining module 22, and is configured to use the spam seed as a starting point, the predetermined bill file set as the crawler processing body, and the garbage content information by using the information content.
- FIG. 3 is a block diagram showing a preferred structure of the acquisition module 22 in the spam processing apparatus according to the embodiment of the present invention. As shown in FIG. 3, the acquisition module 22 includes a first providing unit 32, a second providing unit 34, and a third providing unit 36. The acquisition module 22 will be described below.
- the first providing unit 32 is configured to provide the spam seed by the spam monitored by the spam monitoring system; the second providing unit 34 is configured to obtain the spam seed by acquiring the information in the information bill file by the short message center;
- the providing unit 36 is configured to provide spam seeds by spam complained by the user.
- 4 is a block diagram of a preferred structure of the first determining module 26 in the spam processing apparatus according to the embodiment of the present invention. As shown in FIG. 4, the first determining module 26 includes a first slicing unit 42 and a first determining unit 44. The first determination module 26 will be described below.
- the first sub-unit 42 is configured to divide the information calling number having direct or indirect insect network relationship with the spam seed into a spam calling number group; the first determining unit 44 is connected to the first segmenting unit 42.
- FIG. 5 is a block diagram showing a preferred structure of the first determining unit 44 in the first determining module 26 in the spam processing apparatus according to the embodiment of the present invention.
- the first determining unit 44 includes a sorting subunit 52 and an acquiring subunit. 54.
- the judging subunit 56 and the first determining subunit 58 are described below.
- the sorting sub-unit 52 is arranged to sort the numbers of the spam calling party group set; the obtaining sub-unit 54, connected to the sorting sub-unit 52, is set to obtain the number of consecutive numbers in the predetermined interval after sorting; 56, connected to the obtaining sub-unit 54, configured to determine whether the number of consecutive numbers exceeds a first predetermined threshold; the first determining sub-unit 58 is connected to the determining sub-unit 56, and is set to determine that the determination result is yes, The information calling number is the spam number.
- FIG. 6 is a block diagram showing a preferred structure of the second determining module 28 in the spam processing apparatus according to the embodiment of the present invention. As shown in FIG.
- the second determining module 28 includes a second slicing unit 62 and a second determining unit 64.
- the second determination module 28 will be described below.
- the second segmentation unit 62 is configured to divide the information that has direct or indirect insect network relationship with the spam seed into the spam content group;
- the second determining unit 64 is connected to the second segmentation unit 62, and is configured as The spam content gang determines that the information directly or indirectly related to the spam seed is spam.
- FIG. 7 is a block diagram showing a preferred structure of the second determining unit 64 in the second determining module 28 in the spam processing apparatus according to the embodiment of the present invention. As shown in FIG.
- the second determining unit 64 includes at least one of the following: Two indeed The stator unit 72, the third determining subunit 74, and the fourth determining subunit 76 will be described below for the second determining unit 64.
- the second determining sub-unit 72 is configured to obtain the similarity value between the information and the spam seed by determining the ratio of the number of common characters between the spam content gang information and the spam seed to the maximum length of the message, in the similarity value If the second predetermined threshold is exceeded, the information is determined to be spam;
- the third determining subunit 74 is configured to determine the number of pieces of information in which the spam content gang has direct or indirect insect network relationship with the spam seed, and is sent If the number of the pieces exceeds the third predetermined threshold, the information is determined to be spam;
- the fourth determining subunit 76 is configured to determine the participating calling number of the information in which the spam content gang has a direct or indirect insect network relationship with the spam seed.
- the number is determined to be spam if the number of participating calling numbers exceeds a fourth predetermined threshold.
- spam SMS management in related technologies, monitoring technologies based on traffic thresholds and keyword rules have become relatively mature. Spammers are circumventing these rules, tending to participate in group numbers and low frequency transmission of each number.
- an effective method for identifying gangs and transmitting low-frequency spam messages is provided.
- the spam analysis and recognition method is a reptile-based spam recognition method, that is, a crawler recognition technology in which the garbage caller and the spam message are repeatedly iterated.
- the real-time monitoring system can identify certain spam messages through various monitoring strategies, and the mobile manual complaint platform can provide certain spam messages, SMS manual review stations, etc., to obtain certain spam messages. Or generate a rough spam seed collection based on the suspected text message. Using these spam messages as seeds, generate a spam message seed message content list set, and then use each spam message in the seed message list set as a starting point, and use a certain period of the short message history bill file set as the crawler processing body, and sequentially execute the message content. Climb the spam caller number, climb the SMS content with the spam caller number, crawl the spam caller number with the SMS content..., so crawl it layer by layer until you climb out and the spam message content is directly or SMS of indirect insect network relationship.
- the spam messages that are crawled out are identified as a set of spam caller number group buddies according to the insect network relationship; all spam messages with direct and indirect contact are identified as a group of garbage. SMS content gangs; eventually can identify multiple groups of gangs. After that, the SMS caller number group and the spam message content group are evaluated and reviewed. The audit mode can be automatically processed based on the rules or sent to the maintenance center for manual review.
- the crawler-based spam short message identification method proposed in the present embodiment and the preferred embodiment is a post-based bill-based spam short message monitoring method.
- this scheme it is possible to identify gang-like low-frequency sending short messages, that is, group-to-group spam messages, and can recognize caller number gangs and spam gangs, which can greatly improve the effect of spam message management.
- the implementation system of the foregoing solution is independent of the existing real-time monitoring subsystem, and has no effect on the short message delivery and real-time monitoring of the message flow.
- the present invention does not limit the message type and network type, and can analyze Global System for Mobile Communication (GSM), Code Division Multiple Access (CDMA), and personal portable telephone system (Personal).
- GSM Global System for Mobile Communication
- CDMA Code Division Multiple Access
- Personal personal portable telephone system
- PHS Personal Handyphone System
- PHS is a wireless communication network short message service.
- Preferred embodiments of the present invention will now be described with reference to the accompanying drawings.
- the system includes: a spam real-time monitoring system 8, a short message center 11, and a manual auditing platform (or operator garbage). SMS complaint platform) 9, spam crawler analysis and mining system 10, operation and maintenance subsystem (or operation and maintenance console) 7, and home user server HLR6.
- the spam crawler analysis and mining system 10 is the core processing module of the system, and its input is a short message history bill, 1) can be provided by the spam real-time monitoring system 8, 2) or can directly obtain a short message bill from the short message center 11.
- the other input is garbage spam, 1) provided by the manual auditing platform 9, and the manual auditing platform 9 is a third-party maintenance platform built by the operator.
- the platform sends spam messages to the mining system 10.
- the operation and maintenance station 7 realizes the evaluation and audit of the excavated gang number and the gang text message, and the spam reptile analysis and mining system 10 analyzes the gang number and the gang message that is mined and sends it to 7, 7 and then determines the gang number after the review.
- the gang garbage message content is sent to the spam real-time monitoring system 8, blacklisting blacklisting, content keyword update, and the like.
- HLR 6 the spam message sent by the mining system to send the gang number, send the system blackened, for short message interception, the module is optional.
- 1 interface is the short message history CDR input interface of the crawler mining system.
- the solution is implemented by the FTP interface, but it is not limited to this mode;
- the interface is a spam seed sample input interface. This solution is implemented by using an FTP interface, but is not limited to this mode;
- 3 interface is the reptile mining system short message history CDR input interface (history CDR input can be selected as 1 interface implementation, if 1 interface is used, the interface does not provide historical CDRs), spam SMS seed sample (monitored by real-time monitoring system) Spam) Input interface.
- This solution uses the FTP interface to implement, but is not limited to this method;
- the interface sends the gang number to the spam message, and the gang message content is sent to the real-time monitoring system plus the blacklist interface, and the short message content is also sent to the real-time system for the keyword rule configuration reference.
- This solution uses the FTP interface to implement, but is not limited to this method;
- Interface spam message is sent to the mining system. 10
- the first gang number and gang message content are analyzed and sent to the operation maintenance station for audit evaluation.
- This solution uses the FTP interface to implement, but is not limited to this method;
- this program uses FTP interface to achieve, but not limited to this method; evaluation audit: When the suspicious SMS number is used as the seed number for crawling, there is a normal text message In this case, the SMS caller number group and the spam message content group need to be evaluated and reviewed. The audit mode can be automatically processed based on the rule method or sent to the maintenance center for manual review.
- This scheme can adopt the automatic processing method based on the following rules: (1) Continuity detection of the member number of the calling number group, sorting the numbers in the group to calculate the difference of the number interval, setting the minimum interval difference Dm of the adjacent number, the minimum consecutive number Threshold Hc, if the consecutive number data within Dm exceeds Hc, then the calling number group is considered to have a consecutive number feature. Once the feature is met, the gang is determined to be a valid spam messaging gang. (2) The short message content in the spam message content group is similarly detected.
- the scheme compares the ratio of the number of common characters between the two messages to the maximum length of the message to determine whether there is similarity; setting the threshold S, for example, S Can be set to 0.7, which means that 70% of the two messages are the same, it is considered to be similar content.
- FIG. 9 is a schematic diagram of spam crawling processing according to a preferred embodiment of the present invention. As shown in FIG. 9, the information is described by taking a spam message as an example. Three spam messages are sent to the calling party number, and the message A is sent together. (message A ⁇ F) 6 kinds of spam messages, each of which participates in sending some of them.
- MessageA (message A) is the spam message reported by the user to the delivery platform.
- the crawler system uses MessageA as the seed.
- 10 is a logic flow diagram of a crawler iterative process according to a preferred embodiment of the present invention. As shown in FIG. 10, the crawler process iteration is divided into two main iterative processes: crawling out the short message content by the calling number and crawling the calling number out of the short message content. .
- the input is divided into three types, spam message content, spam caller number, and suspicious spam caller number, among which "spam message content” is used to generate content seed as the starting point of the crawler; among them “spam caller number” Or “suspicious spam caller number” is used to generate the calling number seed as the starting point for the crawler.
- spam message content is used to generate content seed as the starting point of the crawler
- spam caller number is used to generate the calling number seed as the starting point for the crawler.
- the spam message content, the suspicious calling number, and the blacklist number are used as an example to describe the spam content, the suspicious number, or the blacklist number.
- a spam crawler analysis and mining system is realized.
- the test results show that it is possible to identify gang-like low-frequency text messages, that is, group-to-group spam messages, which can identify caller number gangs and spam gangs, and can greatly improve the effect of spam messages.
- group-to-group spam messages which can identify caller number gangs and spam gangs, and can greatly improve the effect of spam messages.
- the above modules or steps of the present invention can be implemented by a general-purpose computing device, which can be concentrated on a single computing device or distributed over a network composed of multiple computing devices.
- they may be implemented by program code executable by the computing device, such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein.
- a garbage information processing method and apparatus provided by an embodiment of the present invention have the following beneficial effects: Solving the problem that the related technology cannot intercept the entire gang garbage, thereby achieving the ability to call the garbage calling number. Groups and spammers can effectively identify and significantly improve the effectiveness of spam management.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Information Transfer Between Computers (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
本发明提供了一种垃圾信息处理方法及装置,该方法包括:采用获取垃圾信息种子;以垃圾信息种子为起点,以预定的话单文件集为爬虫处理体,采用以信息内容爬垃圾信息主叫号码,以垃圾信息主叫号码爬信息内容的方式进行迭代爬行处理;确定与垃圾信息种子有直接或间接虫网关系的信息主叫号码为垃圾信息号码,和/或,确定与垃圾信息种子有直接或间接虫网关系的信息为垃圾信息,通过本发明,解决了相关技术中存在无法对整个团伙垃圾拦截的问题,进而达到了能够对垃圾主叫号码团伙以及垃圾内容团伙进行有效识别,大幅提升垃圾信息治理的效果。
Description
垃圾信息处理方法及装置 技术领域 本发明涉及通信领域, 具体而言, 涉及一种垃圾信息处理方法及装置。 背景技术 短消息业务是基于移动通讯网络的简短文本信息收发方式。 消息由短信服务中心
( Short Message Service Center, 简称为 SMSC) 负责接收、 转存和发送。 该业务广泛 应用于所有移动通信网络: 全球移动通信 (Global system for Mobile Communication, 简称为 GSM)、 码分多址 (Code Division Multiple Access, 简称为 CDMA)、 个人便携 式电话系统 (Personal Handyphone System, 简称为 PHS ), 以及宽带码分多址接入 (Wideband Code Division Multiple Access, 简称为 WCDMA)、 CDMA2000、 时分同步 码分多址接入 ( Time Di vi si on- Synchronous Code Division Multiple Access, 简称为 TD-SCDMA) 等 3G网络, 发展迅速, 并有着更广的发展前景, 已经日益成为手机用 户应用最频繁的业务之一。 而不少商家亦越来越青睐这一便捷、 低廉的广告途径。 然 而这也带来了新的问题: 如何更高效地过滤垃圾短消息。 在申请号为: CN200510086930, 发明名称为 "一种短消息业务系统及其实现短消 息过滤的方法" 的专利文件中提出, 在短消息中心设置过滤垃圾短信条件, 对满足条 件的消息进行鉴权, 并根据鉴权结果控制短信的下发。能实现对垃圾短信的实时监控、 实时过滤。 另外, 在相关技术中垃圾短信监控策略主要采用流量门限规则、 内容关键字匹配 规则、 目的号码连续性、 消息投送状态等进行垃圾短信监控。 基于规则的监控, 很容 易被垃圾发送者识别并化解, 但是, 目前垃圾短信发送趋向于团伙性、单号码低频性、 内容多变性, 即成百上千号码参与一种垃圾短信发送, 每个号码仅发送少量的消息, 并且发送的内容多变, 传统的基于流量门限、 内容关键字匹配、 目的号码联系特征等 难以有效识别这些垃圾短消息, 通常只能拦截其中一部分垃圾短信, 难以将整个团伙 识别并拦截。 因此, 在相关技术中存在无法对整个团伙垃圾拦截的问题。
发明内容 本发明提供了一种垃圾信息处理方法及装置, 以至少解决相关技术中存在的无法 对整个团伙垃圾拦截的问题。 根据本发明的一个方面, 提供了一种垃圾信息处理方法, 包括: 获取垃圾信息种 子; 以所述垃圾信息种子为起点, 以预定的话单文件集为爬虫处理体, 采用以信息内 容爬垃圾信息主叫号码, 以垃圾信息主叫号码爬信息内容的方式进行迭代爬行处理; 确定与所述垃圾信息种子有直接或间接虫网关系的信息主叫号码为垃圾信息号码,和 / 或, 确定与所述垃圾信息种子有直接或间接虫网关系的信息为垃圾信息。 优选地, 获取所述垃圾信息种子包括以下至少之一: 由垃圾信息监控系统监控到 的垃圾信息提供所述垃圾信息种子; 由短消息中心获取信息话单文件中的信息提供所 述垃圾信息种子; 由用户投诉的垃圾信息提供所述垃圾信息种子。 优选地, 确定与所述垃圾信息种子有直接或间接虫网关系的所述信息主叫号码为 所述垃圾信息号码包括: 将与所述垃圾信息种子有直接或间接虫网关系的信息主叫号 码划分为垃圾信息主叫号码团伙集; 依据所述垃圾信息主叫号码团伙集判定与所述垃 圾信息种子有直接或间接虫网关系的所述信息主叫号码为所述垃圾信息号码。 优选地, 依据所述垃圾信息主叫号码团伙集判定与所述垃圾信息种子有直接或间 接虫网关系的所述信息主叫号码为所述垃圾信息号码包括: 对所述垃圾信息主叫号码 团伙集中的号码进行排序; 获取排序后的预定间隔内的连续号码数; 判断所述连续号 码数是否超过第一预定阈值; 在判断结果为是的情况下, 确定信息主叫号码为所述垃 圾信息号码。 优选地, 确定与所述垃圾信息种子有直接或间接虫网关系的所述信息为所述垃圾 信息包括: 将与所述垃圾信息种子有直接或间接虫网关系的所述信息划分为垃圾信息 内容团伙集; 依据所述垃圾信息内容团伙集判定与所述垃圾信息种子有直接或间接虫 网关系的所述信息为所述垃圾信息。 优选地, 通过以下方式至少之一依据所述垃圾信息内容团伙集判定与所述垃圾信 息种子有直接或间接虫网关系的所述信息为所述垃圾信息包括: 通过判断所述垃圾信 息内容团伙集中所述信息与所述垃圾信息种子间公共字符个数与消息最大长度的比的 方式来获得所述信息与所述垃圾信息种子的相似性值, 在相似性值超过第二预定阈值 的情况下, 确定所述信息为所述垃圾信息; 判断所述垃圾信息内容团伙集中与所述垃 圾信息种子有直接或间接虫网关系的所述信息的发送条数, 在发送条数超过第三预定
阈值的情况下, 确定所述信息为所述垃圾信息; 判断所述垃圾信息内容团伙集中与所 述垃圾信息种子有直接或间接虫网关系的所述信息的参与主叫号码数, 在所述参与主 叫号码数超过第四预定阈值的情况下, 确定所述信息为所述垃圾信息。 根据本发明的另一方面, 提供了一种垃圾信息处理装置, 包括: 获取模块, 设置 为获取垃圾信息种子; 处理模块, 设置为以所述垃圾信息种子为起点, 以预定的话单 文件集为爬虫处理体, 采用以信息内容爬垃圾信息主叫号码, 以垃圾信息主叫号码爬 信息内容的方式进行迭代爬行处理; 第一确定模块, 设置为确定与所述垃圾信息种子 有直接或间接虫网关系的信息主叫号码为垃圾信息号码; 和 /或, 第二确定模块, 设置 为确定与所述垃圾信息种子有直接或间接虫网关系的信息为垃圾信息。 优选地, 所述获取模块包括以下至少之一: 第一提供单元, 设置为由垃圾信息监 控系统监控到的垃圾信息提供所述垃圾信息种子; 第二提供单元, 设置为由短消息中 心获取信息话单文件中的信息提供所述垃圾信息种子; 第三提供单元, 设置为由用户 投诉的垃圾信息提供所述垃圾信息种子。 优选地, 所述第一确定模块包括: 第一切分单元, 设置为将与所述垃圾信息种子 有直接或间接虫网关系的信息主叫号码划分为垃圾信息主叫号码团伙集; 第一判定单 元, 设置为依据所述垃圾信息主叫号码团伙集判定与所述垃圾信息种子有直接或间接 虫网关系的所述信息主叫号码为所述垃圾信息号码。 优选地, 所述第一判定单元包括: 排序子单元, 设置为对所述垃圾信息主叫号码 团伙集中的号码进行排序; 获取子单元, 设置为获取排序后的预定间隔内的连续号码 数; 判断子单元, 设置为判断所述连续号码数是否超过第一预定阈值; 第一确定子单 元, 设置为在判断结果为是的情况下, 确定信息主叫号码为所述垃圾信息号码。 优选地, 所述第二确定模块包括: 第二切分单元, 设置为将与所述垃圾信息种子 有直接或间接虫网关系的所述信息划分为垃圾信息内容团伙集; 第二判定单元, 设置 为依据所述垃圾信息内容团伙集判定与所述垃圾信息种子有直接或间接虫网关系的所 述信息为所述垃圾信息。 优选地, 所述第二判定单元包括以下至少之一: 第二确定子单元, 设置为通过判 断所述垃圾信息内容团伙集中所述信息与所述垃圾信息种子间公共字符个数与消息最 大长度的比的方式来获得所述信息与所述垃圾信息种子的相似性值, 在相似性值超过 第二预定阈值的情况下, 确定所述信息为所述垃圾信息; 第三确定子单元, 设置为判 断所述垃圾信息内容团伙集中与所述垃圾信息种子有直接或间接虫网关系的所述信息 的发送条数,在发送条数超过第三预定阈值的情况下,确定所述信息为所述垃圾信息;
第四确定子单元, 设置为判断所述垃圾信息内容团伙集中与所述垃圾信息种子有直接 或间接虫网关系的所述信息的参与主叫号码数, 在所述参与主叫号码数超过第四预定 阈值的情况下, 确定所述信息为所述垃圾信息。 通过本发明, 采用获取垃圾信息种子; 以所述垃圾信息种子为起点, 以预定的话 单文件集为爬虫处理体, 采用以信息内容爬垃圾信息主叫号码, 以垃圾信息主叫号码 爬信息内容的方式进行迭代爬行处理; 确定与所述垃圾信息种子有直接或间接虫网关 系的信息主叫号码为垃圾信息号码, 和 /或, 确定与所述垃圾信息种子有直接或间接虫 网关系的信息为垃圾信息, 解决了相关技术中存在无法对整个团伙垃圾拦截的问题, 进而达到了能够对垃圾主叫号码团伙以及垃圾内容团伙进行有效识别, 大幅提升垃圾 信息治理的效果。 附图说明 此处所说明的附图用来提供对本发明的进一步理解, 构成本申请的一部分, 本发 明的示意性实施例及其说明用于解释本发明, 并不构成对本发明的不当限定。 在附图 中: 图 1是根据本发明实施例的垃圾信息处理方法的流程图; 图 2是根据本发明实施例的垃圾信息处理装置的结构框图; 图 3是根据本发明实施例的垃圾信息处理装置中获取模块 22的优选结构框图; 图 4是根据本发明实施例的垃圾信息处理装置中第一确定模块 26的优选结构框 图; 图 5是根据本发明实施例的垃圾信息处理装置中第一确定模块 26中第一判定单元
44的优选结构框图; 图 6是根据本发明实施例的垃圾信息处理装置中第二确定模块 28 的优选结构框 图; 图 7是根据本发明实施例的垃圾信息处理装置中第二确定模块 28中的第二判定单 元 64的优选结构框图; 图 8是根据本发明优选实施方式的垃圾信息爬虫处理的系统架构图; 图 9是根据本发明优选实施方式的垃圾信息爬虫处理的示意图;
图 10是根据本发明优选实施方式的爬虫迭代处理的逻辑流程图。 具体实施方式 下文中将参考附图并结合实施例来详细说明本发明。 需要说明的是, 在不冲突的 情况下, 本申请中的实施例及实施例中的特征可以相互组合。 在本实施例中提供了一种垃圾信息处理方法, 图 1是根据本发明实施例的垃圾信 息处理方法的流程图, 如图 1所示, 该流程包括如下步骤: 步骤 S102, 获取垃圾信息种子; 步骤 S104, 以垃圾信息种子为起点, 以预定的话单文件集为爬虫处理体, 采用以 信息内容爬垃圾信息主叫号码, 以垃圾信息主叫号码爬信息内容的方式进行迭代爬行 处理; 步骤 S106,确定与垃圾信息种子有直接或间接虫网关系的信息主叫号码为垃圾信 息号码, 和 /或, 确定与垃圾信息种子有直接或间接虫网关系的信息为垃圾信息。 通过上述步骤, 依据垃圾信息内容或是垃圾主叫号码进行迭代爬行处理, 相对于 相关技术中仅对垃圾信息进行简单的流量门限规则, 内容关键字匹配规则处理, 无法 对团伙性的垃圾信息操作进行有效识别, 不仅解决了相关技术中存在无法对整个团伙 垃圾拦截的问题, 进而达到了能够对垃圾主叫号码团伙以及垃圾内容团伙进行有效识 另 |J, 大幅提升垃圾信息治理的效果。 需要说明的是, 上述获取垃圾信息种子可以采用多种处理方式, 例如, 可以采用 以下方式至少之一来获取该垃圾信息种子: 由垃圾信息监控系统监控到的垃圾信息提 供垃圾信息种子; 由短消息中心获取信息话单文件中的信息提供垃圾信息种子; 由用 户投诉的垃圾信息提供垃圾信息种子。 其中, 确定与垃圾信息种子有直接或间接虫网关系的信息主叫号码为垃圾信息号 码, 和 /或, 确定与垃圾信息种子有直接或间接虫网关系的信息为垃圾信息也可以采用 以下较为简单的处理方式, 下面分别对上述两个处理步骤进行说明。 确定与垃圾信息种子有直接或间接虫网关系的信息主叫号码为垃圾信息号码可以 采用以下处理方式: 首先, 将与垃圾信息种子有直接或间接虫网关系的信息主叫号码 划分为垃圾信息主叫号码团伙集; 之后, 依据垃圾信息主叫号码团伙集判定与垃圾信 息种子有直接或间接虫网关系的信息主叫号码为垃圾信息号码。 其中, 依据垃圾信息
主叫号码团伙集判定与垃圾信息种子有直接或间接虫网关系的信息主叫号码为垃圾信 息号码可以包括: 对垃圾信息主叫号码团伙集中的号码进行排序; 获取排序后的预定 间隔内的连续号码数; 判断连续号码数是否超过第一预定阈值; 在判断结果为是的情 况下, 确定信息主叫号码为垃圾信息号码。 确定与垃圾信息种子有直接或间接虫网关系的信息为垃圾信息可以采用以下处理 方式: 将与垃圾信息种子有直接或间接虫网关系的信息划分为垃圾信息内容团伙集; 依据垃圾信息内容团伙集判定与垃圾信息种子有直接或间接虫网关系的信息为垃圾信 息。 其中, 依据垃圾信息内容团伙集判定与垃圾信息种子有直接或间接虫网关系的信 息为垃圾信息也可以采用多种方式, 例如, 可以通过以下方式至少之一来实现: 通过 判断垃圾信息内容团伙集中信息与垃圾信息种子间公共字符个数与消息最大长度的比 的方式来获得信息与垃圾信息种子的相似性值, 在相似性值超过第二预定阈值的情况 下, 确定信息为垃圾信息; 判断垃圾信息内容团伙集中与垃圾信息种子有直接或间接 虫网关系的信息的发送条数, 在发送条数超过第三预定阈值的情况下, 确定信息为垃 圾信息; 判断垃圾信息内容团伙集中与垃圾信息种子有直接或间接虫网关系的信息的 参与主叫号码数, 在参与主叫号码数超过第四预定阈值的情况下, 确定信息为垃圾信 息。 在本实施例中还提供了一种垃圾信息处理装置, 该装置用于实现上述实施例及优 选实施方式, 已经进行过说明的不再赘述。 如以下所使用的, 术语 "模块"可以实现 预定功能的软件和 /或硬件的组合。 尽管以下实施例所描述的装置较佳地以软件来实 现, 但是硬件, 或者软件和硬件的组合的实现也是可能并被构想的。 图 2是根据本发明实施例的垃圾信息处理装置的结构框图, 如图 2所示, 该装置 包括获取模块 22、处理模块 24、第一确定模块 26和 /或第二确定模块 28, 下面对该装 置进行说明。 获取模块 22, 设置为获取垃圾信息种子; 处理模块 24, 连接至上述获取模块 22, 设置为以垃圾信息种子为起点, 以预定的话单文件集为爬虫处理体, 采用以信息内容 爬垃圾信息主叫号码, 以垃圾信息主叫号码爬信息内容的方式进行迭代爬行处理; 第 一确定模块 26,连接至上述处理模块 24,设置为确定与垃圾信息种子有直接或间接虫 网关系的信息主叫号码为垃圾信息号码; 和 /或, 第二确定模块 28, 连接至上述处理模 块 24, 设置为确定与垃圾信息种子有直接或间接虫网关系的信息为垃圾信息。
图 3是根据本发明实施例的垃圾信息处理装置中获取模块 22的优选结构框图,如 图 3所示,该获取模块 22包括第一提供单元 32、第二提供单元 34和第三提供单元 36, 下面对该获取模块 22进行说明。 第一提供单元 32, 设置为由垃圾信息监控系统监控到的垃圾信息提供垃圾信息种 子; 第二提供单元 34, 设置为由短消息中心获取信息话单文件中的信息提供垃圾信息 种子; 第三提供单元 36, 设置为由用户投诉的垃圾信息提供垃圾信息种子。 图 4是根据本发明实施例的垃圾信息处理装置中第一确定模块 26的优选结构框 图, 如图 4所示, 该第一确定模块 26包括第一切分单元 42、 第一判定单元 44, 下面 对该第一确定模块 26进行说明。 第一切分单元 42, 设置为将与垃圾信息种子有直接或间接虫网关系的信息主叫号 码划分为垃圾信息主叫号码团伙集; 第一判定单元 44, 连接至上述第一切分单元 42, 设置为依据垃圾信息主叫号码团伙集判定与垃圾信息种子有直接或间接虫网关系的信 息主叫号码为垃圾信息号码。 图 5是根据本发明实施例的垃圾信息处理装置中第一确定模块 26中第一判定单元 44的优选结构框图, 如图 5所示, 第一判定单元 44包括排序子单元 52、 获取子单元 54、 判断子单元 56和第一确定子单元 58, 下面对该第一判定单元 44进行说明。 排序子单元 52, 设置为对垃圾信息主叫号码团伙集中的号码进行排序; 获取子单 元 54, 连接至上述排序子单元 52, 设置为获取排序后的预定间隔内的连续号码数; 判 断子单元 56,连接至上述获取子单元 54,设置为判断连续号码数是否超过第一预定阈 值;第一确定子单元 58,连接至上述判断子单元 56,设置为在判断结果为是的情况下, 确定信息主叫号码为垃圾信息号码。 图 6是根据本发明实施例的垃圾信息处理装置中第二确定模块 28 的优选结构框 图, 如图 6所示, 该第二确定模块 28包括第二切分单元 62和第二判定单元 64, 下面 对该第二确定模块 28进行说明。 第二切分单元 62, 设置为将与垃圾信息种子有直接或间接虫网关系的信息划分为 垃圾信息内容团伙集; 第二判定单元 64, 连接至上述第二切分单元 62, 设置为依据垃 圾信息内容团伙集判定与垃圾信息种子有直接或间接虫网关系的信息为垃圾信息。 图 7是根据本发明实施例的垃圾信息处理装置中第二确定模块 28中的第二判定单 元 64的优选结构框图, 如图 7所示, 该第二判定单元 64包括以下至少之一: 第二确
定子单元 72、 第三确定子单元 74、 第四确定子单元 76, 下面对该第二判定单元 64进 行说明。 第二确定子单元 72, 设置为通过判断垃圾信息内容团伙集中信息与垃圾信息种子 间公共字符个数与消息最大长度的比的方式来获得信息与垃圾信息种子的相似性值, 在相似性值超过第二预定阈值的情况下, 确定信息为垃圾信息; 第三确定子单元 74, 设置为判断垃圾信息内容团伙集中与垃圾信息种子有直接或间接虫网关系的信息的发 送条数, 在发送条数超过第三预定阈值的情况下, 确定信息为垃圾信息; 第四确定子 单元 76, 设置为判断垃圾信息内容团伙集中与垃圾信息种子有直接或间接虫网关系的 信息的参与主叫号码数, 在参与主叫号码数超过第四预定阈值的情况下, 确定信息为 垃圾信息。 随着相关技术中, 垃圾短信治理的发展, 基于流量门限和关键字规则的监控技术 已经相对成熟, 垃圾短信发送者为绕过这些规则, 趋向于群体号码参与、 每个号码低 频发送。 在本实施例中基于垃圾短信的发送的群到群特征, 以及垃圾短信监控结果, 提供了一种有效的识别团伙性、 发送低频性垃圾短信监控识别方法。 该垃圾短信分析 识别方法为基于爬虫的垃圾短信识别方法, 即, 垃圾主叫和垃圾短信反复迭代的爬虫 识别技术。 实时监控系统通过各种监控策略可以识别一定垃圾短信、 移动人工投诉平台能提 供一定垃圾短信、 短信人工审核台等可以确切的得到一定垃圾短信。 或者根据疑似短 信, 生成粗垃圾短信种子集合。 以这些垃圾短信为种子, 生成垃圾短信种子短信内容 列表集, 再以种子短信列表集内每一条垃圾短信为起点, 以一定周期的短信历史话单 文件集为爬虫处理体, 依次执行以短信内容爬垃圾短信主叫号码, 以垃圾短信主叫号 码爬短信内容, 以短信内容爬垃圾短信主叫号码 ...... , 如此逐层迭代爬行, 直到爬出 与垃圾种子短信内容有直接或间接虫网关系的短信。 之后, 将爬出的垃圾短信根据虫网关系, 将所有有直接间接联系的垃圾短信号码 识别为一组垃圾短信主叫号码团伙集; 将所有有直接间接联系的垃圾短信内容识别为 一组垃圾短信内容团伙集; 最终可以识别出多组团伙集。 之后, 再对短信主叫号码团伙集和垃圾短信内容团伙集, 进行评估审核, 审核方 式既可以采用基于规则方式进行自动处理, 也可以送维护中心进行人工审核。 其中审 核时可以综合 "主叫号码团伙集成员大小"、 "主叫号码团伙集成员号码连续性"、 "垃 圾短信内容团伙集"短消息内容是否有相似性 (垃圾短信发送一般加载噪声, 通过判
断内容体间相似性可以确定是否为垃圾短信)、 "垃圾短信内容团伙集"内每一种短信的 发送条数、 以及总发送条数来进一步判断是否为垃圾短信。 之后, 将判定后的 "短信主叫号码团伙集"作为黑名单团伙集合, 送实时监控系 统或归属位置寄存器 (Home Location Register, 简称为 HLR) 或短信中心作为黑名单 号码; 将判定后的 "垃圾短信内容团伙集" 的短信内容列表送实时监控系统或者操作 维护中心, 作为垃圾短信样本集以及内容关键字识别参考集。 需要说明的是, 在本实施例及优选实施方式中提出的基于爬虫的垃圾短信识别方 法, 是一种事后基于话单的垃圾短信监控方法。 通过这种方案, 可以识别团伙性低频 发送短信, 即群到群垃圾短信发送, 能够识别主叫号码团伙和垃圾内容团伙, 可以大 幅提升垃圾短信治理效果。 另外, 上述方案的实施系统独立于现有实时监控子系统, 对短信下发、 实时监控 消息流程无影响。 而且, 本发明不限制消息类型和网络类型, 能分析全球移动通信 (Global system for Mobile Communication, 简称为 GSM)、 码分多址 (Code Division Multiple Access,简称为 CDMA)、个人便携式电话系统(Personal Handyphone System, 简称为 PHS) 等无线通信网络短信业务。 下面结合附图对本发明优选实施方式进行说明。 图 8是根据本发明优选实施方式的垃圾信息爬虫处理的系统架构图,如图 8所示, 该系统包括: 垃圾短信实时监控系统 8、 短消息中心 11、 人工审核平台 (或称运营商 垃圾短信投诉平台) 9、 垃圾短信爬虫分析挖掘系统 10、 操作维护子系统 (或称操作 维护台) 7, 以及归属用户服务器 HLR6 等。 垃圾短信爬虫分析挖掘系统 10是系统的核心处理模块,它的输入为短消息历史话 单, 1 )可以由垃圾短信实时监控系统 8提供、 2)也可以直接从短消息中心 11获取短 信话单文件; 它的另外一种输入为垃圾短消息, 1 ) 由人工审核平台 9提供, 人工审核 平台 9为运营商建设的第三方维护平台, 当手机用户收到垃圾短信后, 可以向该平台 投诉, 该平台将垃圾短信送给挖掘系统 10。 2)垃圾短信实时监控系统 8, 实时监控到 的垃圾短信, 送垃圾短信爬虫分析挖掘系统 10。 操作维护台 7, 实现对挖掘出的团伙号码、 团伙短信的评估审核, 垃圾短信爬虫 分析挖掘系统 10, 分析挖掘出来的团伙号码、 团伙消息发送给 7, 7再将审核之后的 确定的团伙号码、 团伙垃圾消息内容, 发送给垃圾短信实时监控系统 8, 进行黑名单 加黑, 内容关键字更新等。
HLR 6, 挖掘系统挖掘出的垃圾短信发送团伙号码, 送该系统加黑, 进行短消息 拦截, 该模块为可选。 接口说明:
1接口为爬虫挖掘系统短消息历史话单输入接口, 本方案采用 FTP接口实现, 但 不仅限于该种方式;
2接口为垃圾短信种子样本输入接口, 本方案采用 FTP接口实现, 但不仅限于该 种方式;
3接口为爬虫挖掘系统短消息历史话单输入接口 (历史话单输入可以选择 1接口 实现, 如果采用 1接口, 则该接口不提供历史话单)、 垃圾短信种子样本(实时监控系 统监控到的垃圾短信) 输入接口。 本方案采用 FTP接口实现, 但不仅限于该种方式;
4 接口为垃圾短信发送团伙号码、 团伙短信内容送实时监控系统加黑名单接口、 其中短信内容也送实时系统, 用于关键字规则配置参考。 本方案采用 FTP接口实现, 但不仅限于该种方式;
5接口垃圾短信送给挖掘系统 10分析挖掘出的初次团伙号码、 团伙短信内容送操 作维护台, 进行审核评估。 本方案采用 FTP接口实现, 但不仅限于该种方式;
12接口垃圾短信发送团伙号码送短消息中心加黑名单接口,本方案采用 FTP接口 实现, 但不仅限于该种方式;
13接口垃圾短信发送团伙号码送 HLR加黑名单接口,本方案采用 FTP接口实现, 但不仅限于该种方式; 评估审核: 当以可疑短信号码为种子号码进行爬虫处理时, 存在爬出正常短信的情况, 需要 对短信主叫号码团伙集和垃圾短信内容团伙集, 进行评估审核, 审核方式既可以采用 基于规则方式进行自动处理, 也可以送维护中心进行人工审核。 本方案可以采用基于 以下规则的自动处理方式: ( 1 )主叫号码团伙集成员号码连续性检测,对团伙内号码进行排序计算号码间隔 差, 设定相邻号码最小间隔差 Dm, 最小连续号码阈值 Hc, 如果在 Dm之内的连续号 码数据超过 Hc, 则认为该主叫号码团伙集具有连续号码特征。 一旦满足该特征, 则判 断该团伙为有效垃圾短信发送团伙。
(2)垃圾短信内容团伙集内短消息内容进行相似性检测,本方案采用比较两条消 息间公共字符个数与消息最大长度的比来确定是否具有相似性; 设定门限 S, 例如, S 可以设定为 0.7, 即表示两消息有 70%字符是相同的, 则认为是相似内容。
(3 ) 计算垃圾短信内容团伙集内每一种短信的发送条数, 设定门限 Mc, 当存在 大于 Mc的消息时, 则认为该集团具有高量发特征。
(4) 计算垃圾短信内容团伙集内每一种短信参与主叫号码数, 设定门限 Cc, 当 存在大于 Cc的消息时, 则认为该集团具有群参与特征。 当同时出现 (2) + (3 )、 (2) + (4)、 (3 ) + (4) 特征时, 则判断该团伙为垃圾 短信发送团伙。 图 9是根据本发明优选实施方式的垃圾信息爬虫处理的示意图, 如图 9所示, 此 处信息以垃圾短信为例进行说明, 有三个垃圾短信发送主叫号码, 共同参与发送了 MessageA〜F(消息 A〜F)6种垃圾短信,每个用户均参与发送其中的一部分。 MessageA (消息 A) 为用户向投递平台举报的垃圾短信, 爬虫系统以 MessageA为种子, 首先 通过 MessageA内容爬出两个参与该消息发送的用户 USER1和 USER2,然后以这两个 用户为种子爬出垃圾短信 MeSSageB〜F 5条新的垃圾短信, 然后再逐一以这些新的垃 圾短信为种子爬出另外一个垃圾短信发送参与者 USER3。 图 10是根据本发明优选实施方式的爬虫迭代处理的逻辑流程图, 如图 10所示, 爬虫处理迭代分为以主叫号码爬出短信内容、短信内容爬出主叫号码两个主迭代流程。 其中输入分为三种, 垃圾短信内容、 垃圾短信主叫号码、 可疑垃圾短信主叫号码, 其 中 "垃圾短信内容"用以生成内容种子, 作为爬虫的起始点; 其中 "垃圾短信主叫号 码"或 "可疑垃圾短信主叫号码"用以生成主叫号码种子, 作为爬虫的起始点。 爬虫 爬行过程中, 通过维护待爬 HASH和已经爬虫 HASH, 实现种子的增删以及爬出结果 的冲突检测。 下面以图 10为例对上述两种主迭代流程进行说明。 步骤 S1002, 依据起始源类型分别进行相应的处理, 例如, 以垃圾短信内容、 可 疑主叫号码和黑名单号码为例进行说明, 首先将垃圾消息内容、 可疑号码、 或是黑名 单号码写入待爬号码列表; 之后设置 NewSeed为号码条数; 再将处理类型设置为依据 号码爬短信内容 (即, HM—>NR); 步骤 S1004, 判断 NewSeed是否大于 0, 在判断为是的情况下, 进入步骤 S1006, 结束流程, 否则进入步骤 S1008;
步骤 S1006, 结束流程; 步骤 S1008, 判断爬虫处理类型, 判断结果为依据号码爬短信内容的情况下, 进 入步骤 S1010, 如果判断结果为依据短信内容爬号码的情况下, 进入步骤 S1012; 步骤 S1010, 执行依据号码爬短信内容的处理: 将 NewSeed设置为 0; 判断待爬 号码列表是否为空; 在判断结果为是的情况下, 将爬虫类型修改为依据短信内容爬号 码 (BP, R-> HM); 在判断结果为否的情况下, 依据号码找到内容列表 (以号码为 关键词 (KEY) 查找号码文件 (FILE_HM)); 将该号码插入已爬号码 HASH, 并从待 爬列表删除; 以内容为 KEY, 在已爬内容 HASH检查是否存在; 在判断结果为是的情 况下,返回判断待爬号码列表为空的处理步骤;在判断结果为否的情况下,将 NewSeed 力 Π 1 ; 将内容插入待爬内容 HASH列表; 步骤 S1012, 执行依据短信内容爬号码的处理: 将 NewSeed设置为 0; 判断待爬 内容列表是否为空; 在判断结果为是的情况下, 将爬虫类型修改为依据号码爬短信内 容 (SP, HM-> R); 在判断结果为否的情况下, 依据短信内容找到号码列表 (以短 信为关键词 (KEY) 查找号码文件 (FILE_HM)); 将该内容插入已爬内容 HASH, 并 从待爬列表删除; 以各个号码为 KEY, 在已爬号码 HASH检查是否存在; 在判断结果 为是的情况下, 返回判断待爬内容列表为空的处理步骤; 在判断结果为否的情况下, 将 NewSeed加 1; 将号码插入待爬号码 HASH列表。 基于上述实施例及优选实施方式所提出的基于爬虫的垃圾短信识别方法, 实现了 一种垃圾短信爬虫分析挖掘系统。 测试结果显示, 可以识别团伙性低频发送短信, 即 群到群垃圾短信发送, 能够识别主叫号码团伙和垃圾内容团伙, 可以大幅提升垃圾短 信治理效果。 显然, 本领域的技术人员应该明白, 上述的本发明的各模块或各步骤可以用通用 的计算装置来实现, 它们可以集中在单个的计算装置上, 或者分布在多个计算装置所 组成的网络上, 可选地, 它们可以用计算装置可执行的程序代码来实现, 从而, 可以 将它们存储在存储装置中由计算装置来执行, 并且在某些情况下, 可以以不同于此处 的顺序执行所示出或描述的步骤, 或者将它们分别制作成各个集成电路模块, 或者将 它们中的多个模块或步骤制作成单个集成电路模块来实现。 这样, 本发明不限制于任 何特定的硬件和软件结合。 以上所述仅为本发明的优选实施例而已, 并不用于限制本发明, 对于本领域的技 术人员来说, 本发明可以有各种更改和变化。 凡在本发明的精神和原则之内, 所作的 任何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。
工业实用性 如上所述, 本发明实施例提供的一种垃圾信息处理方法及装置有以下有益效 果: 解决了相关技术中存在无法对整个团伙垃圾拦截的问题, 进而达到了能够对垃 圾主叫号码团伙以及垃圾内容团伙进行有效识别, 大幅提升垃圾信息治理的效果。
Claims
1. 一种垃圾信息处理方法, 包括: 获取垃圾信息种子;
以所述垃圾信息种子为起点, 以预定的话单文件集为爬虫处理体, 采用以 信息内容爬垃圾信息主叫号码, 以垃圾信息主叫号码爬信息内容的方式进行迭 代爬行处理; 确定与所述垃圾信息种子有直接或间接虫网关系的信息主叫号码为垃圾信 息号码, 和 /或, 确定与所述垃圾信息种子有直接或间接虫网关系的信息为垃圾 信息。
2. 根据权利要求 1所述的方法,其中,获取所述垃圾信息种子包括以下至少之一: 由垃圾信息监控系统监控到的垃圾信息提供所述垃圾信息种子; 由短消息中心获取信息话单文件中的信息提供所述垃圾信息种子; 由用户投诉的垃圾信息提供所述垃圾信息种子。
3. 根据权利要求 1所述的方法, 其中, 确定与所述垃圾信息种子有直接或间接虫 网关系的所述信息主叫号码为所述垃圾信息号码包括: 将与所述垃圾信息种子有直接或间接虫网关系的信息主叫号码划分为垃圾 信息主叫号码团伙集;
依据所述垃圾信息主叫号码团伙集判定与所述垃圾信息种子有直接或间接 虫网关系的所述信息主叫号码为所述垃圾信息号码。
4. 根据权利要求 3所述的方法, 其中, 依据所述垃圾信息主叫号码团伙集判定与 所述垃圾信息种子有直接或间接虫网关系的所述信息主叫号码为所述垃圾信息 号码包括:
对所述垃圾信息主叫号码团伙集中的号码进行排序; 获取排序后的预定间隔内的连续号码数;
判断所述连续号码数是否超过第一预定阈值;
在判断结果为是的情况下, 确定信息主叫号码为所述垃圾信息号码。
5. 根据权利要求 1所述的方法, 其中, 确定与所述垃圾信息种子有直接或间接虫 网关系的所述信息为所述垃圾信息包括: 将与所述垃圾信息种子有直接或间接虫网关系的所述信息划分为垃圾信息 内容团伙集; 依据所述垃圾信息内容团伙集判定与所述垃圾信息种子有直接或间接虫网 关系的所述信息为所述垃圾信息。
6. 根据权利要求 5所述的方法, 其中, 通过以下方式至少之一依据所述垃圾信息 内容团伙集判定与所述垃圾信息种子有直接或间接虫网关系的所述信息为所述 垃圾信息包括:
通过判断所述垃圾信息内容团伙集中所述信息与所述垃圾信息种子间公共 字符个数与消息最大长度的比的方式来获得所述信息与所述垃圾信息种子的相 似性值, 在相似性值超过第二预定阈值的情况下, 确定所述信息为所述垃圾信 息;
判断所述垃圾信息内容团伙集中与所述垃圾信息种子有直接或间接虫网关 系的所述信息的发送条数, 在发送条数超过第三预定阈值的情况下, 确定所述 信息为所述垃圾信息;
判断所述垃圾信息内容团伙集中与所述垃圾信息种子有直接或间接虫网关 系的所述信息的参与主叫号码数, 在所述参与主叫号码数超过第四预定阈值的 情况下, 确定所述信息为所述垃圾信息。
7. 一种垃圾信息处理装置, 包括: 获取模块, 设置为获取垃圾信息种子;
处理模块, 设置为以所述垃圾信息种子为起点, 以预定的话单文件集为爬 虫处理体, 采用以信息内容爬垃圾信息主叫号码, 以垃圾信息主叫号码爬信息 内容的方式进行迭代爬行处理; 第一确定模块, 设置为确定与所述垃圾信息种子有直接或间接虫网关系的 信息主叫号码为垃圾信息号码; 和 /或, 第二确定模块, 用于确定与所述垃圾信 息种子有直接或间接虫网关系的信息为垃圾信息。
8. 根据权利要求 7所述的装置, 其中, 所述获取模块包括以下至少之一: 第一提供单元, 设置为由垃圾信息监控系统监控到的垃圾信息提供所述垃 圾信息种子;
第二提供单元, 设置为由短消息中心获取信息话单文件中的信息提供所述 垃圾信息种子;
第三提供单元, 设置为由用户投诉的垃圾信息提供所述垃圾信息种子。
9. 根据权利要求 7所述的装置, 其中, 所述第一确定模块包括: 第一切分单元, 设置为将与所述垃圾信息种子有直接或间接虫网关系的信 息主叫号码划分为垃圾信息主叫号码团伙集; 第一判定单元, 设置为依据所述垃圾信息主叫号码团伙集判定与所述垃圾 信息种子有直接或间接虫网关系的所述信息主叫号码为所述垃圾信息号码。
10. 根据权利要求 9所述的装置, 其中, 所述第一判定单元包括: 排序子单元, 设置为对所述垃圾信息主叫号码团伙集中的号码进行排序; 获取子单元, 设置为获取排序后的预定间隔内的连续号码数; 判断子单元, 设置为判断所述连续号码数是否超过第一预定阈值; 第一确定子单元, 设置为在判断结果为是的情况下, 确定信息主叫号码为 所述垃圾信息号码。
11. 根据权利要求 7所述的装置, 其中, 所述第二确定模块包括: 第二切分单元, 设置为将与所述垃圾信息种子有直接或间接虫网关系的所 述信息划分为垃圾信息内容团伙集;
第二判定单元, 设置为依据所述垃圾信息内容团伙集判定与所述垃圾信息 种子有直接或间接虫网关系的所述信息为所述垃圾信息。
12. 根据权利要求 11所述的装置, 其中, 所述第二判定单元包括以下至少之一: 第二确定子单元, 设置为通过判断所述垃圾信息内容团伙集中所述信息与 所述垃圾信息种子间公共字符个数与消息最大长度的比的方式来获得所述信息 与所述垃圾信息种子的相似性值, 在相似性值超过第二预定阈值的情况下, 确 定所述信息为所述垃圾信息;
第三确定子单元, 设置为判断所述垃圾信息内容团伙集中与所述垃圾信息 种子有直接或间接虫网关系的所述信息的发送条数, 在发送条数超过第三预定 阈值的情况下, 确定所述信息为所述垃圾信息;
第四确定子单元, 设置为判断所述垃圾信息内容团伙集中与所述垃圾信息 种子有直接或间接虫网关系的所述信息的参与主叫号码数, 在所述参与主叫号 码数超过第四预定阈值的情况下, 确定所述信息为所述垃圾信息。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310493826.1 | 2013-10-18 | ||
CN201310493826 | 2013-10-18 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015054993A1 true WO2015054993A1 (zh) | 2015-04-23 |
Family
ID=52827625
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2014/074924 WO2015054993A1 (zh) | 2013-10-18 | 2014-04-08 | 垃圾信息处理方法及装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN104581729B (zh) |
WO (1) | WO2015054993A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10291774B2 (en) | 2015-07-13 | 2019-05-14 | Xiaomi Inc. | Method, device, and system for determining spam caller phone number |
CN109816404A (zh) * | 2019-01-28 | 2019-05-28 | 天津市国瑞数码安全系统股份有限公司 | 基于dbscan算法的电信诈骗团伙聚类方法及电信诈骗团伙聚类系统 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102200388B1 (ko) | 2014-06-23 | 2021-01-07 | 엘지디스플레이 주식회사 | 백색 유기 발광 소자 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101083802A (zh) * | 2007-07-18 | 2007-12-05 | 中兴通讯股份有限公司 | 一种短消息监控实现方法 |
CN101257671A (zh) * | 2007-07-06 | 2008-09-03 | 浙江大学 | 基于内容的大规模垃圾短信实时过滤方法 |
CN101389085A (zh) * | 2008-10-14 | 2009-03-18 | 中国联合通信有限公司 | 基于发送行为的垃圾短消息识别系统及方法 |
CN101959145A (zh) * | 2009-07-13 | 2011-01-26 | 中国移动通信集团江苏有限公司 | 一种移动通信中的垃圾信息识别方法、装置和系统 |
US8412779B1 (en) * | 2004-12-21 | 2013-04-02 | Trend Micro Incorporated | Blocking of unsolicited messages in text messaging networks |
CN103139730A (zh) * | 2011-11-23 | 2013-06-05 | 上海粱江通信系统股份有限公司 | 用于识别大量号码低频发送垃圾短信情况的方法 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080147669A1 (en) * | 2006-12-14 | 2008-06-19 | Microsoft Corporation | Detecting web spam from changes to links of web sites |
CN102724355A (zh) * | 2012-05-04 | 2012-10-10 | 北京百纳威尔科技有限公司 | 垃圾信息处理方法和手机终端 |
CN103150374B (zh) * | 2013-03-11 | 2017-02-08 | 中国科学院信息工程研究所 | 一种识别微博异常用户的方法和系统 |
-
2014
- 2014-04-08 WO PCT/CN2014/074924 patent/WO2015054993A1/zh active Application Filing
- 2014-09-26 CN CN201410504998.9A patent/CN104581729B/zh active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8412779B1 (en) * | 2004-12-21 | 2013-04-02 | Trend Micro Incorporated | Blocking of unsolicited messages in text messaging networks |
CN101257671A (zh) * | 2007-07-06 | 2008-09-03 | 浙江大学 | 基于内容的大规模垃圾短信实时过滤方法 |
CN101083802A (zh) * | 2007-07-18 | 2007-12-05 | 中兴通讯股份有限公司 | 一种短消息监控实现方法 |
CN101389085A (zh) * | 2008-10-14 | 2009-03-18 | 中国联合通信有限公司 | 基于发送行为的垃圾短消息识别系统及方法 |
CN101959145A (zh) * | 2009-07-13 | 2011-01-26 | 中国移动通信集团江苏有限公司 | 一种移动通信中的垃圾信息识别方法、装置和系统 |
CN103139730A (zh) * | 2011-11-23 | 2013-06-05 | 上海粱江通信系统股份有限公司 | 用于识别大量号码低频发送垃圾短信情况的方法 |
Non-Patent Citations (1)
Title |
---|
SHEN CHAO.: "Application of Data Mining in Short Message Spam Filtering", JOURNAL OF UNIVERSITY OF ELECTRONIC SCIENCE AND TECHNOLOGY OF CHINA, vol. 38, 20 November 2009 (2009-11-20) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10291774B2 (en) | 2015-07-13 | 2019-05-14 | Xiaomi Inc. | Method, device, and system for determining spam caller phone number |
CN109816404A (zh) * | 2019-01-28 | 2019-05-28 | 天津市国瑞数码安全系统股份有限公司 | 基于dbscan算法的电信诈骗团伙聚类方法及电信诈骗团伙聚类系统 |
CN109816404B (zh) * | 2019-01-28 | 2023-04-07 | 天津市国瑞数码安全系统股份有限公司 | 基于dbscan算法的电信诈骗团伙聚类方法及电信诈骗团伙聚类系统 |
Also Published As
Publication number | Publication date |
---|---|
CN104581729B (zh) | 2019-07-09 |
CN104581729A (zh) | 2015-04-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109600752B (zh) | 一种深度聚类诈骗检测的方法和装置 | |
WO2016197675A1 (zh) | 骚扰电话的识别方法及装置 | |
US9094325B2 (en) | Reputation based message analysis | |
EP2959707B1 (en) | Network security system and method | |
KR100935052B1 (ko) | 무선 장치에서 컨텐트 교환을 관리하기 위한 장치 및 방법 | |
CN102209326B (zh) | 基于智能手机无线电接口层的恶意行为检测方法及系统 | |
CN104735272B (zh) | 一种骚扰电话的拦截方法及系统 | |
US9572004B2 (en) | System and method for fast accurate detection of SMS spam numbers via monitoring grey phone space | |
CN109698885B (zh) | 一种呼叫请求的处理方法、装置、网络侧服务器和计算机存储介质 | |
CN103581909B (zh) | 一种疑似手机恶意软件的定位方法及其装置 | |
WO2011076984A1 (en) | Apparatus, method and computer-readable storage medium for determining application protocol elements as different types of lawful interception content | |
CN104853357B (zh) | 一种自动识别和触发诈骗号码的方法及系统 | |
JP5363342B2 (ja) | セルラ電話のメッセージを濾波するシステム及び方法 | |
WO2011160328A1 (zh) | 一种通信监控方法及装置 | |
KR101586595B1 (ko) | 그룹 통화들에서 예지적 합법적 감청을 수행하기 위한 장치 및 방법 | |
US20120220271A1 (en) | System and method for selective monitoring of mobile communication terminals based on speech key-phrases | |
WO2015054993A1 (zh) | 垃圾信息处理方法及装置 | |
US9100831B2 (en) | Disabling mobile devices that originate message service spam | |
CN102932753A (zh) | 一种在多媒体系统的链路上实现垃圾彩信拦截的方法 | |
WO2015189380A1 (en) | Method and apparatus for detecting and filtering undesirable phone calls | |
CN110798379B (zh) | 一种VoIP信令网关识别方法、装置及可读存储介质 | |
CN107371141B (zh) | 一种垃圾信息监控方法、装置及通信系统 | |
CN111131626B (zh) | 基于流数据图谱的群组有害呼叫检测方法、装置及可读介质 | |
WO2017084405A1 (zh) | 短消息监管方法及装置 | |
CN109104702B (zh) | 信息拦截方法、装置及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14853787 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14853787 Country of ref document: EP Kind code of ref document: A1 |