WO2012019386A1 - Method and system for monitoring spam short messages - Google Patents
Method and system for monitoring spam short messages Download PDFInfo
- Publication number
- WO2012019386A1 WO2012019386A1 PCT/CN2010/078516 CN2010078516W WO2012019386A1 WO 2012019386 A1 WO2012019386 A1 WO 2012019386A1 CN 2010078516 W CN2010078516 W CN 2010078516W WO 2012019386 A1 WO2012019386 A1 WO 2012019386A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- short message
- sender
- spam
- detecting
- predetermined
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/21—Monitoring or handling of messages
- H04L51/212—Monitoring or handling of messages using filtering or selective blocking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/58—Message adaptation for wireless communication
Definitions
- the present invention relates to short message services in the field of mobile communications, and in particular, to a spam short message monitoring system and method based on sender behavior.
- the main spam filtering mechanism is the spam filtering mechanism.
- it can be divided into black and white list filtering, traffic-based filtering, and keyword-based content filtering.
- the blacklist-based filtering method is to sort the calling number of the known spam maker into a blacklist, and deploy it in the short message center or the short message gateway to reject the shortlist from the blacklisted calling number.
- the blacklist can be intercepted by a number segment or number.
- the whitelisted calling number is not blocked in any way.
- the traffic-based filtering method collects statistics on the number of bursts sent by the user in a certain period of time. When the burst quantity exceeds a preset threshold, it is manually or automatically added to the blacklist.
- the keyword-based content filtering method performs keyword query on the mobile phone content. Once hit, the sending number is added to the blacklist. Whether it is based on traffic filtering or keyword-based content filtering, it has its own drawbacks.
- the traffic-based approach is easy by "sending a small amount of information on multiple phones" The shielding is carried out. At the same time, after many mobile phone terminals implement the group sending function, it is easy to generate a large number of manslaughter messages for the festival greetings, and the user's complaint rate increases.
- Keyword-based methods can be circumvented by means of "homophones”, “typos”, “structural splits", and “changes”.
- operators have deployed a large number of garbage monitoring systems.
- precision and recall There are two important indicators for evaluating the monitoring effect of a garbage monitoring system: precision and recall.
- the precision ratio is the proportion of the spam senders in the detected spam send list.
- the recall rate is the ratio of the number of spammers detected to the actual number of spam senders on the network. .
- a good garbage monitoring system has a high precision and recall rate.
- the indicators that the operators have deployed based on the above traditional technologies or the improved garbage monitoring system based on the traditional technology are not ideal, and have to rely on a large number of human resources to assist in the inspection of spam messages. Therefore, how to improve the accuracy and recall rate of spam messages has become an urgent problem to be solved.
- the present invention provides a method for monitoring spam messages, the method comprising: if detecting a sender of a short message as a spam sender according to a predetermined rule, blacklisting the sender of the message to perform garbage
- the monitoring of the short message includes: the predetermined rule includes: if the timing feature of the short message sent by the short message sender in the predetermined time period is a predetermined time series feature, the short message sender is specified as a spam sender; or if the predetermined time period is If the ratio of the logarithm of the mutual communication record to the total number of pairs of the two or two combinations between the internal short message sender and the other senders of the short message is less than a predetermined value, the short message sender is specified as a spam sender; or The timing feature of the short message sender to send the short message
- the method further includes: extracting a historical short message record of the known spam sender, and obtaining the known spam sending by training from the historical short message record. Transmitting the frequency feature of the short message to train the predetermined time-series feature; and/or connecting the nodes with mutual communication records in the historical short message record to each other to construct the known spam sender and send the short message
- the predetermined value is trained by the ratio of the number of sides to the total number of sides connected by all nodes.
- the method further includes: detecting that the number of the short message sent by the short message sender in the unit time exceeds a threshold.
- the step of detecting the short message sender as the spam sender according to the predetermined rule includes: detecting, by the online sender, the short message of the short message in the current time period, and detecting the time series characteristic of the short message sending the short message by the short message sender Determining the timing feature, determining that the short message sender is a spam sender; or detecting the short message of the short message sender in the current time period, if detecting the sender of the short message and all recipients sending the short message If the ratio of the logarithm of the mutual communication record to the total logarithm of the two-two combination is less than the predetermined value, determining that the short message sender is a spam sender; or detecting the short message sender online at the current time a short message, if the timing feature of the short message
- the method further includes: Extracting a short message of the short message sender in the current period of time; preprocessing the short message.
- the method further includes: detecting that the short message sender is not on the blacklist and the whitelist.
- the present invention provides a system for spam monitoring, the system comprising: a detecting module, configured to: if the sender of the short message is detected as a spam sender according to a predetermined rule, the short message sender Blacklisting, and then sending the blacklist to the monitoring module;
- the monitoring module is configured to: monitor the spam message according to the blacklist, and the predetermined rule at least includes: if detecting that the time series feature of the short message sender sent by the short message sender is a predetermined time series feature, the short message is sent
- the sender is specified as a spam sender; or if the ratio of the logarithm of the mutual communication record to the total number of pairs of the two-way combination between the sender of the message and all the recipients who sent the message within the predetermined time period is less than a predetermined value, Or the sender of the short message is specified as a spam sender; or if the time signature of the short message sent by the sender of the message within the predetermined time period is at a predetermined time signature, and if the sender of the message sends all the messages with the message within a predetermined time period If the ratio of the logarithm of the mutual communication record to the total logarithm of the two-to-two combination is less than a predetermined value, the short message sender is specified as
- the system further includes: a training module, configured to: extract a historical short message record of a known spam sender, and train a clinic by training a frequency profile of a known spam sender to send a short message from the historical short message record Determining a predetermined timing feature, and then transmitting the predetermined timing feature to the detecting module; and/or constructing a node with mutual communication records in the historical short message record a social relationship network diagram between a known spam sender and all recipients that send a text message, training the predetermined value by the ratio of the number of sides to the total number of sides connected between all nodes, and then The predetermined value is sent to the detection module.
- a training module configured to: extract a historical short message record of a known spam sender, and train a clinic by training a frequency profile of a known spam sender to send a short message from the historical short message record Determining a predetermined timing feature, and then transmitting the predetermined timing feature to the detecting module; and/or constructing a node with mutual communication
- the detecting module includes: an online detecting module, configured to: detect a short message CDR of the short message sender in a current period of time, and if the timing feature of the short message sent by the short message sender is detected as the predetermined time series feature, And determining, by the sender of the short message, a spam sender; or detecting, by online, the short message of the short message sender in the current period of time, if the sender of the short message is detected to communicate with all recipients that send the short message If the ratio of the logarithm of the record to the total logarithm of the two-two combination is less than the predetermined value, determining that the short message sender is a spam sender; or detecting the short message of the short message sender within the current time period If the timing feature of the short message sender sending the short message is detected as the predetermined time-series feature, and if the sender of the short message is detected, the logarithm of the mutual communication record is combined with the logarithm of the mutual communication record If
- the online detection module is further configured to: before detecting whether the short message sender is a spam sender, detecting that the number of the short message sent by the short message sender in a unit time exceeds a threshold.
- the system further includes: a bill pre-processing module, configured to: extract the sender of the short message within a current period of time
- the detecting module is further configured to: before detecting the short message sender as the spam sender according to the predetermined rule, detecting that the short message sender is not on the blacklist and the whitelist.
- the traditional content-based garbage monitoring system is not ideal for spam filtering in both the precision and the recall rate, and the content of the short message needs to be scanned, and the system resource overhead is large.
- the method and system for spam monitoring provided by the present invention is based on the characteristics of the sender's behavior in time series and space for spam message monitoring, which has a high precision and recall rate, and also improves the spam maker's Avoiding costs, and not needing to scan SMS content, the system performance has also been greatly improved.
- 1 is a schematic diagram of a spam short message monitoring system of the present invention
- FIG. 2 is a flowchart of a spam short message monitoring method according to the present invention
- FIG. 3 is a schematic diagram of a spam short message monitoring system according to an embodiment of the present invention
- FIG. 5 is a flowchart of a behavior of training a spam sender according to an embodiment of the present invention
- FIG. 6 is a flowchart of online detection according to an embodiment of the present invention.
- SMS senders have certain temporal characteristics and spatial characteristics in behavior. For example, many spammers use the method of group sending to send commercial advertisements. The frequency characteristics expressed in the transmission timing are obviously different from those of ordinary SMS senders. The frequency of machine group sending is often fixed. For example, the time interval for sending SMS messages is certain. The frequency of sending ordinary messages is not fixed, and the regularity is not strong.
- the sender of normal short messages has a stable and unique social network characteristics, and the relationship is relatively hidden, and the social relationship network reflected by spammers is chaotic and unstable. Because everyone has their own fixed social circle, most of the objects that send text messages are mostly in the social circle, and each person's social circle is different, that is, the social network is different; and the spam messages are sent between the objects. Often there is no relationship. If spammers want to circumvent monitoring based on social network, they must acquire everyone's social network. Because everyone has their own unique social network, it is difficult for spammers to get everyone's. Social network. Simply put, the relationship is more subtle. That is to say, we usually don't know what other people's social networks are.
- the invention utilizes the behavior of the spam sender and the normal message sender in time characteristics and / or the difference in spatial characteristics for the monitoring of spam messages.
- the time series features and social network characteristics are extracted, the time series features of the spam makers and the measurement model of the social relationship network are trained, and the model is used to measure the sender of the message. The probability of a garbage maker.
- the process of training the timing characteristics of the spam maker and the measurement model of the social relationship network is actually, by obtaining a list of known spam makers, by analyzing the group of spam makers at the time and The features on the space extract features that are common in time series and on the social network, and are expressed in the form of parameter values as a reference for checking whether other SMS senders are spammers.
- the time series feature model is a set of frequency characteristic parameters for sending short messages sent from the historical short message record of the spam sender. For example, each message sent within a certain period of time has a certain time interval between sending messages. The rule, such as a spammer sending a text message every 1 second, then the characteristic is that the time interval is 1 second.
- Some low-frequency spam sending users may deliberately set a longer interval in order to avoid the monitoring transmission time. However, as long as it is sent through the machine group, there will always be a certain regularity in the sending time interval.
- the social relationship network feature ie, the spatial feature model
- the social relationship between spammers is relatively alienated, that is, there are fewer communication records between each other.
- SMS senders for example, a pair of two users who have a reply and a reply
- SMS receivers messages senders
- the proportion of social relationships between senders and recipients of spam messages is generally small.
- the social network diagram including the sender of the short message and all the recipients of the short message can be constructed through the historical short message record, and each short message sender and all the short message receivers are regarded as one node respectively, and the nodes having the communication record are side by side.
- the node aggregation degree parameter calculated according to the figure can be measured by the ratio of the number of sides actually connected in the figure to the total number of sides connected to each node.
- the more the number of edges of the graph means the higher the degree of node aggregation, and the degree of node aggregation is usually lower in the social network diagram constructed by the spam message maker.
- Spammers have the distinction of high frequency sending users and low frequency sending users. High-frequency sending users are more harmful because they send a large amount of spam messages in a short period of time. Low-frequency sending users will not generate a large amount of spam messages in a short period of time, which will not cause harm in a short period of time.
- the garbage monitoring system needs to detect the high frequency transmitting user in a short time and detect the low frequency transmitting user within a certain period of time.
- the present invention employs a combination of on-line detection and off-line detection.
- On-line detection for high-frequency transmission users to examine the current period of time data, has a strong timeliness; offline detection to examine a certain period (such as data within 1 week), as a supplement to online detection, offline detection can detect online detection can not detect The low frequency spam message found was sent to the user.
- the training process includes extracting the sender's time series characteristics and the social relationship network characteristics, performing cluster analysis, and statistically obtaining the rules of the spam sender, and finally generating a model file containing the spam message transmission regular parameters.
- the timing characteristics of the sender and the social relationship network feature in the real-time short message are also extracted, and the similarity between the sample and the model file is calculated to determine whether the sender is a spam sender.
- the training process is adaptive, and the system periodically picks up the CDRs for training and adjusts the template library.
- the black and white list is detected. If the sender of the message is on the black and white list, the user is directly skipped. Because the blacklist is a user who has been identified as a spammer or a specific user that is prohibited from sending SMS by the operator, it does not make sense to detect the blacklist again.
- the purpose of spam monitoring is to find the spammer. It is added to the blacklist, and since it is already on the blacklist, there is no need to check it.
- a whitelist user is usually a non-monitoring user set by the operator. No matter what kind of short message the whitelist user sends, the spam SMS monitoring system cannot be handled as a spam message maker, so there is no meaning for whitelist monitoring.
- FIG. 1 is a schematic diagram of a spam short message monitoring system according to the present invention.
- the spam short message monitoring system of the present invention mainly includes: a detecting module and a monitoring module, wherein the detecting module is configured to: if the short message sender is detected according to a predetermined rule For the sender of the spam message, the sender of the message is blacklisted, and then the blacklist is sent to the monitoring module.
- the monitoring module is configured to: monitor the spam message according to the blacklist, and the predetermined rule is at least
- the method includes: if detecting that the timing feature of the short message sent by the sender of the short message is a predetermined time series feature, for example, the time interval for sending the short message in a unit time is fixed, the short message sender is specified as a spam sender; or Detecting that the ratio of the logarithm of the mutual communication record and the total logarithm of the two-way combination between the sender of the short message and all the recipients that send the short message within a predetermined time period is less than a predetermined value, for example, less than 10%, the short message is sent Is specified as a spammer; or if the sender of the message is detected for a predetermined period of time
- the timing feature of sending the short message is a predetermined time-series feature, and detecting that the ratio of the logarithm of the mutual communication record and the total logarithm of the two-way combination between the sender of
- the spam monitoring system of the present invention may further include: a training module, configured to: extract a historical short message record of a known spam sender, and send a known spam sender by training from the historical short message record.
- the frequency characteristics of the text message come Training the predetermined timing feature, and then transmitting the predetermined timing feature to the detecting module; or connecting the nodes with mutual communication records in the historical short message record to form the known spam sending
- the social relationship network diagram between all the recipients and the senders of the short message, the predetermined value is trained by the ratio of the number of sides to the total number of sides connected between all the nodes, and then the predetermined value is sent to The detection module.
- the spam short message monitoring system of the present invention can train different time series feature models and spatial feature models for different operators.
- Step 10 detecting, according to a predetermined rule, whether a sender of a short message is a spam sender, and if so, executing Step 20: If not, repeat step 10; Step 20, blacklist the sender of the short message to monitor the spam message.
- the predetermined rule includes: if the timing feature of the short message sent by the short message sender in the predetermined time period is at a predetermined time series feature, for example, the time interval for sending the short message within a predetermined time period is fixed, the short message sender is specified as The spam sender; or if the ratio of the logarithm of the mutual communication record to the total number of pairs of the two-way combination between the sender of the message and all the recipients who sent the message within a predetermined time period is less than a predetermined value, the message is sent The sender is specified as a spammer; or if the SMS sender sends the SMS in a predetermined time period, the timing feature is at a predetermined timing feature, and if there is a mutual time between the sender of the SMS and all recipients of the SMS within the predetermined time period If the ratio of the logarithm of the communication record to the total logarithm of the two-two combination is less than a predetermined value, the short message sender is specified as the spam sender.
- the spam message monitoring method can implement the monitoring of spam messages based on the timing characteristics and/or spatial features of the spam sender, so as to improve the precision and recall rate of the spam messages.
- the following steps may also be included: Extracting a historical short message record of a known spam sender, training the predetermined time-series feature by training a frequency feature of the known spam sender from the historical short message record; and/or A social relationship network diagram between the nodes having mutual communication records in the record and the neighbors and the senders who send the short messages by the edges, and the number of sides and all the nodes are The ratio of the total number of sides connected is trained to the predetermined value.
- FIG. 3 is a schematic diagram of a spam short message monitoring system according to an embodiment of the present invention.
- the garbage monitoring system of the embodiment includes: a bill preprocessing module, a training module, a manual labeling module, a detecting module, and a black and white list management module.
- the CDR pre-processing module is configured to: pre-process the SMS center CDRs, including: removing duplicate records, removing non-point-to-point short messages, removing non-target carrier CDRs, extracting useful fields, and converting the format to the system internal format and warehousing operating.
- Some records in the bill record are records that failed to be retried due to system reasons.
- Such records can only be processed as one SMS; some SMS records are sent to the user by the operator's customer service system, not sent by the user, and no monitoring is required. To remove; the operator only monitors the users belonging to the carrier. For the non-operator users to send text messages to the carrier users, the SMS center will also generate bill records, and such records do not need to be monitored; There are a lot of fields, but for spam monitoring, you only need to use a few of them, just need to extract useful fields. In addition, you need to convert the bill into a format that can be recognized inside the system.
- the CDR pre-processing module can obtain the original CDR of the SMS center through a File Transfer Protocol (FTP).
- FTP File Transfer Protocol
- the training module is configured to: train historical CDRs known as spam senders to generate model files for spam detection.
- the manual labeling module is set to: correctly mark the user category of the candidate user who may be the spam sender before training the spammer sender model, so that the model training is obtained.
- the model file more accurately conforms to the regular characteristics of spammers.
- the detecting module in this embodiment may include: the online timing detecting module is configured to: detect a timing feature of the short message sender online and derive a blacklist.
- the online space detection module is configured to: detect the social network characteristics of the sender of the short message online and derive a blacklist.
- the offline space detection module is configured to: offline detect the social network characteristics of the sender of the message and derive a blacklist.
- the black and white list management module is configured to: after the blacklists of the above three detection modules are combined, the results are synchronized to the BOSS, and the black and white list is obtained from the BOSS and synchronized to the detection module.
- the black and white list can also be synchronized by FTP between the black and white list management module and the BOSS.
- FIG. 4 is a flowchart of a method for monitoring spam messages according to an embodiment of the present invention.
- the specific process includes the following steps: Step 201: Acquire an original bill of a short message center, and perform preprocessing.
- the pre-processing of the bill pre-processing module includes: removing duplicate records, removing non-point-to-point short messages, removing non-target operator bills, extracting useful fields, converting the format to the internal format of the system, and sorting according to the time sequence of submitting the short messages, wherein, extracting Useful fields include: message identification (id), sender number, recipient number, SMS submission time, SMS length, and SMS content.
- the CDR pre-processing module sends the pre-processed SMS CDR to the detection module.
- Step 202 The detecting module scans the pre-processed bills one by one, and records only the submission time and the sender number and the receiver number.
- Step 203 The detection module performs blacklist-based filtering on each record. If the user is on the black and white list, the user is directly ignored.
- Step 204 Perform, according to the model file generated by the training module training, based on the short message sender timing feature and/or the spatial feature.
- online detection may be performed, and offline detection may also be performed.
- the online detection may detect the timing characteristics of the short message sender, and may also detect the spatial characteristics of the short message sender.
- Offline detection generally detects the spatial characteristics of a short message sender over a historical period of time.
- the online timing detection module, online space detection module and offline space detection module can be operated in parallel or separately.
- the online time series detection module and the online space detection module analyze the characteristics of the scanned text sent by the user during the current period of time.
- the offline space detection generally analyzes the social relationship network characteristics of the user for a long period of time (for example, one week).
- Step 205 Blacklist the detected spammers. If the above three detection modules operate in parallel, the blacklist will be generated independently.
- the blacklist and whitelist management module will combine the blacklists exported by the three detection modules to obtain the final blacklist.
- the three detection modules can detect spam senders from three angles.
- Step 206 The black and white list management module synchronizes the blacklist to the BOSS.
- BOSS will provide the blacklist to the control module of the SMS center.
- the SMS center will first check whether the sender is on the blacklist. If it is on the blacklist, the user is prohibited from sending the SMS.
- FIG. 5 is a flowchart of a behavior of training a spam sender according to an embodiment of the present invention; as shown in FIG. 5, the specific process includes the following steps: Step 301: Extract a historical period of a period for pre-processing into a library. Step 302: Initially obtain a candidate training set that is considered to be a spam sender according to an existing empirical model.
- the existing empirical model refers to a set of parameters obtained by analyzing the timing characteristics and spatial characteristic rules of spam makers in the operator's historical bill data.
- Step 303 The training set size is evaluated. If the training set is not large enough, indicating that the number of spam senders is not large, the model file trained by the training set has little statistical significance, and it is necessary to return to step 301 to obtain more CDRs. Train. If the training set is considered to be sufficient in size, proceed to step 304 for the next step.
- Step 304 Perform manual labeling on the training set, and use the annotation tool provided by the manual labeling module to view the short message sent by each user in the training set, and classify and label the training set user according to manual judgment.
- the manual classification labeling usually determines whether the user has sent spam messages according to the content of the sent short message by checking the content of the short message. Generally, the criterion for the spam message is combined with the requirements of the operator. Manual categorization usually divides users into four categories, namely normal SMS senders, spammers, mixed SMS senders, and other SMS senders. Among them, the mixed SMS sender sends both normal text messages and spam messages, and other SMS senders are usually garbled or curse messages sent by the operator.
- Step 305 Extract a historical CDR of the spam sender according to the labeling result, and train the time series feature and the space feature.
- the time-series feature can be converted into frequency domain information, and the extracted spatial feature parameters can include: the number of sent short messages, the number of received short messages, the number of recipients replying to the short message, the number of pairs of recipients having mutual communication records, and the like,
- the spatial feature model can be trained by replying to the number of short messages, that is, the logarithm of the mutual communication record.
- Step 306 Determine a spam short message sender transmission rule by frequency domain analysis and social relationship network analysis, and generate a time series feature based model file and a spatial feature based model file respectively.
- Step 307 Synchronize the generated model file to the detection module.
- Model files can be flexibly adjusted according to different operators' requirements for precision and recall. For example, if the operator wants a higher recall rate, the users marked as mixed SMS senders will be classified as spam messages during training; if the operator wants a higher precision, the training will only be Users marked as spammers are trained.
- FIG. 6 is a flowchart of online detection according to an embodiment of the present invention.
- the specific process includes The following steps are as follows: Step 401: Scan the pre-processed bills one by one, and record only the submission time and the number of the sender and recipient of the short message.
- Step 402 Perform an online detection condition triggering judgment. If a certain trigger condition is met, the process proceeds to step 403 to start the online detection algorithm. Otherwise, return to step 401 to continue scanning the CDR. For example, if the number of short messages sent by the user in a unit time exceeds a certain threshold, the threshold can be adjusted according to the actual detection status, and an online detection related algorithm is started.
- Step 403 Extract timing characteristics and spatial features of the real-time short message sender.
- Step 404 After determining the timing feature and the spatial feature of the short message sender, compare with the trained model file to determine whether the sender is a spammer.
- the method and system for spam monitoring provided by the present invention is based on the characteristics of the sender's behavior in time series and space for spam message monitoring, which has a high precision and recall rate, and also improves spam message manufacturing.
- the cost of avoiding, and the need to scan the text message content, the system performance has also been greatly improved.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Transfer Between Computers (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
A method for monitoring spam short messages is disclosed in the present invention, which includes: detecting whether the short message sender is a spam short message sender according to the preset rules, if yes, putting said short message sender into a blacklist to monitor spam short messages; said preset rules at least include: in the preset time, if the time sequence characteristic of sending short messages is the preset time sequence characteristic, and/or, if the ratio of the pair number of the short message sender and all his recipients with the mutual communication records to the total pair number, is smaller than the preset value, judging said short message sender to be a spam short message sender. A system for monitoring spam short messages is also disclosed in the present invention. The present invention enables higher search precision rate and search completion rate, therefore increasing the avoiding cost of spam short message producer; and the present invention does not need scanning the content of short messages, therefore making a great improvement in the system performance.
Description
一种垃圾短信监控的方法和系统 Method and system for monitoring spam messages
技术领域 本发明涉及移动通信领域中的短消息业务, 尤其涉及一种基于发送者行 为特征的垃圾短信监控系统及方法。 TECHNICAL FIELD The present invention relates to short message services in the field of mobile communications, and in particular, to a spam short message monitoring system and method based on sender behavior.
背景技术 据统计, 中国手机用户数量已超过 6亿, 平均每天有超过 6.5亿条短信 在用户的拇指之间传送。 然而随着手机使用的普及和短信业务的迅速发展, 人们在享受快捷方便的通信手段的同时, 伴随而来的却是日趋泛滥的垃圾短 信。 垃圾短信产生的根源在于短信的发送成本极其低廉, 而获得的广告效益 却非常之高。 但是, 垃圾短信不仅会对运营商的网络产生冲击, 给广大用户 的利益也带来了巨大的损害, 更造成了严重的不良的社会影响。 国外在垃圾 短信的治理上, 主要通过立法和先进的技术手段来识别并处理欺诈性的信息 及手机, 还有一整套的打击手机犯罪的先进技术手段。 在国内, 垃圾短信防 控任务主要由运营商主导和负责, 通常从技术和管理上釆取措施, 立法方面 还比较欠缺。 在目前普遍釆用的垃圾短信监控技术中, 主要釆用的是垃圾短信过滤机 制。 从原理上, 又可以分为黑白名单过滤、 基于流量的过滤、 以及基于关键 字的内容过滤几种方式。 基于黑名单的过滤方式是将确定已知垃圾短信制造 者的主叫号码整理成黑名单, 并部署在短消息中心或者短信网关, 就可以拒 绝来自黑名单的主叫号码发送短消息。对黑名单可以进行号段或号码的拦截。 对于白名单的主叫号码不做任何形式的拦截。 基于流量的过滤方式对用户在 某个时间段内的群发数量进行统计, 当群发量超过预先设定的阔值时, 将其 手动或自动添加到黑名单中去。 基于关键字的内容过滤方式对手机内容进行 关键字查询, 一旦命中, 即将发送号码加入到黑名单中去。 不管是基于流量的过滤方式还是基于关键字的内容过滤方式, 都有其自 身的弊端。 基于流量的方式很容易通过 "在多个手机发送少量信息的形式"
进行屏蔽, 同时这种方式在很多手机终端实现了群发功能之后容易对过节类 的祝贺短信产生大量的误杀, 而造成用户的投诉率的上升。 基于关键词的方 法可以通过 "同音词" 、 "错别字" 、 "结构拆分" 、 以及 "换词" 等方式 规避。 目前运营商已部署了大量的垃圾监控系统, 评价一个垃圾监控系统监控 效果有两个重要的指标: 查准率和查全率。 查准率即在检测出的垃圾短信发 送名单中真正为垃圾短信发送者所占的比例; 查全率为检测出的真正为垃圾 短信发送者的数量占网络中实际垃圾短信发送者数量的比例。 显然, 一个好 的垃圾监控系统具备较高的查准率和查全率。 目前运营商已部署的基于以上 传统技术或基于传统技术的改良的垃圾监控系统这两个方面的指标都不够理 想, 而不得不依靠大量人力辅助检查垃圾短信。 因此如何提高垃圾短信查准 率和查全率成为当前迫切需要解决的问题。 BACKGROUND According to statistics, the number of mobile phone users in China has exceeded 600 million, and an average of more than 650 million text messages are transmitted between the users' thumbs. However, with the popularity of mobile phones and the rapid development of short message services, people are enjoying the fast and convenient means of communication, accompanied by increasingly spam messages. The root cause of spam messages is that the cost of sending SMS messages is extremely low, and the benefits of advertising are very high. However, spam messages not only have an impact on the operators' networks, but also bring great damage to the interests of the users, and even cause serious adverse social impacts. In the governance of spam messages, foreign countries mainly use legislation and advanced technology to identify and deal with fraudulent information and mobile phones, as well as a set of advanced technical means to combat mobile phone crime. In China, the task of preventing and controlling spam messages is mainly led and responsible by operators. Usually, measures are taken from technology and management, and legislation is still lacking. Among the commonly used spam SMS monitoring technologies, the main spam filtering mechanism is the spam filtering mechanism. In principle, it can be divided into black and white list filtering, traffic-based filtering, and keyword-based content filtering. The blacklist-based filtering method is to sort the calling number of the known spam maker into a blacklist, and deploy it in the short message center or the short message gateway to reject the shortlist from the blacklisted calling number. The blacklist can be intercepted by a number segment or number. The whitelisted calling number is not blocked in any way. The traffic-based filtering method collects statistics on the number of bursts sent by the user in a certain period of time. When the burst quantity exceeds a preset threshold, it is manually or automatically added to the blacklist. The keyword-based content filtering method performs keyword query on the mobile phone content. Once hit, the sending number is added to the blacklist. Whether it is based on traffic filtering or keyword-based content filtering, it has its own drawbacks. The traffic-based approach is easy by "sending a small amount of information on multiple phones" The shielding is carried out. At the same time, after many mobile phone terminals implement the group sending function, it is easy to generate a large number of manslaughter messages for the festival greetings, and the user's complaint rate increases. Keyword-based methods can be circumvented by means of "homophones", "typos", "structural splits", and "changes". At present, operators have deployed a large number of garbage monitoring systems. There are two important indicators for evaluating the monitoring effect of a garbage monitoring system: precision and recall. The precision ratio is the proportion of the spam senders in the detected spam send list. The recall rate is the ratio of the number of spammers detected to the actual number of spam senders on the network. . Obviously, a good garbage monitoring system has a high precision and recall rate. At present, the indicators that the operators have deployed based on the above traditional technologies or the improved garbage monitoring system based on the traditional technology are not ideal, and have to rely on a large number of human resources to assist in the inspection of spam messages. Therefore, how to improve the accuracy and recall rate of spam messages has become an urgent problem to be solved.
发明内容 本发明要解决的技术问题是提供一种垃圾短信监控的方法和系统, 以提 高垃圾短信查准率和查全率。 为了解决上述技术问题, 本发明提供了一种垃圾短信监控的方法, 该方 法包括: 若根据预定规则检测短信发送者为垃圾短信发送者, 则将所述短信发送 者列入黑名单, 进行垃圾短信的监控, 所述预定规则至少包括: 若短信发送者在预定时间段内发送短信的时序特征在预定时序特征, 则 将所述短信发送者规定为垃圾短信发送者; 或 若在预定时间段内短信发送者与其发送短信的所有接收者之间有相互通 信记录的对数与其两两组合的总对数的比例小于预定值, 则将所述短信发送 者规定为垃圾短信发送者; 或 若短信发送者在预定时间段内发送短信的时序特征在预定时序特征, 且 若在预定时间段内短信发送者与其发送短信的所有接收者之间有相互通信记 录的对数与其两两组合的总对数的比例小于预定值, 则将所述短信发送者规
定为垃圾短信发送者。 在根据预定规则检测短信发送者为垃圾短信发送者的步骤之前, 所述方 法还包括: 提取已知垃圾短信发送者的历史短信记录, 通过从所述历史短信记录中训练得到已知垃圾短信发送者发送短信的频 率特征来训练出所述预定时序特征; 和 /或 将所述历史短信记录中的有相互通信记录的节点之间以边相连构建所述 已知垃圾短信发送者与其发送短信的所有接收者之间的社会关系网络图, 通 过所述边数与所有节点之间两两相连的总边数的比值训练出所述预定值。 所述根据预定规则检测短信发送者为垃圾短信发送者的步骤之前, 所述 方法还包括: 检测所述短信发送者在单位时间内发送短信的条数超过阔值。 所述根据预定规则检测短信发送者为垃圾短信发送者的步骤包括: 在线检测所述短信发送者在当前一段时间内的短信话单, 若检测所述短 信发送者发送短信的时序特征为所述预定时序特征, 则判定所述短信发送者 为垃圾短信发送者; 或 在线检测所述短信发送者在当前一段时间内的短信话单, 若检测所述短 信发送者与其发送短信的所有接收者之间以有相互通信记录的对数与其两两 组合的总对数的比例小于所述预定值, 则判定所述短信发送者为垃圾短信发 送者; 或 在线检测所述短信发送者在当前一段时间内的短信话单, 若检测所述短 信发送者发送短信的时序特征为所述预定时序特征, 且若检测所述短信发送 者与其发送短信的所有接收者之间以有相互通信记录的对数与其两两组合的 总对数的比例小于所述预定值, 则判定所述短信发送者为垃圾短信发送者。 所述根据预定规则检测短信发送者为垃圾短信发送者的步骤之前, 所述 方法还包括:
提取所述短信发送者在当前一段时间内的短信话单; 对所述短信话单进行预处理。 所述根据预定规则检测短信发送者为垃圾短信发送者的步骤之前, 所述 方法还包括: 检测所述短信发送者不在黑名单和白名单上。 为了解决上述技术问题, 本发明提供了一种垃圾短信监控的系统, 该系 统包括: 检测模块, 其设置为: 若根据预定规则检测短信发送者为垃圾短信发送 者, 则将所述短信发送者列入黑名单, 然后将所述黑名单发送给监控模块; 以及 SUMMARY OF THE INVENTION The technical problem to be solved by the present invention is to provide a method and system for monitoring spam messages to improve the accuracy and recall rate of spam messages. In order to solve the above technical problem, the present invention provides a method for monitoring spam messages, the method comprising: if detecting a sender of a short message as a spam sender according to a predetermined rule, blacklisting the sender of the message to perform garbage The monitoring of the short message includes: the predetermined rule includes: if the timing feature of the short message sent by the short message sender in the predetermined time period is a predetermined time series feature, the short message sender is specified as a spam sender; or if the predetermined time period is If the ratio of the logarithm of the mutual communication record to the total number of pairs of the two or two combinations between the internal short message sender and the other senders of the short message is less than a predetermined value, the short message sender is specified as a spam sender; or The timing feature of the short message sender to send the short message within the predetermined time period is at a predetermined time series feature, and if the sender of the short message has a logarithm of the mutual communication record and the total of the two pairs thereof between the sender and the sender receiving the short message within the predetermined time period If the ratio of the logarithm is less than the predetermined value, the short message sender will be Designed as a spammer. Before the step of detecting the short message sender as the spam sender according to the predetermined rule, the method further includes: extracting a historical short message record of the known spam sender, and obtaining the known spam sending by training from the historical short message record. Transmitting the frequency feature of the short message to train the predetermined time-series feature; and/or connecting the nodes with mutual communication records in the historical short message record to each other to construct the known spam sender and send the short message A social relationship network diagram between all recipients, the predetermined value is trained by the ratio of the number of sides to the total number of sides connected by all nodes. Before the step of detecting the short message sender as the spam sender according to the predetermined rule, the method further includes: detecting that the number of the short message sent by the short message sender in the unit time exceeds a threshold. The step of detecting the short message sender as the spam sender according to the predetermined rule includes: detecting, by the online sender, the short message of the short message in the current time period, and detecting the time series characteristic of the short message sending the short message by the short message sender Determining the timing feature, determining that the short message sender is a spam sender; or detecting the short message of the short message sender in the current time period, if detecting the sender of the short message and all recipients sending the short message If the ratio of the logarithm of the mutual communication record to the total logarithm of the two-two combination is less than the predetermined value, determining that the short message sender is a spam sender; or detecting the short message sender online at the current time a short message, if the timing feature of the short message sent by the sender of the short message is detected as the predetermined time series feature, and if the sender of the short message is detected, the logarithm of the mutual communication record is detected between the sender and the sender And the ratio of the total logarithm of the two-two combination is less than the predetermined value, determining that the short message sender is spam sender. Before the step of detecting the short message sender as the spam sender according to the predetermined rule, the method further includes: Extracting a short message of the short message sender in the current period of time; preprocessing the short message. Before the step of detecting the short message sender as the spam sender according to the predetermined rule, the method further includes: detecting that the short message sender is not on the blacklist and the whitelist. In order to solve the above technical problem, the present invention provides a system for spam monitoring, the system comprising: a detecting module, configured to: if the sender of the short message is detected as a spam sender according to a predetermined rule, the short message sender Blacklisting, and then sending the blacklist to the monitoring module;
监控模块, 其设置为: 根据所述黑名单进行垃圾短信的监控, 所述预定规则至少包括: 若检测短信发送者在预定时间段内发送短信的时序特征为预定时序特 征, 则将所述短信发送者规定为垃圾短信发送者; 或 若检测在预定时间段内短信发送者与其发送短信的所有接收者之间有相 互通信记录的对数与其两两组合的总对数的比例小于预定值, 则将所述短信 发送者规定为垃圾短信发送者; 或 若短信发送者在预定时间段内发送短信的时序特征在预定时序特征, 且 若在预定时间段内短信发送者与其发送短信的所有接收者之间有相互通信记 录的对数与其两两组合的总对数的比例小于预定值, 则将所述短信发送者规 定为垃圾短信发送者。 所述系统还包括, 训练模块, 其设置为: 提取已知垃圾短信发送者的历史短信记录, 通过 从所述历史短信记录中训练得到已知垃圾短信发送者发送短信的频率特征来 训练出所述预定时序特征, 然后将所述预定时序特征发送给所述检测模块; 和 /或,将所述历史短信记录中的有相互通信记录的节点之间以边相连构建所
述已知垃圾短信发送者与其发送短信的所有接收者之间的社会关系网络图, 通过所述边数与所有节点之间两两相连的总边数的比值训练出所述预定值, 然后将所述预定值发送给所述检测模块。 所述检测模块包括, 在线检测模块, 其设置为: 在线检测所述短信发送者在当前一段时间内 的短信话单, 若检测所述短信发送者发送短信的时序特征为所述预定时序特 征, 则判定所述短信发送者为垃圾短信发送者; 或在线检测所述短信发送者 在当前一段时间内的短信话单, 若检测所述短信发送者与其发送短信的所有 接收者之间有相互通信记录的对数与其两两组合的总对数的比例小于所述预 定值, 则判定所述短信发送者为垃圾短信发送者; 或在线检测所述短信发送 者在当前一段时间内的短信话单, 若检测所述短信发送者发送短信的时序特 征为所述预定时序特征, 且若检测所述短信发送者与其发送短信的所有接收 者之间以有相互通信记录的对数与其两两组合的总对数的比例小于所述预定 值, 则判定所述短信发送者为垃圾短信发送者。 所述在线检测模块还设置为: 在检测短信发送者是否为垃圾短信发送者 之前, 检测所述短信发送者在单位时间内发送短信的条数超过阔值。 所述系统还包括: 话单预处理模块, 其设置为: 提取所述短信发送者在当前一段时间内的 The monitoring module is configured to: monitor the spam message according to the blacklist, and the predetermined rule at least includes: if detecting that the time series feature of the short message sender sent by the short message sender is a predetermined time series feature, the short message is sent The sender is specified as a spam sender; or if the ratio of the logarithm of the mutual communication record to the total number of pairs of the two-way combination between the sender of the message and all the recipients who sent the message within the predetermined time period is less than a predetermined value, Or the sender of the short message is specified as a spam sender; or if the time signature of the short message sent by the sender of the message within the predetermined time period is at a predetermined time signature, and if the sender of the message sends all the messages with the message within a predetermined time period If the ratio of the logarithm of the mutual communication record to the total logarithm of the two-to-two combination is less than a predetermined value, the short message sender is specified as the spam sender. The system further includes: a training module, configured to: extract a historical short message record of a known spam sender, and train a clinic by training a frequency profile of a known spam sender to send a short message from the historical short message record Determining a predetermined timing feature, and then transmitting the predetermined timing feature to the detecting module; and/or constructing a node with mutual communication records in the historical short message record a social relationship network diagram between a known spam sender and all recipients that send a text message, training the predetermined value by the ratio of the number of sides to the total number of sides connected between all nodes, and then The predetermined value is sent to the detection module. The detecting module includes: an online detecting module, configured to: detect a short message CDR of the short message sender in a current period of time, and if the timing feature of the short message sent by the short message sender is detected as the predetermined time series feature, And determining, by the sender of the short message, a spam sender; or detecting, by online, the short message of the short message sender in the current period of time, if the sender of the short message is detected to communicate with all recipients that send the short message If the ratio of the logarithm of the record to the total logarithm of the two-two combination is less than the predetermined value, determining that the short message sender is a spam sender; or detecting the short message of the short message sender within the current time period If the timing feature of the short message sender sending the short message is detected as the predetermined time-series feature, and if the sender of the short message is detected, the logarithm of the mutual communication record is combined with the logarithm of the mutual communication record If the ratio of the total logarithm is less than the predetermined value, it is determined that the short message sender is a spam sender. The online detection module is further configured to: before detecting whether the short message sender is a spam sender, detecting that the number of the short message sent by the short message sender in a unit time exceeds a threshold. The system further includes: a bill pre-processing module, configured to: extract the sender of the short message within a current period of time
所述检测模块还设置为: 根据预定规则检测短信发送者为垃圾短信发送 者之前, 检测所述短信发送者不在黑名单和白名单上。 The detecting module is further configured to: before detecting the short message sender as the spam sender according to the predetermined rule, detecting that the short message sender is not on the blacklist and the whitelist.
传统的基于内容的垃圾监控系统对垃圾短信过滤在查准率和查全率两个 指标上都不是很理想, 并且需要扫描短信内容, 系统资源开销较大。 而本发 明提供的垃圾短信监控的方法和系统是基于发送者行为在时序和空间上的特 征进行垃圾短信监控, 具有较高的查准率和查全率, 同时也提高了垃圾短信 制造者的规避成本, 并且不需要扫描短信内容, 系统性能上也有了很大的提 升。
附图概述 图 1为本发明的垃圾短信监控系统的示意图; 图 2为本发明的垃圾短信监控的方法的流程图; 图 3为本发明实施例的垃圾短信监控系统的示意图; 图 4为本发明实施例的垃圾短信监控的方法的流程图; 图 5为本发明实施例的训练垃圾短信发送者的行为特征的流程图; 图 6为本发明实施例的在线检测的流程图。 The traditional content-based garbage monitoring system is not ideal for spam filtering in both the precision and the recall rate, and the content of the short message needs to be scanned, and the system resource overhead is large. The method and system for spam monitoring provided by the present invention is based on the characteristics of the sender's behavior in time series and space for spam message monitoring, which has a high precision and recall rate, and also improves the spam maker's Avoiding costs, and not needing to scan SMS content, the system performance has also been greatly improved. 1 is a schematic diagram of a spam short message monitoring system of the present invention; FIG. 2 is a flowchart of a spam short message monitoring method according to the present invention; FIG. 3 is a schematic diagram of a spam short message monitoring system according to an embodiment of the present invention; FIG. 5 is a flowchart of a behavior of training a spam sender according to an embodiment of the present invention; FIG. 6 is a flowchart of online detection according to an embodiment of the present invention.
本发明的较佳实施方式 下文中将结合附图对本发明的实施例进行详细说明。 需要说明的是, 在 不冲突的情况下, 本申请中的实施例及实施例中的特征可以相互任意组合。 短信发送者在行为上具有一定的时间特征和空间特征, 比如很多垃圾短 信发送者釆用机器群发的方法发送商业广告, 在发送时序上所表现的频率特 征和普通短信发送者有明显的区别。 机器群发频率往往比较固定, 比如发送 短信的时间间隔一定, 普通短信发送频率不固定, 规律性不强。 同样在空间特征上, 正常短信的发送者具有稳定而独特的社会关系网络 特征, 且关系较为隐蔽, 而垃圾短信发送者体现出来的社会关系网络混乱且 不稳定。 因为, 每个人都有自己较为固定的社交圈, 正常发送短信的对象大 部分为社交圈内的,并且每个人的社交圈都不同,也就是社会关系网络不同; 而垃圾短信发送的对象之间往往没有任何关系。 垃圾短信制造者如果要规避 基于社会关系网络的监控, 就必须要获取每个人的社会关系网络, 正因为每 个人都有自己独特的社会关系网络, 所以垃圾短信制造者很难获取到每个人 的社会关系网络。 简单地说, 关系较为隐蔽就是说我们通常并不知道别人的 社交网络怎样, 垃圾短信制造者群发垃圾短信要获取很多人的社交网络更困 难。 本发明正是利用了垃圾短信发送者与正常短信发送者行为在时间特征和
/或空间特征上的不同进行垃圾短信的监控。通过分析垃圾短信制造者的时间 特征和空间特征, 来提取时序特征和社会关系网络特征, 训练构造垃圾短信 制造者的时序特征和社会关系网络的度量模型, 并用该模型用来度量短信发 送者属于垃圾制造者的概率。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other. SMS senders have certain temporal characteristics and spatial characteristics in behavior. For example, many spammers use the method of group sending to send commercial advertisements. The frequency characteristics expressed in the transmission timing are obviously different from those of ordinary SMS senders. The frequency of machine group sending is often fixed. For example, the time interval for sending SMS messages is certain. The frequency of sending ordinary messages is not fixed, and the regularity is not strong. Similarly, in terms of spatial characteristics, the sender of normal short messages has a stable and unique social network characteristics, and the relationship is relatively hidden, and the social relationship network reflected by spammers is chaotic and unstable. Because everyone has their own fixed social circle, most of the objects that send text messages are mostly in the social circle, and each person's social circle is different, that is, the social network is different; and the spam messages are sent between the objects. Often there is no relationship. If spammers want to circumvent monitoring based on social network, they must acquire everyone's social network. Because everyone has their own unique social network, it is difficult for spammers to get everyone's. Social network. Simply put, the relationship is more subtle. That is to say, we usually don't know what other people's social networks are. It is more difficult for spammers to send spam messages to get a lot of people's social networks. The invention utilizes the behavior of the spam sender and the normal message sender in time characteristics and / or the difference in spatial characteristics for the monitoring of spam messages. By analyzing the temporal and spatial characteristics of the spammers, the time series features and social network characteristics are extracted, the time series features of the spam makers and the measurement model of the social relationship network are trained, and the model is used to measure the sender of the message. The probability of a garbage maker.
训练构造垃圾短信制造者的时序特征和社会关系网络的度量模型的过程 实际上就是, 在获取了一组已知的垃圾短信制造者名单的前提下, 通过分析 这组垃圾短信制造者在时间和空间上的特征, 提取出在时序上以及在社会关 系网络上具有共性的特征, 以参数值的形式体现, 作为检验其他短信发送者 是否为垃圾短信发送者的参照。 时序特征模型就是从垃圾短信发送者的历史短信记录中训练分析得出的 一组发送短信的频率特征参数, 例如, 在某段时间内所发送的每条短信之间 在发送时间间隔上有一定的规律, 比如某个垃圾短信发送者每隔 1秒发送一 条短信, 那么表现出的特征就是时间间隔为 1秒。 而有些低频的垃圾短信发 送用户可能为了逃避监控发送的时间间隔会故意设置得长一些 , 但是只要是 通过机器群发的, 在发送时间间隔上总会表现出一定的规律性。 社会关系网络特征(即空间特征模型)可以从一定时期内发送者与接收 者之间的短信通信记录的情况体现出来。 垃圾短信接收者之间的社会关系比 较疏远, 即相互之间的通信记录较少。 可以利用所有短信接收者(包括短信 发送者)之间有相互通信记录的对数(例如, 有发信有回信的两用户为一对) 与所有短信接收者(短信发送者)之间两两组合的总对数的比例, 来衡量短 信发送者与所有接收者之间的社会关系密切度。 对垃圾短信的发送者和接收 者之间的社会关系比例一般很小。 可以通过历史短信记录构造出包含短信发送者与所有短信接收者之间的 社会关系网络图, 将各个短信发送者与所有短信接收者分别视为一个节点, 互相有通信记录的节点之间以边相连, 然后可以根据该图计算出的节点聚合 程度参数, 具体可以用图中实际相连的边数与各节点两两相连的总边数的比 例来衡量。 图的边数越多意味着节点聚合程度越高, 通常由垃圾短信制造者 构造的社会关系网络图中节点聚合程度较低。
垃圾短信制造者有高频发送用户和低频发送用户之分。 高频发送用户由 于在短时间内发送大量垃圾短信, 造成的危害性较大; 低频发送用户不会在 短时间内产生大量垃圾短信, 短期内不会造成危害。 针对两种情况, 垃圾监控系统需要在短时间内检测出高频发送用户, 在 一定时期内检测出低频发送用户。 为了满足该要求, 本发明釆用了在线检测 和离线检测相结合的方法。 在线检测针对高频发送用户, 考察当前一段时间 内数据, 具有较强的时效性; 离线检测考察一定时期 (比如 1周内的数据), 作为在线检测的补充, 离线检测可以检测出在线检测无法发现的低频垃圾短 信发送用户。 为了实现基于时序特征和空间特征的垃圾短信检测, 首先需要对一定时 间内历史话单中的垃圾短信制造者的短信发送记录作为短信训练集合进行离 线的训练, 以得到垃圾短信制造者的时序特征和社会关系网络度量模型, 训 练过程包括提取发送者时序特征和社会关系网络特征, 进行聚类分析, 统计 得到垃圾短信发送者的规律, 最终生成包含垃圾短信发送规律参数的模型文 件。 在进行垃圾短信检测时, 同样提取实时短信中发送者的时序特征和社会 关系网络特征, 通过计算该样本与模型文件相似度从而确定发送者是否为垃 圾短信发送者。 训练的过程是自适应的, 系统会定期取话单进行训练, 并调 整模板库。 The process of training the timing characteristics of the spam maker and the measurement model of the social relationship network is actually, by obtaining a list of known spam makers, by analyzing the group of spam makers at the time and The features on the space extract features that are common in time series and on the social network, and are expressed in the form of parameter values as a reference for checking whether other SMS senders are spammers. The time series feature model is a set of frequency characteristic parameters for sending short messages sent from the historical short message record of the spam sender. For example, each message sent within a certain period of time has a certain time interval between sending messages. The rule, such as a spammer sending a text message every 1 second, then the characteristic is that the time interval is 1 second. Some low-frequency spam sending users may deliberately set a longer interval in order to avoid the monitoring transmission time. However, as long as it is sent through the machine group, there will always be a certain regularity in the sending time interval. The social relationship network feature (ie, the spatial feature model) can be reflected from the situation of short message communication records between the sender and the receiver in a certain period of time. The social relationship between spammers is relatively alienated, that is, there are fewer communication records between each other. It is possible to utilize the logarithm of mutual communication records between all SMS recipients (including SMS senders) (for example, a pair of two users who have a reply and a reply) and two or two SMS receivers (message senders) The ratio of the total logarithm of the combination to measure the closeness of the social relationship between the sender of the message and all recipients. The proportion of social relationships between senders and recipients of spam messages is generally small. The social network diagram including the sender of the short message and all the recipients of the short message can be constructed through the historical short message record, and each short message sender and all the short message receivers are regarded as one node respectively, and the nodes having the communication record are side by side. Connected, and then the node aggregation degree parameter calculated according to the figure can be measured by the ratio of the number of sides actually connected in the figure to the total number of sides connected to each node. The more the number of edges of the graph means the higher the degree of node aggregation, and the degree of node aggregation is usually lower in the social network diagram constructed by the spam message maker. Spammers have the distinction of high frequency sending users and low frequency sending users. High-frequency sending users are more harmful because they send a large amount of spam messages in a short period of time. Low-frequency sending users will not generate a large amount of spam messages in a short period of time, which will not cause harm in a short period of time. For both cases, the garbage monitoring system needs to detect the high frequency transmitting user in a short time and detect the low frequency transmitting user within a certain period of time. In order to meet this requirement, the present invention employs a combination of on-line detection and off-line detection. On-line detection for high-frequency transmission users, to examine the current period of time data, has a strong timeliness; offline detection to examine a certain period (such as data within 1 week), as a supplement to online detection, offline detection can detect online detection can not detect The low frequency spam message found was sent to the user. In order to realize spam message detection based on time series features and spatial features, it is necessary to perform offline training on the SMS message transmission records of the spam message makers in the historical bills in a certain period of time to obtain the time series characteristics of the spam message makers. And the social relationship network measurement model, the training process includes extracting the sender's time series characteristics and the social relationship network characteristics, performing cluster analysis, and statistically obtaining the rules of the spam sender, and finally generating a model file containing the spam message transmission regular parameters. In the process of spam detection, the timing characteristics of the sender and the social relationship network feature in the real-time short message are also extracted, and the similarity between the sample and the model file is calculated to determine whether the sender is a spam sender. The training process is adaptive, and the system periodically picks up the CDRs for training and adjusts the template library.
在系统进行垃圾短信检测时, 首先, 进行基于黑白名单的检测, 如果短 信发送者在黑白名单列表上, 则直接跳过该用户。 因为黑名单为已经确定为 垃圾短信发送者用户或者是被运营商设定禁止发送短信的特定用户, 对黑名 单用户再作检测没有意义, 垃圾短信监控的目的就是找出垃圾短信发送者, 将其加入到黑名单列表,既然已经在黑名单列表上了就无需再检测了。 同样, 白名单用户通常为运营商设定的不作监控的用户, 白名单用户不管发送什么 样的短信, 垃圾短信监控系统都不能作为垃圾短信制造者来处理, 因此对白 名单监控也没有意义。 然后, 可以进行基于时序特征和 /或空间特征的检测, 并且可以在线检测和离线检测并行进行; 最后, 可以对几种不同的检测方法 导出的黑名单取并集, 并将黑名单同步给业务操作支撑系统 (Business
Operation Support System, BOSS ) When the system performs spam detection, first, the black and white list is detected. If the sender of the message is on the black and white list, the user is directly skipped. Because the blacklist is a user who has been identified as a spammer or a specific user that is prohibited from sending SMS by the operator, it does not make sense to detect the blacklist again. The purpose of spam monitoring is to find the spammer. It is added to the blacklist, and since it is already on the blacklist, there is no need to check it. Similarly, a whitelist user is usually a non-monitoring user set by the operator. No matter what kind of short message the whitelist user sends, the spam SMS monitoring system cannot be handled as a spam message maker, so there is no meaning for whitelist monitoring. Then, detection based on time series features and/or spatial features can be performed, and online detection and offline detection can be performed in parallel; finally, the blacklists derived from several different detection methods can be combined and the blacklist can be synchronized to the service. Operation support system Operation Support System, BOSS )
为了更好地理解本发明, 下面结合附图和具体实施例对本发明作进一步 地描述。 图 1为本发明的垃圾短信监控系统的示意图, 如图 1所示, 本发明的垃 圾短信监控系统主要包括: 检测模块和监控模块, 其中, 检测模块设置为: 若根据预定规则检测短信发送者为垃圾短信发送者, 则将所述短信发送者列入黑名单, 然后将所述黑名单发送给监控模块; 监控模块设置为: 根据所述黑名单进行垃圾短信的监控, 所述预定规则至少包括: 若检测短信发送者在预定时间段内发送短信的时序特征为预定时序特 征, 例如在单位时间内发送短信的时间间隔一定, 则将所述短信发送者规定 为垃圾短信发送者; 或 若检测在预定时间段内短信发送者与其发送短信的所有接收者之间有相 互通信记录的对数与其两两组合的总对数的比例小于预定值,例如小于 10%, 则将所述短信发送者规定为垃圾短信发送者; 或 若检测短信发送者在预定时间段内发送短信的时序特征为预定时序特 征, 且检测在预定时间段内短信发送者与其发送短信的所有接收者之间有相 互通信记录的对数与其两两组合的总对数的比例小于预定值, 则将所述短信 发送者规定为垃圾短信发送者。 这样, 本发明的垃圾短信监控系统即可以根据垃圾短信发送者的时序特 征和 /或空间特征, 实现对垃圾短信的监控, 以提高垃圾短信的查准率和查全 率。 For a better understanding of the invention, the invention will be further described in conjunction with the drawings and specific embodiments. FIG. 1 is a schematic diagram of a spam short message monitoring system according to the present invention. As shown in FIG. 1 , the spam short message monitoring system of the present invention mainly includes: a detecting module and a monitoring module, wherein the detecting module is configured to: if the short message sender is detected according to a predetermined rule For the sender of the spam message, the sender of the message is blacklisted, and then the blacklist is sent to the monitoring module. The monitoring module is configured to: monitor the spam message according to the blacklist, and the predetermined rule is at least The method includes: if detecting that the timing feature of the short message sent by the sender of the short message is a predetermined time series feature, for example, the time interval for sending the short message in a unit time is fixed, the short message sender is specified as a spam sender; or Detecting that the ratio of the logarithm of the mutual communication record and the total logarithm of the two-way combination between the sender of the short message and all the recipients that send the short message within a predetermined time period is less than a predetermined value, for example, less than 10%, the short message is sent Is specified as a spammer; or if the sender of the message is detected for a predetermined period of time The timing feature of sending the short message is a predetermined time-series feature, and detecting that the ratio of the logarithm of the mutual communication record and the total logarithm of the two-way combination between the sender of the short message and all the recipients that send the short message within the predetermined time period is less than a predetermined value, The sender of the short message is specified as a spammer. In this way, the spam short message monitoring system of the present invention can monitor the spam short message according to the timing characteristics and/or spatial characteristics of the spam sender, so as to improve the precision and recall rate of the spam message.
进一步地, 本发明的垃圾短信监控系统还可以包括: 训练模块, 其设置为: 提取已知垃圾短信发送者的历史短信记录, 通过 从所述历史短信记录中训练得到已知垃圾短信发送者发送短信的频率特征来
训练出所述预定时序特征, 然后将所述预定时序特征发送给所述检测模块; 或将所述历史短信记录中的有相互通信记录的节点之间以边相连构建所述已 知垃圾短信发送者与其发送短信的所有接收者之间的社会关系网络图, 通过 所述边数与所有节点之间两两相连的总边数的比值训练出所述预定值, 然后 将所述预定值发送给所述检测模块。 这样, 本发明的垃圾短信监控系统可以针对不同运营商, 训练出不同的 时序特征模型和空间特征模型。 图 2为本发明的垃圾短信监控的方法的流程图, 如图 2所述, 本发明的 方法包括下面步骤: 步骤 10, 根据预定规则检测短信发送者是否为垃圾短信发送者, 若是, 则执行步骤 20, 若不是则重复执行步骤 10; 步骤 20, 将所述短信发送者列入黑名单, 进行垃圾短信的监控。 其中, 所述预定规则至少包括: 若短信发送者在预定时间段内发送短信的时序特征在预定时序特征, 例 如在预定时间段内发送短信的时间间隔一定, 则将所述短信发送者规定为垃 圾短信发送者; 或 若在预定时间段内短信发送者与其发送短信的所有接收者之间有相互通 信记录的对数与其两两组合的总对数的比例小于预定值, 则将所述短信发送 者规定为垃圾短信发送者; 或 若短信发送者在预定时间段内发送短信的时序特征在预定时序特征, 且 若在预定时间段内短信发送者与其发送短信的所有接收者之间有相互通信记 录的对数与其两两组合的总对数的比例小于预定值, 则将所述短信发送者规 定为垃圾短信发送者。 这样, 根据本发明的垃圾短信监控的方法即可基于垃圾短信发送者的时 序特征和 /或空间特征, 实现对垃圾短信的监控, 以提高垃圾短信的查准率和 查全率。 优选地, 在步骤 10之前, 还可以包括下面步骤:
提取已知垃圾短信发送者的历史短信记录, 通过从所述历史短信记录中训练得到已知垃圾短信发送者发送短信的频 率特征来训练出所述预定时序特征; 和 /或 将所述历史短信记录中的有相互通信记录的节点之间以边相连构建所述 已知垃圾短信发送者与其发送短信的所有接收者之间的社会关系网络图, 通 过所述边数与所有节点之间两两相连的总边数的比值训练出所述预定值。 Further, the spam monitoring system of the present invention may further include: a training module, configured to: extract a historical short message record of a known spam sender, and send a known spam sender by training from the historical short message record. The frequency characteristics of the text message come Training the predetermined timing feature, and then transmitting the predetermined timing feature to the detecting module; or connecting the nodes with mutual communication records in the historical short message record to form the known spam sending The social relationship network diagram between all the recipients and the senders of the short message, the predetermined value is trained by the ratio of the number of sides to the total number of sides connected between all the nodes, and then the predetermined value is sent to The detection module. In this way, the spam short message monitoring system of the present invention can train different time series feature models and spatial feature models for different operators. 2 is a flowchart of a method for monitoring spam messages according to the present invention. As shown in FIG. 2, the method of the present invention includes the following steps: Step 10: detecting, according to a predetermined rule, whether a sender of a short message is a spam sender, and if so, executing Step 20: If not, repeat step 10; Step 20, blacklist the sender of the short message to monitor the spam message. The predetermined rule includes: if the timing feature of the short message sent by the short message sender in the predetermined time period is at a predetermined time series feature, for example, the time interval for sending the short message within a predetermined time period is fixed, the short message sender is specified as The spam sender; or if the ratio of the logarithm of the mutual communication record to the total number of pairs of the two-way combination between the sender of the message and all the recipients who sent the message within a predetermined time period is less than a predetermined value, the message is sent The sender is specified as a spammer; or if the SMS sender sends the SMS in a predetermined time period, the timing feature is at a predetermined timing feature, and if there is a mutual time between the sender of the SMS and all recipients of the SMS within the predetermined time period If the ratio of the logarithm of the communication record to the total logarithm of the two-two combination is less than a predetermined value, the short message sender is specified as the spam sender. In this way, the spam message monitoring method according to the present invention can implement the monitoring of spam messages based on the timing characteristics and/or spatial features of the spam sender, so as to improve the precision and recall rate of the spam messages. Preferably, before step 10, the following steps may also be included: Extracting a historical short message record of a known spam sender, training the predetermined time-series feature by training a frequency feature of the known spam sender from the historical short message record; and/or A social relationship network diagram between the nodes having mutual communication records in the record and the neighbors and the senders who send the short messages by the edges, and the number of sides and all the nodes are The ratio of the total number of sides connected is trained to the predetermined value.
下面通过具体实施例对本发明作详细的说明。 图 3为本发明实施例的垃圾短信监控系统的示意图, 如图 3所示, 本实 施例的垃圾监控系统包括: 话单预处理模块、 训练模块、 人工标注模块、 检 测模块和黑白名单管理模块。 话单预处理模块设置为: 对短信中心话单进行预处理, 主要包括: 去除 重复记录, 去除非点对点短信, 去除非目标运营商话单, 提取有用字段, 转 换格式为系统内部格式和入库操作。 话单记录中有些记录是因为系统原因发送失败重试的记录, 这种记录只 能当作 1条短信来处理; 有些短信记录是运营商客服系统群发给用户的, 并 非用户发送的, 无需监控要去除; 运营商只监控属于本运营商的用户, 对于 非本运营商用户发送短信给本运营商用户, 在短信中心也会产生话单记录, 对这类记录也无需监控; 话单记录会有很多字段, 但对于垃圾短信监控只需 要用到其中很少一部分字段, 只需要提取有用字段。 另外, 还需要把话单转 换为系统内部能够识别的格式。 其中, 话单预处理模块可以通过文件传输协议( File Transfer Protocol, FTP )方式获取短信中心的原始话单。 训练模块设置为: 训练已知为垃圾短信发送者的历史话单, 产生用于垃 圾短信检测的模型文件。 人工标注模块设置为: 在训练垃圾短信发送者的模型前对候选出的可能 为垃圾短信发送者的用户正确地进行用户类别的标注, 以便模型训练得到的
模型文件更加准确地符合垃圾短信发送者的规律特征。 本实施例中的检测模块可以包括: 在线时序检测模块设置为: 在线检测短信发送者的时序特征并导出黑名 单。 The invention will now be described in detail by way of specific examples. FIG. 3 is a schematic diagram of a spam short message monitoring system according to an embodiment of the present invention. As shown in FIG. 3, the garbage monitoring system of the embodiment includes: a bill preprocessing module, a training module, a manual labeling module, a detecting module, and a black and white list management module. . The CDR pre-processing module is configured to: pre-process the SMS center CDRs, including: removing duplicate records, removing non-point-to-point short messages, removing non-target carrier CDRs, extracting useful fields, and converting the format to the system internal format and warehousing operating. Some records in the bill record are records that failed to be retried due to system reasons. Such records can only be processed as one SMS; some SMS records are sent to the user by the operator's customer service system, not sent by the user, and no monitoring is required. To remove; the operator only monitors the users belonging to the carrier. For the non-operator users to send text messages to the carrier users, the SMS center will also generate bill records, and such records do not need to be monitored; There are a lot of fields, but for spam monitoring, you only need to use a few of them, just need to extract useful fields. In addition, you need to convert the bill into a format that can be recognized inside the system. The CDR pre-processing module can obtain the original CDR of the SMS center through a File Transfer Protocol (FTP). The training module is configured to: train historical CDRs known as spam senders to generate model files for spam detection. The manual labeling module is set to: correctly mark the user category of the candidate user who may be the spam sender before training the spammer sender model, so that the model training is obtained. The model file more accurately conforms to the regular characteristics of spammers. The detecting module in this embodiment may include: the online timing detecting module is configured to: detect a timing feature of the short message sender online and derive a blacklist.
在线空间检测模块设置为: 在线检测短信发送者的社会关系网络特征并 导出黑名单。 离线空间检测模块设置为: 离线检测短信发送者的社会关系网络特征并 导出黑名单。 黑白名单管理模块设置为:对以上 3个检测模块导出的黑名单取并集后, 将结果同步给 BOSS, 并从 BOSS获取黑白名单列表同步给检测模块。 黑白名单管理模块和 BOSS之间也可以通过 FTP方式同步黑白名单。 The online space detection module is configured to: detect the social network characteristics of the sender of the short message online and derive a blacklist. The offline space detection module is configured to: offline detect the social network characteristics of the sender of the message and derive a blacklist. The black and white list management module is configured to: after the blacklists of the above three detection modules are combined, the results are synchronized to the BOSS, and the black and white list is obtained from the BOSS and synchronized to the detection module. The black and white list can also be synchronized by FTP between the black and white list management module and the BOSS.
图 4为本发明实施例的垃圾短信监控的方法的流程图, 如图 4所示, 具 体流程包括如下步骤: 步骤 201 , 获取短信中心的原始话单, 进行预处理。 话单预处理模块预处理包括: 去除重复记录, 去除非点对点短信, 去除 非目标运营商话单, 提取有用字段, 转换格式为系统内部格式, 同时按提交 短信的时间顺序进行排序, 其中, 提取的有用字段包括: 消息标识 ( identification, id ) 、 发送者号码、 接收者号码、 短信提交时间、 短信长度 和短信内容。 然后, 话单预处理模块将预处理后的短信话单发送给检测模块。 步骤 202, 检测模块逐条扫描预处理后的话单, 只记录提交时间和发送 者号码和接收者号码。 步骤 203 , 检测模块对每条记录进行基于黑白名单的过滤, 如果用户在 黑白名单列表上, 则直接忽略该用户。 步骤 204, 根据训练模块训练产生的模型文件, 基于短信发送者时序特 征和 /或空间特征的进行检测。
本实施例中可以进行在线检测, 也可以进行离线检测, 在线检测可以对 短信发送者的时序特征进行检测,也可以对短信发送者的空间特征进行检测。 离线检测一般检测短信发送者在历史一段时间内的空间特征。 在线时序检测模块、 在线空间检测模块和离线空间检测模块可以并行操 作, 也可以单独操作。 在线时序检测模块、 在线空间检测模块分析当前一段时间内扫描到的用 户发送短信的特征, 离线空间检测通常分析用户在历史较长一段时间内 (例 如, 一周) 的社会关系网络特征。 步骤 205 , 将检测出的垃圾短信发送者列入黑名单。 若以上 3种检测模块并行操作, 将独立产生黑名单, 黑白名单管理模块 对 3个检测模块导出的黑名单取并集得到最终的黑名单列表。 这 3个检测模块可以从 3个角度检测垃圾短信发送者, 从检测结果看, 这 3种方法检测出来的大部分黑名单都是相同的。 这 3种方法并行使用的目 的是为了互补, 可能有少部分垃圾短信发送者, 有些方法能监控到, 有些监 控不到。 比如低频率的垃圾短信发送者通过在线检测方法比较难监控到, 而 通过离线方法能监控到。 另外 3种方法并行使用也起到了提高垃圾短信制造 者规避成本的目的。 步骤 206, 黑白名单管理模块将黑名单列表同步给 BOSS。 FIG. 4 is a flowchart of a method for monitoring spam messages according to an embodiment of the present invention. As shown in FIG. 4, the specific process includes the following steps: Step 201: Acquire an original bill of a short message center, and perform preprocessing. The pre-processing of the bill pre-processing module includes: removing duplicate records, removing non-point-to-point short messages, removing non-target operator bills, extracting useful fields, converting the format to the internal format of the system, and sorting according to the time sequence of submitting the short messages, wherein, extracting Useful fields include: message identification (id), sender number, recipient number, SMS submission time, SMS length, and SMS content. Then, the CDR pre-processing module sends the pre-processed SMS CDR to the detection module. Step 202: The detecting module scans the pre-processed bills one by one, and records only the submission time and the sender number and the receiver number. Step 203: The detection module performs blacklist-based filtering on each record. If the user is on the black and white list, the user is directly ignored. Step 204: Perform, according to the model file generated by the training module training, based on the short message sender timing feature and/or the spatial feature. In this embodiment, online detection may be performed, and offline detection may also be performed. The online detection may detect the timing characteristics of the short message sender, and may also detect the spatial characteristics of the short message sender. Offline detection generally detects the spatial characteristics of a short message sender over a historical period of time. The online timing detection module, online space detection module and offline space detection module can be operated in parallel or separately. The online time series detection module and the online space detection module analyze the characteristics of the scanned text sent by the user during the current period of time. The offline space detection generally analyzes the social relationship network characteristics of the user for a long period of time (for example, one week). Step 205: Blacklist the detected spammers. If the above three detection modules operate in parallel, the blacklist will be generated independently. The blacklist and whitelist management module will combine the blacklists exported by the three detection modules to obtain the final blacklist. The three detection modules can detect spam senders from three angles. From the detection results, most of the blacklists detected by the three methods are the same. The purpose of using these three methods in parallel is to complement each other. There may be a small number of spam senders, some methods can be monitored, and some can not be monitored. For example, low-frequency spam senders are more difficult to monitor through online detection methods, but can be monitored by offline methods. The parallel use of the other three methods also serves to improve the cost of spammers. Step 206: The black and white list management module synchronizes the blacklist to the BOSS.
BOSS会把黑名单提供给短信中心的临控模块, 短信中心在发送短信时 会首先检查发送者是否在黑名单上,如果在黑名单上则禁止该用户发送短信。 BOSS will provide the blacklist to the control module of the SMS center. When sending the SMS, the SMS center will first check whether the sender is on the blacklist. If it is on the blacklist, the user is prohibited from sending the SMS.
图 5为本发明实施例的训练垃圾短信发送者的行为特征的流程图; 如图 5所示, 具体流程包括如下步骤: 步骤 301 , 提取一段时期的历史话单作预处理并入库。 步骤 302 , 根据已有的经验模型初步获取认为可能为垃圾短信发送者的 候选训练集。 FIG. 5 is a flowchart of a behavior of training a spam sender according to an embodiment of the present invention; as shown in FIG. 5, the specific process includes the following steps: Step 301: Extract a historical period of a period for pre-processing into a library. Step 302: Initially obtain a candidate training set that is considered to be a spam sender according to an existing empirical model.
所述已有的经验模型是指通过分析运营商历史话单数据中垃圾短信制造 者的时序特征和空间特征规律得出的一组参数。
步骤 303 , 评估训练集规模, 如果训练集规模不够, 表明其中垃圾短信 发送者数量不大, 则由该训练集训练得到的模型文件统计意义不大, 须要重 新返回步骤 301获取更多话单重新进行训练。 如果认为训练集规模足够则进 入步骤 304进行下一步工作。 步骤 304 , 对训练集进行人工标注, 利用人工标注模块提供的标注工具, 查看训练集每个用户所发送的短信, 根据人工判断对训练集用户进行分类标 注。 人工分类标注通常通过查看短信内容, 根据发送的短信内容来判定该用 户有没有发送垃圾短信, 通常垃圾短信的判定标准还要结合运营商的要求。 人工分类标注通常将用户分成 4类, 即正常短信发送者、 垃圾短信发送 者、 混合短信发送者和其它短信发送者。 其中, 混合短信发送者既发送了正 常的短信又发送了垃圾短信, 其他短信发送者通常为乱码或运营商群发的祝 福类短信。 步骤 305 , 根据标注结果提取垃圾短信发送者的历史话单, 来训练时序 特征和空间特征。 其中,可以将时序特征转换为频域信息,提取的空间特征参数可以包括: 发送短信条数、 接收短信条数、 回复短信的接收者的数量、 有相互通信记录 的接收者的对数等, 可以通过回复短信的数量, 即有相互通信记录的对数, 训练出空间特征模型。 步骤 306 , 通过频域分析和社会关系网络分析, 确定垃圾短信发送者发 送规律, 分别产生基于时序特征的模型文件和基于空间特征的模型文件。 步骤 307 , 将生成的模型文件同步给检测模块。 根据不同运营商对查准率和查全率的不同要求,模型文件可以灵活调整。 比如, 如果运营商希望更高的查全率, 则训练时对标注为混合短信发送者这 类用户将归为垃圾短信来处理; 如果运营商希望更高的查准率, 则训练时只 对标注为垃圾短信发送者的用户进行训练。 The existing empirical model refers to a set of parameters obtained by analyzing the timing characteristics and spatial characteristic rules of spam makers in the operator's historical bill data. Step 303: The training set size is evaluated. If the training set is not large enough, indicating that the number of spam senders is not large, the model file trained by the training set has little statistical significance, and it is necessary to return to step 301 to obtain more CDRs. Train. If the training set is considered to be sufficient in size, proceed to step 304 for the next step. Step 304: Perform manual labeling on the training set, and use the annotation tool provided by the manual labeling module to view the short message sent by each user in the training set, and classify and label the training set user according to manual judgment. The manual classification labeling usually determines whether the user has sent spam messages according to the content of the sent short message by checking the content of the short message. Generally, the criterion for the spam message is combined with the requirements of the operator. Manual categorization usually divides users into four categories, namely normal SMS senders, spammers, mixed SMS senders, and other SMS senders. Among them, the mixed SMS sender sends both normal text messages and spam messages, and other SMS senders are usually garbled or blessing messages sent by the operator. Step 305: Extract a historical CDR of the spam sender according to the labeling result, and train the time series feature and the space feature. The time-series feature can be converted into frequency domain information, and the extracted spatial feature parameters can include: the number of sent short messages, the number of received short messages, the number of recipients replying to the short message, the number of pairs of recipients having mutual communication records, and the like, The spatial feature model can be trained by replying to the number of short messages, that is, the logarithm of the mutual communication record. Step 306: Determine a spam short message sender transmission rule by frequency domain analysis and social relationship network analysis, and generate a time series feature based model file and a spatial feature based model file respectively. Step 307: Synchronize the generated model file to the detection module. Model files can be flexibly adjusted according to different operators' requirements for precision and recall. For example, if the operator wants a higher recall rate, the users marked as mixed SMS senders will be classified as spam messages during training; if the operator wants a higher precision, the training will only be Users marked as spammers are trained.
图 6为本发明实施例的在线检测的流程图, 如图 6所示, 具体流程包括
如下步骤: 步骤 401 , 逐条扫描预处理后的话单, 只记录提交时间和短信发送者和 接收者的号码。 步骤 402 , 进行在线检测条件触发判断, 满足一定触发条件才会进入步 骤 403启动在线检测算法, 否则返回步骤 401继续扫描话单。 比如用户在单位时间内发送短信条数超过一定阔值, 这个阔值可以根据 实际检测状况进行调整, 则启动在线检测相关算法。 步骤 403 , 提取实时短信发送者的时序特征和空间特征。 步骤 404 , 确定该短信发送者的时序特征和空间特征后, 与训练出的模 型文件相比较, 从而判断该发送者是否为垃圾发送者。 FIG. 6 is a flowchart of online detection according to an embodiment of the present invention. As shown in FIG. 6, the specific process includes The following steps are as follows: Step 401: Scan the pre-processed bills one by one, and record only the submission time and the number of the sender and recipient of the short message. Step 402: Perform an online detection condition triggering judgment. If a certain trigger condition is met, the process proceeds to step 403 to start the online detection algorithm. Otherwise, return to step 401 to continue scanning the CDR. For example, if the number of short messages sent by the user in a unit time exceeds a certain threshold, the threshold can be adjusted according to the actual detection status, and an online detection related algorithm is started. Step 403: Extract timing characteristics and spatial features of the real-time short message sender. Step 404: After determining the timing feature and the spatial feature of the short message sender, compare with the trained model file to determine whether the sender is a spammer.
本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序 来指令相关硬件完成, 所述程序可以存储于计算机可读存储介质中, 如只读 存储器、 磁盘或光盘等。 可选地, 上述实施例的全部或部分步骤也可以使用 一个或多个集成电路来实现。 相应地, 上述实施例中的各模块 /单元可以釆用 硬件的形式实现, 也可以釆用软件功能模块的形式实现。 本发明不限制于任 何特定形式的硬件和软件的结合。 One of ordinary skill in the art will appreciate that all or a portion of the above steps may be accomplished by a program instructing the associated hardware, such as a read-only memory, a magnetic disk, or an optical disk. Alternatively, all or part of the steps of the above embodiments may also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the above embodiment may be implemented in the form of hardware or in the form of a software function module. The invention is not limited to any specific form of combination of hardware and software.
以上仅为本发明的优选实施例, 当然, 本发明还可有其他多种实施例, 在不背离本发明精神及其实质的情况下, 熟悉本领域的技术人员当可根据本 发明作出各种相应的改变和变形, 但这些相应的改变和变形都应属于本发明 所附的权利要求的保护范围。 The above is only a preferred embodiment of the present invention, and of course, the present invention may be embodied in various other embodiments without departing from the spirit and scope of the invention. Corresponding changes and modifications are intended to be included within the scope of the appended claims.
工业实用性 本发明提供的垃圾短信监控的方法和系统是基于发送者行为在时序和空 间上的特征进行垃圾短信监控, 具有较高的查准率和查全率, 同时也提高了 垃圾短信制造者的规避成本, 并且不需要扫描短信内容, 系统性能上也有了 很大的提升。
Industrial Applicability The method and system for spam monitoring provided by the present invention is based on the characteristics of the sender's behavior in time series and space for spam message monitoring, which has a high precision and recall rate, and also improves spam message manufacturing. The cost of avoiding, and the need to scan the text message content, the system performance has also been greatly improved.
Claims
1、 一种垃圾短信监控的方法, 该方法包括: 若根据预定规则检测短信发送者为垃圾短信发送者, 则将所述短信发送 者列入黑名单, 进行垃圾短信的监控, 所述预定规则至少包括: 若短信发送者在预定时间段内发送短信的时序特征在预定时序特征, 则 将所述短信发送者规定为垃圾短信发送者; 或 若在预定时间段内短信发送者与其发送短信的所有接收者之间有相互通 信记录的对数与其两两组合的总对数的比例小于预定值, 则将所述短信发送 者规定为垃圾短信发送者; 或 若短信发送者在预定时间段内发送短信的时序特征在预定时序特征, 且 若在预定时间段内短信发送者与其发送短信的所有接收者之间有相互通信记 录的对数与其两两组合的总对数的比例小于预定值, 则将所述短信发送者规 定为垃圾短信发送者。 A method for monitoring spam messages, the method comprising: if detecting a sender of a short message as a spam sender according to a predetermined rule, the sender of the message is blacklisted, and monitoring the spam message, the predetermined rule The method at least includes: if the time-sending feature of the short message sent by the sender of the short message in the predetermined time period is a predetermined time series feature, the short message sender is specified as a spam sender; or if the short message sender sends a short message within a predetermined time period If the ratio of the logarithm of the mutual communication record to the total logarithm of the two-to-two combination between all the receivers is less than the predetermined value, the short message sender is specified as the spam sender; or if the short message sender is within the predetermined time period The timing characteristic of sending the short message is at a predetermined time-series feature, and if the ratio of the logarithm of the mutual communication record to the total number of pairs of the two-way combination between the sender of the short message and all the recipients who send the short message within a predetermined time period is less than a predetermined value, The sender of the short message is specified as a spammer.
2、如权利要求 1所述的方法, 其中: 在根据预定规则检测短信发送者为 垃圾短信发送者的步骤之前, 所述方法还包括: 提取已知垃圾短信发送者的历史短信记录, 通过从所述历史短信记录中训练得到已知垃圾短信发送者发送短信的频 率特征来训练出所述预定时序特征; 和 /或 将所述历史短信记录中的有相互通信记录的节点之间以边相连构建所述 已知垃圾短信发送者与其发送短信的所有接收者之间的社会关系网络图, 通 过所述边数与所有节点之间两两相连的总边数的比值训练出所述预定值。 2. The method according to claim 1, wherein: before the step of detecting that the short message sender is a spam sender according to a predetermined rule, the method further comprises: extracting a historical short message record of the known spam sender, by The historical short message record trains the frequency feature of the known spam sender to send the short message to train the predetermined time series feature; and/or connects the nodes in the historical short message record with the mutual communication record Constructing a social relationship network diagram between the known spam sender and all recipients sending the short message, and training the predetermined value by the ratio of the number of sides to the total number of sides connected by all the nodes.
3、如权利要求 1所述的方法, 其中: 所述根据预定规则检测短信发送者 为垃圾短信发送者的步骤之前, 所述方法还包括: 检测所述短信发送者在单位时间内发送短信的条数超过阔值。 The method of claim 1, wherein: before the step of detecting that the short message sender is a spam sender according to the predetermined rule, the method further comprises: detecting that the short message sender sends the short message in a unit time The number of bars exceeds the threshold.
4、如权利要求 3所述的方法, 其中: 所述根据预定规则检测短信发送者 为垃圾短信发送者的步骤包括: 在线检测所述短信发送者在当前一段时间内的短信话单, 若检测所述短 信发送者发送短信的时序特征为所述预定时序特征, 则判定所述短信发送者 为垃圾短信发送者; 或 在线检测所述短信发送者在当前一段时间内的短信话单, 若检测所述短 信发送者与其发送短信的所有接收者之间以有相互通信记录的对数与其两两 组合的总对数的比例小于所述预定值, 则判定所述短信发送者为垃圾短信发 送者; 或 在线检测所述短信发送者在当前一段时间内的短信话单, 若检测所述短 信发送者发送短信的时序特征为所述预定时序特征, 且若检测所述短信发送 者与其发送短信的所有接收者之间以有相互通信记录的对数与其两两组合的 总对数的比例小于所述预定值, 则判定所述短信发送者为垃圾短信发送者。 The method of claim 3, wherein: the step of detecting that the short message sender is a spam sender according to the predetermined rule comprises: detecting, by the online sender, the short message of the short message within the current period of time, if the detecting If the timing feature of the short message sent by the short message sender is the predetermined time series feature, the short message sender is determined to be a spam sender; or the short message of the short message sender in the current time period is detected online, if the short message is detected Determining that the short message sender is a spam sender between the sender of the short message and all the recipients of the short message with a ratio of the logarithm of the mutual communication record and the total logarithm of the two pairs being combined is less than the predetermined value. Or detecting the short message CDR of the short message sender in the current period of time, if detecting that the timing feature of the short message sender sending the short message is the predetermined time series feature, and if detecting the short message sender and sending the short message The ratio of the logarithm of the mutual communication record to the total logarithm of the two-two combination between all receivers is less than The predetermined value is determined to determine that the short message sender is a spam sender.
5、如权利要求 4所述的方法, 其中: 所述根据预定规则检测短信发送者 为垃圾短信发送者的步骤之前, 所述方法还包括: 提取所述短信发送者在当前一段时间内的短信话单; 对所述短信话单进行预处理。 The method of claim 4, wherein: before the step of detecting that the short message sender is a spam sender according to the predetermined rule, the method further comprises: extracting the short message of the short message sender in the current period of time CDR; pre-processing the short message.
6、 如权利要求 1-5中任一项所述的方法, 其中: 所述根据预定规则检测 短信发送者为垃圾短信发送者的步骤之前, 所述方法还包括: 检测所述短信发送者不在黑名单和白名单上。 The method according to any one of claims 1 to 5, wherein: before the step of detecting that the short message sender is a spam sender according to a predetermined rule, the method further comprises: detecting that the short message sender is absent Blacklist and whitelist.
7、 一种垃圾短信监控的系统, 该系统包括: 检测模块, 其设置为: 若根据预定规则检测短信发送者为垃圾短信发送 者, 则将所述短信发送者列入黑名单, 然后将所述黑名单发送给监控模块; 以及 监控模块, 其设置为: 根据所述黑名单进行垃圾短信的监控, 所述预定规则至少包括: 若检测短信发送者在预定时间段内发送短信的时序特征为预定时序特 征, 则将所述短信发送者规定为垃圾短信发送者; 或 若检测在预定时间段内短信发送者与其发送短信的所有接收者之间有相 互通信记录的对数与其两两组合的总对数的比例小于预定值, 则将所述短信 发送者规定为垃圾短信发送者; 或 若短信发送者在预定时间段内发送短信的时序特征在预定时序特征, 且 若在预定时间段内短信发送者与其发送短信的所有接收者之间有相互通信记 录的对数与其两两组合的总对数的比例小于预定值, 则将所述短信发送者规 定为垃圾短信发送者。 A system for monitoring spam messages, the system comprising: a detection module, configured to: if the sender of the message is detected as a spam sender according to a predetermined rule, the sender of the message is blacklisted, and then the The blacklist is sent to the monitoring module; and the monitoring module is configured to: monitor the spam message according to the blacklist, The predetermined rule at least includes: if detecting that the timing feature of the short message sender sending the short message within the predetermined time period is a predetermined time series feature, specifying the short message sender as a spam sender; or detecting the short message within the predetermined time period If the ratio of the logarithm of the mutual communication record to the total number of pairs of the two or two combinations of the sender and the sender of the short message is less than a predetermined value, the sender of the short message is specified as a spam sender; or if the message is sent The timing feature of the short message sent during the predetermined time period is at a predetermined time-series feature, and if the sender of the short message has a logarithm of the mutual communication record and the total number of pairs of the two-way combination between the sender and the sender of the short message within the predetermined time period If the ratio is less than a predetermined value, the sender of the short message is specified as a spammer.
8、 如权利要求 7所述的系统, 其还包括, 训练模块, 其设置为: 提取已知垃圾短信发送者的历史短信记录, 通过 从所述历史短信记录中训练得到已知垃圾短信发送者发送短信的频率特征来 训练出所述预定时序特征, 然后将所述预定时序特征发送给所述检测模块; 和 /或,将所述历史短信记录中的有相互通信记录的节点之间以边相连构建所 述已知垃圾短信发送者与其发送短信的所有接收者之间的社会关系网络图, 通过所述边数与所有节点之间两两相连的总边数的比值训练出所述预定值, 然后将所述预定值发送给所述检测模块。 8. The system of claim 7, further comprising: a training module configured to: extract a historical short message record of a known spam sender, and obtain a known spam sender by training from the historical short message record Transmitting a frequency characteristic of the short message to train the predetermined timing feature, and then transmitting the predetermined timing feature to the detecting module; and/or, between the nodes having the mutual communication record in the historical short message record Constructing a social relationship network diagram between the known spam sender and all recipients sending the short message, and training the predetermined value by the ratio of the number of sides to the total number of sides connected by all the nodes And then transmitting the predetermined value to the detection module.
9、 如权利要求 7所述的系统, 其中, 所述检测模块包括, 在线检测模块, 其设置为: 在线检测所述短信发送者在当前一段时间内 的短信话单, 若检测所述短信发送者发送短信的时序特征为所述预定时序特 征, 则判定所述短信发送者为垃圾短信发送者; 或在线检测所述短信发送者 在当前一段时间内的短信话单, 若检测所述短信发送者与其发送短信的所有 接收者之间有相互通信记录的对数与其两两组合的总对数的比例小于所述预 定值, 则判定所述短信发送者为垃圾短信发送者; 或在线检测所述短信发送 者在当前一段时间内的短信话单, 若检测所述短信发送者发送短信的时序特 征为所述预定时序特征, 且若检测所述短信发送者与其发送短信的所有接收 者之间以有相互通信记录的对数与其两两组合的总对数的比例小于所述预定 值, 则判定所述短信发送者为垃圾短信发送者。 The system of claim 7, wherein the detecting module comprises: an online detecting module, configured to: detect, on the online, the short message of the short message sender in the current time period, if the short message is sent If the timing feature of the short message sent is the predetermined time series feature, the sender of the short message is determined to be a spam sender; or the short message of the short message sender in the current time period is detected online, and if the short message is sent, If the ratio of the logarithm of the mutual communication record to the total logarithm of the two-to-two combination between the recipients and the other recipients of the short message is less than the predetermined value, the short message sender is determined to be the spam sender; or the online detection office a short message CDR of the short message sender in the current period of time, if the timing feature of the short message sent by the short message sender is detected as the predetermined time series feature, and if all the receiving of the short message sender and the short message is detected The ratio between the logarithm of the mutual communication record and the total logarithm of the two-two combination is less than the predetermined value, and the short message sender is determined to be the spam sender.
10、 如权利要求 9所述的系统, 其中: 所述在线检测模块还设置为: 在 检测短信发送者是否为垃圾短信发送者之前, 检测所述短信发送者在单位时 间内发送短信的条数超过阔值。 10. The system according to claim 9, wherein: the online detecting module is further configured to: detect, before detecting whether the short message sender is a spam sender, the number of the short message sent by the short message sender in a unit time More than the threshold.
11、 如权利要求 9所述的系统, 其还包括: 话单预处理模块, 其设置为: 提取所述短信发送者在当前一段时间内的 11. The system of claim 9, further comprising: a bill pre-processing module configured to: extract the sender of the short message for a current period of time
12、 如权利要求 7-11中任一项所述的系统, 其中, 所述检测模块还设置 为: 根据预定规则检测短信发送者为垃圾短信发送者之前, 检测所述短信发 送者不在黑名单和白名单上。 The system according to any one of claims 7 to 11, wherein the detecting module is further configured to: detect that the short message sender is not blacklisted before detecting that the short message sender is a spam sender according to a predetermined rule And on the white list.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010102527552A CN101909261A (en) | 2010-08-10 | 2010-08-10 | Method and system for monitoring spam |
CN201010252755.2 | 2010-08-10 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2012019386A1 true WO2012019386A1 (en) | 2012-02-16 |
Family
ID=43264550
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2010/078516 WO2012019386A1 (en) | 2010-08-10 | 2010-11-08 | Method and system for monitoring spam short messages |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN101909261A (en) |
WO (1) | WO2012019386A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105323763A (en) * | 2014-06-27 | 2016-02-10 | 中国移动通信集团湖南有限公司 | Method and apparatus for identifying spam messages |
CN118474682A (en) * | 2024-07-15 | 2024-08-09 | 浙江三子智联科技有限公司 | Service short message monitoring method and system based on big data |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102231873A (en) * | 2011-06-22 | 2011-11-02 | 中兴通讯股份有限公司 | Method and system for monitoring garbage message and monitor processing apparatus |
CN102231874A (en) * | 2011-06-23 | 2011-11-02 | 中兴通讯股份有限公司 | Short message processing method, device and system |
CN102890688B (en) * | 2011-07-22 | 2018-01-02 | 深圳市世纪光速信息技术有限公司 | A kind of automatic detection method and device for submitting content |
CN103996130B (en) * | 2014-04-29 | 2016-04-27 | 北京京东尚科信息技术有限公司 | A kind of information on commodity comment filter method and system |
CN105744493B (en) * | 2014-12-08 | 2019-09-10 | 中国移动通信集团河北有限公司 | A kind of information identifying method and device |
CN105119910A (en) * | 2015-07-23 | 2015-12-02 | 浙江大学 | Template-based online social network rubbish information real-time detecting method |
CN106559761A (en) * | 2015-09-28 | 2017-04-05 | 中国移动通信集团公司 | A kind of information processing method and terminal, server |
CN105704689A (en) * | 2016-01-12 | 2016-06-22 | 深圳市深讯数据科技股份有限公司 | Big data acquisition and analysis method and system of short message behaviors |
CN106506329A (en) * | 2016-10-20 | 2017-03-15 | 北京小米移动软件有限公司 | Delete the method and device of end-user listening data information |
CN108306811B (en) * | 2017-02-06 | 2021-03-26 | 腾讯科技(深圳)有限公司 | Message processing method and device |
CN107872772B (en) * | 2017-12-19 | 2021-02-26 | 北京奇虎科技有限公司 | Method and device for detecting fraud short messages |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1696619A1 (en) * | 2003-06-23 | 2006-08-30 | Microsoft Corporation | Method and device for spam detection |
CN101188580A (en) * | 2007-12-05 | 2008-05-28 | 中国联合通信有限公司 | A real time spam filtering method and system |
CN101299729A (en) * | 2008-06-25 | 2008-11-05 | 哈尔滨工程大学 | Method for judging rubbish mail based on topological action |
CN101686444A (en) * | 2008-09-28 | 2010-03-31 | 国际商业机器公司 | System and method for detecting spam SMS sender number in real time |
-
2010
- 2010-08-10 CN CN2010102527552A patent/CN101909261A/en active Pending
- 2010-11-08 WO PCT/CN2010/078516 patent/WO2012019386A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1696619A1 (en) * | 2003-06-23 | 2006-08-30 | Microsoft Corporation | Method and device for spam detection |
CN101188580A (en) * | 2007-12-05 | 2008-05-28 | 中国联合通信有限公司 | A real time spam filtering method and system |
CN101299729A (en) * | 2008-06-25 | 2008-11-05 | 哈尔滨工程大学 | Method for judging rubbish mail based on topological action |
CN101686444A (en) * | 2008-09-28 | 2010-03-31 | 国际商业机器公司 | System and method for detecting spam SMS sender number in real time |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105323763A (en) * | 2014-06-27 | 2016-02-10 | 中国移动通信集团湖南有限公司 | Method and apparatus for identifying spam messages |
CN105323763B (en) * | 2014-06-27 | 2019-03-05 | 中国移动通信集团湖南有限公司 | A kind of recognition methods of junk short message and device |
CN118474682A (en) * | 2024-07-15 | 2024-08-09 | 浙江三子智联科技有限公司 | Service short message monitoring method and system based on big data |
Also Published As
Publication number | Publication date |
---|---|
CN101909261A (en) | 2010-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2012019386A1 (en) | Method and system for monitoring spam short messages | |
CN101472245B (en) | Method and apparatus for intercepting rubbish short message | |
CN108881265B (en) | Network attack detection method and system based on artificial intelligence | |
WO2011153744A1 (en) | Method and system for monitoring spam short message | |
CN108683687B (en) | Network attack identification method and system | |
KR101186743B1 (en) | Detection of unwanted messages spam | |
CN101257671B (en) | Method for real time filtering large scale rubbish SMS based on content | |
CN105228143B (en) | A kind of refuse messages discrimination method, device and terminal | |
US20120110672A1 (en) | Systems and methods for classification of messaging entities | |
CN102790752A (en) | Fraud information filtering system and method on basis of feature identification | |
CN108471429A (en) | A kind of network attack alarm method and system | |
CN108881263A (en) | A kind of network attack result detection method and system | |
CN111752973B (en) | System and method for generating heuristic rules for identifying spam emails | |
CN201491020U (en) | Event classification and rule tree-based association analysis device | |
CN103442014A (en) | Method and system for automatic detection of suspected counterfeit websites | |
CA2977807C (en) | Technique for detecting suspicious electronic messages | |
CN108183888A (en) | A kind of social engineering Network Intrusion path detection method based on random forests algorithm | |
CN110519150A (en) | Mail-detection method, apparatus, equipment, system and computer readable storage medium | |
Janabi et al. | Convolutional neural network based algorithm for early warning proactive system security in software defined networks | |
US20180351897A1 (en) | A method and device for spam sms detection | |
CN104091122A (en) | Detection system of malicious data in mobile internet | |
CN108011805A (en) | Method, apparatus, intermediate server and the car networking system of message screening | |
WO2012151929A1 (en) | Method and device for monitoring short message | |
Wang et al. | A behavior-based SMS antispam system | |
CN108322354B (en) | Method and device for identifying running-stealing flow account |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10855807 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 10855807 Country of ref document: EP Kind code of ref document: A1 |