WO2016177069A1 - 一种管理方法、装置、垃圾短信监控系统及计算机存储介质 - Google Patents

一种管理方法、装置、垃圾短信监控系统及计算机存储介质 Download PDF

Info

Publication number
WO2016177069A1
WO2016177069A1 PCT/CN2016/075548 CN2016075548W WO2016177069A1 WO 2016177069 A1 WO2016177069 A1 WO 2016177069A1 CN 2016075548 W CN2016075548 W CN 2016075548W WO 2016177069 A1 WO2016177069 A1 WO 2016177069A1
Authority
WO
WIPO (PCT)
Prior art keywords
short message
spam
keyword
sample
message
Prior art date
Application number
PCT/CN2016/075548
Other languages
English (en)
French (fr)
Inventor
李冠军
侯振强
于思亮
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016177069A1 publication Critical patent/WO2016177069A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W12/00Security arrangements; Authentication; Protecting privacy or anonymity
    • H04W12/12Detection or prevention of fraud
    • H04W12/128Anti-malware arrangements, e.g. protection against SMS fraud or mobile malware
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W12/00Security arrangements; Authentication; Protecting privacy or anonymity
    • H04W12/12Detection or prevention of fraud
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W88/00Devices specially adapted for wireless communication networks, e.g. terminals, base stations or access point devices
    • H04W88/18Service support devices; Network management devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/12Messaging; Mailboxes; Announcements
    • H04W4/14Short messaging services, e.g. short message services [SMS] or unstructured supplementary service data [USSD]

Definitions

  • the invention relates to the field of spam short message monitoring, in particular to a management method, device, spam short message monitoring system and computer storage medium.
  • the prior art analyzes text messages through the spam short message monitoring system, and filters spam messages to improve the user experience.
  • the existing spam monitoring system is a keyword that is provided by operators' operation and maintenance personnel based on experience, such as "invoicing, transfer,” etc., to analyze and filter the content of the short message. This method is inevitable at the same time as screening and eliminating spam messages.
  • the normal text messages of some users are eliminated, and there is a problem of mis-blocking.
  • the operation and maintenance personnel provide keywords, the labor is large, and inevitably there will be spam messages that are missing. That is, the manner in which the prior art uses the keyword policy provided by the operation and maintenance personnel cannot satisfy the user calendar enhanced use requirement.
  • the embodiment of the invention provides a management method, a device, a spam short message monitoring system and a computer storage medium, so as to solve the problem that the existing manual keyword providing policy cannot meet the user's calendar enhanced use requirement.
  • the embodiment of the invention provides a keyword policy management method for a spam short message monitoring system, which comprises: acquiring a keyword strategy of a spam short message monitoring system; and evaluating and optimizing a keyword strategy based on a short message sample database, according to the processing The result is a keyword strategy; the keyword strategy after the evaluation optimization process is sent to the spam monitoring system.
  • the evaluation optimization process includes: simulating a normal short message based on the short message sample database, performing at least spam short message optimization processing, spam short message tracking optimization processing, and spam short message interception efficiency optimization processing for each keyword in the keyword strategy.
  • the spam short message optimization process includes: performing a precision correction rate and a recall ratio for each keyword in the keyword strategy, comparing the predicted result with the optimization target, and managing the keyword according to the comparison result.
  • managing the keywords according to the comparison result includes: deleting keywords with poor prediction results, suggesting to process keywords with general prediction results, and retaining keywords with good prediction results.
  • the spam message leakage optimization process includes: determining a spam message library that is not intercepted in the ordinary text message, calculating an interception keyword of the spam message library that is not intercepted, and adding the interception keyword to the keyword policy.
  • the spam short message interception efficiency optimization process includes: determining, for each keyword, whether there is a keyword that overlaps with it, and deleting if it exists; determining whether there is a keyword that intersects with it; if it exists, combining and sorting; There are keywords that can be merged with them, and if they exist, they are merged.
  • the method further includes: re-evaluating and optimizing the keyword strategy after the evaluation optimization process until the optimization goal is reached, or reaching a predetermined number of times.
  • the method further comprises: obtaining a spam message sample and a normal message sample from the spam message monitoring system and the complaint platform, and establishing a short message sample library according to the spam message sample and the normal message sample.
  • the method for creating a short message sample according to the spam sample and the normal short message sample comprises: adding the spam sample and the normal short message sample directly to the trusted sample database of the short message sample database, and the spam short message monitoring system and the complaint platform according to the trusted sample database
  • the synchronized SMS to be detected is classified and reviewed, and stored in the SMS sample database.
  • the classified review of the to-be-detected short message synchronized by the spam short message monitoring system and the complaint platform according to the trusted sample library includes: treating the fingerprint signature of the short message to be detected, the similarity with the fingerprint of the spam message sample and the normal short message sample, and treating Detect SMS for classification review.
  • the classified review of the detected short message includes: extracting the garbage fingerprint signature of each short message content from the spam short message sample, and comparing The similarity between the fingerprint signature of the short message to be detected and the garbage fingerprint signature. If the two are similar, the short message method to be detected is classified into a spam message; the normal fingerprint signature of each short message content is extracted from the normal short message sample, and the short message to be detected is compared. The similarity between the fingerprint signature and the normal fingerprint signature. If the two are similar, the short message method to be detected is divided into normal short messages.
  • the classified review of the SMS to be detected by the spam short message monitoring system and the complaint platform further includes: learning a trusted sample database to generate a spam message classifier, using a spam message classifier to the spam message monitoring system and complaints.
  • the SMS to be detected synchronized by the platform is classified and audited.
  • the learning the trusted sample database to generate the spam message classifier comprises: extracting a batch of spam message samples from the spam message samples, extracting a batch of normal message samples from the normal message samples; and preprocessing the extracted message content samples;
  • the pre-processed SMS content is segmented in Chinese, and finally the segmentation of the SMS is generated; the weight of each segmentation in the spam sample and the weight in the normal SMS sample are sequentially.
  • the embodiment of the present invention further provides a computer storage medium, where the computer storage medium stores an execution instruction, and the execution instruction is used to execute the foregoing method.
  • An embodiment of the present invention provides a keyword policy management apparatus for a spam short message monitoring system, which includes: an obtaining module, configured to acquire a keyword policy of a spam short message monitoring system; and a processing module configured to be based on a short message sample database pair
  • the keyword strategy performs evaluation optimization processing, and the keyword strategy is processed according to the processing result; the sending module is configured to send the keyword strategy after the evaluation optimization process to the spam short message monitoring system.
  • the processing module is configured to simulate a normal short message based on the short message sample database, perform spam short message optimization processing, spam short message error optimization processing, and spam short message interception efficiency optimization processing for each keyword in the keyword strategy. At least one.
  • the processing module is configured to perform a prediction of the precision and the recall ratio for each keyword in the keyword strategy, compare the predicted result with the optimization target, and manage the keyword according to the comparison result.
  • processing module is configured to delete keywords with poor prediction results, and it is recommended to process keywords with general prediction results and keywords with good prediction results.
  • the processing module is configured to determine a spam message library that is not intercepted in the normal text message, calculate an interception keyword of the spam message library that is not intercepted, and add the interception keyword to the keyword policy.
  • the processing module is configured to determine, for each keyword, whether there is a keyword that is repeated with the keyword, if it exists, delete it; determine whether there is a keyword that intersects with it; if yes, combine the collation; determine whether there is a combinable Keywords, if any, merge.
  • processing module is further configured to re-evaluate and optimize the keyword strategy after the evaluation optimization process until the optimization goal is reached, or a predetermined number of times is reached.
  • the method further includes establishing a module, configured to obtain a spam message sample and a normal message sample from the spam message monitoring system and the complaint platform, and establish a short message sample library according to the spam message sample and the normal message sample.
  • the establishing module is configured to directly add the spam short message sample and the normal short message sample to the trusted sample database of the short message sample database, and classify and review the to-be-detected short message synchronized by the spam short message monitoring system and the complaint platform according to the trusted sample database, and Save the SMS sample library.
  • the establishing module is configured to perform a classification review on the detected short message according to the fingerprint signature of the short message to be detected, the similarity with the fingerprint of the spam message sample and the normal short message sample.
  • the establishing module is configured to extract the garbage fingerprint signature of each short message content from the spam short message sample, and compare the similarity between the fingerprint signature of the short message to be detected and the garbage fingerprint signature, and if the two are similar, divide the short message method to be detected.
  • Spam SMS extract the normal fingerprint signature of each SMS content from the normal SMS sample, and compare the similarity between the fingerprint signature of the SMS to be detected and the normal fingerprint signature. If the two are similar, the SMS method to be detected is divided into normal SMS.
  • the establishing module is configured to learn the trusted sample database to generate the spam message classifier, and use the spam message classifier to classify and audit the spam short message monitoring system and the to-be-detected short message synchronized by the complaint platform.
  • the establishing module is configured to extract a batch of spam samples from the spam sample, extract a batch of normal short message samples from the normal short message samples, preprocess the extracted short message content samples, and perform pre-processed short message content in Chinese.
  • Word segmentation which ultimately generates the word segmentation of the message; the weight of each word segment in the spam sample and the weight in the normal message sample.
  • the embodiment of the present invention provides a spam short message monitoring system, which uses the management device provided by the embodiment of the present invention to manage keyword measurement and control.
  • the embodiment of the invention provides a new management method, and the keyword strategy is evaluated and optimized according to the short message sample database, and no manual intervention is needed, and the automatic optimization management of the keyword strategy according to the short message sample database is realized, so that the keyword is implemented.
  • the strategy is more complete and the interception is more accurate. It solves the problem that the existing manual keyword policy can not meet the user's calendar enhanced usage requirements, and enhances the user experience.
  • FIG. 1 is a schematic structural diagram of a management apparatus according to a first embodiment of the present invention.
  • FIG. 2 is a flowchart of a management method according to a second embodiment of the present invention.
  • FIG. 3 is a flowchart of a management method according to a third embodiment of the present invention.
  • FIG. 4 is a schematic diagram of short message fingerprint recognition in a third embodiment of the present invention.
  • the management apparatus 1 is a schematic structural diagram of a management apparatus according to a first embodiment of the present invention. As shown in FIG. 1, in the embodiment, the management apparatus 1 provided by the present invention includes:
  • the obtaining module 11 is configured to obtain a keyword policy of the spam short message monitoring system
  • the processing module 12 is configured to perform an evaluation and optimization process on the keyword policy based on the short message sample database, and process the keyword policy according to the processing result;
  • the sending module 13 is configured to send the keyword strategy after the evaluation optimization process to the spam short message monitoring system.
  • the processing module 12 in the foregoing embodiment is configured to simulate a normal short message based on the short message sample database, perform spam short message optimization processing, spam short message optimization processing, and garbage for each keyword in the keyword policy. At least one of SMS interception efficiency optimization processing.
  • the processing module 12 in the foregoing embodiment is configured to perform a prediction of the precision and the recall ratio for each keyword in the keyword strategy, and compare the predicted result with the optimization target, according to the comparison result. Manage keywords.
  • the processing module 12 in the above embodiment is configured to delete keywords with poor prediction results, suggest to process keywords with general prediction results, and retain keywords with good prediction results.
  • the processing module 12 in the foregoing embodiment is configured to determine a spam message library that is not intercepted in the normal text message, calculate an interception keyword of the spam message library that is not intercepted, and add the interception keyword to the keyword policy. .
  • the processing module 12 in the above embodiment is configured to determine, for each keyword, whether there is a keyword that overlaps with it, and if so, delete it; determine whether there is a keyword that intersects with it, if it exists, Combination Reason; determine whether there are keywords that can be merged with them, and if they exist, merge them.
  • the processing module 12 in the above embodiment is further configured to re-evaluate the keyword strategy after the evaluation optimization process until the optimization goal is reached, or a predetermined number of times is reached.
  • the management apparatus in the above embodiment further includes an establishing module 14 configured to obtain a spam message sample and a normal message sample from the spam short message monitoring system and the complaint platform, according to the spam message sample and normal.
  • the SMS sample is used to create a short message sample library.
  • the establishing module 14 in the foregoing embodiment is configured to directly add the spam short message sample and the normal short message sample to the trusted sample database of the short message sample database, and synchronize the spam short message monitoring system and the complaint platform according to the trusted sample database.
  • the SMS to be detected is classified and reviewed, and stored in the SMS sample database.
  • the establishing module 14 in the foregoing embodiment is configured to perform classified auditing on the detected short message according to the fingerprint signature of the short message to be detected, the similarity with the fingerprint signature of the spam short message sample and the normal short message sample.
  • the establishing module 14 in the foregoing embodiment is configured to extract a garbage fingerprint signature of each short message content from the spam short message sample, and compare the similarity between the fingerprint signature of the short message to be detected and the garbage fingerprint signature, if the two are similar , the short message method to be detected is divided into spam messages; the normal fingerprint signature of each short message content is extracted from the normal short message sample, and the similarity between the fingerprint signature of the short message to be detected and the normal fingerprint signature is compared, and if the two are similar, the pair will be treated.
  • the detection short message method is divided into normal short messages.
  • the establishing module 14 in the above embodiment is configured to learn the trusted sample database to generate the spam short message classifier, and use the spam short message classifier to classify and audit the spam short message monitoring system and the to-be-detected short message synchronized by the complaint platform.
  • the establishing module 14 in the foregoing embodiment is configured to extract a batch of spam samples from the spam sample, extract a batch of normal short message samples from the normal short message samples, and perform preprocessing on the extracted short message content samples; Perform Chinese segmentation on the pre-processed SMS content, and finally generate the segmentation of the SMS; the weight of each segmentation in the spam sample and the weight in the normal SMS sample.
  • the embodiment of the present invention provides a spam short message monitoring system, which uses the management device 1 provided by the embodiment of the present invention to manage keyword measurement and control.
  • FIG. 2 is a flowchart of a management method according to a second embodiment of the present invention. As shown in FIG. 2, in the embodiment, the management method provided by the present invention includes the following steps:
  • S202 Perform an evaluation and optimization process on the keyword policy based on the short message sample database, and process the keyword strategy according to the processing result;
  • S203 Send a keyword strategy after the evaluation optimization process to the spam short message monitoring system.
  • the evaluation optimization process in the foregoing embodiment includes: simulating a normal short message based on the short message sample database, performing spam short message optimization processing, spam short message optimization processing, and garbage for each keyword in the keyword policy. At least one of SMS interception efficiency optimization processing.
  • the spam short message optimization process in the above embodiment includes: performing a precision correction rate and a recall ratio for each keyword in the keyword strategy, and comparing the predicted result with the optimization target, Manage keywords based on comparison results.
  • managing the keyword according to the comparison result in the foregoing embodiment includes: deleting a keyword with a poor prediction result, suggesting to process a keyword with a general prediction result, and retaining a keyword with a good prediction result.
  • the spam skipping optimization process in the foregoing embodiment includes: determining a spam message library that is not intercepted in the normal text message, calculating an interception keyword of the spam message library that is not intercepted, and adding the interception keyword to the Keyword strategy.
  • the spam interception efficiency optimization process in the foregoing embodiment includes: determining, for each keyword, whether there is a keyword that is repeated with the keyword, and if yes, deleting; determining whether there is a keyword that intersects with the keyword, if If there is, it is combined; it is judged whether there are keywords that can be merged with it, and if they exist, they are merged.
  • the method in the foregoing embodiment further includes: re-evaluating the optimization of the keyword strategy after the optimization process, until the optimization goal is reached, or a predetermined number of times is reached.
  • the method in the foregoing embodiment further includes: obtaining a spam message sample and a normal message sample from the spam message monitoring system and the complaint platform, and establishing a short message sample library according to the spam message sample and the normal message sample.
  • the establishing a short message sample library according to the spam message sample and the normal short message sample in the foregoing embodiment includes: adding the spam message sample and the normal short message sample directly to the trusted sample library of the short message sample database, according to the trusted sample database
  • the classified SMS to be detected by the spam monitoring system and the complaint platform is classified and reviewed, and stored in the short message sample database.
  • the classified review of the to-be-detected short message synchronized by the spam short message monitoring system and the complaint platform according to the trusted sample library in the above embodiment includes: fingerprint signature according to the to-be-detected short message, and spam sample and normal short message sample. The similarity of the fingerprint signatures, the classification of the detected SMS messages.
  • the classification review of the detected short message includes: extracting each piece from the spam message sample.
  • the garbage fingerprint signature of the short message content compares the similarity between the fingerprint signature of the short message to be detected and the garbage fingerprint signature. If the two are similar, the short message method to be detected is classified into a spam message; and the normal content of each short message is extracted from the normal short message sample.
  • the fingerprint signature compares the similarity between the fingerprint signature of the short message to be detected and the normal fingerprint signature. If the two are similar, the short message method to be detected is divided into normal short messages.
  • the classification and verification of the to-be-detected short message synchronized by the spam short message monitoring system and the complaint platform according to the trusted sample library in the above embodiment further includes: learning the trusted sample database to generate the spam message classifier, and using the spam message classification.
  • the device classifies and reviews the SMS to be detected synchronized with the spam SMS monitoring system and the complaint platform.
  • the learning trusted sample library in the above embodiment generates the spam message classifier, comprising: extracting a batch of spam message samples from the spam message samples, and extracting a batch of normal message samples from the normal message samples; SMS The content sample is preprocessed; the Chinese word segmentation of the pre-processed short message content is performed, and finally the segmentation of the short message is generated; the weight of each word segment in the spam message sample and the weight in the normal short message sample are sequentially performed.
  • the short message related to the embodiment of the present invention includes short information of a scene, a multimedia message, a broadcast message, an email, and the like.
  • FIG. 3 is a flowchart of a management method according to a third embodiment of the present invention. As shown in FIG. 3, in the embodiment, the management method provided by the present invention includes the following steps:
  • S301 The management device synchronizes data with the spam short message monitoring system and the complaint platform.
  • the embodiment of the invention provides a data synchronization interface between the management device and the spam short message monitoring system and the complaint platform.
  • the IF1 interface receives spam messages and normal message samples from the spam SMS monitoring system and the complaint platform, and forms a credible spam sample database and a normal message sample library through automatic review, and the samples in the sample library are the basis for evaluation and optimization;
  • IF2 interface Receives the keyword strategy to be evaluated and optimized before the formal deployment from the spam SMS monitoring system;
  • IF3 The optimized keyword strategy synchronization spam monitoring system is used for formal deployment.
  • S302 The management device establishes a short message sample library.
  • the management device adds the spam message (user mark or complaint report) and the normal message in the short message obtained by the synchronization to the trusted sample library in the short message sample library.
  • the learning training of the naive Bayes classifier is taken as an example for explanation.
  • the specific process is described as follows:
  • pre-processing the extracted short message content samples including but not limited to content short message rejection, such as content less than 10 words; noise processing, such as deleting spaces, punctuation and other special characters;
  • a naive Bayes classifier is obtained through the above learning training.
  • the core idea based on Naive Bayesian SMS classification is to calculate the SMS to be detected as normal SMS and spam. Probability, if the probability that the short message belongs to the spam message P(C0
  • C0 spam message class
  • C1 normal message class
  • P(C0), P(C1) are the global probability of spam message and normal message, which can be obtained by statistical probability
  • P(C0) takes the number of garbage message samples. (The number of spam samples + the number of normal SMS samples); P (C1) takes the ratio of the number of normal SMS samples (the number of spam samples + the number of normal SMS samples).
  • the content of the short message is expressed as a participle vector, and the participles are regarded as independent of each other.
  • Cj) can be expressed as the product of the conditional probabilities of each participle under the Cj class, so P(Wt
  • C1) is correspondingly expressed as the probability that the participle Wt appears in the normal short message class.
  • the ratio of the number of spam samples of a naive Bayesian classifier to the number of normal SMS samples is 5:95, that is, P(C0) is equal to 0.05 and P(C1) is equal to 0.95.
  • the probability that the to-be-detected short message belongs to the spam message is 4.58 times the probability of belonging to the normal short message, so the message belongs to the spam message.
  • the naive Bayesian multi-classifier refers to extracting N sets of samples from the junk SMS sample library and the normal short message sample database, each group of samples including a batch. Spam SMS samples and a batch of normal SMS samples. N defaults to 30 groups. Each group of samples learns to train a classifier. When identifying the text messages to be detected, each group of classifiers performs detection and scoring. When more than half of the classifiers identify them as garbage. When texting, it is considered to be spam, and the accuracy is improved by introducing a scoring mechanism.
  • the type of the short message is first identified by the fingerprint, and the unrecognized short message is identified by the classifier, and the still unrecognizable short message is discarded.
  • the garbage fingerprint signature of each short message content is extracted from the spam short message sample library, and the similarity between the fingerprint signature of the short message to be detected and the garbage fingerprint signature is compared. If the two are similar, the short message method to be detected is to be detected. It is divided into spam messages; similarly, the normal fingerprint signature of each short message content is extracted from the normal short message sample database, and the similarity between the fingerprint signature of the short message to be detected and the normal fingerprint signature is compared. If the two are similar, the short message method to be detected is divided. For normal text messages.
  • Pre-processing the content of the short message including but not limited to noise processing, such as deleting special characters such as spaces and punctuation marks;
  • the present embodiment performs an automatic review on the short message to be classified (non-user reported, which may be misplaced) intercepted by the spam short message monitoring system, and the process description is as follows:
  • the trusted sample in the external sample is already manually marked spam or normal text message, such as the sample of the manual review and the complaint platform in the spam SMS monitoring system, so the fingerprint SMS sample library and the normal SMS sample library are directly entered according to the mark;
  • Non-trusted samples in the external sample such as the spam message detected by the spam SMS monitoring system, need to be automatically reviewed by the spam message classifier;
  • the untrusted sample first enters the fingerprint signature recognition classifier.
  • the classifier recognizes the normal message, it enters the normal message sample database.
  • the classifier recognizes the spam message, it enters the spam sample database, and when the classifier cannot recognize it, it enters.
  • Naive Bayes classifier identification link
  • the Naive Bayes classifier detects the untrusted sample. When the classifier recognizes the normal message, it enters the normal short message sample database. When the classifier recognizes the spam message, it enters the spam message sample library. When the classifier cannot recognize it, Discard directly.
  • this embodiment Based on the short message sample database, this embodiment also provides a keyword policy extraction mechanism.
  • the main processes are described as follows:
  • pre-processing the extracted short message content samples including but not limited to content short message rejection, such as content less than 10 words; noise processing, such as deleting spaces, punctuation and other special characters;
  • the dimension feature vector Dx is dimension-reduced, and the M eigenvalues with the highest probability are selected, and each eigenvalue is greater than a certain threshold K. If the probability value is filtered, the number of eigenvalues is less than L. , then discard this participle feature vector Dx, and finally get the following weight feature vector with dimension M:
  • This vector is the candidate keyword set for the sample
  • step S303 After updating the short message sample library, the process returns to step S303 to perform the typed learning training.
  • step 6 The optimized strategy again proceeds to step 3 for pre-evaluation, and the pre-evaluation and intelligent optimization form a loop iteration until the optimization goal is reached or finally the loop iteration is reached.
  • X1, X2, Y1, Y2 can be configured, and N1 ⁇ N2, X1 ⁇ X2, Y1 ⁇ Y2; the rule contribution refers to The number of spam samples that a rule hits.
  • the method of misinterpreting optimization is:
  • the method of optimization is:
  • Efficiency optimization can improve the efficiency of keyword combination strategies for performance reduction, including:
  • the keyword strategy is evaluated and optimized, no manual intervention is required, and the automatic optimization management of the keyword strategy according to the short message sample database is realized, so that the keyword strategy is more complete, the interception is more accurate, and the existing manual is solved.
  • Providing a keyword strategy that does not meet the enhanced usage requirements of the user's calendar enhances the user experience.
  • modules or steps of the present invention described above can be implemented by a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein.
  • the steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps thereof are fabricated as a single integrated circuit module.
  • the invention is not limited to any specific combination of hardware and software.
  • the foregoing embodiments of the present invention can be applied to the field of spam SMS monitoring, and solve the problem that the existing manual keyword providing policy cannot meet the user's calendar enhanced usage requirement, and the user experience is enhanced.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Transfer Between Computers (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明提供了一种管理方法、装置、垃圾短信监控系统及计算机存储介质,该方法包括:获取垃圾短信监控系统的关键字策略;基于短信样本库对关键字策略进行评估优化处理,根据处理结果处理关键字策略;发送评估优化处理后的关键字策略至垃圾短信监控系统。通过本发明的实施,根据短信样本库对关键字策略进行评估优化处理,不需要人工进行干预,实现了根据短信样本库对关键字策略的自动优化管理,使得关键字策略更加完整、拦截更加准确,解决了现有人工提供关键字策略不能满足用户日历增强的使用需求的问题,增强了用户的使用体验。

Description

一种管理方法、装置、垃圾短信监控系统及计算机存储介质 技术领域
本发明涉及垃圾短信监控领域,尤其涉及一种管理方法、装置、垃圾短信监控系统及计算机存储介质。
背景技术
随着垃圾短信的日益频繁,严重的影响了用户的正常生活,为了减小这些垃圾短信,现有技术通过垃圾短信监控系统对短信进行分析,筛选垃圾短信,以提高用户使用体验。
现有垃圾短信监控系统是使用运营商运维人员根据经验提供的关键字,如“开发票、转账”等,对短信内容进行分析筛选,这种方式在筛选剔除垃圾短信的同时,也不可避免的导致部分用户的正常短信被剔除,存在误拦的问题,同时,运维人员提供关键字,劳动量大,并不可避免的会出现漏拦的垃圾短信。即,现有技术通过运维人员提供的关键字策略的方式不能满足用户日历增强的使用需求。
因此,如何提供一种可管理关键字策略的管理方法,是本领域技术人员亟待解决的技术问题。
发明内容
本发明实施例提供了一种管理方法、装置、垃圾短信监控系统及计算机存储介质,以解决现有人工提供关键字策略不能满足用户日历增强的使用需求的问题。
本发明实施例提供了一种用于垃圾短信监控系统的关键字策略的管理方法,其包括:获取垃圾短信监控系统的关键字策略;基于短信样本库对关键字策略进行评估优化处理,根据处理结果处理关键字策略;发送评估优化处理后的关键字策略至垃圾短信监控系统。
进一步的,评估优化处理包括:基于短信样本库模拟普通短信,对关键字策略中的每一条关键字执行垃圾短信误拦优化处理、垃圾短信漏拦优化处理、垃圾短信拦截效率优化处理中的至少一种。
进一步的,垃圾短信误拦优化处理包括:对关键字策略中的每一条关键字分别进行查准率、查全率的预测,将预测结果与优化目标进行比较,根据比较结果管理关键字。
进一步的,根据比较结果管理关键字包括:删除预测结果差的关键字,建议处理预测结果一般的关键字,保留预测结果好的关键字。
进一步的,垃圾短信漏拦优化处理包括:确定普通短信中没有被拦截的垃圾短信库,计算没有被拦截的垃圾短信库的拦截关键词,将拦截关键词添加到关键字策略。
进一步的,垃圾短信拦截效率优化处理包括:针对每一条关键词,判断是否存在与其重复的关键词,若存在,则删除;判断是否存在与其交叉的关键词,若存在,则组合整理;判断是否存在与其可合并的关键词,若存在,则合并。
进一步的,还包括:对评估优化处理后的关键词策略重新进行评估优化处理,直至达到优化目标,或者达到预定次数。
进一步的,还包括:从垃圾短信监控系统及投诉平台获取垃圾短信样本及正常短信样本,根据垃圾短信样本及正常短信样本建立短信样本库。
进一步的,根据垃圾短信样本及正常短信样本建立短信样本库包括:将垃圾短信样本及正常短信样本直接添加到短信样本库的可信样本库,根据可信样本库对垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核,并存入短信样本库。
进一步的,根据可信样本库对垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核包括:根据待检测短信的指纹签名、与垃圾短信样本及正常短信样本的指纹签名的相似性,对待检测短信进行分类审核。
进一步的,根据待检测短信的指纹签名、与垃圾短信样本及正常短信样本的指纹签名的相似度,对待检测短信进行分类审核包括:从垃圾短信样本中提取每条短信内容的垃圾指纹签名,比较待检测短信的指纹签名与垃圾指纹签名的相似性,如果两者相似,则将待检测短信法分为垃圾短信;从正常短信样本中提取每条短信内容的正常指纹签名,比较待检测短信的指纹签名与正常指纹签名的相似性,如果两者相似,则将待检测短信法分为正常短信。
进一步的,根据可信样本库对垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核还包括:学习可信样本库生成垃圾短信分类器,利用垃圾短信分类器对垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核。
进一步的,学习可信样本库生成垃圾短信分类器包括:从垃圾短信样本中抽取一批垃圾短信样本,从正常短信样本中抽取一批正常短信样本;对抽取的短信内容样本进行预处理;对预处理后的短信内容进行中文分词,最终生成短信的分词;依次每个分词在垃圾短信样本中的权重以及在正常短信样本中的权重。
本发明实施例还提供了一种计算机存储介质,所述计算机存储介质存储有执行指令,所述执行指令用于执行上述的方法。
本发明实施例提供了一种用于垃圾短信监控系统的关键字策略的管理装置,其包括:获取模块,设置为获取垃圾短信监控系统的关键字策略;处理模块,设置为基于短信样本库对关键字策略进行评估优化处理,根据处理结果处理关键字策略;发送模块,设置为发送评估优化处理后的关键字策略至垃圾短信监控系统。
进一步的,处理模块设置为基于短信样本库模拟普通短信,对关键字策略中的每一条关键字执行垃圾短信误拦优化处理、垃圾短信漏拦优化处理、垃圾短信拦截效率优化处理中的 至少一种。
进一步的,处理模块设置为对关键字策略中的每一条关键字分别进行查准率、查全率的预测,将预测结果与优化目标进行比较,根据比较结果管理关键字。
进一步的,处理模块设置为删除预测结果差的关键字,建议处理预测结果一般的关键字,保留预测结果好的关键字。
进一步的,处理模块设置为确定普通短信中没有被拦截的垃圾短信库,计算没有被拦截的垃圾短信库的拦截关键词,将拦截关键词添加到关键字策略。
进一步的,处理模块设置为针对每一条关键词,判断是否存在与其重复的关键词,若存在,则删除;判断是否存在与其交叉的关键词,若存在,则组合整理;判断是否存在与其可合并的关键词,若存在,则合并。
进一步的,处理模块还设置为对评估优化处理后的关键词策略重新进行评估优化处理,直至达到优化目标,或者达到预定次数。
进一步的,还包括建立模块,设置为从垃圾短信监控系统及投诉平台获取垃圾短信样本及正常短信样本,根据垃圾短信样本及正常短信样本建立短信样本库。
进一步的,建立模块设置为将垃圾短信样本及正常短信样本直接添加到短信样本库的可信样本库,根据可信样本库对垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核,并存入短信样本库。
进一步的,建立模块设置为根据待检测短信的指纹签名、与垃圾短信样本及正常短信样本的指纹签名的相似性,对待检测短信进行分类审核。
进一步的,建立模块设置为从垃圾短信样本中提取每条短信内容的垃圾指纹签名,比较待检测短信的指纹签名与垃圾指纹签名的相似性,如果两者相似,则将待检测短信法分为垃圾短信;从正常短信样本中提取每条短信内容的正常指纹签名,比较待检测短信的指纹签名与正常指纹签名的相似性,如果两者相似,则将待检测短信法分为正常短信。
进一步的,建立模块设置为学习可信样本库生成垃圾短信分类器,利用垃圾短信分类器对垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核。
进一步的,建立模块设置为从垃圾短信样本中抽取一批垃圾短信样本,从正常短信样本中抽取一批正常短信样本;对抽取的短信内容样本进行预处理;对预处理后的短信内容进行中文分词,最终生成短信的分词;依次每个分词在垃圾短信样本中的权重以及在正常短信样本中的权重。
本发明实施例提供了一种垃圾短信监控系统,其使用本发明实施例提供的管理装置管理关键字测控。
本发明实施例的有益效果:
本发明实施例提供了一种新的管理方法,根据短信样本库对关键字策略进行评估优化处理,不需要人工进行干预,实现了根据短信样本库对关键字策略的自动优化管理,使得关键字策略更加完整、拦截更加准确,解决了现有人工提供关键字策略不能满足用户日历增强的使用需求的问题,增强了用户的使用体验。
附图说明
图1为本发明第一实施例提供的管理装置的结构示意图;
图2为本发明第二实施例提供的管理方法的流程图;
图3为本发明第三实施例提供的管理方法的流程图;
图4为本发明第三实施例中短信指纹识别的示意图。
具体实施方式
现通过具体实施方式结合附图的方式对本发明做出进一步的诠释说明。
第一实施例:
图1为本发明第一实施例提供的管理装置的结构示意图,由图1可知,在本实施例中,本发明提供的管理装置1包括:
获取模块11,设置为获取垃圾短信监控系统的关键字策略;
处理模块12,设置为基于短信样本库对关键字策略进行评估优化处理,根据处理结果处理关键字策略;
发送模块13,设置为发送评估优化处理后的关键字策略至垃圾短信监控系统。
在一些实施例中,上述实施例中的处理模块12设置为基于短信样本库模拟普通短信,对关键字策略中的每一条关键字执行垃圾短信误拦优化处理、垃圾短信漏拦优化处理、垃圾短信拦截效率优化处理中的至少一种。
在一些实施例中,上述实施例中的处理模块12设置为对关键字策略中的每一条关键字分别进行查准率、查全率的预测,将预测结果与优化目标进行比较,根据比较结果管理关键字。
在一些实施例中,上述实施例中的处理模块12设置为删除预测结果差的关键字,建议处理预测结果一般的关键字,保留预测结果好的关键字。
在一些实施例中,上述实施例中的处理模块12设置为确定普通短信中没有被拦截的垃圾短信库,计算没有被拦截的垃圾短信库的拦截关键词,将拦截关键词添加到关键字策略。
在一些实施例中,上述实施例中的处理模块12设置为针对每一条关键词,判断是否存在与其重复的关键词,若存在,则删除;判断是否存在与其交叉的关键词,若存在,则组合整 理;判断是否存在与其可合并的关键词,若存在,则合并。
在一些实施例中,上述实施例中的处理模块12还设置为对评估优化处理后的关键词策略重新进行评估优化处理,直至达到优化目标,或者达到预定次数。
在一些实施例中,如图1所示,上述实施例中的管理装置还包括建立模块14,设置为从垃圾短信监控系统及投诉平台获取垃圾短信样本及正常短信样本,根据垃圾短信样本及正常短信样本建立短信样本库。
在一些实施例中,上述实施例中的建立模块14设置为将垃圾短信样本及正常短信样本直接添加到短信样本库的可信样本库,根据可信样本库对垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核,并存入短信样本库。
在一些实施例中,上述实施例中的建立模块14设置为根据待检测短信的指纹签名、与垃圾短信样本及正常短信样本的指纹签名的相似性,对待检测短信进行分类审核。
在一些实施例中,上述实施例中的建立模块14设置为从垃圾短信样本中提取每条短信内容的垃圾指纹签名,比较待检测短信的指纹签名与垃圾指纹签名的相似性,如果两者相似,则将待检测短信法分为垃圾短信;从正常短信样本中提取每条短信内容的正常指纹签名,比较待检测短信的指纹签名与正常指纹签名的相似性,如果两者相似,则将待检测短信法分为正常短信。
在一些实施例中,上述实施例中的建立模块14设置为学习可信样本库生成垃圾短信分类器,利用垃圾短信分类器对垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核。
在一些实施例中,上述实施例中的建立模块14设置为从垃圾短信样本中抽取一批垃圾短信样本,从正常短信样本中抽取一批正常短信样本;对抽取的短信内容样本进行预处理;对预处理后的短信内容进行中文分词,最终生成短信的分词;依次每个分词在垃圾短信样本中的权重以及在正常短信样本中的权重。
对应的,本发明实施例提供了一种垃圾短信监控系统,其使用本发明实施例提供的管理装置1管理关键字测控。
第二实施例:
图2为本发明第二实施例提供的管理方法的流程图,由图2可知,在本实施例中,本发明提供的管理方法包括以下步骤:
S201:获取垃圾短信监控系统的关键字策略;
S202:基于短信样本库对关键字策略进行评估优化处理,根据处理结果处理关键字策略;
S203:发送评估优化处理后的关键字策略至垃圾短信监控系统。
在一些实施例中,上述实施例中的评估优化处理包括:基于短信样本库模拟普通短信,对关键字策略中的每一条关键字执行垃圾短信误拦优化处理、垃圾短信漏拦优化处理、垃圾 短信拦截效率优化处理中的至少一种。
在一些实施例中,上述实施例中的垃圾短信误拦优化处理包括:对关键字策略中的每一条关键字分别进行查准率、查全率的预测,将预测结果与优化目标进行比较,根据比较结果管理关键字。
在一些实施例中,上述实施例中的根据比较结果管理关键字包括:删除预测结果差的关键字,建议处理预测结果一般的关键字,保留预测结果好的关键字。
在一些实施例中,上述实施例中的垃圾短信漏拦优化处理包括:确定普通短信中没有被拦截的垃圾短信库,计算没有被拦截的垃圾短信库的拦截关键词,将拦截关键词添加到关键字策略。
在一些实施例中,上述实施例中的垃圾短信拦截效率优化处理包括:针对每一条关键词,判断是否存在与其重复的关键词,若存在,则删除;判断是否存在与其交叉的关键词,若存在,则组合整理;判断是否存在与其可合并的关键词,若存在,则合并。
在一些实施例中,上述实施例中的方法还包括:对评估优化处理后的关键词策略重新进行评估优化处理,直至达到优化目标,或者达到预定次数。
在一些实施例中,上述实施例中的方法还包括:从垃圾短信监控系统及投诉平台获取垃圾短信样本及正常短信样本,根据垃圾短信样本及正常短信样本建立短信样本库。
在一些实施例中,上述实施例中的根据垃圾短信样本及正常短信样本建立短信样本库包括:将垃圾短信样本及正常短信样本直接添加到短信样本库的可信样本库,根据可信样本库对垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核,并存入短信样本库。
在一些实施例中,上述实施例中的根据可信样本库对垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核包括:根据待检测短信的指纹签名、与垃圾短信样本及正常短信样本的指纹签名的相似性,对待检测短信进行分类审核。
在一些实施例中,上述实施例中的根据待检测短信的指纹签名、与垃圾短信样本及正常短信样本的指纹签名的相似度,对待检测短信进行分类审核包括:从垃圾短信样本中提取每条短信内容的垃圾指纹签名,比较待检测短信的指纹签名与垃圾指纹签名的相似性,如果两者相似,则将待检测短信法分为垃圾短信;从正常短信样本中提取每条短信内容的正常指纹签名,比较待检测短信的指纹签名与正常指纹签名的相似性,如果两者相似,则将待检测短信法分为正常短信。
在一些实施例中,上述实施例中的根据可信样本库对垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核还包括:学习可信样本库生成垃圾短信分类器,利用垃圾短信分类器对垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核。
在一些实施例中,上述实施例中的学习可信样本库生成垃圾短信分类器包括:从垃圾短信样本中抽取一批垃圾短信样本,从正常短信样本中抽取一批正常短信样本;对抽取的短信 内容样本进行预处理;对预处理后的短信内容进行中文分词,最终生成短信的分词;依次每个分词在垃圾短信样本中的权重以及在正常短信样本中的权重。
本发明实施例所涉及的短信包括场景的短信息、彩信、广播消息、邮件等信息。
现结合具体应用场景对本发明实施例做进一步的诠释说明。
第三实施例:
图3为本发明第三实施例提供的管理方法的流程图,由图3可知,在本实施例中,本发明提供的管理方法包括以下步骤:
S301:管理装置与垃圾短信监控系统、投诉平台进行数据同步。
本发明实施例提供管理装置与垃圾短信监控系统之间、与投诉平台之间有数据同步接口。具体的,IF1接口:从垃圾短信监控系统和投诉平台接收垃圾短信和正常短信样本,经过自动审核形成可信的垃圾短信样本库和正常短信样本库,样本库内的样本是评估优化的基础;IF2接口:从垃圾短信监控系统接收正式部署前待评估优化的关键字策略;IF3:将评估优化后的关键字策略同步垃圾短信监控系统用于正式部署上线。
S302:管理装置建立短信样本库。
管理装置将通过同步获取的短信内的垃圾短信(用户标记或者投诉举报的)、正常短信添加到短信样本库内的可信样本库。
S303:垃圾短信分类器的学习训练。
本实施例以朴素贝叶斯分类器的学习训练为例,进行说明。具体的流程描述如下:
1)从垃圾短信样本库中抽取一批垃圾短信样本,从正常短信样本库中抽取一批正常短信样本,P(C0)=(垃圾短信样本条数)/(垃圾短信样本条数+正常短信样本条数),P(C1)=(正常短信样本条数)/(垃圾短信样本条数+正常短信样本条数);
2)对抽取的短信内容样本进行预处理,包括但不限于内容过短消息剔除,如内容少于10个字;噪声处理,如删除空格、标点符号等特殊字符等;
3)对预处理后的短信内容进行中文分词,最终生成短信的分词特征向量Dx,Dx={W1,W2,W3,W4,.......Wn},其中n为该短信内容包括的分词总数,Wt为分词,词与词之间顺序无关,即采用一元向量模型;
4)从Dx中依次取出分词,计算每个分词的权重,Wt在垃圾短信样本中的权重P(Wt|C0)=(在垃圾短信样本中含有该分词的样本条数)/(垃圾短信样本条数),Wt在正常短信样本中的权重P(Wt|C1)=(在正常短信样本中含有该分词的样本条数)/(正常短信样本条数);
通过上述学习训练得到一个朴素贝叶斯分类器。
基于朴素贝叶斯短信分类的核心思想,是计算待检测短信属于正常短信以及垃圾短信的 概率,如果短信属于垃圾短信的概率P(C0|Dx),大于属于正常短信的概率P(C1|Dx),则认为该短信为垃圾短信,否则认为是正常短信。
因此朴素贝叶斯分类可以转换为计算以下信息:
Figure PCTCN2016075548-appb-000001
这里C0表示垃圾短信类,C1表示正常短信类;P(C0),P(C1)分别为垃圾短信和正常短信的全局概率,可以通过统计概率获得,P(C0)取垃圾短信样本条数占(垃圾短信样本条数+正常短信样本条数)比值;P(C1)取正常短信样本条数占(垃圾短信样本条数+正常短信样本条数)比值。
短信内容表示为分词向量,并且将分词视为相互独立的,则P(Dx|Cj)可以表示为各个分词在Cj类下的条件概率的乘积,因此P(Wt|C0)相应表示为分词Wt在垃圾短信类出现的概率,P(Wt|C1)相应表示为分词Wt在正常短信类出现的概率。
面对朴素贝叶斯分类器的使用进行举例说明。
假设一个朴素贝叶斯分类器的垃圾短信样本条数与正常短信样本条数的比例为5:95,即P(C0)等于0.05,P(C1)等于0.95。
对待检测短信内容“现有发票可开联系林燕”分词;
Dx={现有,发票,开,联系,林,燕}
在分类器中上述分词对应的权重如下:
分词 P(Wi|C0) P(Wi|C1)
现有 0.016846 0.006351
发票 0.027553 0.003003
0.012857 0.018764
联系 0.010556 0.007387
0.000485 0.000295
0.000402 0.000382
因此根据朴素贝叶斯公式P(C0|Dx)/P(C1|Dx)
(0.05/0.95)*(0.016846/0.006351)*(0.027553/0.003003)*(0.012857/0.018764)*(0.010556/0.007387)
*(0.000485/0.000295)*(0.000402/0.000382)
=4.58
这条待检测短信属于垃圾短信的概率是属于正常短信的概率的4.58倍,因此该消息属于垃圾短信。
为提升准确性,本实施例提出朴素贝叶斯多分类器的概念,朴素贝叶斯多分类器是指从垃圾短信样本库和正常短信样本库中抽取N组样本,每组样本包含一批垃圾短信样本和一批正常短信样本,N默认为30组,每组样本学习训练一个分类器,在识别待检测短信时,每组分类器都进行检测打分,当超过一半的分类器识别为垃圾短信时则认为是垃圾短信,通过引入打分机制有效的提高了准确率。
S304:更新短信样本库。
本步骤先通过指纹识别短信类型,针对无法识别的短信,则通过分类器进行识别,仍然无法识别的短信,则丢弃处理。具体的,
基于指纹签名识别的核心思想,从垃圾短信样本库中提取每条短信内容的垃圾指纹签名,比较待检测短信的指纹签名与垃圾指纹签名的相似性,如果两者相似,则将待检测短信法分为垃圾短信;同理从正常短信样本库中提取每条短信内容的正常指纹签名,比较待检测短信的指纹签名与正常指纹签名的相似性,如果两者相似,则将待检测短信法分为正常短信。
如图4所示,指纹签名提取流程描述如下:
1)将短信内容进行预处理,包括但不限于噪声处理,如删除空格、标点符号等特殊字符等;
2)对预处理后的短信内容按照分词切片大小3进行切片,得到分词向量Dx,Dx={W1,W2,W3,W4,.......Wi},其中i为该短信内容包括的分词总数,词与词之间无序,随机排列;
3)使用N组HASH函数,依次对Dy中所有分词计算HASH值,并取出每个分词HASH最小的值,得到HASH特征向量即指纹签名Dy,Dy={H1,H2,H3,H4,.......Hi}。
指纹签名相似度比较的流程判断如下:
1)依次从垃圾短信样本中提取样本的指纹签名Di,其中i为垃圾指纹签名的总数,将Di中每个指纹签名分成b个段(桶),每个段有r行(桶容量);
2)依次从正常短信样本中提取样本的指纹签名Dj,其中j为正常指纹签名的总数,将Dj中每个指纹签名分成b个段(桶),每个段有r行(桶容量);
3)从待检测短信中提取指纹签名D1,将D1分成b个段(桶),每个段有r行(桶容量),如果D1中某一段与Di中某一段落入到同一个桶里面,那么这两条消息就是相似的,这时待 检测短信就是垃圾短信;如果D1中某一段与Dj中某一段落入到同一个桶里面,那么这两条消息就是相似的,这时待检测短信就是正常短信。
为了更新短信样本库,本实施例通过对垃圾短信监控系统拦截的待分类短信(非用户举报的,可能存在误拦的问题)执行自动审核,流程描述如下:
1)从垃圾短信监控系统和投诉平台接收垃圾短信和正常短信样本;
2)外部样本中的可信样本已经是人工标记垃圾短信或正常短信,例如垃圾短信监控系统中人工审核的样本和投诉平台样本,因此直接根据标记进入垃圾短信样本库和正常短信样本库;
3)外部样本中的非可信样本,例如垃圾短信监控系统通过机器识别出的疑似垃圾短信,需要通过垃圾短信分类器进行自动审核;
4)非可信样本首先进入指纹签名识别分类器环节,当分类器识别为正常短信时进入正常短信样本库,当分类器识别为垃圾短信时进入垃圾短信样本库,当分类器无法识别时进入朴素贝叶斯分类器识别环节;
5)朴素贝叶斯分类器对非可信样本进行检测,当分类器识别为正常短信时进入正常短信样本库,当分类器识别为垃圾短信时进入垃圾短信样本库,当分类器无法识别时直接丢弃。
基于短信样本库,本实施例还提供了关键字策略提取机制,主要流程描述如下:
1)基于朴素贝叶斯短信分类,从垃圾短信样本库中抽取所有垃圾短信样本,从正常短信样本库中抽取所有正常短信样本,P(C0)=(全部垃圾短信样本条数)/(全部垃圾短信样本条数+全部正常短信样本条数),P(C1)=(全部正常短信样本条数)/(全部垃圾短信样本条数+全部正常短信样本条数);
2)对抽取的短信内容样本进行预处理,包括但不限于内容过短消息剔除,如内容少于10个字;噪声处理,如删除空格、标点符号等特殊字符等;
3)对预处理后的短信内容进行中文分词,最终生成短信的分词特征向量Dx,Dx={W1,W2,W3,W4,.......Wn},其中n为该短信内容包括的分词总数,Wt为分词,词与词之间顺序无关,即采用一元向量模型;
4)从Dx中依次取出分词,计算每个分词的权重,Wt在垃圾短信样本中的权重P(Wt|C0)=(在垃圾短信样本中含有该分词的样本条数)/(全部垃圾短信样本条数),Wt在正常短信样本中的权重P(Wt|C1)=(在正常短信样本中含有该分词的样本条数)/(全部正常短信样本条数),最终得到朴素贝叶斯分类器;
5)针对从垃圾短信样本库中获得的Dx,使用朴素贝叶斯分类器,计算Dx中各分词属于垃圾短信的概率值,得到Wx,将Wx中分词按照概率值从大到小排序,得到Wx={E1,E2,E3,E4,.......En},其中E1≥E2≥E3.......≥En;
6)基于上述概率值Wx,对分词特征向量Dx进行降维,筛选出概率最大的M个特征值,且每个特征值要大于某个阈值K,如果概率值筛选出的特征值数量小于L,则抛弃此分词特征向量Dx,最后得出如下维度为M的权值特征向量:
Wx={W1,W2,W3.......WM},
获取该概率值对应的分词,得到该垃圾短信样本备选的分词特征向量
Dx’={T1,T2,T3,T4,.......TM}
此向量即为该条样本备选的关键字集合;
7)将备选的关键字通过与&关系组合为关键字规则,即(T1)&(T2)&…&(TM),因此每条备选关键字规则都对应一个垃圾短信样本库中的样本。
在更新短信样本库之后,返回执行步骤S303,进行分类型的学习训练。
S305:对垃圾短信监控系统的关键字策略进行优化处理。
本步骤包括的业务流程描述如下:
1)从垃圾短信监控系统接收正式部署前待评估优化的关键字策略;
2)预评估对垃圾短信监控系统的环境进行模拟重现,加载待评估的关键字策略;
3)预评估基于样本库中的垃圾短信样本和正常短信样本,模拟普通短信发送至预评估环境进行测试;在预评估分析过程中,检测分析待评估关键字策略的有效性,将各关键字规则监控到样本与样本本身垃圾短信样本或正常短信样本属性进行比对等,分析查准率、查全率等;
4)并将预测结果与优化目标进行比对,若未达到优化目标则进入智能优化,若达到优化目标则优化完成保存策略;
5)智能优化根据预评估结果,分析每一条规则实际的价值,从漏拦和误拦等角度进行优化,发现无效策略、合并重复策略,分析现有策略的盲点,引入新的关键字策略;
6)优化后的策略再次进入步骤3进行预评估,预评估和智能优化形成循环迭代,直到达到优化目标或最终达到循环迭代最大次数。
具体的,误拦分析流程需按设定条件进行优化,标准:
(1)对于查准率小于等于X1且贡献量小于等于Y1的规则,做出删除处理;
(2)对于查准率小于等于X2且贡献量小于等于Y2的规则,进行误拦优化处理;
(3)整体执行效率优化输出新策略。
其中,X1、X2、Y1、Y2均可配置,且N1<N2,X1<X2,Y1<Y2;规则贡献量是指 某条规则命中的垃圾短信样本数。
误拦优化的方法为:
1)使用朴素贝叶斯分类器计算规则中各关键字为垃圾短信的概率值;
2)对关键字规则中各关键字按照概率值排序,删除概率值低的单个关键字。
漏拦优化的方法为:
(1)从预评估结果中获取漏拦样本集合,该集合是垃圾短信样本库的子集;
(2)由于样本管理模块已经针对垃圾短信样本库中的每条样本做了备选关键字规则提取,因此仅需要分析漏拦样本,找到每条样本对应的备选关键字规则;
(3)将备选关键字规则补充进来;
(4)整体执行效率优化输出新策略。
效率优化能够针对性能降低关键字组合策略,进行效率提升,包括:
(1)分析单条关键字规则内部是否存在词组的包含关系,并给出优化建议;
(2)分析多条关键字规则间的交织、重合和包含关系,对相似策略进行聚类。
综上可知,通过本发明的实施,至少存在以下有益效果:
根据短信样本库对关键字策略进行评估优化处理,不需要人工进行干预,实现了根据短信样本库对关键字策略的自动优化管理,使得关键字策略更加完整、拦截更加准确,解决了现有人工提供关键字策略不能满足用户日历增强的使用需求的问题,增强了用户的使用体验。
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
工业实用性
上述的本发明实施例,可以应用于垃圾短信监控领域,解决了现有人工提供关键字策略不能满足用户日历增强的使用需求的问题,增强了用户的使用体验。

Claims (28)

  1. 一种用于垃圾短信监控系统的关键字策略的管理方法,包括:
    获取所述垃圾短信监控系统的关键字策略;
    基于短信样本库对所述关键字策略进行评估优化处理,根据处理结果处理所述关键字策略;
    发送评估优化处理后的关键字策略至所述垃圾短信监控系统。
  2. 如权利要求1所述的管理方法,其中,所述评估优化处理包括:基于所述短信样本库模拟普通短信,对所述关键字策略中的每一条关键字执行垃圾短信误拦优化处理、垃圾短信漏拦优化处理、垃圾短信拦截效率优化处理中的至少一种。
  3. 如权利要求2所述的管理方法,其中,所述垃圾短信误拦优化处理包括:对所述关键字策略中的每一条关键字分别进行查准率、查全率的预测,将预测结果与优化目标进行比较,根据比较结果管理所述关键字。
  4. 如权利要求3所述的管理方法,其中,所述根据比较结果管理所述关键字包括:删除预测结果差的关键字,建议处理预测结果一般的关键字,保留预测结果好的关键字。
  5. 如权利要求2所述的管理方法,其中,所述垃圾短信漏拦优化处理包括:确定所述普通短信中没有被拦截的垃圾短信库,计算所述没有被拦截的垃圾短信库的拦截关键词,将所述拦截关键词添加到所述关键字策略。
  6. 如权利要求2所述的管理方法,其中,所述垃圾短信拦截效率优化处理包括:针对每一条关键词,判断是否存在与其重复的关键词,若存在,则删除;判断是否存在与其交叉的关键词,若存在,则组合整理;判断是否存在与其可合并的关键词,若存在,则合并。
  7. 如权利要求1所述的管理方法,其中,还包括:对评估优化处理后的关键词策略重新进行评估优化处理,直至达到优化目标,或者达到预定次数。
  8. 如权利要求1至7任一项所述的管理方法,其中,还包括:从所述垃圾短信监控系统及投诉平台获取垃圾短信样本及正常短信样本,根据所述垃圾短信样本及正常短信样本建立所述短信样本库。
  9. 如权利要求8所述的管理方法,其中,所述根据所述垃圾短信样本及正常短信样本建立所述短信样本库包括:将所述垃圾短信样本及正常短信样本直接添加到所述短信样本库的可信样本库,根据所述可信样本库对所述垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核,并存入所述短信样本库。
  10. 如权利要求9所述的管理方法,其中,所述根据所述可信样本库对所述垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核包括:根据所述待检测短信的指纹签名、与所述垃圾短信样本及正常短信样本的指纹签名的相似性,对所述待检测短信进行分类审核。
  11. 如权利要求10所述的管理方法,其中,所述根据所述待检测短信的指纹签名、与所述垃 圾短信样本及正常短信样本的指纹签名的相似度,对所述待检测短信进行分类审核包括:所述从垃圾短信样本中提取每条短信内容的垃圾指纹签名,比较待检测短信的指纹签名与垃圾指纹签名的相似性,如果两者相似,则将待检测短信法分为垃圾短信;从所述正常短信样本中提取每条短信内容的正常指纹签名,比较待检测短信的指纹签名与正常指纹签名的相似性,如果两者相似,则将待检测短信法分为正常短信。
  12. 如权利要求9所述的管理方法,其中,所述根据所述可信样本库对所述垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核还包括:学习所述可信样本库生成垃圾短信分类器,利用所述垃圾短信分类器对所述垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核。
  13. 如权利要求12所述的管理方法,其中,所述学习所述可信样本库生成垃圾短信分类器包括:从所述垃圾短信样本中抽取一批垃圾短信样本,从所述正常短信样本中抽取一批正常短信样本;对抽取的短信内容样本进行预处理;对预处理后的短信内容进行中文分词,最终生成短信的分词;依次每个分词在垃圾短信样本中的权重以及在正常短信样本中的权重。
  14. 一种用于垃圾短信监控系统的关键字策略的管理装置,包括:
    获取模块,设置为获取所述垃圾短信监控系统的关键字策略;
    处理模块,设置为基于短信样本库对所述关键字策略进行评估优化处理,根据处理结果处理所述关键字策略;
    发送模块,设置为发送评估优化处理后的关键字策略至所述垃圾短信监控系统。
  15. 如权利要求14所述的管理装置,其中,所述处理模块设置为基于所述短信样本库模拟普通短信,对所述关键字策略中的每一条关键字执行垃圾短信误拦优化处理、垃圾短信漏拦优化处理、垃圾短信拦截效率优化处理中的至少一种。
  16. 如权利要求15所述的管理装置,其中,所述处理模块设置为对所述关键字策略中的每一条关键字分别进行查准率、查全率的预测,将预测结果与优化目标进行比较,根据比较结果管理所述关键字。
  17. 如权利要求16所述的管理装置,其中,所述处理模块设置为删除预测结果差的关键字,建议处理预测结果一般的关键字,保留预测结果好的关键字。
  18. 如权利要求15所述的管理装置,其中,所述处理模块设置为确定所述普通短信中没有被拦截的垃圾短信库,计算所述没有被拦截的垃圾短信库的拦截关键词,将所述拦截关键词添加到所述关键字策略。
  19. 如权利要求15所述的管理装置,其中,所述处理模块设置为针对每一条关键词,判断是否存在与其重复的关键词,若存在,则删除;判断是否存在与其交叉的关键词,若存在,则组合整理;判断是否存在与其可合并的关键词,若存在,则合并。
  20. 如权利要求14所述的管理装置,其中,所述处理模块还设置为对评估优化处理后的关键词策略重新进行评估优化处理,直至达到优化目标,或者达到预定次数。
  21. 如权利要求14至20任一项所述的管理装置,其中,还包括建立模块,设置为从所述垃圾短信监控系统及投诉平台获取垃圾短信样本及正常短信样本,根据所述垃圾短信样本及正常短信样本建立所述短信样本库。
  22. 如权利要求21所述的管理装置,其中,所述建立模块设置为将所述垃圾短信样本及正常短信样本直接添加到所述短信样本库的可信样本库,根据所述可信样本库对所述垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核,并存入所述短信样本库。
  23. 如权利要求22所述的管理装置,其中,所述建立模块设置为根据所述待检测短信的指纹签名、与所述垃圾短信样本及正常短信样本的指纹签名的相似性,对所述待检测短信进行分类审核。
  24. 如权利要求23所述的管理装置,其中,所述建立模块设置为所述从垃圾短信样本中提取每条短信内容的垃圾指纹签名,比较待检测短信的指纹签名与垃圾指纹签名的相似性,如果两者相似,则将待检测短信法分为垃圾短信;从所述正常短信样本中提取每条短信内容的正常指纹签名,比较待检测短信的指纹签名与正常指纹签名的相似性,如果两者相似,则将待检测短信法分为正常短信。
  25. 如权利要求22所述的管理装置,其中,所述建立模块设置为学习所述可信样本库生成垃圾短信分类器,利用所述垃圾短信分类器对所述垃圾短信监控系统及投诉平台同步的待检测短信进行分类审核。
  26. 如权利要求25所述的管理装置,其中,所述建立模块设置为从所述垃圾短信样本中抽取一批垃圾短信样本,从所述正常短信样本中抽取一批正常短信样本;对抽取的短信内容样本进行预处理;对预处理后的短信内容进行中文分词,最终生成短信的分词;依次每个分词在垃圾短信样本中的权重以及在正常短信样本中的权重。
  27. 一种垃圾短信监控系统,使用如权利要求14至26任一项所述的管理装置管理关键字策略。
  28. 一种计算机存储介质,所述计算机存储介质存储有执行指令,所述执行指令用于执行权利要求1至13中任一项所述的方法。
PCT/CN2016/075548 2015-07-20 2016-03-03 一种管理方法、装置、垃圾短信监控系统及计算机存储介质 WO2016177069A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510427184.4 2015-07-20
CN201510427184.4A CN106376002B (zh) 2015-07-20 2015-07-20 一种管理方法及装置、垃圾短信监控系统

Publications (1)

Publication Number Publication Date
WO2016177069A1 true WO2016177069A1 (zh) 2016-11-10

Family

ID=57218096

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/075548 WO2016177069A1 (zh) 2015-07-20 2016-03-03 一种管理方法、装置、垃圾短信监控系统及计算机存储介质

Country Status (2)

Country Link
CN (1) CN106376002B (zh)
WO (1) WO2016177069A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108810829A (zh) * 2018-04-19 2018-11-13 北京奇安信科技有限公司 一种彩信拦截处理方法及装置
CN109800435A (zh) * 2019-01-29 2019-05-24 北京金山数字娱乐科技有限公司 一种语言模型的训练方法及装置
CN110309446A (zh) * 2019-04-26 2019-10-08 深圳市赛为智能股份有限公司 文本内容快速去重方法、装置、计算机设备及存储介质
CN113316153A (zh) * 2020-04-02 2021-08-27 阿里巴巴集团控股有限公司 一种短信息检验方法、装置和系统
CN114466314A (zh) * 2022-01-29 2022-05-10 重庆华唐云树科技有限公司 一种基于基站定位的固定人群手机号筛查方法
CN116089669A (zh) * 2023-03-09 2023-05-09 数影星球(杭州)科技有限公司 一种基于浏览器的网站上传拦截方式与系统

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109413595B (zh) * 2017-08-17 2020-09-25 中国移动通信集团公司 一种垃圾短信的识别方法、装置及存储介质
CN109408795B (zh) * 2017-08-17 2022-04-15 中国移动通信集团公司 一种文本识别方法、设备、计算机可读存储介质及装置
CN109819125A (zh) * 2017-11-20 2019-05-28 中兴通讯股份有限公司 一种限制电信诈骗的方法及装置
CN111970651A (zh) * 2020-08-18 2020-11-20 珠海格力电器股份有限公司 一种短消息处理方法、装置、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040083270A1 (en) * 2002-10-23 2004-04-29 David Heckerman Method and system for identifying junk e-mail
CN101447984A (zh) * 2008-11-28 2009-06-03 电子科技大学 一种自反馈垃圾信息过滤方法
CN101790142A (zh) * 2010-03-11 2010-07-28 上海粱江通信系统股份有限公司 结合短信内容和发送频次识别垃圾短信源的系统与方法
CN101908055A (zh) * 2010-03-05 2010-12-08 黑龙江工程学院 一种优化lam%的信息分类阈值的设定方法及使用该方法的信息过滤系统

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101257671B (zh) * 2007-07-06 2010-12-08 浙江大学 基于内容的大规模垃圾短信实时过滤方法
CN101184259B (zh) * 2007-11-01 2010-06-23 浙江大学 垃圾短信中的关键词自动学习及更新方法
CN102857921B (zh) * 2011-06-30 2016-03-30 国际商业机器公司 判断垃圾信息发送者的方法及装置
CN102982048B (zh) * 2011-09-07 2017-08-01 百度在线网络技术(北京)有限公司 一种用于评估垃圾信息挖掘规则的方法与设备
CN103166932A (zh) * 2011-12-15 2013-06-19 上海粱江通信系统股份有限公司 识别并治理利用大量短信实施DDoS的系统及方法
CN103473492B (zh) * 2013-09-05 2016-11-02 北京百纳威尔科技有限公司 权限识别方法和用户终端
CN103634473B (zh) * 2013-12-05 2016-03-23 南京理工大学连云港研究院 基于朴素贝叶斯分类的手机垃圾短信过滤方法与系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040083270A1 (en) * 2002-10-23 2004-04-29 David Heckerman Method and system for identifying junk e-mail
CN101447984A (zh) * 2008-11-28 2009-06-03 电子科技大学 一种自反馈垃圾信息过滤方法
CN101908055A (zh) * 2010-03-05 2010-12-08 黑龙江工程学院 一种优化lam%的信息分类阈值的设定方法及使用该方法的信息过滤系统
CN101790142A (zh) * 2010-03-11 2010-07-28 上海粱江通信系统股份有限公司 结合短信内容和发送频次识别垃圾短信源的系统与方法

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108810829A (zh) * 2018-04-19 2018-11-13 北京奇安信科技有限公司 一种彩信拦截处理方法及装置
CN109800435A (zh) * 2019-01-29 2019-05-24 北京金山数字娱乐科技有限公司 一种语言模型的训练方法及装置
CN110309446A (zh) * 2019-04-26 2019-10-08 深圳市赛为智能股份有限公司 文本内容快速去重方法、装置、计算机设备及存储介质
CN113316153A (zh) * 2020-04-02 2021-08-27 阿里巴巴集团控股有限公司 一种短信息检验方法、装置和系统
CN113316153B (zh) * 2020-04-02 2024-03-26 阿里巴巴集团控股有限公司 一种短信息检验方法、装置和系统
CN114466314A (zh) * 2022-01-29 2022-05-10 重庆华唐云树科技有限公司 一种基于基站定位的固定人群手机号筛查方法
CN114466314B (zh) * 2022-01-29 2024-04-02 重庆华唐云树科技有限公司 一种基于基站定位的固定人群手机号筛查方法
CN116089669A (zh) * 2023-03-09 2023-05-09 数影星球(杭州)科技有限公司 一种基于浏览器的网站上传拦截方式与系统
CN116089669B (zh) * 2023-03-09 2023-10-03 数影星球(杭州)科技有限公司 一种基于浏览器的网站上传拦截方式与系统

Also Published As

Publication number Publication date
CN106376002B (zh) 2021-10-12
CN106376002A (zh) 2017-02-01

Similar Documents

Publication Publication Date Title
WO2016177069A1 (zh) 一种管理方法、装置、垃圾短信监控系统及计算机存储介质
US20230259621A1 (en) Stacking-ensemble-based apt organization identification method and system, and storage medium
CN110443274B (zh) 异常检测方法、装置、计算机设备及存储介质
Stamatatos et al. Clustering by authorship within and across documents
US8527436B2 (en) Automated parsing of e-mail messages
CN111045847B (zh) 事件审计方法、装置、终端设备以及存储介质
US20220004878A1 (en) Systems and methods for synthetic document and data generation
US20210216443A1 (en) Automatic parameter value resolution for api evaluation
CN104834940A (zh) 一种基于支持向量机的医疗影像检查疾病分类方法
Probierz et al. Rapid detection of fake news based on machine learning methods
US11481707B2 (en) Risk prediction system and operation method thereof
CN111143842A (zh) 一种恶意代码检测方法及系统
CN112036168B (zh) 事件主体识别模型优化方法、装置、设备及可读存储介质
EP3920067A1 (en) Method and system for machine learning model testing and preventive measure recommendation
CN111177367B (zh) 案件分类方法、分类模型训练方法及相关产品
CN112001170A (zh) 一种识别经过变形的敏感词的方法和系统
Aghaei et al. Ensemble classifier for misuse detection using N-gram feature vectors through operating system call traces
CN114896305A (zh) 一种基于大数据技术的智慧互联网安全平台
CN109783633A (zh) 数据分析服务流程模型推荐方法
CN110889451B (zh) 事件审计方法、装置、终端设备以及存储介质
CN115473726A (zh) 一种识别域名的方法及装置
CN115982706A (zh) 基于api调用序列行为多视角融合的恶意软件检测方法
Alzhrani et al. Automated us diplomatic cables security classification: Topic model pruning vs. classification based on clusters
KR20120059935A (ko) 문서분류장치 및 그것의 문서분류방법
CN113282686B (zh) 一种不平衡样本的关联规则确定方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16789063

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16789063

Country of ref document: EP

Kind code of ref document: A1