CN112597282B - Management method applied to short message data security - Google Patents

Management method applied to short message data security Download PDF

Info

Publication number
CN112597282B
CN112597282B CN202110092470.5A CN202110092470A CN112597282B CN 112597282 B CN112597282 B CN 112597282B CN 202110092470 A CN202110092470 A CN 202110092470A CN 112597282 B CN112597282 B CN 112597282B
Authority
CN
China
Prior art keywords
short message
classification
message data
short
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110092470.5A
Other languages
Chinese (zh)
Other versions
CN112597282A (en
Inventor
曾永明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Chengliye Technology Development Co ltd
Original Assignee
Shenzhen Chengliye Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Chengliye Technology Development Co ltd filed Critical Shenzhen Chengliye Technology Development Co ltd
Priority to CN202110092470.5A priority Critical patent/CN112597282B/en
Publication of CN112597282A publication Critical patent/CN112597282A/en
Application granted granted Critical
Publication of CN112597282B publication Critical patent/CN112597282B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention provides a management method applied to short message data safety, which is applied to a short message data safety management system, wherein the short message data safety management system comprises a short message data safety management remote server and a user terminal, a short message data personalized classification engine runs on the user terminal, and the short message data personalized classification engine uses data which is stored in the user terminal and is synchronously updated with the short message data safety management remote server. The invention designs a two-layer classification model based on personalized selection of a short message receiver and centralized monitoring and filtering of a server side based on a short message classification algorithm of text content and machine learning, and realizes high-precision intelligent short message classification.

Description

Management method applied to short message data security
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of data security management, in particular to a management method applied to short message data security.
[ background of the invention ]
Short messages have already gained a high degree of practical application as an important way of communication in people's life. The mobile phone short messages are not only favorite communication tools of people, but also propagation means of various bad messages, the flooding of the short messages impacts the network, and serious bad social influence is caused. When the short message enters the application fields of large batch and various categories, certain short message contents with fraudulent behavior properties, which are manufactured and formed by illegal merchants, are mixed in the short message.
At present, the filtering technologies for spam messages include black and white list filtering, message length and flow threshold filtering, and an artificial intelligence filtering method using a text classification algorithm, but all methods have advantages and disadvantages, the filtering effect for spam messages is not good, how to improve the filtering effect for spam messages and improve the security of message data becomes a technical problem which needs to be solved urgently.
[ summary of the invention ]
The invention provides a management method applied to short message data security, which aims to solve one or more of the above-mentioned technical problems.
The technical scheme adopted by the application is as follows:
a management method applied to the safety of short message data is applied to a short message data safety management system, the short message data safety management system comprises a short message data safety management remote server and a user terminal, a short message data personalized classification engine runs on the user terminal, and the short message data personalized classification engine uses data which is stored in the user terminal and is synchronously updated with the short message data safety management remote server;
the management method comprises the following steps:
step 1, after receiving a short message, a short message data security management remote server performs short message text preprocessing operation, sends the preprocessed short message text into a public classifier, performs first-layer filtering on the short message according to a public feature library, and shields junk short messages which do not pass through the first-layer filtering; sending the short messages filtered by the first layer to an individual classifier;
step 2, after the short messages filtered by the first layer are sent into an individual classifier, according to an individual feature library, Bayesian classification is applied to carry out second-layer individual classification, and the classification result is sent to a short message through a short message data safety management remote server to inform a receiving party whether to receive the short message; if the receiving party user selects to receive, the short message is forwarded to the user terminal of the receiving party user, otherwise, the short message data safety management remote server shields the short message;
step 3, after receiving the short messages which are subjected to the two-layer classification filtering, the receiving party user calls a word segmentation processing module and a Bayesian classification module according to a personalized classification engine and a classification feature library on a user terminal of the receiving party user, performs primary classification on the short messages for the first time, and presents the short messages to the receiving party user;
step 4, the receiver user determines the classification type of the short messages according to the own requirements, and performs second individual classification on the short messages;
step 5, feeding back the second individual classification category, the number of the information sender, the sending time and the number information of the receiver of the short message to the short message data safety management remote server through the network;
step 6, the short message data safety management remote server receives user feedback information, calls a Bayesian training module and updates a personalized feature library;
and 7, the receiver user downloads the updated personalized feature library from the short message data safety management remote server regularly through the network, and classifies and judges the newly received short messages by adopting the updated personalized feature library.
Further, the short message text preprocessing operation specifically includes the following steps:
step 101, reading a short message into a memory, recording an ASCII code corresponding to each read character by using an integer variable, and reading a first character;
step 102, judging a numerical range of a read-in character, if the numerical range is in a Chinese character coding range in a Chinese character set, adding the read-in character into a character string variable, otherwise, taking the read-in character as the character string variable and adding a space;
and 103, returning to the step 101, and ending the preprocessing operation until all characters of the short message are read in.
Further, the bayesian classification specifically comprises the following steps:
step 201, reading in training sample short messages, and counting the number of various short messages;
step 202, reading in a word segmentation dictionary, and performing word segmentation processing on a training sample short message to obtain each entry and a corresponding document frequency DF value;
step 203, according to a feature vector selection method, according to the document frequency DF value from large to small, the first 50 feature words are selected to form feature vectors;
step 204, reading in a training sample short message, and training a Bayesian classifier;
and step 205, reading in the short message to be classified, identifying by using a trained Bayesian classifier, and giving a classification result.
Further, the calculating step of the bayesian classifier comprises:
step 301, after text word segmentation, the data sample short message is expressed as an n-dimensional feature vector X (w) by applying a vector space model1,w2,w3,……,wn) Wherein w isiIs the absolute word frequency;
step 302, set a total of m types C1,……,CmGiven a sample X to be classified, calculating X attribution class CiProbability P ofi(Ci| X), eventually X belongs to Pi(Ci| X) largest class CiBayesian classification assigns unknown samples X to class CiIf and only if Pi(Ci|X)> Pi(Cj| X), where j is not less than 1 and not more than m, i is not equal to j, and P is maximizedi(CiI X), wherein Pi(Ci| X) the largest class is called the maximum a posteriori assumption, according to bayes' theorem: pi(Ci|X) = (P(X|Ci)P(Ci) P (X), since P (X) is constant for all classes, only P (X | C) is requiredi)P(Ci) The maximum is obtained;
step 303, calculate P (C) firsti)=siS, wherein siIs of class CiThe number of samples in (1), s is the total number of training samples;
step 304, recalculate P (X | C)i) Given a dataset with multiple attributes, P (X | C) is computedi) Assuming that the types are independent of each other, thus
Figure 209467DEST_PATH_IMAGE001
Wherein
Figure 950896DEST_PATH_IMAGE002
Can be calculated from training samples;
step 305, classifying the unknown sample X, and respectively classifying each class CiCalculate P (X | C)i)P(Ci) X belongs to P (X | C)i)P(Ci) Class C with the largest valuei
Furthermore, the short message data security management remote server comprises a feature library maintenance updating module and a short message content processing module;
the characteristic library maintenance and updating module is used for maintaining and updating the public characteristic library and the personalized characteristic library;
the short message content processing module comprises a short message preprocessing module, a word segmentation processing module and a feature extraction module, wherein the short message preprocessing module is used for preprocessing a short message text; the word segmentation processing module is used for segmenting words of the short messages; the characteristic extraction module is used for extracting the short message length characteristic, the frequency characteristic, the rule characteristic and the text characteristic information.
Furthermore, the maintenance and updating comprises two modes, the first mode is a training learning mode, and after short message classification information fed back by the user terminal is received, a machine training algorithm is triggered to perform machine learning, and the personalized feature library is updated; the second is to maintain a public characteristic library in a short message data security management remote server, and manually update the public characteristic library.
Furthermore, the public feature library is shared by all users, a black and white list filtering feature library and a keyword library are set, and the public feature library and the keyword library are updated manually at any time.
Further, the personalized feature library is private for each user, the mobile phone number of the user is used as a keyword, the system establishes a personalized classifier for each user, and two tables are generated: and the personality classification category table and the category characteristic table are used for respectively storing the categories of the user personality classification.
Further, the user terminal transmits the fed back individual classification information to the server terminal through the network, wherein the individual classification information comprises short message classification category, whether the short messages are junk short messages or not, information sender number, sending time and receiver number information.
Furthermore, the public feature library is maintained manually by the server and according to the short message classification information fed back by the user terminal, and a black and white list and a keyword list are updated regularly or at any time; the personalized feature library receives the short message classification information fed back by the client terminal through the server terminal, performs incremental learning, automatically triggers a machine training algorithm after receiving the short message fed back by the user, performs machine learning, and updates the personalized feature library.
Through the embodiment of the application, the following technical effects can be obtained: the invention designs a two-layer classification model based on personalized selection of a short message receiver and centralized monitoring and filtering of a server side based on a short message classification algorithm of text content and machine learning, and realizes high-precision intelligent short message classification.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and those skilled in the art can also obtain other drawings according to the drawings without inventive labor.
FIG. 1 is a flow chart of a management method of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The method is applied to a short message data security management system, the system adopts a short message classification algorithm based on text content and machine learning, and a two-layer classification model based on personalized selection of a short message receiver and centralized filtering and classification of a short message data security management remote server side is adopted, so that high-precision short message intelligent classification and management are realized.
The short message data security management system comprises a short message data security management remote server and a user terminal classification engine, wherein a core classification algorithm is deployed at the short message data security management remote server, and a personalized classification engine is arranged at a user terminal;
the short message data safety management remote server collects the classified short message information from the user terminal and then performs machine learning on the short message information. And after machine learning, updating the feature libraries (a public feature library and an individualized feature library), downloading the updated feature libraries from the short message data security management remote server by the user terminal, and carrying out adaptive adjustment by an individualized classification engine on the user terminal according to the updated feature libraries.
The short message data personalized classification engine runs on the user terminal and runs in a background service mode, and the short message data personalized classification engine uses data which is stored in the user terminal and is synchronously updated with a short message data safety management remote server, namely a personalized short message feature library. The short message data personalized classification engine has the following functions:
(1) the user can decide the classification of the short messages according to the preference of the user. When the short message arrives, calling a word segmentation module and a Bayesian classification module on the user terminal according to the personalized short message feature library to classify the short message;
(2) the classified category of the short message, information sender number, sending time, receiver number and other information are fed back to the short message data safety management remote server through a network;
(3) and downloading the updated personalized feature library from the short message data safety management remote server regularly through a network, and classifying and judging by adopting the updated feature library when the next short message reaches the terminal.
The short message data security management remote server realizes a distributed short message classification learning mode and comprises a feature library maintenance updating module and a short message content processing module;
the characteristic library maintenance and updating module is used for maintaining and updating the public characteristic library and the personalized characteristic library;
the maintenance and updating comprises two modes, the first mode is a training and learning mode, and after short message classification information fed back by a user terminal is received, a machine training algorithm is triggered to perform machine learning, and an individualized feature library is updated; the second one is
Secondly, a public characteristic library is maintained in a short message data security management remote server, and is manually updated;
the short message content processing module comprises a short message preprocessing module, a word segmentation processing module and a feature extraction module;
the short message preprocessing module is used for preprocessing the short message text;
the word segmentation processing module is used for segmenting words of the short messages;
the characteristic extraction module is used for extracting the short message length characteristic, the frequency characteristic, the rule characteristic and the text characteristic information;
fig. 1 is a schematic flow chart of a management method of the present invention, which includes the following steps:
step 1, after receiving a short message, a short message data security management remote server performs short message text preprocessing operation, sends the preprocessed short message text into a public classifier, performs first-layer filtering on the short message according to a public feature library, and shields junk short messages which do not pass through the first-layer filtering; sending the short messages filtered by the first layer to an individual classifier;
step 2, after the short messages filtered by the first layer are sent into an individual classifier, according to an individual feature library, Bayesian classification is applied to carry out second-layer individual classification, and the classification result is sent to a short message through a short message data safety management remote server to inform a receiving party whether to receive the short message; if the receiving party user selects to receive, the short message is forwarded to the user terminal of the receiving party user, otherwise, the short message data safety management remote server shields the short message;
step 3, after receiving the short messages which are subjected to the two-layer classification filtering, the receiving party user calls a word segmentation processing module and a Bayesian classification module according to a personalized classification engine and a classification feature library on a user terminal of the receiving party user, performs primary classification on the short messages for the first time, and presents the short messages to the receiving party user;
step 4, the receiver user determines the classification type of the short messages according to the own requirements, and performs second individual classification on the short messages;
step 5, feeding back the second individual classification category, the number of the information sender, the sending time and the number information of the receiver of the short message to the short message data safety management remote server through the network;
step 6, the short message data safety management remote server receives user feedback information, calls a Bayesian training module and updates a personalized feature library;
and 7, the receiver user downloads the updated personalized feature library from the short message data safety management remote server regularly through the network, and classifies and judges the newly received short messages by adopting the updated personalized feature library.
The short message text preprocessing operation specifically comprises the following steps:
step 101, reading a short message into a memory, recording an ASCII code corresponding to each read character by using an integer variable, and reading a first character;
step 102, judging the numerical range of the read-in character, if the numerical range is in the Chinese character coding range in the Chinese character set, such as 19800-;
and 103, returning to the step 101, and ending the preprocessing operation until all characters of the short message are read in.
The Bayesian classification specifically comprises the following steps:
step 201, reading in training sample short messages, and counting the number of various short messages;
step 202, reading in a word segmentation dictionary, and performing word segmentation processing on a training sample short message to obtain each entry and a corresponding document frequency DF value;
step 203, according to a feature vector selection method, according to the document frequency DF value from large to small, the first 50 feature words are selected to form feature vectors;
step 204, reading in a training sample short message, and training a Bayesian classifier;
and step 205, reading in the short message to be classified, identifying by using a trained Bayesian classifier, and giving a classification result.
The calculation step of the Bayesian classifier comprises the following steps:
step 301, after text word segmentation, the data sample short message is expressed as an n-dimensional feature vector X (w) by applying a vector space model1,w2,w3,……,wn) Wherein w isiIs the absolute word frequency;
step 302, set a total of m types C1,……,CmGiven a sample X to be classified, calculating X attribution class CiProbability P ofi(Ci| X), eventually X belongs to Pi(Ci| X) largest class CiBayesian classification assigns unknown samples X to class CiIf and only if Pi(Ci|X)> Pi(Cj| X), where j is not less than 1 and not more than m, i ≠ j, thus maximizing Pi(Ci| X). Wherein P isi(Ci| X) the largest class is called the maximum a posteriori assumption, according to bayes' theorem: pi(Ci|X) = (P(X|Ci)P(Ci) P (X), since P (X) is constant for all classes, only P (X | C) is requiredi)P(Ci) The maximum is obtained;
step 303, calculate P (C) firsti)=siS, wherein siIs of class CiThe number of samples in (1), s is the total number of training samples;
step 304, recalculate P (X | C)i) Given a dataset with multiple attributes, P (X | C) is computedi) Assuming that the types are independent of each other, thus
Figure 254838DEST_PATH_IMAGE001
Wherein
Figure 343011DEST_PATH_IMAGE002
Can be calculated from training samples;
step 305, classifying the unknown sample X, and respectively classifying each class CiCalculate P (X | C)i)P(Ci) X belongs to P (X | C)i)P(Ci) Class C with the largest valuei
And for the received new short message, the short message data safety management remote server firstly carries out text preprocessing in the steps, judges whether the received new short message belongs to a junk short message or not according to a preprocessing result and a public feature library, shields and filters the junk short message if the received new short message belongs to the junk short message, otherwise sends the short message to a user terminal of a receiving party user, and carries out short message text processing again at the user terminal of the receiving party user.
To realize intelligent classification of short messages, the content of the short messages must be understood first. For Chinese text, to understand the content of the short message, the Chinese text must be segmented. And performing word segmentation on the preprocessed short message text at the server side and the client side respectively. The present word segmentation algorithm mainly comprises two types: one is a mechanical word segmentation method, generally based on a word segmentation dictionary, and completing word segmentation by matching Chinese character strings in documents and words in a word list one by one; the other is an understandable word segmentation method, namely, word segmentation is carried out by utilizing grammar knowledge and semantic knowledge of Chinese, and a word segmentation database, a knowledge base and an inference base are required to be established. Because the comprehension type word segmentation method is far from mature in the aspects of semantic analysis, grammar analysis and the like, the existing word segmentation system mostly adopts the mechanical word segmentation method.
Considering that the words used in the text of the short message are common words, the server side generally has the requirement on the real-time short message processing and the limitation of the storage and processing capacity of the user terminal, the word dictionary is simplified and compressed, the dictionary scale is reduced, and finally the word segmentation dictionary contains about 4 ten thousand entries, occupies about 215M of storage space, and completely meets the word segmentation requirements of the server side and the user terminal.
The system maintains two feature libraries at the server side, namely a public feature library and a personalized feature library. The public feature library is shared by all users, a black and white list filtering feature library and a keyword library are set, and the public feature library is updated manually at any time. The personalized feature library is private for each user, the mobile phone number of the user is used as a keyword, the system establishes a personalized classifier for each user, and two tables are generated: the personality classification category table and the category feature table respectively store categories of user personality classifications, such as a house category, an automobile category, a stock category and the like, and 50 feature words contained in each category. The user terminal transmits the fed back individual classification information including classification type of short message, whether the short message is junk or not, number of information sender, sending time and number information of receiver to the server through the network.
The public feature library is maintained manually by the server and according to the short message classification information fed back by the user terminal, and a black and white list and a keyword list are updated regularly or at any time. The personalized feature library receives the short message classification information fed back by the client terminal through the server terminal, performs incremental learning, automatically triggers a machine training algorithm after receiving the short message fed back by the user, performs machine learning, and updates the personalized feature library.
The system is provided with two Bayesian classifiers at a short message data safety management remote server: the short message classification system is used for classifying short messages in real time, and when the short messages of a user reach a short message center, the short messages are called by a server to classify and judge the short messages; and the other one is used as a background service program, each user is polled regularly, training and learning are carried out according to the information fed back by the user, and the personalized feature library of each user is updated. The system also starts a Bayesian classifier on the user terminal to classify and judge the short messages reaching the user terminal and then prompts the user to make selection and judgment.
The management method of the application is tested, a public feature library is established according to a black and white list and key words provided by an operator, test short messages are sent to a short message data security management system to operate, first-layer short message filtering is realized at a server side, and 472 spam short messages are filtered. Through inspection, the number of the pornographic messages including 205 strips, the fraud messages including 45 strips, the rumor messages including 80 strips, the public safety messages including 130 strips and the advertisement messages including 12 strips is completely consistent with the number of the advertisement messages removed from the artificially marked spam messages, which indicates that the filtering effect of the public feature library of the first layer on the spam messages is good.
In the test of the second layer personalized classification performance, the test short messages are classified into 3 types: the system comprises an automobile type, a house property type and a stock type, wherein 3 types of short messages are selected from all test data to serve as test samples, 200 automobile types, 300 house property types and 500 stocks are measured by adopting classification accuracy for the personalized classification performance of the second layer, and the classification accuracy for the automobile types, the house property types and the stock type reaches over 90 percent.
In some embodiments, part or all of the computer program may be loaded and/or installed onto the device via ROM. When being loaded and executed, may carry out one or more of the steps of the method described above.
The functions described above in this disclosure may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (10)

1. A management method applied to short message data security is characterized in that the method is applied to a short message data security management system, the short message data security management system comprises a short message data security management remote server and a user terminal, a short message data personalized classification engine runs on the user terminal, and the short message data personalized classification engine uses data which are stored in the user terminal and are updated synchronously with the short message data security management remote server;
the management method comprises the following steps:
step 1, after receiving a short message, a short message data security management remote server performs short message text preprocessing operation, sends the preprocessed short message text into a public classifier, performs first-layer filtering on the short message according to a public feature library, and shields junk short messages which do not pass through the first-layer filtering; sending the short messages filtered by the first layer to an individual classifier;
step 2, after the short messages filtered by the first layer are sent into an individual classifier, according to an individual feature library, Bayesian classification is applied to carry out second-layer individual classification, and the classification result is sent to a short message through a short message data safety management remote server to inform a receiver user whether to receive the short messages under the classification; if the receiving party user selects to receive, the short message is forwarded to the user terminal of the receiving party user, otherwise, the short message data safety management remote server shields the short message;
step 3, after receiving the short messages which are subjected to the two-layer classification filtering, the receiving party user calls a word segmentation processing module and a Bayesian classification module according to a personalized classification engine and a classification feature library on a user terminal of the receiving party user, performs primary classification on the short messages for the first time, and presents the short messages to the receiving party user;
step 4, the receiver user determines the classification type of the short messages according to the own requirements, and performs second individual classification on the short messages;
step 5, feeding back the second individual classification category, the number of the information sender, the sending time and the number information of the receiver of the short message to the short message data safety management remote server through the network;
step 6, the short message data safety management remote server receives user feedback information, calls a Bayesian training module and updates a personalized feature library;
and 7, the receiver user downloads the updated personalized feature library from the short message data safety management remote server regularly through the network, and classifies and judges the newly received short messages by adopting the updated personalized feature library.
2. The management method according to claim 1, wherein the short message text preprocessing operation specifically comprises the following steps:
step 101, reading a short message into a memory, recording an ASCII code corresponding to each read character by using an integer variable, and reading a first character;
step 102, judging a numerical range of a read-in character, if the numerical range is in a Chinese character coding range in a Chinese character set, adding the read-in character into a character string variable, otherwise, taking the read-in character as the character string variable and adding a space;
and 103, returning to the step 101, and ending the preprocessing operation until all characters of the short message are read in.
3. The management method according to claim 1, wherein the bayesian classification comprises the following steps:
step 201, reading in training sample short messages, and counting the number of various short messages;
step 202, reading in a word segmentation dictionary, and performing word segmentation processing on a training sample short message to obtain each entry and a corresponding document frequency DF value;
step 203, according to a feature vector selection method, according to the document frequency DF value from large to small, the first 50 feature words are selected to form feature vectors;
step 204, reading in a training sample short message, and training a Bayesian classifier;
and step 205, reading in the short message to be classified, identifying by using a trained Bayesian classifier, and giving a classification result.
4. The method for managing according to claim 1, wherein the step of calculating by the bayesian classification module comprises:
step 301, after text word segmentation, the data sample short message is expressed as an n-dimensional feature vector X (w) by applying a vector space model1,w2,w3,……,wn) Wherein w isiIs the absolute word frequency;
step 302, set a total of m types C1,……,CmGiven a sample X to be classified, calculating X attribution class CiProbability P ofi(Ci| X), eventually X belongs to Pi(Ci| X) largest class CiBayesian classification assigns unknown samples X to class CiIf and only if Pi(Ci|X)> Pi(Cj| X), where j is not less than 1 and not more than m, i is not equal to j, and P is maximizedi(CiI X), wherein Pi(Ci| X) the largest class is called the maximum a posteriori assumption, according to bayes' theorem: pi(Ci|X) = (P(X|Ci)P(Ci) P (X), since P (X) is constant for all classes, only P (X | C) is requiredi)P(Ci) The maximum is obtained;
step 303, calculate P (C) firsti)=siS, wherein siIs of class CiThe number of samples in (1), s is the total number of training samples;
step 304, recalculate P (X | C)i) Given a dataset with multiple attributes, P (X | C) is computedi) Assuming that the types are independent of each other, thus
Figure 419849DEST_PATH_IMAGE001
Wherein
Figure 501069DEST_PATH_IMAGE002
Can be calculated from training samples;
step 305, classifying the unknown sample X, and respectively classifying each class CiCalculate P (X | C)i)P(Ci) X belongs to P (X | C)i)P(Ci) Class C with the largest valuei
5. The management method of claim 1, wherein the short message data security management remote server comprises a feature library maintenance and update module and a short message content processing module;
the characteristic library maintenance and updating module is used for maintaining and updating the public characteristic library and the personalized characteristic library;
the short message content processing module comprises a short message preprocessing module, a word segmentation processing module and a feature extraction module, wherein the short message preprocessing module is used for preprocessing a short message text; the word segmentation processing module is used for segmenting words of the short messages; the characteristic extraction module is used for extracting the short message length characteristic, the frequency characteristic, the rule characteristic and the text characteristic information.
6. The management method according to claim 5, wherein the maintenance update includes two modes, the first mode is a training learning mode, and after receiving the short message classification information fed back by the user terminal, a machine training algorithm is triggered to perform machine learning, and the personalized feature library is updated; the second is to maintain a public characteristic library in a short message data security management remote server, and manually update the public characteristic library.
7. The method of claim 1, wherein the common feature library is common to all users, and the black and white list filtering feature library and the keyword library are set and updated by human at any time.
8. The management method according to claim 1, wherein the personalized feature library is private to each user, and the system establishes a personalized classifier for each user by using the mobile phone number of the user as a keyword, and generates two tables: and the personality classification category table and the category characteristic table are used for respectively storing the categories of the user personality classification.
9. The management method according to claim 1, wherein the user terminal transmits the fed back individual classification information to the server terminal through the network, including classification type of the short message, whether the short message is spam, number of the sender of the message, sending time, and number information of the receiver.
10. The management method according to claim 8 or 9, wherein the public feature library is maintained manually by the server and according to the short message classification information fed back by the user terminal, and the black-and-white list and the keyword list are updated periodically or at any time; the personalized feature library receives the short message classification information fed back by the client terminal through the server terminal, performs incremental learning, automatically triggers a machine training algorithm after receiving the short message fed back by the user, performs machine learning, and updates the personalized feature library.
CN202110092470.5A 2021-01-24 2021-01-24 Management method applied to short message data security Active CN112597282B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110092470.5A CN112597282B (en) 2021-01-24 2021-01-24 Management method applied to short message data security

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110092470.5A CN112597282B (en) 2021-01-24 2021-01-24 Management method applied to short message data security

Publications (2)

Publication Number Publication Date
CN112597282A CN112597282A (en) 2021-04-02
CN112597282B true CN112597282B (en) 2021-06-11

Family

ID=75207409

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110092470.5A Active CN112597282B (en) 2021-01-24 2021-01-24 Management method applied to short message data security

Country Status (1)

Country Link
CN (1) CN112597282B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113923669A (en) * 2021-11-10 2022-01-11 恒安嘉新(北京)科技股份公司 Anti-fraud early warning method, device, equipment and medium for multi-source cross-platform fusion
CN114466362B (en) * 2022-04-11 2022-06-28 武汉卓鹰世纪科技有限公司 Method and device for filtering junk short messages under 5G communication based on BilSTM

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635886A (en) * 2008-07-22 2010-01-27 北京光芒星空信息技术有限公司 Method for filtering spam short messages based on user selection
US8660592B2 (en) * 2011-07-11 2014-02-25 General Electric Company System and method for enhancing short message service (SMS) text messages
CN103634473A (en) * 2013-12-05 2014-03-12 南京理工大学连云港研究院 Naive Bayesian classification based mobile phone spam short message filtering method and system
CN104284306A (en) * 2013-07-04 2015-01-14 北京壹人壹本信息科技有限公司 Junk message filter method and system, mobile terminal and cloud server
CN106162584A (en) * 2015-01-27 2016-11-23 北京奇虎科技有限公司 Identify the method for refuse messages, client, cloud server and system
CN110300383A (en) * 2019-05-24 2019-10-01 深圳市趣创科技有限公司 A kind of filtering junk short messages programmed algorithm and device and system and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635886A (en) * 2008-07-22 2010-01-27 北京光芒星空信息技术有限公司 Method for filtering spam short messages based on user selection
US8660592B2 (en) * 2011-07-11 2014-02-25 General Electric Company System and method for enhancing short message service (SMS) text messages
CN104284306A (en) * 2013-07-04 2015-01-14 北京壹人壹本信息科技有限公司 Junk message filter method and system, mobile terminal and cloud server
CN103634473A (en) * 2013-12-05 2014-03-12 南京理工大学连云港研究院 Naive Bayesian classification based mobile phone spam short message filtering method and system
CN106162584A (en) * 2015-01-27 2016-11-23 北京奇虎科技有限公司 Identify the method for refuse messages, client, cloud server and system
CN110300383A (en) * 2019-05-24 2019-10-01 深圳市趣创科技有限公司 A kind of filtering junk short messages programmed algorithm and device and system and storage medium

Also Published As

Publication number Publication date
CN112597282A (en) 2021-04-02

Similar Documents

Publication Publication Date Title
CN107798032B (en) Method and device for processing response message in self-service voice conversation
CN110351301B (en) HTTP request double-layer progressive anomaly detection method
CN112597282B (en) Management method applied to short message data security
CN111178380B (en) Data classification method and device and electronic equipment
CN111681653A (en) Call control method, device, computer equipment and storage medium
CN107145516B (en) Text clustering method and system
CN113011889B (en) Account anomaly identification method, system, device, equipment and medium
CN108111399B (en) Message processing method, device, terminal and storage medium
CN112633962A (en) Service recommendation method and device, computer equipment and storage medium
CN112016313A (en) Spoken language element identification method and device and alarm situation analysis system
CN110619535B (en) Data processing method and device
CN116089873A (en) Model training method, data classification and classification method, device, equipment and medium
CN111368529B (en) Mobile terminal sensitive word recognition method, device and system based on edge calculation
CN109978575B (en) Method and device for mining user flow operation scene
CN115796310A (en) Information recommendation method, information recommendation device, information recommendation model training device, information recommendation equipment and storage medium
CN111581388A (en) User intention identification method and device and electronic equipment
CN113282433B (en) Cluster anomaly detection method, device and related equipment
CN114722191A (en) Automatic call clustering method and system based on semantic understanding processing
CN110428816A (en) A kind of method and device voice cell bank training and shared
CN113726942A (en) Intelligent telephone answering method, system, medium and electronic terminal
CN113836898A (en) Automatic order dispatching method for power system
CN112199388A (en) Strange call identification method and device, electronic equipment and storage medium
CN112069392B (en) Method and device for preventing and controlling network-related crime, computer equipment and storage medium
CN115391541A (en) Intelligent contract code automatic checking method, storage medium and electronic equipment
CN115619245A (en) Portrait construction and classification method and system based on data dimension reduction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Management Method Applied to SMS Data Security

Effective date of registration: 20230329

Granted publication date: 20210611

Pledgee: Shenzhen Branch of Huishang Bank Co.,Ltd.

Pledgor: Shenzhen chengliye Technology Development Co.,Ltd.

Registration number: Y2023980036803

PE01 Entry into force of the registration of the contract for pledge of patent right