CN112597282B

CN112597282B - Management method applied to short message data security

Info

Publication number: CN112597282B
Application number: CN202110092470.5A
Authority: CN
Inventors: 曾永明
Original assignee: Shenzhen Chengliye Technology Development Co ltd
Current assignee: Shenzhen Chengliye Technology Development Co ltd
Priority date: 2021-01-24
Filing date: 2021-01-24
Publication date: 2021-06-11
Anticipated expiration: 2041-01-24
Also published as: CN112597282A

Abstract

The invention provides a management method applied to short message data safety, which is applied to a short message data safety management system, wherein the short message data safety management system comprises a short message data safety management remote server and a user terminal, a short message data personalized classification engine runs on the user terminal, and the short message data personalized classification engine uses data which is stored in the user terminal and is synchronously updated with the short message data safety management remote server. The invention designs a two-layer classification model based on personalized selection of a short message receiver and centralized monitoring and filtering of a server side based on a short message classification algorithm of text content and machine learning, and realizes high-precision intelligent short message classification.

Description

Management method applied to short message data security

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of data security management, in particular to a management method applied to short message data security.

[ background of the invention ]

Short messages have already gained a high degree of practical application as an important way of communication in people's life. The mobile phone short messages are not only favorite communication tools of people, but also propagation means of various bad messages, the flooding of the short messages impacts the network, and serious bad social influence is caused. When the short message enters the application fields of large batch and various categories, certain short message contents with fraudulent behavior properties, which are manufactured and formed by illegal merchants, are mixed in the short message.

At present, the filtering technologies for spam messages include black and white list filtering, message length and flow threshold filtering, and an artificial intelligence filtering method using a text classification algorithm, but all methods have advantages and disadvantages, the filtering effect for spam messages is not good, how to improve the filtering effect for spam messages and improve the security of message data becomes a technical problem which needs to be solved urgently.

[ summary of the invention ]

The invention provides a management method applied to short message data security, which aims to solve one or more of the above-mentioned technical problems.

The technical scheme adopted by the application is as follows:

a management method applied to the safety of short message data is applied to a short message data safety management system, the short message data safety management system comprises a short message data safety management remote server and a user terminal, a short message data personalized classification engine runs on the user terminal, and the short message data personalized classification engine uses data which is stored in the user terminal and is synchronously updated with the short message data safety management remote server;

the management method comprises the following steps:

step 1, after receiving a short message, a short message data security management remote server performs short message text preprocessing operation, sends the preprocessed short message text into a public classifier, performs first-layer filtering on the short message according to a public feature library, and shields junk short messages which do not pass through the first-layer filtering; sending the short messages filtered by the first layer to an individual classifier;

step 2, after the short messages filtered by the first layer are sent into an individual classifier, according to an individual feature library, Bayesian classification is applied to carry out second-layer individual classification, and the classification result is sent to a short message through a short message data safety management remote server to inform a receiving party whether to receive the short message; if the receiving party user selects to receive, the short message is forwarded to the user terminal of the receiving party user, otherwise, the short message data safety management remote server shields the short message;

step 3, after receiving the short messages which are subjected to the two-layer classification filtering, the receiving party user calls a word segmentation processing module and a Bayesian classification module according to a personalized classification engine and a classification feature library on a user terminal of the receiving party user, performs primary classification on the short messages for the first time, and presents the short messages to the receiving party user;

step 4, the receiver user determines the classification type of the short messages according to the own requirements, and performs second individual classification on the short messages;

step 5, feeding back the second individual classification category, the number of the information sender, the sending time and the number information of the receiver of the short message to the short message data safety management remote server through the network;

step 6, the short message data safety management remote server receives user feedback information, calls a Bayesian training module and updates a personalized feature library;

and 7, the receiver user downloads the updated personalized feature library from the short message data safety management remote server regularly through the network, and classifies and judges the newly received short messages by adopting the updated personalized feature library.

Further, the short message text preprocessing operation specifically includes the following steps:

step 101, reading a short message into a memory, recording an ASCII code corresponding to each read character by using an integer variable, and reading a first character;

step 102, judging a numerical range of a read-in character, if the numerical range is in a Chinese character coding range in a Chinese character set, adding the read-in character into a character string variable, otherwise, taking the read-in character as the character string variable and adding a space;

and 103, returning to the step 101, and ending the preprocessing operation until all characters of the short message are read in.

Further, the bayesian classification specifically comprises the following steps:

step 201, reading in training sample short messages, and counting the number of various short messages;

step 202, reading in a word segmentation dictionary, and performing word segmentation processing on a training sample short message to obtain each entry and a corresponding document frequency DF value;

step 203, according to a feature vector selection method, according to the document frequency DF value from large to small, the first 50 feature words are selected to form feature vectors;

step 204, reading in a training sample short message, and training a Bayesian classifier;

and step 205, reading in the short message to be classified, identifying by using a trained Bayesian classifier, and giving a classification result.

Further, the calculating step of the bayesian classifier comprises:

step 301, after text word segmentation, the data sample short message is expressed as an n-dimensional feature vector X (w) by applying a vector space model₁,w₂,w₃,……,w_n) Wherein w is_iIs the absolute word frequency;

step 302, set a total of m types C₁,……,C_mGiven a sample X to be classified, calculating X attribution class C_iProbability P of_i(C_i| X), eventually X belongs to P_i(C_i| X) largest class C_iBayesian classification assigns unknown samples X to class C_iIf and only if P_i(C_i|X)> P_i(C_j| X), where j is not less than 1 and not more than m, i is not equal to j, and P is maximized_i(C_iI X), wherein P_i(C_i| X) the largest class is called the maximum a posteriori assumption, according to bayes' theorem: p_i(C_i|X) = (P(X|C_i)P(C_i) P (X), since P (X) is constant for all classes, only P (X | C) is required_i)P(C_i) The maximum is obtained;

step 303, calculate P (C) first_i)=s_iS, wherein s_iIs of class C_iThe number of samples in (1), s is the total number of training samples;

step 304, recalculate P (X | C)_i) Given a dataset with multiple attributes, P (X | C) is computed_i) Assuming that the types are independent of each other, thus

Wherein

Can be calculated from training samples;

step 305, classifying the unknown sample X, and respectively classifying each class C_iCalculate P (X | C)_i)P(C_i) X belongs to P (X | C)_i)P(C_i) Class C with the largest value_i。

Furthermore, the short message data security management remote server comprises a feature library maintenance updating module and a short message content processing module;

the characteristic library maintenance and updating module is used for maintaining and updating the public characteristic library and the personalized characteristic library;

the short message content processing module comprises a short message preprocessing module, a word segmentation processing module and a feature extraction module, wherein the short message preprocessing module is used for preprocessing a short message text; the word segmentation processing module is used for segmenting words of the short messages; the characteristic extraction module is used for extracting the short message length characteristic, the frequency characteristic, the rule characteristic and the text characteristic information.

Furthermore, the maintenance and updating comprises two modes, the first mode is a training learning mode, and after short message classification information fed back by the user terminal is received, a machine training algorithm is triggered to perform machine learning, and the personalized feature library is updated; the second is to maintain a public characteristic library in a short message data security management remote server, and manually update the public characteristic library.

Furthermore, the public feature library is shared by all users, a black and white list filtering feature library and a keyword library are set, and the public feature library and the keyword library are updated manually at any time.

Further, the personalized feature library is private for each user, the mobile phone number of the user is used as a keyword, the system establishes a personalized classifier for each user, and two tables are generated: and the personality classification category table and the category characteristic table are used for respectively storing the categories of the user personality classification.

Further, the user terminal transmits the fed back individual classification information to the server terminal through the network, wherein the individual classification information comprises short message classification category, whether the short messages are junk short messages or not, information sender number, sending time and receiver number information.

Furthermore, the public feature library is maintained manually by the server and according to the short message classification information fed back by the user terminal, and a black and white list and a keyword list are updated regularly or at any time; the personalized feature library receives the short message classification information fed back by the client terminal through the server terminal, performs incremental learning, automatically triggers a machine training algorithm after receiving the short message fed back by the user, performs machine learning, and updates the personalized feature library.

Through the embodiment of the application, the following technical effects can be obtained: the invention designs a two-layer classification model based on personalized selection of a short message receiver and centralized monitoring and filtering of a server side based on a short message classification algorithm of text content and machine learning, and realizes high-precision intelligent short message classification.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and those skilled in the art can also obtain other drawings according to the drawings without inventive labor.

FIG. 1 is a flow chart of a management method of the present invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The method is applied to a short message data security management system, the system adopts a short message classification algorithm based on text content and machine learning, and a two-layer classification model based on personalized selection of a short message receiver and centralized filtering and classification of a short message data security management remote server side is adopted, so that high-precision short message intelligent classification and management are realized.

The short message data security management system comprises a short message data security management remote server and a user terminal classification engine, wherein a core classification algorithm is deployed at the short message data security management remote server, and a personalized classification engine is arranged at a user terminal;

the short message data safety management remote server collects the classified short message information from the user terminal and then performs machine learning on the short message information. And after machine learning, updating the feature libraries (a public feature library and an individualized feature library), downloading the updated feature libraries from the short message data security management remote server by the user terminal, and carrying out adaptive adjustment by an individualized classification engine on the user terminal according to the updated feature libraries.

The short message data personalized classification engine runs on the user terminal and runs in a background service mode, and the short message data personalized classification engine uses data which is stored in the user terminal and is synchronously updated with a short message data safety management remote server, namely a personalized short message feature library. The short message data personalized classification engine has the following functions:

(1) the user can decide the classification of the short messages according to the preference of the user. When the short message arrives, calling a word segmentation module and a Bayesian classification module on the user terminal according to the personalized short message feature library to classify the short message;

(2) the classified category of the short message, information sender number, sending time, receiver number and other information are fed back to the short message data safety management remote server through a network;

(3) and downloading the updated personalized feature library from the short message data safety management remote server regularly through a network, and classifying and judging by adopting the updated feature library when the next short message reaches the terminal.

The short message data security management remote server realizes a distributed short message classification learning mode and comprises a feature library maintenance updating module and a short message content processing module;

the maintenance and updating comprises two modes, the first mode is a training and learning mode, and after short message classification information fed back by a user terminal is received, a machine training algorithm is triggered to perform machine learning, and an individualized feature library is updated; the second one is

Secondly, a public characteristic library is maintained in a short message data security management remote server, and is manually updated;

the short message content processing module comprises a short message preprocessing module, a word segmentation processing module and a feature extraction module;

the short message preprocessing module is used for preprocessing the short message text;

the word segmentation processing module is used for segmenting words of the short messages;

the characteristic extraction module is used for extracting the short message length characteristic, the frequency characteristic, the rule characteristic and the text characteristic information;

fig. 1 is a schematic flow chart of a management method of the present invention, which includes the following steps:

The short message text preprocessing operation specifically comprises the following steps:

step 102, judging the numerical range of the read-in character, if the numerical range is in the Chinese character coding range in the Chinese character set, such as 19800-;

The Bayesian classification specifically comprises the following steps:

The calculation step of the Bayesian classifier comprises the following steps:

step 302, set a total of m types C₁,……,C_mGiven a sample X to be classified, calculating X attribution class C_iProbability P of_i(C_i| X), eventually X belongs to P_i(C_i| X) largest class C_iBayesian classification assigns unknown samples X to class C_iIf and only if P_i(C_i|X)> P_i(C_j| X), where j is not less than 1 and not more than m, i ≠ j, thus maximizing P_i(C_i| X). Wherein P is_i(C_i| X) the largest class is called the maximum a posteriori assumption, according to bayes' theorem: p_i(C_i|X) = (P(X|C_i)P(C_i) P (X), since P (X) is constant for all classes, only P (X | C) is required_i)P(C_i) The maximum is obtained;

Wherein

Can be calculated from training samples;

And for the received new short message, the short message data safety management remote server firstly carries out text preprocessing in the steps, judges whether the received new short message belongs to a junk short message or not according to a preprocessing result and a public feature library, shields and filters the junk short message if the received new short message belongs to the junk short message, otherwise sends the short message to a user terminal of a receiving party user, and carries out short message text processing again at the user terminal of the receiving party user.

To realize intelligent classification of short messages, the content of the short messages must be understood first. For Chinese text, to understand the content of the short message, the Chinese text must be segmented. And performing word segmentation on the preprocessed short message text at the server side and the client side respectively. The present word segmentation algorithm mainly comprises two types: one is a mechanical word segmentation method, generally based on a word segmentation dictionary, and completing word segmentation by matching Chinese character strings in documents and words in a word list one by one; the other is an understandable word segmentation method, namely, word segmentation is carried out by utilizing grammar knowledge and semantic knowledge of Chinese, and a word segmentation database, a knowledge base and an inference base are required to be established. Because the comprehension type word segmentation method is far from mature in the aspects of semantic analysis, grammar analysis and the like, the existing word segmentation system mostly adopts the mechanical word segmentation method.

Considering that the words used in the text of the short message are common words, the server side generally has the requirement on the real-time short message processing and the limitation of the storage and processing capacity of the user terminal, the word dictionary is simplified and compressed, the dictionary scale is reduced, and finally the word segmentation dictionary contains about 4 ten thousand entries, occupies about 215M of storage space, and completely meets the word segmentation requirements of the server side and the user terminal.

The system maintains two feature libraries at the server side, namely a public feature library and a personalized feature library. The public feature library is shared by all users, a black and white list filtering feature library and a keyword library are set, and the public feature library is updated manually at any time. The personalized feature library is private for each user, the mobile phone number of the user is used as a keyword, the system establishes a personalized classifier for each user, and two tables are generated: the personality classification category table and the category feature table respectively store categories of user personality classifications, such as a house category, an automobile category, a stock category and the like, and 50 feature words contained in each category. The user terminal transmits the fed back individual classification information including classification type of short message, whether the short message is junk or not, number of information sender, sending time and number information of receiver to the server through the network.

The public feature library is maintained manually by the server and according to the short message classification information fed back by the user terminal, and a black and white list and a keyword list are updated regularly or at any time. The personalized feature library receives the short message classification information fed back by the client terminal through the server terminal, performs incremental learning, automatically triggers a machine training algorithm after receiving the short message fed back by the user, performs machine learning, and updates the personalized feature library.

The system is provided with two Bayesian classifiers at a short message data safety management remote server: the short message classification system is used for classifying short messages in real time, and when the short messages of a user reach a short message center, the short messages are called by a server to classify and judge the short messages; and the other one is used as a background service program, each user is polled regularly, training and learning are carried out according to the information fed back by the user, and the personalized feature library of each user is updated. The system also starts a Bayesian classifier on the user terminal to classify and judge the short messages reaching the user terminal and then prompts the user to make selection and judgment.

The management method of the application is tested, a public feature library is established according to a black and white list and key words provided by an operator, test short messages are sent to a short message data security management system to operate, first-layer short message filtering is realized at a server side, and 472 spam short messages are filtered. Through inspection, the number of the pornographic messages including 205 strips, the fraud messages including 45 strips, the rumor messages including 80 strips, the public safety messages including 130 strips and the advertisement messages including 12 strips is completely consistent with the number of the advertisement messages removed from the artificially marked spam messages, which indicates that the filtering effect of the public feature library of the first layer on the spam messages is good.

In the test of the second layer personalized classification performance, the test short messages are classified into 3 types: the system comprises an automobile type, a house property type and a stock type, wherein 3 types of short messages are selected from all test data to serve as test samples, 200 automobile types, 300 house property types and 500 stocks are measured by adopting classification accuracy for the personalized classification performance of the second layer, and the classification accuracy for the automobile types, the house property types and the stock type reaches over 90 percent.

In some embodiments, part or all of the computer program may be loaded and/or installed onto the device via ROM. When being loaded and executed, may carry out one or more of the steps of the method described above.

The functions described above in this disclosure may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A management method applied to short message data security is characterized in that the method is applied to a short message data security management system, the short message data security management system comprises a short message data security management remote server and a user terminal, a short message data personalized classification engine runs on the user terminal, and the short message data personalized classification engine uses data which are stored in the user terminal and are updated synchronously with the short message data security management remote server;

the management method comprises the following steps:

step 2, after the short messages filtered by the first layer are sent into an individual classifier, according to an individual feature library, Bayesian classification is applied to carry out second-layer individual classification, and the classification result is sent to a short message through a short message data safety management remote server to inform a receiver user whether to receive the short messages under the classification; if the receiving party user selects to receive, the short message is forwarded to the user terminal of the receiving party user, otherwise, the short message data safety management remote server shields the short message;

2. The management method according to claim 1, wherein the short message text preprocessing operation specifically comprises the following steps:

3. The management method according to claim 1, wherein the bayesian classification comprises the following steps:

4. The method for managing according to claim 1, wherein the step of calculating by the bayesian classification module comprises:

Wherein

Can be calculated from training samples;

5. The management method of claim 1, wherein the short message data security management remote server comprises a feature library maintenance and update module and a short message content processing module;

6. The management method according to claim 5, wherein the maintenance update includes two modes, the first mode is a training learning mode, and after receiving the short message classification information fed back by the user terminal, a machine training algorithm is triggered to perform machine learning, and the personalized feature library is updated; the second is to maintain a public characteristic library in a short message data security management remote server, and manually update the public characteristic library.

7. The method of claim 1, wherein the common feature library is common to all users, and the black and white list filtering feature library and the keyword library are set and updated by human at any time.

8. The management method according to claim 1, wherein the personalized feature library is private to each user, and the system establishes a personalized classifier for each user by using the mobile phone number of the user as a keyword, and generates two tables: and the personality classification category table and the category characteristic table are used for respectively storing the categories of the user personality classification.

9. The management method according to claim 1, wherein the user terminal transmits the fed back individual classification information to the server terminal through the network, including classification type of the short message, whether the short message is spam, number of the sender of the message, sending time, and number information of the receiver.

10. The management method according to claim 8 or 9, wherein the public feature library is maintained manually by the server and according to the short message classification information fed back by the user terminal, and the black-and-white list and the keyword list are updated periodically or at any time; the personalized feature library receives the short message classification information fed back by the client terminal through the server terminal, performs incremental learning, automatically triggers a machine training algorithm after receiving the short message fed back by the user, performs machine learning, and updates the personalized feature library.