CN113591994A

CN113591994A - Terminal behavior prediction method based on automatic labeling

Info

Publication number: CN113591994A
Application number: CN202110884609.XA
Authority: CN
Inventors: 张宁波; 严雅洁
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-08-03
Filing date: 2021-08-03
Publication date: 2021-11-02
Anticipated expiration: 2041-08-03
Also published as: CN113591994B

Abstract

The invention discloses a terminal behavior prediction method based on automatic labeling, which includes five steps of data preprocessing, frequent behavior pattern mining, behavior pattern clustering, behavior recognition and behavior prediction. , which can automatically label the operation data sequence without human intervention, which solves the problem that the terminal behavior prediction model in the current Internet of Things scenario cannot automatically perform behavior labeling. This method has high behavior recognition and behavior prediction accuracy. , and saves a lot of time and labor costs required for the behavior labeling process, and realizes the integration of terminal behavior recognition and behavior prediction in the Internet of Things environment and the further intelligence of the terminal behavior prediction model.

Description

Terminal behavior prediction method based on automatic labeling

Technical Field

The invention relates to the technical field of networks, in particular to a terminal behavior prediction method based on automatic labeling.

Background

In recent years, the technology of the internet of things is rapidly developed, and great improvement is brought to the daily life of human beings. The number of intelligent terminal devices is remarkably increased, and the intelligentized internet of everything becomes the inevitable trend of the development of the internet of things in the future. In the LTE-a network, call records (CDR) in a core network store call, short message, and data service information of person-to-person (Human to Human, H2H) communication in real time, including information of a User Equipment Identity (UE ID), a base station location, a direction and a communication type of a voice call (SMS/call), data traffic, and the like. According to the CDR data, hidden predictable information can be extracted, the future behavior of the terminal is predicted, a network operator can make a coping strategy in advance, and the service efficiency of the operator is improved. Similarly, in the 5G network, the core network also stores event log (EDR) data of the terminal of the internet of things in real time. The EDR data includes information such as UE ID, terminal operation sequence, operation execution time, operation duration, and physical resource occupation. Through the data, the access behavior of the terminal of the Internet of things can be predicted.

The existing terminal behavior prediction model obtains a terminal behavior sequence by artificially behavior labeling a terminal operation sequence, and can be used for modeling the terminal behavior prediction model. The behavior labeling process needs human intervention, consumes a large amount of time cost and labor cost, and has limitation in practical application.

The modeling process of the conventional terminal behavior prediction model includes the following steps.

Step 1: preprocessing EDR data of a terminal: and processing the abnormal characteristic data to obtain ERD data capable of being subjected to behavior marking.

Step 2: and (3) marking artificial behaviors: a group of continuous operation events corresponds to one behavior of the terminal, related researchers carry out artificial behavior marking, and the terminal operation sequence is marked into a corresponding terminal behavior sequence for terminal behavior prediction.

And step 3: and (3) behavior prediction: and predicting the behavior of the terminal at the next moment through a prediction model based on the marked historical terminal behavior data and the current terminal behavior.

The existing terminal behavior prediction model needs to artificially label the behavior hierarchy of the data of the operation event hierarchy, so the prediction model needs human intervention, which hinders the intellectualization of the terminal behavior prediction model and has certain limitation in practical application. In addition, when the terminal data size is very large, the workload and time cost of the terminal behavior labeling and verification process may significantly increase.

Disclosure of Invention

The invention aims to provide a terminal behavior prediction method based on automatic labeling, which realizes a terminal behavior prediction model with high accuracy rate, can automatically label, does not need human intervention, reduces time cost and labor cost, and further improves the intellectualization and practicability of the terminal behavior prediction model.

In order to achieve the above purpose, the invention provides the following technical scheme:

the invention provides a terminal behavior prediction method based on automatic labeling, which comprises the following steps:

s1, preprocessing data: acquiring current behavior data of a terminal, numbering a terminal operation sequence, screening infrequent operation events from the terminal operation sequence data, and re-numbering the processed operation data;

s2, mining frequent behavior patterns: performing frequent behavior pattern mining on the operation data processed in the step 1 until a new behavior pattern is not mined, and stopping iteration to enable a behavior pattern sequence to meet a minimum description length principle;

s3, behavior pattern clustering: clustering the frequent behavior patterns mined in the step (2) to obtain a clustering center and a category to which each behavior pattern belongs;

s4, behavior recognition: performing behavior recognition on the clustered result by adopting an HMM model, and labeling to obtain the current behavior and the historical behavior of the terminal;

s5, behavior prediction: inputting the current terminal behavior into a trained prediction model to obtain the predicted terminal behavior at the next moment, wherein the behavior prediction model is obtained by training a training sample based on the prediction model of the neural network, and the training sample comprises the historical behavior of the terminal.

Further, in step S1, the current behavior data of the terminal includes EDR data of the terminal and log information, and the current behavior of the terminal is obtained by automatically labeling according to the EDR data of the terminal.

Further, the EDR data of the terminal at least comprises one of the following information: UE ID, terminal operation sequence, operation execution time, operation duration and occupied physical resource information.

Further, the method in step S2 specifically includes:

s201, searching a non-repeated general behavior pattern with the length of L by using a sliding window: setting the initial iteration number to be 1, setting the size of a sliding window to be L, searching a behavior pattern with the length of L, combining repeated behavior patterns, and taking the combined behavior pattern as an initial general behavior pattern;

s202, judging whether the behavior pattern with the length of L +1 is a variant of the general behavior pattern with the length of L or a new general behavior pattern: comparing the similarity of the behavior pattern with the length L +1 with the general behavior pattern with the length L, wherein the similarity of the two behavior patterns is measured by the edit distance, and if the similarity is greater than a given threshold, the behavior pattern with the length L +1 is considered to be a variant of the general behavior pattern with the length L; otherwise, the method is regarded as a new general behavior pattern with the length of L + 1; the common behavior pattern and its corresponding variants are stored using a dictionary;

s203, whether the general behavior mode needs pruning is measured through a minimum description length principle, the general behavior mode which is mined and does not conform to the minimum description length principle and the variant of the general behavior mode are pruned, and iteration is stopped when the general behavior mode cannot be found any more.

Further, the method for clustering patterns in step S3: and initially, randomly selecting a clustering center, and continuously and iteratively updating the clustering center according to the editing distance until convergence.

Further, the action in step S4 identifies a decoding problem corresponding to the HMM model, which is solved using the Viterbi algorithm.

Further, the terminal operation sequence is used as an observation sequence, and the clustered terminal behavior mode is used as a hidden state, so that parameters required by the Viterbi algorithm are calculated, wherein the parameters comprise an observation probability matrix, an initial state probability matrix and a state transition probability matrix.

Further, the calculation method of the initial state probability matrix is as follows: the total number of occurrences of all behavioral patterns in this class is divided by the total number of occurrences of all behavioral patterns in all classes. .

Further, the calculation method of the state transition probability matrix is as follows: in the course of the behavior pattern extraction, marking and recording the starting position and the ending position of each behavior pattern corresponding to the operation sequence data, comparing the recorded starting and ending subscripts with the starting and ending subscripts of each behavior pattern in other classes for each behavior pattern in one class, if the subscripts do not have an inclusion relationship, adding 1 to the number of transition states, and then dividing the number of transition states of each class by the total number of transition states to obtain the transition probability from each class to each other class.

Further, the observed probability matrix is calculated by dividing the total number of occurrences of each operation by the total number of occurrences of all operations in each class.

Compared with the prior art, the invention has the beneficial effects that:

the terminal behavior prediction method based on automatic labeling comprises five steps of data preprocessing, frequent behavior pattern mining, behavior pattern clustering, behavior recognition and behavior prediction, the behavior recognition model and the behavior prediction model are well combined, manual intervention is not needed, the operation data sequence can be automatically labeled, the problem that the terminal behavior prediction model in the current scene of the Internet of things cannot automatically label behaviors is solved, the method has high accuracy of behavior recognition and behavior prediction, a large amount of time cost and labor cost required in the behavior labeling process are saved, and integration of terminal behavior recognition and behavior prediction in the environment of the Internet of things and further intellectualization of the terminal behavior prediction model are achieved.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a flowchart of a terminal behavior prediction method based on automatic annotation according to an embodiment of the present invention.

Fig. 2 is a flowchart of frequent behavior pattern mining according to an embodiment of the present invention.

Detailed Description

For a better understanding of the present solution, the method of the present invention is described in detail below with reference to the accompanying drawings.

The terminal behavior prediction method based on automatic labeling, provided by the invention, well combines a behavior recognition model and a behavior prediction model, and mainly comprises five steps of data preprocessing, frequent behavior pattern mining, behavior pattern clustering, behavior recognition and behavior prediction, as shown in figure 1. Wherein:

step 1: data preprocessing: firstly numbering the terminal operation sequence, then screening out infrequent operation events from the terminal operation sequence, and renumbering the processed operation data. Assuming that the terminal operation data has X operation events, random numbering (0-X-1) is carried out firstly, when the frequency threshold is set to be f, f X operation events with the frequency of occurrence times larger than the frequency threshold should be screened out and retained, meanwhile, (1-f) X operation events with the frequency smaller than the frequency threshold are eliminated, and the retained f X operation events are numbered again.

Step 2: and (3) frequent behavior pattern mining: and (3) performing frequent behavior pattern mining on the operation data processed in the step 1, wherein the flow of the steps is shown in fig. 2.

The first step is to use a sliding window to find a non-repeating generic behavior pattern of length L. Specifically, the initial iteration number is 1, the size of the sliding window is set to be L, a behavior pattern with the length of L is searched, and after repeated behavior patterns are combined, the behavior pattern serves as an initial general behavior pattern (where the initial size L of the sliding window is set to be 2). Assuming that the operation data sequence is [3,5,6,7,6,7 … ] and has a length of N, when the initial size L of the sliding window is set to 2, the operation data is subjected to sliding window extraction, and the extracted behavior patterns have N-L +1, which are [3,5], [5,6], [6,7], [7,6], [6,7] …, and only one of the repeated behavior patterns [6,7] is merged.

The second step is to judge whether the behavior pattern with the length of L +1 is a variant of the general behavior pattern with the length of L or a new general behavior pattern, and the step is a merging process. And comparing the similarity of the behavior pattern with the length L +1 with the general behavior pattern with the length L, wherein the similarity of the two behavior patterns is measured by the edit distance. If the similarity is greater than a given threshold, then the behavior pattern of length L +1 is considered a variant of the general behavior pattern of length L; otherwise, the method is considered as a new general behavior pattern with the length of L + 1. There are many possible variations of a generic behavior pattern. For convenience of comparison and improvement of query efficiency, the common behavior pattern and its corresponding variant are stored by a dictionary. Assuming that the similarity threshold is 0.6, assuming that the extracted general behavior pattern with the length of L (L ═ 2) is [6,7], the mined behavior pattern with the length of L +1(3) is [5,6,7], comparing the similarities between the two behavior patterns, and considering that the behavior pattern [5,6,7] is a variant of the general behavior pattern [6,7] as the similarity between the two is greater than the set threshold, and storing the general behavior pattern and the variant thereof in a dictionary.

And the third step is pruning operation, which is carried out at the end of iteration, and whether the general behavior mode needs pruning is measured through a minimum description length principle. Specifically, common behavior patterns which do not conform to the minimum description length principle and variants of the common behavior patterns which are mined are cut off, and frequent behavior patterns are searched to the maximum extent. Through pruning operations, the redundancy of behavior patterns can be greatly eliminated. The iteration is stopped when the generic behavior pattern is no longer found. It is assumed that after step 2, M terminal-generic behavior patterns are obtained, and these behavior patterns and their variants are stored in a dictionary.

And step 3: behavior pattern clustering: and (4) clustering the frequent behavior patterns mined in the step (2), and obtaining the class (which behavior) of each behavior pattern from the clustering result.

The behavior pattern is first preprocessed. The terminal behavior pattern after frequent pattern mining is composed of operation events. In clustering algorithms, patterns consist of states. The state thus corresponds to an event of the mode, but the state may also contain additional information such as duration of operation, occupied physical resource information, type of operation event, and duration, etc. We merge all the successive states corresponding to the same operation to form an extended state. For example, if an operation is repeatedly triggered several times in succession and no other operation event interrupts the sequence, the repeated operation events are merged into one operation event with a longer duration and the duration (number of repeated triggers) is recorded as a state attribute. After the processing, the operation event sequence is converted into an extended state sequence, the representation of the behavior pattern is simpler and more compact, and whether the two behavior patterns are similar or not is easier to compare, so that the complexity of calculation is reduced.

The clustering method of behavior patterns is exemplified by the K-means clustering method, but not limited to this method. In order to calculate the similarity between two behaviors, the distance between two extended state sequences needs to be defined. Since the operation sequence and the extended state sequence are not numerical sequences, but category sequences. The numerical value of the data in the sequence represents the category and does not represent the position in the space, so that the common scalar measurement distance cannot be used for measuring the similarity between the two behavior sequences, and the edit distance is adopted. The effect of the edit distance is mainly to compare the similarity between two character strings. The edit distance is the minimum number of edit operations required to change from one string to another string, and if the distance is larger, the more different the strings are. Permitted editing operations include replacing one character with another, inserting one character and deleting one character. From the definition of the edit distance, the edit distance is suitable for comparing the distance between extended state sequences (category sequences). And (3) clustering the behavior patterns mined in the step (2), initially randomly selecting a clustering center, and continuously and iteratively updating the clustering center according to the editing distance until convergence. After clustering, the clustering center and which class each behavior pattern belongs to can be obtained. Assuming that M terminal behavior patterns are mined after step 2 and belong to 5 types of terminal behaviors, 5 cluster centers can be obtained after clustering, and which type each behavior pattern belongs to (the number of the corresponding type is 1-5), such as [2,3,1,5,5,4, … ], can be obtained, and the number at each position represents which type the behavior pattern at that position belongs to.

And 4, step 4: and (3) behavior recognition: and performing behavior recognition of the terminal by using a Hidden Markov Model (HMM).

Behavior recognition corresponds to a decoding problem of the HMM model, and the decoding problem of the HMM is solved by using a Viterbi algorithm. For the Viterbi algorithm, dynamic programming is usually adopted to solve the decoding problem of the HMM model, and it can find the path with the highest probability (the optimal path), where one path corresponds to one hidden state sequence in the HMM model. In the HMM model, a terminal operation sequence is regarded as an observation sequence, and a result after clustering is regarded as a hidden state.

When terminal behavior labeling is carried out on the terminal historical data, in the process of mining the frequent behavior patterns, the length range of the behavior sequence does not need to be known in advance, namely the behavior patterns with different lengths are mined according to the set sequence length range, and iteration is carried out continuously, so that the behavior pattern sequence meets the minimum description length principle; pruning is carried out on the behavior patterns, redundancy is removed, and iteration is stopped until a new behavior pattern is not mined.

Through clustering, the category of each behavior pattern can be known, and an observation probability matrix, an initial state probability matrix and a state transition probability matrix are calculated.

Initial state probability calculation: after clustering, all the mined terminal behavior patterns are classified into corresponding classes, so that each class of behaviors has a plurality of behavior patterns. For each class (each cluster), the initial state probability is defined as the number of behavior patterns in this class divided by the total number of all behavior patterns in all classes.

The calculation process of the transition probability is complicated, the start position and the end position (subscript) of each behavior pattern corresponding to the operation data are marked and recorded in the behavior pattern extraction process, and the category of each behavior pattern can be known from the result of the clustering process. Thus, for each behavior pattern in one class, the recorded start and end indices are compared to the start and end indices of each behavior pattern in the other class. If these subscripts do not contain a relationship, then it is assumed that a state transition is present, then the transition state number is incremented by 1. And then dividing the number of the transition states of each class by the total number of the transition states to obtain the transition probability from each class to each other class. Assuming that 5 classes of behaviors are obtained through clustering in the step 3, namely A, B, C, D and E, the transition probability from each class to other classes and the probability of each class transferring to the own class are calculated respectively. For class A behavior, the transition probabilities between A- > A, A- > B, A- > C, A- > D, A- > E need to be calculated.

And (3) calculating observation probability: taking the terminal operation sequence as an observation sequence, firstly counting the occurrence frequency of each operation event in the terminal operation data, assuming that the terminal operation data has X operation events, respectively counting the occurrence frequency of the X operation events, and then calculating the observation probability by dividing the occurrence frequency of each operation by the total occurrence frequency of all the operations.

And 5: and (3) behavior prediction: after frequent behavior pattern mining, behavior pattern clustering and behavior identification are carried out on the operation sequence data, manual marking and checking work is not needed, and the operation data are automatically marked into corresponding terminal behaviors. And obtaining the current behavior of the terminal and the historical behavior of the terminal. The historical behavior of the terminal is obtained by behavior marking according to the historical data of the terminal.

And constructing a behavior prediction model of the terminal for the marked terminal behavior data based on the neural network. The prediction model based on the neural network can effectively model the time series data, and the prediction result has higher accuracy. Take Long Short Term Memory (LSTM) network as an example, but not limited to this method. The prediction model based on the LSTM network is capable of efficiently predictive modeling time series data having long-term dependency, and the LSTM network is suitable for predictive modeling of labeled terminal behavior data because the terminal behavior data is a long series of time-varying, long-term time-dependent series data and the prediction process of the terminal behavior depends on the previous behavior. Based on the method, the overall accuracy of the terminal behavior automatic labeling can reach 89.3%, and the accuracy of the top2 of the terminal behavior prediction can reach 92.37%.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: it is to be understood that modifications may be made to the technical solutions described in the foregoing embodiments, or equivalents may be substituted for some of the technical features thereof, but such modifications or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A terminal behavior prediction method based on automatic labeling is characterized by comprising the following steps:

2. The method for predicting terminal behavior based on automatic annotation of claim 1, wherein the terminal current behavior data in step S1 includes terminal EDR data and log information, and the terminal current behavior is obtained by automatic annotation according to the terminal EDR data.

3. The method for predicting behavior of terminal based on automatic labeling according to claim 2, wherein EDR data of terminal at least comprises one of the following information: UEID, terminal operation sequence, operation execution time, operation duration and occupied physical resource information.

4. The method for predicting terminal behavior based on automatic annotation according to claim 1, wherein the method in step S2 specifically comprises:

5. The method for predicting terminal behavior based on automatic labeling of claim 1, wherein the method for clustering behavior patterns in step S3 comprises: and initially, randomly selecting a clustering center, and continuously and iteratively updating the clustering center according to the editing distance until convergence.

6. The automatic labeling-based terminal behavior prediction method of claim 1, wherein the behavior in step S4 identifies a decoding problem corresponding to the HMM model, and the decoding problem of the HMM model is solved by using a Viterbi algorithm.

7. The method of claim 6, wherein the terminal operation sequence is used as an observation sequence, and the clustered terminal behavior pattern is used as a hidden state, so as to calculate parameters required by the Viterbi algorithm, including an observation probability matrix, an initial state probability matrix and a state transition probability matrix.

8. The method for predicting terminal behavior based on automatic labeling according to claim 7, wherein the initial state probability matrix is calculated by: the total number of occurrences of all behavioral patterns in this class is divided by the total number of occurrences of all behavioral patterns in all classes.

9. The method for predicting terminal behavior based on automatic labeling according to claim 7, wherein the method for calculating the state transition probability matrix is as follows: in the course of the behavior pattern extraction, marking and recording the starting position and the ending position of each behavior pattern corresponding to the operation sequence data, comparing the recorded starting and ending subscripts with the starting and ending subscripts of each behavior pattern in other classes for each behavior pattern in one class, if the subscripts do not have an inclusion relationship, adding 1 to the number of transition states, and then dividing the number of transition states of each class by the total number of transition states to obtain the transition probability from each class to each other class.

10. The method of claim 7, wherein the observation probability matrix is calculated by dividing the total occurrence count of each operation by the total occurrence count of all operations in each class.