CN113378892A

CN113378892A - Multi-sequence comparison classification method based on mobile phone app use behavior data

Info

Publication number: CN113378892A
Application number: CN202110554096.6A
Authority: CN
Inventors: 陆艺; 李嘉晨; 马卫卫; 周建成
Original assignee: Nanjing Guangpu Information Technology Co ltd
Current assignee: Nanjing Guangpu Information Technology Co ltd
Priority date: 2021-05-20
Filing date: 2021-05-20
Publication date: 2021-09-10
Anticipated expiration: 2041-05-20
Also published as: CN113378892B

Abstract

The invention discloses a multi-sequence comparison and classification method based on mobile phone app use behavior data, which comprises the following steps of: step 1, collecting app use behavior data of a plurality of mobile phone users to form a user behavior sequence; step 2, carrying out global matching on the user behavior sequence and constructing a distance matrix; and 3, building a tree for the distance matrix by adopting a non-weighted grouping average method, thereby classifying. The classification method refers to a phylogenetic analysis method of biological information sequences, utilizes a global matching algorithm (Needleman-Wunsch algorithm) to calculate sequence calculation distances to carry out tree building classification, and can better classify the crowd behaviors when the crowd behavior values are insufficient in periodicity and strong in volatility.

Description

Multi-sequence comparison classification method based on mobile phone app use behavior data

Technical Field

The invention belongs to the technical field of big data classification, and particularly relates to an analysis method for classifying user behaviors of a mobile phone.

Background

The crowd classification technology is a method for classifying attribute data or time sequence data of crowds by using a technical method and extracting common and difference points among the crowds, so that the characteristics of the same crowd are similar as much as possible, and the attribute characteristics of the crowds with the same behavior are conveniently further mined.

At present, the user classification method has great commercial value in the aspects of accurate marketing, user portrait and the like. The conversion rate of the marketing campaign can be greatly improved. With the popularization of big data technology, the data infrastructure of various industries is more perfect, and various activities of people are recorded more perfectly, so that a foundation is laid for analyzing by using the activity data of people.

The traditional crowd classification method mainly comprises two types: the first is a static classification method: the classification angle is irrelevant to the time sequence, and people are classified and divided by using labels through labeling different attributes and dimensions. The classification method can better combine multiple dimensions to separate people; secondly, the similarity of the crowd on the time sequence expression is classified by using similarity calculation methods such as DTW (dynamic programming) and the like through the expression of the crowd on the time sequence, the classification method can better reflect the similarity of the behavior expressions of the crowd in a set time period, and the defect is that the crowd classification cannot be well carried out when the difference between the crowd behaviors is not very large or the periodicity is not strong and the randomness is large.

Disclosure of Invention

The invention aims to provide a multi-sequence comparison and classification method based on mobile phone app use behavior data, which refers to a phylogenetic analysis method of a biological information sequence, utilizes a global matching algorithm (Needleman-Wunsch algorithm) to calculate sequence calculation distances to carry out tree building and classification, and can better classify crowd behaviors when the crowd behavior values are insufficient in periodicity and strong in volatility.

In order to achieve the above purpose, the solution of the invention is:

a multi-sequence comparison classification method based on mobile phone app use behavior data comprises the following steps:

step 1, collecting app use behavior data of a plurality of mobile phone users to form a user behavior sequence;

step 2, carrying out global matching on the user behavior sequence and constructing a distance matrix;

and 3, building a tree for the distance matrix by adopting a non-weighted grouping average method, thereby classifying.

In step 1, the specific method for processing the collected app usage behavior data of a plurality of mobile phone users to form a user behavior sequence is as follows:

step 11, collecting app use behavior data of a plurality of mobile phone users, user (p): [ (X1, t1), (X2, t2), … ], where p is the pth user, X1, X2, … denotes the 1 st, 2 nd, … th app used, t1, t2, … denotes the length of use of the corresponding app;

step 12, deleting the app usage behavior data with the usage duration less than the threshold as noise, and only extracting apps in sequence to obtain a new sequence as a user behavior sequence user (p): [ X01, X02, … ].

The specific process of the step 2 is as follows:

step 21, obtaining a matching mode with the highest user behavior sequence score by adopting a dynamically planned Needleman-Wunsch algorithm;

step 22, comparing every two mobile phone users by using a global matching algorithm, adding a third sequence based on the result of the pairwise comparison of the sequences, and performing multi-sequence comparison;

the LCS for multi-sequence associations is defined as follows:

for the combination of aligned sequences resulting from the combination of (u1, u2, u 3.., un) [ A1 ]₁ A2₁ A3₁ ... An₁ A1₂A2₂ A3₂ ... An₂ A1₃ A2₃ A3₃ ... An₃ .... A1_i A2_i A3_i ... An_i]The LCS calculation with another independent sequence is defined as follows:

LCS(i，j)

＝LCS

(A1₁ A2₁ A3₁ ... An₁ A1₂ A2₂ A3₂ ... An₂ A1₃ A2₃ A3₃ ... An₃ .... A1_i A2_iA3_i ... An_i，c₁ c₂ c₃ c₄ ... c_j) Wherein i is more than or equal to 0 and less than N, and j is more than or equal to 0 and less than or equal to M

For i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, the formula is shown as follows:

LCS(i，j)＝

and step 23, calculating a similarity matrix between every two sequences according to the global matching result, namely a distance matrix.

The specific process of the step 21 is as follows:

step 211, defining the following two sequences:

A＝a₁ a₂ a₃ a₄ ... a_Nwherein A is represented by₁，a₂，a₃，a₄，...a_NComposed of N characters of length N

B＝b₁ b₂ b₃ b₄ ... b_MWherein B is represented by₁，b₂，b₃，b₄，...b_MM character components of length M

In step 212, LCS (a, B) represents the longest common substring of string a and string B, and LCS (i, j) ═ LCS (a)₁ a₂ a₃ a₄ ... a_i，b₁ b₂b₃b₄...b_j) Wherein i is more than or equal to 0 and less than or equal to N, and j is more than or equal to 0 and less than or equal to M.

In step 22, when comparing the sequences two by two, firstly inserting a space character '-' at the head of each of the two sequences, wherein the space character '-' does not match with the character at each position, and the score is 0; further obtaining the calculation results of each row, thereby obtaining the length of the maximum common substring; and then, determining the positions of the matched substrings through backtracking, writing out matched characters according to a backtracking path, and adding characters from the upper left corner to the lower right corner, thereby obtaining sequence results of two sequences subjected to backtracking.

When the matched substring positions are determined through backtracking, firstly the substring positions are positioned at the lower right corner of the matrix, and then the cells are backtracked according to the following rules: first, if a_i＝b_jIf the cell with the same maximum value exists, the priority is in the order of the upper left corner, the upper edge and the left side; secondly, if the current cell is in the first row of the matrix, backtracking to the cell on the left; third, if the current cell is in the first column of the matrix, then go back to the top cell.

The above-mentioned matching character is written out according to the backtracking path, and the specific rule for adding characters from the upper left corner to the lower right corner is as follows:

if the backtracking path is passed by the bottom right corner cell, a_iAdding to the matching string of the corresponding user, b_jAdding the matched word string to the corresponding user;

if the backtracking path is passed by the vertical lower cell, a_iAdding the interval "-" to the matching string of the user corresponding to bj;

if the backtracking path is passed by the right cell, add the interval "-" to a_iMatching strings of the corresponding user, b_jAnd adding the string to the matching string of the corresponding user.

In the above step 23, the similarity between the users 1 and 2 is defined as:

the specific content of the step 3 is as follows:

step 31, synthesizing the two sequences with the shortest distance in the distance matrix into a composite sequence group (AB), updating the distance matrix, and combining the two sequences with the shortest distance in the updated distance matrix again until all the sequences are gathered into one type;

step 32, determining a tree structure according to the merging sequence;

and step 33, realizing classification according to the tree structure.

After adopting the scheme, the invention has the following characteristics:

(1) according to the invention, the similarity of the front and back behaviors is considered, and people with inconsistent and segmental similar front and back behavior segments are clustered together by a global comparison method, so that users with the same behavior habits can be clustered and classified;

(2) in actual production life, the behaviors of people are endless, all people cannot be required to follow the same trend, and the people can be judged to be the same class as long as enough similar behavior combinations exist. Therefore, compared with the DTW method, the method does not require the similarity of the whole behavior sequence, and can be classified into the same class as long as enough similar fragments of the front behavior and the back behavior exist, so that the classification method is more suitable for the requirements in production and life.

Drawings

FIG. 1 is a flow chart of the present invention;

fig. 2 is a schematic diagram of a tree structure obtained by the embodiment of the present invention.

Detailed Description

As shown in fig. 1, the present invention provides a multiple sequence comparison classification method based on mobile phone app usage behavior data, including the following steps:

(1) collecting app use data of a large number of mobile phone users;

(2) globally matching the user behavior sequence and constructing a distance matrix;

(3) and (4) carrying out tree building by adopting a non-weighted packet mean method (UPGMA) on the distance matrix so as to classify.

To more particularly describe the process of the present invention, the following detailed description of specific embodiments of the present invention is provided.

A first part: basic data processing

1. Collecting app use behavior data of a mobile phone user, wherein the app use behavior data comprises the following use sequence and use duration:

user1: [ (wechat, 300 seconds), (qq,2 seconds), (wechat, 30 seconds), (Tanbao, 20 seconds), (King Rong, 2000 seconds), (tremble, 1850 seconds), (wechat, 30 seconds), (Tanbao, 300 seconds) ];

user2: [ (believe, 300 seconds), (Taobao, 20 seconds), (believe, 2 seconds), (Taobao, 30 seconds), (Roche, 2000 seconds), (Weobao, 1 second), (Wangbao, 2000 seconds), (tremble, 1850 seconds), (Taobao, 300 seconds), (Weobao, 30 seconds), (Wangbao, 2000 seconds), (Weobao, 1 second), (Wangbao, 2000 seconds), (Taobao, 3000 seconds) ];

user3: [ (royal glory, 3000 seconds), (tremble, 2000 seconds), (WeChat, 300 seconds) ];

user4: [ (tremble, 3000 seconds), (royal glory, 2000 seconds), (WeChat, 20 seconds), (qq,2 seconds), (WeChat, 200 seconds) ];

2. and taking the app use data with the use time of less than 5 seconds as noise removal, and not considering the use time of the app after subsequent processing to obtain the following sequence:

user1, Weixin, Taobao, Rong glory, tremble, Weixin, Taobao;

user2 shows [ Weixin, Tanbao, Rongyang, Weixin, Rongyang, Mianqi, Rongyang, Tanbao ];

user3, wherein the Wang is glory, tremble and Wenxin;

user4, wherein [ tremble, royal glory, Wenxin ];

a second part: calculating distance matrix by global sequence contrast method

1. The method comprises the following steps of (1) performing a Needleman-Wunsch algorithm description based on dynamic programming:

the main idea of the method is to make as many identical character pairs as possible in the upper and lower rows by inserting spaces. We define that the corresponding position matches MATCH score 1, and not scores for the case of MISMATCH or GAP (i.e., GAP). Ideally, the best case would be a perfect match between the two sequences, and the score would be highest (any mismatch would be deducted), whereas for all characters matching only GAPs, the most frequent sequence length would be M + N (M, N is the length of the two sequences), i.e. the two sequences are completely inconsistent and cannot be matched by inserting GAPs. In order to improve the computing efficiency, the idea of dynamic programming is adopted to solve the problem of global optimization. And (3) calculating the longest common substring between the A and the B, and then backtracking according to the length of the longest common substring to find the matching mode with the highest score.

1) The following two sequences were prepared:

2) LCS (a, B) denotes the longest common substring of string a and string B. For example, character string a is kitten and character string B is sitting, and their longest common substring is ittn and the longest common substring length is 4. The longest common substrings need not occur consecutively, but the order of occurrence must be consistent. Define LCS (i, j) ═ LCS (a)₁ a₂ a₃ a₄ ... a_i，b₁b₂ b₃ b₄ ... b_j) Wherein i is more than or equal to 0 and less than or equal to N, and j is more than or equal to 0 and less than or equal to M;

if a_i＝b_jWhen LCS (i, j) ═ LCS (i-1, j-1) +1)

If a_i≠b_jWhen the LCS (i, j) ═ Max (LCS (i-1, j-1), LCS (i-1, j), LCS (i, j-1))

2. The method is illustrated with the actual sequence: firstly, pairwise comparison is carried out on the 4 people group sequences by using a global matching algorithm.

User1: [ Weixin, Tanbao, Rong of the king glory, tremble, Weixin, Tanbao ];

user2: [ Mianxin, Tanbao, Rongbao, Rongyang, Tanbao, Mianxin, Rongyang, Wangbao, Rongyang, Tanbao ];

to facilitate calculation of the LCS for each position for the first character of the two sequences according to the method described above, a space character '-' is inserted at the beginning of both sequences, which does not match the character at each position and has a score of 0.

1) The calculation result of the first row of the LCS matrix is as follows:

by analogy, the calculation results of the remaining rows are:

so LCS (a, B) is 6, i.e. the maximum common substring length is 6.

2) Determining the matched substring positions by backtracking:

i. positioned in the lower right corner of the matrix

ii. Backtracking cell

If a_i＝b_jIf the cells with the same maximum value exist, the priority is in the order of the upper left corner, the upper edge and the left side.

If the current cell is in the first row of the matrix, then we trace back to the cell on the left.

If the current cell is in the first column of the matrix, backtracking to the top cell.

The backtracking results are as follows:

3) writing out matched characters according to the backtracking path, and adding characters from the upper left corner to the lower right corner:

if the backtracking path is passed by the bottom right corner cell, a_iAdd to the matching string user1, add b_jAdd to match string user 2;

if the backtracking path is passed by the vertical lower cell, a_iAdd to match string user1, add interval "-" to match string user 2;

if the trace back path is passed by the right cell, add the interval "-" to the matching string user1, and b_jAdd to match string user 2;

in this embodiment, the sequence result of backtracking of the user1 and the user2 is:

4) following the above procedure, (user1, user3), (user2, user3) were compared separately.

3. And gradually adding a third sequence based on the result of pairwise comparison of the sequences to perform multi-sequence comparison.

1) And obtaining the matched situation between every two through the sequence matching method in the last step. And taking the ratio of the maximum substring distance between every two to the length of the matched sequence as the similarity between the sequences, for example, if the maximum substring distance of the user1 and the user2 is 6, and the sequence length is 11, the similarity is

2) In this way, a similarity matrix between each two sequences is obtained.

3) And selecting a group of comparison sequences with the highest overall similarity as an initial set, then selecting the sequences closest to the set in the next step, and then performing sequence association, thus calculating the sequence by one step. In this example, the first set with the highest similarity (user1, user2), followed by user3 being closest to this set, user4 times. The sequence is added to the set in this order for sequence matching.

The algorithm for multiple sequence matching is as follows: and taking the combination of the positions corresponding to the sequences which are matched in the set as the whole character at the new position, and matching the character with the newly added sequence.

The LCS for multi-sequence associations is defined as follows:

LCS(i，j)

＝LCS

for example (user1, user2) is represented as:

[ Weixin + Weixin, Weixin + Tanbao, Tanbao + Tanbao ] - - + Rongqi, Rongqi + Rongqi, tremble + tremble, - - - + Tanbao, Weixin + Weixin, - - + Rongqi, Tanbao + Tanbao ]. If the characters in the user3 are the same as the characters in the corresponding positions of the users 1 and 2, the user gets one point, and if the characters are the same as the characters in the same positions of the users 1,2 and 3, the user gets two points.

The LCS matrices of (user1, user2) and user3 in this embodiment are traced back according to the same rule as follows:

obtaining a matching result:

user1+user2:

[ Weixin + Weixin, Weixin + Tanbao, Tanbao + Tanbao ] - - + Rongzhe Rongzhao, Wangbao + Rongzhe Rongzhao, tremble + tremble, - - + Tanbao, Weixin + Weixin, - - + Rongzhe Rongzhao-, - + Rongzhao, Tanbao + Tanbao ]

user3:

The ones of-, and- -, the ones of royal are glory, buffalo, -, Wenxin, - - - - - - ]

4) And adding a user4 to repeat the process to obtain the matching matrix and the matching result of all the sequences.

User1, Weixin, Tanbao-, Rongqi, wonderful, tremble, -, Weixin, -, Tanbao;

user3 [ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -;

user4 [ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

5) And calculating a similarity matrix between every two sequences according to the global matching result:

and a third part: according to the similarity matrix of the global sequence matching result, a non-weighted block average method (UPGMA) is used for tree building to obtain sequence classification

1. UPGMA Process description

The UPGMA algorithm first combines two nearest sequences into a composite sequence set, and after combining, it needs to update the distance matrix and calculate the distance between the new set (AB) and species C and D, such as:

d(AB)C＝1/2(dAC+dBC)；d(AB)D＝1/2(dAD+dBD)；

and then combining the sequences with the nearest distance in the new matrix into a compound sequence group again, and repeating the steps until all the sequences are grouped into one type.

2. The specific calculation steps are illustrated by way of example

And obtaining a distance matrix based on a global association algorithm. The distance matrix is a complementary matrix of the similarity matrix, the similarity matrix is the number of the same character, and the distance matrix is the number of different characters:

the sequences user1 and user2 are grouped into the same class, and then a new distance matrix is calculated:

the smallest distances are user1+ user2 and user 3.

The sequences user1+ user2 and user3 are grouped into the same class, and then a new distance matrix is calculated:

3. determining a tree structure based on the just-merged order

The merging sequence of the previous step is as follows: firstly, merging user1 and user 2; the users are merged again 3 and finally user4 is added. The resulting tree structure is as follows, which can be combined with that shown in FIG. 2:

((user1,user2),user3)user4)

therefore, when the original sequence components are classified into 3 types: user1, user2 is the first type; user3 is of the second type; user4 is of the third class;

when the original sequence components are divided into 2 types: user1, user2, user3 are the first class; user4 is of the second type;

when the original sequence components are classified into 1 types: the same category is used for user1, user2, user3 and user 4.

It should be noted that the same category in the present invention can better express the habit of people using mobile phones, not simply the time consumption deviation of each software, but the switching relationship between the software, and on the basis of not calculating the switching times and respective calling times between the software, people with the same wear-out time mode can be gathered together more accurately, but the present invention does not depend on the setting of threshold value, etc., and avoids human intervention.

For example, in this embodiment, all users have their own usage behaviors of WeChat, trembling, and glory, but their usage habits are not exactly the same. The first category (user1, user2) is more biased to the combined calling of each software for a plurality of times, the user3 and the user4 are called for a few times, if the classification is carried out by setting a threshold value, a proper threshold value is difficult to find, and by using the classification method provided by the invention, the same habit of people can be accurately extracted, and more accurate classification can be carried out.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims

1. A multi-sequence comparison classification method based on mobile phone app use behavior data is characterized by comprising the following steps:

2. The method of claim 1, wherein: in the step 1, the specific method for processing the collected app usage behavior data of a plurality of mobile phone users to form the user behavior sequence is as follows:

step 11, collecting app use behavior data of a plurality of mobile phone users, wherein user (p) [ (X1, t1), (X2, t2), … ], wherein p is the pth user, X1, X2, … represent the 1 st, 2 nd, … th apps used, t1, t2, … represent the use duration of the corresponding apps;

and step 12, deleting the app use behavior data with the use duration less than the threshold as noise, and only extracting apps in sequence to obtain a new sequence as a user behavior sequence user (p) [ X01, X02, … ].

3. The method of claim 1, wherein: the specific process of the step 2 is as follows:

the LCS for multi-sequence associations is defined as follows:

for the combination of aligned sequences resulting from the combination of (u1, u2, u 3.., un) [ A1 ]₁ A2₁ A3₁ ... An₁ A1₂ A2₂A3₂ ... An₂ A1₃ A2₃ A3₃ ... An₃ .... A1_i A2_i A3_i ... An_i]The LCS calculation with another independent sequence is defined as follows:

LCS(i,j)

＝LCS(A1₁ A2₁ A3₁ ... An₁ A1₂ A2₂ A3₂ ... An₂ A1₃ A2₃ A3₃ ... An₃ .... A1_i A2_iA3_i ... An_i,c₁ c₂ c₃ c₄ ... c_j) Wherein i is more than or equal to 0 and less than or equal to N, and j is more than or equal to 0 and less than or equal to M

4. The method of claim 3, wherein: the specific process of step 21 is:

step 211, defining the following two sequences:

A＝a₁ a₂ a₃ a₄ ... a_Nwherein A is represented by₁,a₂,a₃,a₄,...a_NComposed of N characters of length N

B＝b₁ b₂ b₃ b₄ ... b_MWherein B is represented by₁,b₂,b₃,b₄,...b_MM character components of length M

In step 212, LCS (a, B) represents the longest common substring of string a and string B, and LCS (i, j) ═ LCS (a)₁ a₂a₃ a₄ ... a_i,b₁ b₂ b₃ b₄ ... b_j) Wherein i is more than or equal to 0 and less than or equal to N, and j is more than or equal to 0 and less than or equal to M.

5. The method of claim 4, wherein: in the step 22, when the sequences are compared with each other, a space character '-' is inserted into the head of each of the two sequences, and the space character '-' is not matched with the character at each position, and the score is 0; further obtaining the calculation results of each row, thereby obtaining the length of the maximum common substring; and then, determining the positions of the matched substrings through backtracking, writing out matched characters according to a backtracking path, and adding characters from the upper left corner to the lower right corner, thereby obtaining sequence results of two sequences subjected to backtracking.

6. The method of claim 5, wherein: when the matched substring position is determined through backtracking, the substring position is firstly positioned at the lower right corner of the matrix, and then the cell is backtracked according to the following rules: first, if a_i＝b_jIf the cell with the same maximum value exists, the priority is in the order of the upper left corner, the upper edge and the left side; secondly, if the current cell is in the first row of the matrix, backtracking to the cell on the left; third, if the current cell is in the first column of the matrix, then go back to the top cell.

7. The method of claim 5, wherein: the matching characters are written according to the backtracking path, and the specific rule for adding characters from the upper left corner to the lower right corner is as follows:

if the backtracking path is passed by the vertical lower cell, a_iAdding the interval "-" to the matching string of the corresponding user_jMatching strings of corresponding users;

if the backtracking path is passed by the right cell, add the interval "-"To a is added to_iMatching strings of the corresponding user, b_jAnd adding the string to the matching string of the corresponding user.

8. The method of claim 4, wherein: in step 23, the similarity between the users 1 and 2 is defined as:

9. the method of claim 1, wherein: the specific content of the step 3 is as follows:

step 32, determining a tree structure according to the merging sequence;

and step 33, realizing classification according to the tree structure.