CN113378892A - Multi-sequence comparison classification method based on mobile phone app use behavior data - Google Patents
Multi-sequence comparison classification method based on mobile phone app use behavior data Download PDFInfo
- Publication number
- CN113378892A CN113378892A CN202110554096.6A CN202110554096A CN113378892A CN 113378892 A CN113378892 A CN 113378892A CN 202110554096 A CN202110554096 A CN 202110554096A CN 113378892 A CN113378892 A CN 113378892A
- Authority
- CN
- China
- Prior art keywords
- sequence
- equal
- sequences
- user
- matching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 239000011159 matrix material Substances 0.000 claims abstract description 43
- 238000004364 calculation method Methods 0.000 claims abstract description 12
- 239000002131 composite material Substances 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 2
- 230000006399 behavior Effects 0.000 abstract description 37
- 238000013081 phylogenetic analysis Methods 0.000 abstract description 2
- 206010044565 Tremor Diseases 0.000 description 14
- 230000014509 gene expression Effects 0.000 description 3
- 239000008852 wen-xin Substances 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 150000001875 compounds Chemical group 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a multi-sequence comparison and classification method based on mobile phone app use behavior data, which comprises the following steps of: step 1, collecting app use behavior data of a plurality of mobile phone users to form a user behavior sequence; step 2, carrying out global matching on the user behavior sequence and constructing a distance matrix; and 3, building a tree for the distance matrix by adopting a non-weighted grouping average method, thereby classifying. The classification method refers to a phylogenetic analysis method of biological information sequences, utilizes a global matching algorithm (Needleman-Wunsch algorithm) to calculate sequence calculation distances to carry out tree building classification, and can better classify the crowd behaviors when the crowd behavior values are insufficient in periodicity and strong in volatility.
Description
Technical Field
The invention belongs to the technical field of big data classification, and particularly relates to an analysis method for classifying user behaviors of a mobile phone.
Background
The crowd classification technology is a method for classifying attribute data or time sequence data of crowds by using a technical method and extracting common and difference points among the crowds, so that the characteristics of the same crowd are similar as much as possible, and the attribute characteristics of the crowds with the same behavior are conveniently further mined.
At present, the user classification method has great commercial value in the aspects of accurate marketing, user portrait and the like. The conversion rate of the marketing campaign can be greatly improved. With the popularization of big data technology, the data infrastructure of various industries is more perfect, and various activities of people are recorded more perfectly, so that a foundation is laid for analyzing by using the activity data of people.
The traditional crowd classification method mainly comprises two types: the first is a static classification method: the classification angle is irrelevant to the time sequence, and people are classified and divided by using labels through labeling different attributes and dimensions. The classification method can better combine multiple dimensions to separate people; secondly, the similarity of the crowd on the time sequence expression is classified by using similarity calculation methods such as DTW (dynamic programming) and the like through the expression of the crowd on the time sequence, the classification method can better reflect the similarity of the behavior expressions of the crowd in a set time period, and the defect is that the crowd classification cannot be well carried out when the difference between the crowd behaviors is not very large or the periodicity is not strong and the randomness is large.
Disclosure of Invention
The invention aims to provide a multi-sequence comparison and classification method based on mobile phone app use behavior data, which refers to a phylogenetic analysis method of a biological information sequence, utilizes a global matching algorithm (Needleman-Wunsch algorithm) to calculate sequence calculation distances to carry out tree building and classification, and can better classify crowd behaviors when the crowd behavior values are insufficient in periodicity and strong in volatility.
In order to achieve the above purpose, the solution of the invention is:
a multi-sequence comparison classification method based on mobile phone app use behavior data comprises the following steps:
step 1, collecting app use behavior data of a plurality of mobile phone users to form a user behavior sequence;
step 2, carrying out global matching on the user behavior sequence and constructing a distance matrix;
and 3, building a tree for the distance matrix by adopting a non-weighted grouping average method, thereby classifying.
In step 1, the specific method for processing the collected app usage behavior data of a plurality of mobile phone users to form a user behavior sequence is as follows:
step 11, collecting app use behavior data of a plurality of mobile phone users, user (p): [ (X1, t1), (X2, t2), … ], where p is the pth user, X1, X2, … denotes the 1 st, 2 nd, … th app used, t1, t2, … denotes the length of use of the corresponding app;
step 12, deleting the app usage behavior data with the usage duration less than the threshold as noise, and only extracting apps in sequence to obtain a new sequence as a user behavior sequence user (p): [ X01, X02, … ].
The specific process of the step 2 is as follows:
step 21, obtaining a matching mode with the highest user behavior sequence score by adopting a dynamically planned Needleman-Wunsch algorithm;
step 22, comparing every two mobile phone users by using a global matching algorithm, adding a third sequence based on the result of the pairwise comparison of the sequences, and performing multi-sequence comparison;
the LCS for multi-sequence associations is defined as follows:
for the combination of aligned sequences resulting from the combination of (u1, u2, u 3.., un) [ A1 ]1 A21 A31 ... An1 A12A22 A32 ... An2 A13 A23 A33 ... An3 .... A1i A2i A3i ... Ani]The LCS calculation with another independent sequence is defined as follows:
LCS(i,j)
=LCS
(A11 A21 A31 ... An1 A12 A22 A32 ... An2 A13 A23 A33 ... An3 .... A1i A2iA3i ... Ani,c1 c2 c3 c4 ... cj) Wherein i is more than or equal to 0 and less than N, and j is more than or equal to 0 and less than or equal to M
For i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, the formula is shown as follows:
LCS(i,j)=
and step 23, calculating a similarity matrix between every two sequences according to the global matching result, namely a distance matrix.
The specific process of the step 21 is as follows:
step 211, defining the following two sequences:
A=a1 a2 a3 a4 ... aNwherein A is represented by1,a2,a3,a4,...aNComposed of N characters of length N
B=b1 b2 b3 b4 ... bMWherein B is represented by1,b2,b3,b4,...bMM character components of length M
In step 212, LCS (a, B) represents the longest common substring of string a and string B, and LCS (i, j) ═ LCS (a)1 a2 a3 a4 ... ai,b1 b2b3b4...bj) Wherein i is more than or equal to 0 and less than or equal to N, and j is more than or equal to 0 and less than or equal to M.
In step 22, when comparing the sequences two by two, firstly inserting a space character '-' at the head of each of the two sequences, wherein the space character '-' does not match with the character at each position, and the score is 0; further obtaining the calculation results of each row, thereby obtaining the length of the maximum common substring; and then, determining the positions of the matched substrings through backtracking, writing out matched characters according to a backtracking path, and adding characters from the upper left corner to the lower right corner, thereby obtaining sequence results of two sequences subjected to backtracking.
When the matched substring positions are determined through backtracking, firstly the substring positions are positioned at the lower right corner of the matrix, and then the cells are backtracked according to the following rules: first, if ai=bjIf the cell with the same maximum value exists, the priority is in the order of the upper left corner, the upper edge and the left side; secondly, if the current cell is in the first row of the matrix, backtracking to the cell on the left; third, if the current cell is in the first column of the matrix, then go back to the top cell.
The above-mentioned matching character is written out according to the backtracking path, and the specific rule for adding characters from the upper left corner to the lower right corner is as follows:
if the backtracking path is passed by the bottom right corner cell, aiAdding to the matching string of the corresponding user, bjAdding the matched word string to the corresponding user;
if the backtracking path is passed by the vertical lower cell, aiAdding the interval "-" to the matching string of the user corresponding to bj;
if the backtracking path is passed by the right cell, add the interval "-" to aiMatching strings of the corresponding user, bjAnd adding the string to the matching string of the corresponding user.
the specific content of the step 3 is as follows:
step 31, synthesizing the two sequences with the shortest distance in the distance matrix into a composite sequence group (AB), updating the distance matrix, and combining the two sequences with the shortest distance in the updated distance matrix again until all the sequences are gathered into one type;
step 32, determining a tree structure according to the merging sequence;
and step 33, realizing classification according to the tree structure.
After adopting the scheme, the invention has the following characteristics:
(1) according to the invention, the similarity of the front and back behaviors is considered, and people with inconsistent and segmental similar front and back behavior segments are clustered together by a global comparison method, so that users with the same behavior habits can be clustered and classified;
(2) in actual production life, the behaviors of people are endless, all people cannot be required to follow the same trend, and the people can be judged to be the same class as long as enough similar behavior combinations exist. Therefore, compared with the DTW method, the method does not require the similarity of the whole behavior sequence, and can be classified into the same class as long as enough similar fragments of the front behavior and the back behavior exist, so that the classification method is more suitable for the requirements in production and life.
Drawings
FIG. 1 is a flow chart of the present invention;
fig. 2 is a schematic diagram of a tree structure obtained by the embodiment of the present invention.
Detailed Description
As shown in fig. 1, the present invention provides a multiple sequence comparison classification method based on mobile phone app usage behavior data, including the following steps:
(1) collecting app use data of a large number of mobile phone users;
(2) globally matching the user behavior sequence and constructing a distance matrix;
(3) and (4) carrying out tree building by adopting a non-weighted packet mean method (UPGMA) on the distance matrix so as to classify.
To more particularly describe the process of the present invention, the following detailed description of specific embodiments of the present invention is provided.
A first part: basic data processing
1. Collecting app use behavior data of a mobile phone user, wherein the app use behavior data comprises the following use sequence and use duration:
user1: [ (wechat, 300 seconds), (qq,2 seconds), (wechat, 30 seconds), (Tanbao, 20 seconds), (King Rong, 2000 seconds), (tremble, 1850 seconds), (wechat, 30 seconds), (Tanbao, 300 seconds) ];
user2: [ (believe, 300 seconds), (Taobao, 20 seconds), (believe, 2 seconds), (Taobao, 30 seconds), (Roche, 2000 seconds), (Weobao, 1 second), (Wangbao, 2000 seconds), (tremble, 1850 seconds), (Taobao, 300 seconds), (Weobao, 30 seconds), (Wangbao, 2000 seconds), (Weobao, 1 second), (Wangbao, 2000 seconds), (Taobao, 3000 seconds) ];
user3: [ (royal glory, 3000 seconds), (tremble, 2000 seconds), (WeChat, 300 seconds) ];
user4: [ (tremble, 3000 seconds), (royal glory, 2000 seconds), (WeChat, 20 seconds), (qq,2 seconds), (WeChat, 200 seconds) ];
2. and taking the app use data with the use time of less than 5 seconds as noise removal, and not considering the use time of the app after subsequent processing to obtain the following sequence:
user1, Weixin, Taobao, Rong glory, tremble, Weixin, Taobao;
user2 shows [ Weixin, Tanbao, Rongyang, Weixin, Rongyang, Mianqi, Rongyang, Tanbao ];
user3, wherein the Wang is glory, tremble and Wenxin;
user4, wherein [ tremble, royal glory, Wenxin ];
a second part: calculating distance matrix by global sequence contrast method
1. The method comprises the following steps of (1) performing a Needleman-Wunsch algorithm description based on dynamic programming:
the main idea of the method is to make as many identical character pairs as possible in the upper and lower rows by inserting spaces. We define that the corresponding position matches MATCH score 1, and not scores for the case of MISMATCH or GAP (i.e., GAP). Ideally, the best case would be a perfect match between the two sequences, and the score would be highest (any mismatch would be deducted), whereas for all characters matching only GAPs, the most frequent sequence length would be M + N (M, N is the length of the two sequences), i.e. the two sequences are completely inconsistent and cannot be matched by inserting GAPs. In order to improve the computing efficiency, the idea of dynamic programming is adopted to solve the problem of global optimization. And (3) calculating the longest common substring between the A and the B, and then backtracking according to the length of the longest common substring to find the matching mode with the highest score.
1) The following two sequences were prepared:
A=a1 a2 a3 a4 ... aNwherein A is represented by1,a2,a3,a4,...aNComposed of N characters of length N
B=b1 b2 b3 b4 ... bMWherein B is represented by1,b2,b3,b4,...bMM character components of length M
2) LCS (a, B) denotes the longest common substring of string a and string B. For example, character string a is kitten and character string B is sitting, and their longest common substring is ittn and the longest common substring length is 4. The longest common substrings need not occur consecutively, but the order of occurrence must be consistent. Define LCS (i, j) ═ LCS (a)1 a2 a3 a4 ... ai,b1b2 b3 b4 ... bj) Wherein i is more than or equal to 0 and less than or equal to N, and j is more than or equal to 0 and less than or equal to M;
for i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, the formula is shown as follows:
if ai=bjWhen LCS (i, j) ═ LCS (i-1, j-1) +1)
If ai≠bjWhen the LCS (i, j) ═ Max (LCS (i-1, j-1), LCS (i-1, j), LCS (i, j-1))
2. The method is illustrated with the actual sequence: firstly, pairwise comparison is carried out on the 4 people group sequences by using a global matching algorithm.
User1: [ Weixin, Tanbao, Rong of the king glory, tremble, Weixin, Tanbao ];
user2: [ Mianxin, Tanbao, Rongbao, Rongyang, Tanbao, Mianxin, Rongyang, Wangbao, Rongyang, Tanbao ];
to facilitate calculation of the LCS for each position for the first character of the two sequences according to the method described above, a space character '-' is inserted at the beginning of both sequences, which does not match the character at each position and has a score of 0.
1) The calculation result of the first row of the LCS matrix is as follows:
by analogy, the calculation results of the remaining rows are:
so LCS (a, B) is 6, i.e. the maximum common substring length is 6.
2) Determining the matched substring positions by backtracking:
i. positioned in the lower right corner of the matrix
ii. Backtracking cell
If ai=bjIf the cells with the same maximum value exist, the priority is in the order of the upper left corner, the upper edge and the left side.
If the current cell is in the first row of the matrix, then we trace back to the cell on the left.
If the current cell is in the first column of the matrix, backtracking to the top cell.
The backtracking results are as follows:
3) writing out matched characters according to the backtracking path, and adding characters from the upper left corner to the lower right corner:
if the backtracking path is passed by the bottom right corner cell, aiAdd to the matching string user1, add bjAdd to match string user 2;
if the backtracking path is passed by the vertical lower cell, aiAdd to match string user1, add interval "-" to match string user 2;
if the trace back path is passed by the right cell, add the interval "-" to the matching string user1, and bjAdd to match string user 2;
in this embodiment, the sequence result of backtracking of the user1 and the user2 is:
4) following the above procedure, (user1, user3), (user2, user3) were compared separately.
3. And gradually adding a third sequence based on the result of pairwise comparison of the sequences to perform multi-sequence comparison.
1) And obtaining the matched situation between every two through the sequence matching method in the last step. And taking the ratio of the maximum substring distance between every two to the length of the matched sequence as the similarity between the sequences, for example, if the maximum substring distance of the user1 and the user2 is 6, and the sequence length is 11, the similarity is
2) In this way, a similarity matrix between each two sequences is obtained.
3) And selecting a group of comparison sequences with the highest overall similarity as an initial set, then selecting the sequences closest to the set in the next step, and then performing sequence association, thus calculating the sequence by one step. In this example, the first set with the highest similarity (user1, user2), followed by user3 being closest to this set, user4 times. The sequence is added to the set in this order for sequence matching.
The algorithm for multiple sequence matching is as follows: and taking the combination of the positions corresponding to the sequences which are matched in the set as the whole character at the new position, and matching the character with the newly added sequence.
The LCS for multi-sequence associations is defined as follows:
for the combination of aligned sequences resulting from the combination of (u1, u2, u 3.., un) [ A1 ]1 A21 A31 ... An1 A12A22 A32 ... An2 A13 A23 A33 ... An3 .... A1i A2i A3i ... Ani]The LCS calculation with another independent sequence is defined as follows:
LCS(i,j)
=LCS
(A11 A21 A31 ... An1 A12 A22 A32 ... An2 A13 A23 A33 ... An3 .... A1i A2iA3i ... Ani,c1 c2 c3 c4 ... cj) Wherein i is more than or equal to 0 and less than N, and j is more than or equal to 0 and less than or equal to M
For i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, the formula is shown as follows:
for example (user1, user2) is represented as:
[ Weixin + Weixin, Weixin + Tanbao, Tanbao + Tanbao ] - - + Rongqi, Rongqi + Rongqi, tremble + tremble, - - - + Tanbao, Weixin + Weixin, - - + Rongqi, Tanbao + Tanbao ]. If the characters in the user3 are the same as the characters in the corresponding positions of the users 1 and 2, the user gets one point, and if the characters are the same as the characters in the same positions of the users 1,2 and 3, the user gets two points.
The LCS matrices of (user1, user2) and user3 in this embodiment are traced back according to the same rule as follows:
obtaining a matching result:
user1+user2:
[ Weixin + Weixin, Weixin + Tanbao, Tanbao + Tanbao ] - - + Rongzhe Rongzhao, Wangbao + Rongzhe Rongzhao, tremble + tremble, - - + Tanbao, Weixin + Weixin, - - + Rongzhe Rongzhao-, - + Rongzhao, Tanbao + Tanbao ]
user3:
The ones of-, and- -, the ones of royal are glory, buffalo, -, Wenxin, - - - - - - ]
4) And adding a user4 to repeat the process to obtain the matching matrix and the matching result of all the sequences.
User1, Weixin, Tanbao-, Rongqi, wonderful, tremble, -, Weixin, -, Tanbao;
user2 shows [ Weixin, Tanbao, Rongyang, Weixin, Rongyang, Mianqi, Rongyang, Tanbao ];
user3 [ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -;
user4 [ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
5) And calculating a similarity matrix between every two sequences according to the global matching result:
and a third part: according to the similarity matrix of the global sequence matching result, a non-weighted block average method (UPGMA) is used for tree building to obtain sequence classification
1. UPGMA Process description
The UPGMA algorithm first combines two nearest sequences into a composite sequence set, and after combining, it needs to update the distance matrix and calculate the distance between the new set (AB) and species C and D, such as:
d(AB)C=1/2(dAC+dBC);d(AB)D=1/2(dAD+dBD);
and then combining the sequences with the nearest distance in the new matrix into a compound sequence group again, and repeating the steps until all the sequences are grouped into one type.
2. The specific calculation steps are illustrated by way of example
And obtaining a distance matrix based on a global association algorithm. The distance matrix is a complementary matrix of the similarity matrix, the similarity matrix is the number of the same character, and the distance matrix is the number of different characters:
the sequences user1 and user2 are grouped into the same class, and then a new distance matrix is calculated:
the smallest distances are user1+ user2 and user 3.
The sequences user1+ user2 and user3 are grouped into the same class, and then a new distance matrix is calculated:
3. determining a tree structure based on the just-merged order
The merging sequence of the previous step is as follows: firstly, merging user1 and user 2; the users are merged again 3 and finally user4 is added. The resulting tree structure is as follows, which can be combined with that shown in FIG. 2:
((user1,user2),user3)user4)
therefore, when the original sequence components are classified into 3 types: user1, user2 is the first type; user3 is of the second type; user4 is of the third class;
when the original sequence components are divided into 2 types: user1, user2, user3 are the first class; user4 is of the second type;
when the original sequence components are classified into 1 types: the same category is used for user1, user2, user3 and user 4.
It should be noted that the same category in the present invention can better express the habit of people using mobile phones, not simply the time consumption deviation of each software, but the switching relationship between the software, and on the basis of not calculating the switching times and respective calling times between the software, people with the same wear-out time mode can be gathered together more accurately, but the present invention does not depend on the setting of threshold value, etc., and avoids human intervention.
For example, in this embodiment, all users have their own usage behaviors of WeChat, trembling, and glory, but their usage habits are not exactly the same. The first category (user1, user2) is more biased to the combined calling of each software for a plurality of times, the user3 and the user4 are called for a few times, if the classification is carried out by setting a threshold value, a proper threshold value is difficult to find, and by using the classification method provided by the invention, the same habit of people can be accurately extracted, and more accurate classification can be carried out.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.
Claims (9)
1. A multi-sequence comparison classification method based on mobile phone app use behavior data is characterized by comprising the following steps:
step 1, collecting app use behavior data of a plurality of mobile phone users to form a user behavior sequence;
step 2, carrying out global matching on the user behavior sequence and constructing a distance matrix;
and 3, building a tree for the distance matrix by adopting a non-weighted grouping average method, thereby classifying.
2. The method of claim 1, wherein: in the step 1, the specific method for processing the collected app usage behavior data of a plurality of mobile phone users to form the user behavior sequence is as follows:
step 11, collecting app use behavior data of a plurality of mobile phone users, wherein user (p) [ (X1, t1), (X2, t2), … ], wherein p is the pth user, X1, X2, … represent the 1 st, 2 nd, … th apps used, t1, t2, … represent the use duration of the corresponding apps;
and step 12, deleting the app use behavior data with the use duration less than the threshold as noise, and only extracting apps in sequence to obtain a new sequence as a user behavior sequence user (p) [ X01, X02, … ].
3. The method of claim 1, wherein: the specific process of the step 2 is as follows:
step 21, obtaining a matching mode with the highest user behavior sequence score by adopting a dynamically planned Needleman-Wunsch algorithm;
step 22, comparing every two mobile phone users by using a global matching algorithm, adding a third sequence based on the result of the pairwise comparison of the sequences, and performing multi-sequence comparison;
the LCS for multi-sequence associations is defined as follows:
for the combination of aligned sequences resulting from the combination of (u1, u2, u 3.., un) [ A1 ]1 A21 A31 ... An1 A12 A22A32 ... An2 A13 A23 A33 ... An3 .... A1i A2i A3i ... Ani]The LCS calculation with another independent sequence is defined as follows:
LCS(i,j)
=LCS(A11 A21 A31 ... An1 A12 A22 A32 ... An2 A13 A23 A33 ... An3 .... A1i A2iA3i ... Ani,c1 c2 c3 c4 ... cj) Wherein i is more than or equal to 0 and less than or equal to N, and j is more than or equal to 0 and less than or equal to M
For i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, the formula is shown as follows:
and step 23, calculating a similarity matrix between every two sequences according to the global matching result, namely a distance matrix.
4. The method of claim 3, wherein: the specific process of step 21 is:
step 211, defining the following two sequences:
A=a1 a2 a3 a4 ... aNwherein A is represented by1,a2,a3,a4,...aNComposed of N characters of length N
B=b1 b2 b3 b4 ... bMWherein B is represented by1,b2,b3,b4,...bMM character components of length M
In step 212, LCS (a, B) represents the longest common substring of string a and string B, and LCS (i, j) ═ LCS (a)1 a2a3 a4 ... ai,b1 b2 b3 b4 ... bj) Wherein i is more than or equal to 0 and less than or equal to N, and j is more than or equal to 0 and less than or equal to M.
5. The method of claim 4, wherein: in the step 22, when the sequences are compared with each other, a space character '-' is inserted into the head of each of the two sequences, and the space character '-' is not matched with the character at each position, and the score is 0; further obtaining the calculation results of each row, thereby obtaining the length of the maximum common substring; and then, determining the positions of the matched substrings through backtracking, writing out matched characters according to a backtracking path, and adding characters from the upper left corner to the lower right corner, thereby obtaining sequence results of two sequences subjected to backtracking.
6. The method of claim 5, wherein: when the matched substring position is determined through backtracking, the substring position is firstly positioned at the lower right corner of the matrix, and then the cell is backtracked according to the following rules: first, if ai=bjIf the cell with the same maximum value exists, the priority is in the order of the upper left corner, the upper edge and the left side; secondly, if the current cell is in the first row of the matrix, backtracking to the cell on the left; third, if the current cell is in the first column of the matrix, then go back to the top cell.
7. The method of claim 5, wherein: the matching characters are written according to the backtracking path, and the specific rule for adding characters from the upper left corner to the lower right corner is as follows:
if the backtracking path is passed by the bottom right corner cell, aiAdding to the matching string of the corresponding user, bjAdding the matched word string to the corresponding user;
if the backtracking path is passed by the vertical lower cell, aiAdding the interval "-" to the matching string of the corresponding userjMatching strings of corresponding users;
if the backtracking path is passed by the right cell, add the interval "-"To a is added toiMatching strings of the corresponding user, bjAnd adding the string to the matching string of the corresponding user.
9. the method of claim 1, wherein: the specific content of the step 3 is as follows:
step 31, synthesizing the two sequences with the shortest distance in the distance matrix into a composite sequence group (AB), updating the distance matrix, and combining the two sequences with the shortest distance in the updated distance matrix again until all the sequences are gathered into one type;
step 32, determining a tree structure according to the merging sequence;
and step 33, realizing classification according to the tree structure.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110554096.6A CN113378892B (en) | 2021-05-20 | 2021-05-20 | Multi-sequence comparison classification method based on mobile phone app usage behavior data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110554096.6A CN113378892B (en) | 2021-05-20 | 2021-05-20 | Multi-sequence comparison classification method based on mobile phone app usage behavior data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113378892A true CN113378892A (en) | 2021-09-10 |
CN113378892B CN113378892B (en) | 2024-07-09 |
Family
ID=77571419
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110554096.6A Active CN113378892B (en) | 2021-05-20 | 2021-05-20 | Multi-sequence comparison classification method based on mobile phone app usage behavior data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113378892B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113836370A (en) * | 2021-11-25 | 2021-12-24 | 上海观安信息技术股份有限公司 | User group classification method and device, storage medium and computer equipment |
CN114298863A (en) * | 2022-03-11 | 2022-04-08 | 浙江万胜智能科技股份有限公司 | Data acquisition method and system of intelligent meter reading terminal |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0654755A1 (en) * | 1993-11-23 | 1995-05-24 | International Business Machines Corporation | A system and method for automatic handwriting recognition with a writer-independent chirographic label alphabet |
US20110224984A1 (en) * | 2010-03-11 | 2011-09-15 | Telefonica, S.A. | Fast Partial Pattern Matching System and Method |
US8787700B1 (en) * | 2011-11-30 | 2014-07-22 | Google Inc. | Automatic pose estimation from uncalibrated unordered spherical panoramas |
CN107045673A (en) * | 2017-03-31 | 2017-08-15 | 杭州电子科技大学 | Public bicycles changes in flow rate amount Forecasting Methodology based on heap Model Fusion |
CN107704868A (en) * | 2017-08-29 | 2018-02-16 | 重庆邮电大学 | Tenant group clustering method based on Mobile solution usage behavior |
CN108306879A (en) * | 2018-01-30 | 2018-07-20 | 福建师范大学 | The real-time abnormal localization method of distribution based on Web session streams |
CN108710836A (en) * | 2018-05-04 | 2018-10-26 | 南京邮电大学 | A kind of lip detecting and read method based on cascade nature extraction |
CN111768306A (en) * | 2020-06-23 | 2020-10-13 | 中国工商银行股份有限公司 | Risk identification method and system based on intelligent data analysis |
CN112039196A (en) * | 2020-04-22 | 2020-12-04 | 广东电网有限责任公司 | Power monitoring system private protocol analysis method based on protocol reverse engineering |
CN112800109A (en) * | 2021-01-21 | 2021-05-14 | 蜜兔(杭州)网络科技有限公司 | Information mining method and system |
-
2021
- 2021-05-20 CN CN202110554096.6A patent/CN113378892B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0654755A1 (en) * | 1993-11-23 | 1995-05-24 | International Business Machines Corporation | A system and method for automatic handwriting recognition with a writer-independent chirographic label alphabet |
US20110224984A1 (en) * | 2010-03-11 | 2011-09-15 | Telefonica, S.A. | Fast Partial Pattern Matching System and Method |
US8787700B1 (en) * | 2011-11-30 | 2014-07-22 | Google Inc. | Automatic pose estimation from uncalibrated unordered spherical panoramas |
CN107045673A (en) * | 2017-03-31 | 2017-08-15 | 杭州电子科技大学 | Public bicycles changes in flow rate amount Forecasting Methodology based on heap Model Fusion |
CN107704868A (en) * | 2017-08-29 | 2018-02-16 | 重庆邮电大学 | Tenant group clustering method based on Mobile solution usage behavior |
CN108306879A (en) * | 2018-01-30 | 2018-07-20 | 福建师范大学 | The real-time abnormal localization method of distribution based on Web session streams |
CN108710836A (en) * | 2018-05-04 | 2018-10-26 | 南京邮电大学 | A kind of lip detecting and read method based on cascade nature extraction |
CN112039196A (en) * | 2020-04-22 | 2020-12-04 | 广东电网有限责任公司 | Power monitoring system private protocol analysis method based on protocol reverse engineering |
CN111768306A (en) * | 2020-06-23 | 2020-10-13 | 中国工商银行股份有限公司 | Risk identification method and system based on intelligent data analysis |
CN112800109A (en) * | 2021-01-21 | 2021-05-14 | 蜜兔(杭州)网络科技有限公司 | Information mining method and system |
Non-Patent Citations (1)
Title |
---|
唐玉荣: "生物信息学中的序列比对算法", 计算机工程与应用, no. 29 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113836370A (en) * | 2021-11-25 | 2021-12-24 | 上海观安信息技术股份有限公司 | User group classification method and device, storage medium and computer equipment |
CN113836370B (en) * | 2021-11-25 | 2022-03-01 | 上海观安信息技术股份有限公司 | User group classification method and device, storage medium and computer equipment |
CN114298863A (en) * | 2022-03-11 | 2022-04-08 | 浙江万胜智能科技股份有限公司 | Data acquisition method and system of intelligent meter reading terminal |
Also Published As
Publication number | Publication date |
---|---|
CN113378892B (en) | 2024-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kukreja | A retrospective study on handwritten mathematical symbols and expressions: Classification and recognition | |
CN113378892A (en) | Multi-sequence comparison classification method based on mobile phone app use behavior data | |
Cai et al. | ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time | |
Bogdanowicz et al. | TreeCmp: comparison of trees in polynomial time | |
Ciriello et al. | AlignNemo: a local network alignment method to integrate homology and topology | |
Schulz et al. | Fiona: a parallel and automatic strategy for read error correction | |
Apostolico et al. | Sequence alignment in molecular biology | |
Simon et al. | Insect phylogenomics: exploring the source of incongruence using new transcriptomic data | |
CN110457672B (en) | Keyword determination method and device, electronic equipment and storage medium | |
CN105069560A (en) | Resume information extraction and characteristic identification analysis system and method based on knowledge base and rule base | |
Molloy et al. | FastMulRFS: fast and accurate species tree estimation under generic gene duplication and loss models | |
CN106095955A (en) | The behavior patterns mining method matched based on traffic log and entity track | |
CN112508726B (en) | False public opinion identification system based on information spreading characteristics and processing method thereof | |
JPWO2008032780A1 (en) | Retrieval method, similarity calculation method, similarity calculation and same document collation system, and program thereof | |
Menezes et al. | Molecular phylogeny and historical biogeography of the neotropical swarm-founding social wasp genus Synoeca (Hymenoptera: Vespidae) | |
Brinda | Novel computational techniques for mapping and classification of Next-Generation Sequencing data | |
He et al. | De novo assembly methods for next generation sequencing data | |
Lai et al. | Transconv: Relationship embedding in social networks | |
CN106570188A (en) | Digging method of structural hole node in multi-theme information propagation | |
Zheng et al. | Creating and using minimizer sketches in computational genomics | |
CA3025233A1 (en) | Systems and methods for segmenting interactive session text | |
Zhang et al. | MAT2: manifold alignment of single-cell transcriptomes with cell triplets | |
CN111859975B (en) | Method and system for expanding corpus regular expression of sample corpus | |
US20220171815A1 (en) | System and method for generating filters for k-mismatch search | |
CN112951337A (en) | Molecular fingerprint generation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |