CN113469244B

CN113469244B - Volkswagen app classification system

Info

Publication number: CN113469244B
Application number: CN202110733806.1A
Authority: CN
Inventors: 方毅; 周琦; 吕繁荣; 李正; 孙勇韬; 王志豪
Original assignee: Hangzhou Yunshen Technology Co ltd
Current assignee: Hangzhou Yunshen Technology Co ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2023-07-04
Anticipated expiration: 2041-06-30
Also published as: CN113469244A

Abstract

The invention relates to an audience app classification system, which comprises the following steps of S1, generating an input feature vector based on each audience app feature information record in a first database, inputting the input feature vector into an app classification model to obtain a class label of each audience app, and generating M classes of audience apps; step S2, inputting each initial vector of the popular apps into the app target vector generation model to generate a target vector corresponding to each popular app; step S3, obtaining a center vector corresponding to each class of the small-population apps, obtaining an intra-class distance and an inter-class distance corresponding to each class of small-population apps based on the center vector corresponding to each class of small-population apps and the target vectors of all small-population apps in each class of small-population apps, and changing class labels of all small-population apps in small-population app classes with the intra-class distance and the inter-class distance smaller than a preset ratio into an (M+1) th class. The classification accuracy of the popular apps is improved.

Description

Volkswagen app classification system

Technical Field

The invention relates to the technical field of computers, in particular to a popular app classification system.

Background

With the rapid development of technology, the number of apps is also rapidly increasing, and in many application scenarios, analysis needs to be performed based on one or more types of apps, which requires accurate classification of a large number of apps. The existing app classification is mostly based on app characteristic information such as app name, package name and the like, and the app is directly trained to obtain a classification model to classify apps, the apps can be divided into mass apps and mass apps according to the magnitude of installation quantity, the mass apps occupy a large amount in all apps, the variety is more, the total data quantity of the mass apps is very large, and the sample quantity of each mass app is smaller, so that the existing app classification method is adopted to classify mass apps, and the classification accuracy is low. Therefore, how to improve the classification accuracy of the popular apps is a technical problem to be solved.

Disclosure of Invention

The invention aims to provide a classification system for the apps of the masses, which improves the classification accuracy of the apps of the masses.

According to a first aspect of the present invention, there is provided a popular app classification system comprising a first database and a second database, pre-built app classification model and app target vector generation model storing memory and processor of a computer program, the first database storing popular app feature information record comprising popular app id and corresponding plurality of app feature information; the second database stores an audience app sequence table and an audience app initial vector mapping table, wherein the audience app sequence table comprises one or more of an audience app installation sequence table, an audience app uninstallation sequence table and an audience app active sequence table; the app classification model is obtained based on training of the characteristic information corresponding to the first database by the first sample masses app; the app target vector generation model is obtained by training based on a small-population app sequence record corresponding to the sample user id in the second database and a small-population app initial vector mapping table, wherein the small-population app is an app with an installation amount smaller than a preset installation amount;

when the processor is executing the computer program, the following steps are implemented:

step S1, generating an input feature vector based on each group of app feature information record in the first database, and inputting the input feature vector into the app classification model to obtain a class label of each group of apps, thereby generating M groups of group of apps;

step S2, inputting each initial vector of the popular apps into the app target vector generation model to generate a target vector corresponding to each popular app;

step S3, obtaining a center vector corresponding to each class of the small-population apps, obtaining an intra-class distance and an inter-class distance corresponding to each class of small-population apps based on the center vector corresponding to each class of small-population apps and the target vectors of all small-population apps in each class of small-population apps, and changing class labels of all small-population apps in small-population app classes with the intra-class distance and the inter-class distance smaller than a preset ratio into an (M+1) th class.

Compared with the prior art, the invention has obvious advantages and beneficial effects. By means of the technical scheme, the popular app classification system provided by the invention can achieve quite technical progress and practicality, has wide industrial application value, and has at least the following advantages:

according to the invention, the app classification model is used for carrying out coarse classification on the app, and then the coarse classification result is calibrated based on the target vector corresponding to the app generated by the app target vector generation model, so that the accuracy of the app classification is improved.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention, as well as the preferred embodiments thereof, together with the following detailed description of the invention, given by way of illustration only, together with the accompanying drawings.

Drawings

Fig. 1 is a schematic diagram of a classification system of a public app according to an embodiment of the present invention.

Detailed Description

In order to further describe the technical means and effects adopted by the present invention to achieve the preset purpose, the following detailed description refers to the specific implementation of a popular app classification system and its effects according to the present invention with reference to the accompanying drawings and preferred embodiments.

Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts steps as a sequential process, many of the steps may be implemented in parallel, concurrently, or with other steps. Furthermore, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Embodiment 1,

A first embodiment provides a class system of an app, as shown in fig. 1, including a first database and a second database, which are pre-built, the pre-trained app classification model and app target vector generation model storing a memory and a processor of a computer program, the first database storing a record of app feature information, the record of app feature information including an app id and a corresponding plurality of app feature information, and including app package name information as an example app feature information. The second database stores an internet app sequence table and an internet app initial vector mapping table, the internet app sequence table comprises one or more of an internet app installation sequence table, an internet app uninstallation sequence table and an internet app active sequence table, the internet app initial vector mapping table stores an initial vector corresponding to each internet app, and the initial vector can be obtained through random initialization. The app classification model is obtained based on training of the first sample small-people apps in the feature information corresponding to the first database, and it can be understood that the first sample small-people apps are small-people apps in known categories. The app target vector generation model is obtained by training based on a small-population app sequence record corresponding to the sample user id in the second database and a small-population app initial vector mapping table, wherein the small-population app is an app with an installation amount smaller than a preset installation amount; when the processor is executing the computer program, the following steps are implemented:

step S1, generating an input feature vector based on each group of app feature information record in the first database, and inputting the input feature vector into the app classification model to obtain a class label of each group of apps, thereby generating M groups of group of apps, wherein M is a positive integer;

the method comprises the steps of constructing corresponding first sample small-population app feature information on the first database based on a first sample small-population app, training the first sample small-population app to obtain an app classification model based on the first sample small-population app and the corresponding class information, and directly adopting an app classification model training method in the prior art in the specific training process of the app classification model, wherein the description is not repeated. Because the accuracy of the small-population app classification result obtained by only adopting the app classification model is not high, the small-population app classification method and device introduce target vectors corresponding to the small-population app, and further judge and correct the small-population app classification result obtained by the app classification model based on the target vectors corresponding to the app, so that the accuracy of the small-population app classification is improved.

it should be noted that, the app target vector generation model learns the sequence relation of the mass apps in the sequence of the mass apps, and constructs samples through the sequence of the mass apps, so that the number of reliable samples can be greatly increased, and the training accuracy of the target vector generation model is improved, thereby improving the accuracy of the target vector corresponding to the mass app, and further improving the accuracy of the mass app classification obtained based on target vector calibration.

As an embodiment, the average value of each feature of the target vectors of all the mass apps in each class of mass apps may be specifically calculated, and the intra-class distance specific algorithm is as follows: and obtaining second distances between the target vectors of all the mass apps in the mass apps and the center vector of the mass, and taking the maximum value of the second distances as the corresponding intra-class distance of the mass. The inter-class distance algorithm is that a third distance between a center vector corresponding to one class of the popular apps and a center vector corresponding to other classes of the popular apps is obtained, and the minimum value of the third distance is taken as the inter-class distance corresponding to the class. The class of the class-to-class distance and the class-to-class distance are smaller than the preset ratio, and the class of the class-to-class app is classified into the class with low classification accuracy and reliability, so that the class-to-class app is totally classified into other classes except the M class, and the class-to-class distance are not smaller than the preset ratio, and the class-to-class app list label is kept motionless, so that the class with accurate and reliable classification is screened, and the accuracy of the class result of the class-to-class app is improved. And obtaining the center vector corresponding to the category. It should be noted that the preset ratio is specifically set according to the specific classification accuracy requirement.

Embodiment II,

In the first embodiment, the low-accuracy and low-reliability categories marked by the app classification model are filtered, and the class of the popular apps in the filtered categories is changed to the (M+1) th class. In order to further improve accuracy of the classification result of the popular apps, further analysis may be performed based on the target vector of each popular app in each category, so that the popular apps with wrong classification in each category are also filtered out and classified into the (m+1) th category, and in particular, after executing step S3, the method further includes:

and S4, determining an app class with an intra-class distance and an inter-class distance larger than or equal to a preset ratio as a class to be processed, setting an initial radius, increasing the radius by a preset radius increment step for each class to be processed, acquiring the app density and the app recall rate of the class to be processed under each radius, determining an app actually belonging to the class from the class to be processed based on app density distribution and app recall rate distribution of the class to be processed under different radii, and changing the label of the undetermined app to the (M+1) th class.

As an embodiment, the step S4 may further include:

step S41, obtaining the corresponding audience app recall rate and the audience app density of each category to be processed under different radiuses based on the following formula:

R＝xδ

wherein rec represents the recall rate of the public apps at the current radius, density represents the density of the public apps at the current radius, N represents the number of public apps in the class to be processed, R represents the current radius value, N takes on values from 1, n=1, 2,3 …, dist _i Representing the illustrative value, x _i A target vector representing an ith popular app in the class to be processed,

a center vector representing a class to be processed;

step S42, a radius corresponding to the category to be processed is taken as an abscissa, a recall rate of the mass app is taken as an ordinate to obtain a first curve, and a density of the mass app is taken as an ordinate to obtain a second curve;

step S43, acquiring a target radius based on the first curve and the second curve, and determining the small-people apps within the radius range of the processing category as small-people apps actually belonging to the category.

Further, the step S43 may further include:

step S431, acquiring a preset recall threshold, judging whether an elbow point exists in a line segment of a first curve which accords with a second curve corresponding to an abscissa larger than or equal to the recall threshold, if so, taking a radius value corresponding to the elbow point as the target radius, otherwise, executing step S432;

the elbow point of the line segment of the second curve may be determined directly by the existing elbow point determination method, and will not be described here.

Step S432, performing a phase operation on the first curve and the second curve based on the recall threshold, and determining the target radius.

Through the second embodiment, the accuracy of each class classification result can be further improved on the basis of the first embodiment, so that the accuracy of the class classification result of the public app is improved.

Third embodiment,

It should be noted that, the technical details of the third embodiment may be executed on the basis of the first embodiment or the second embodiment, so as to further improve accuracy of the classification result of the popular apps, further determine the category of the popular apps in the (m+1) th category, and when the processor executes the computer program, further implement the following steps:

step S01, determining the class of the mass app with the intra-class distance and the inter-class distance larger than or equal to a preset ratio in the step S3 as a class to be processed;

it can be understood that, in step S3, the filtered class of the popular apps is a class with inaccurate and unreliable division results, so when the class of the popular apps in the (m+1) th class is further determined, only the class of the popular apps with the intra-class distance and the inter-class distance greater than or equal to the preset ratio is used as the class to be processed, so as to improve the accuracy of determining the class of the popular apps in the (m+1) th class.

Step S02, obtaining a first distance between a target vector of each of the (M+1) th class and a center vector corresponding to each class to be processed, obtaining a first distance minimum value, comparing the first distance minimum value with a preset first distance threshold value, and changing a class label of the class app into a label of the class to be processed corresponding to the first distance minimum value if the first distance minimum value is smaller than the first distance threshold value.

It can be understood that, in the embodiment, through steps S01-S02, the (m+1) -th category may be further corrected by the originally wrong category classification, so as to further improve the accuracy of the app classification result of the masses.

The technical details of the first embodiment, the second embodiment and the third embodiment can be realized by the following embodiments.

The installation orders of the mass app and the mass app are greatly different, so that the preset installation amount can be determined according to the installation amount of the apps, thereby dividing the mass app and the mass app based on the preset installation amount. As an embodiment, the processor, when executing the computer program, further implements the steps of:

step S100, an app installation amount distribution map is obtained based on the installation amount of a full-scale app, the full-scale app comprises a public app and a small public app, and the installation amount corresponding to the inflection point of the app installation amount distribution map dip is determined to be the preset installation amount.

As an embodiment, the processor, when executing the computer program, further implements the following steps, and the constructing the target vector generation model specifically includes:

step S10, selecting a second sample crowd app from the crowd app sequence table corresponding to the sample user;

step S20, selecting a window sequence containing the second sample mass app from the mass app sequence table based on a preset time window as a positive sample sequence;

step S30, randomly extracting the small-population apps from the first database and constructing a negative-sample sequence by the second-sample small-population apps, wherein the small-population apps of the positive-sample sequence are equal in number to the small-population apps of the negative-sample sequence;

it should be noted that, because the number of the small-sized apps is huge, the probability of combining with the positive sample sequence is small due to the extraction of other small-sized apps, so that the random extraction of the small-sized apps can meet the construction requirement of constructing the negative sample sequence, and the construction efficiency is high. To further improve the accuracy of the negative sample sequence construction, in another embodiment, in step S30, an app adjacent to the second sample app may be determined from the positive sample sequence generated based on the sample user corresponding sequence, and then other apps except for the app adjacent to the second sample app may be randomly extracted from the first database to construct a corresponding negative sample sequence with the second sample app.

Step S40, based on the public app initial vector mapping table, constructing a sample input vector corresponding to each positive sample sequence and each negative sample sequence;

as an embodiment, the initial vector of the popular apps is a 1*m-dimensional vector, the number of popular apps of the preset time window positive sample sequence is n, and the step S40 further includes:

step S401, according to the ordering of the mass apps in each sample sequence, converting each app id into a corresponding initial vector, where the initial vector corresponding to each app id corresponds to one line of input feature vectors, and finally obtaining an n×m dimensional vector.

And S50, training a preset target vector generation model frame based on the positive sample, the sample label corresponding to the negative sample and the positive sample input vector corresponding to the positive sample sequence and the negative sample sequence, and generating the target vector generation model.

As an embodiment, the object vector generation model framework is a multi-layer neural network model, each app in the positive and negative sample sequences corresponds to an independent input channel, and the number of input channels is equal to the preset window size, that is, if the number of the popular apps in the positive sample sequence in the preset time window is n, the multi-layer neural network model corresponds to n independent input channels, and the order of the input channels is consistent with the order of the popular apps in the sample sequence. Each layer of neural network is configured with a corresponding first weight value, and the first weight value is a model parameter which needs to be updated for the target vector generation model. The last layer of neural network includes two neurons, and corresponding positive sample label is 10, and negative sample label is 01, step S50 includes:

step S501, inputting positive and negative sample data of a current batch into the target vector generation model frame, wherein each sample obtains a pair of probability prediction values;

step S502, obtaining a current loss function value based on probability prediction values and sample labels of all samples of a current batch, judging whether the current loss function value accords with a preset model training ending condition, if so, executing step S504, otherwise, executing step S503;

as an embodiment, the model training end condition comprises that the loss function is smaller than a preset first loss threshold or that the loss function is smaller than a preset second loss threshold, and remains unchanged, the first loss threshold being smaller than the second loss threshold.

Step S503, obtaining a current tuning parameter value based on the current loss function, updating a first weight value corresponding to each neural network based on the current tuning parameter value, taking positive and negative sample data of the next batch as positive and negative sample data of the current batch, and returning to execute the step S501;

step S504, taking an input channel of a current target vector generation model frame as input, taking a vector generated by a network of a previous layer of the last layer of neural network as output, and generating the target vector generation model.

As an embodiment, the step S2 includes:

step S21, inputting n small-people app initial vectors into the app target vector generation model according to preset sequences, and generating n-m-dimensional vector output vectors in n-m-dimensional vectors;

in step S22, the j-th row of output vectors in the n×m dimension output vectors are determined as the target vectors corresponding to the j-th group app in the preset sequence, and the j takes the values from 1 to n.

It can be appreciated that the target vectors corresponding to the n popular apps can be obtained simultaneously through the target vector generation model. It will be appreciated that if a target vector corresponding to a preset target crowd app is obtained, the target crowd app and the randomly extracted (n-1) crowd app may be input into the app target vector generation model, and a vector corresponding to the position of the target crowd app in the output vector is determined as the target vector corresponding to the target crowd app.

It should be noted that, in the embodiment of the invention, the app target vector generation model and the app classification model are both obtained by training based on the data corresponding to the sample app, and the data of the app is basically in the same order, and the number of the app is not very different, so that the accuracy and the reliability of the app target vector generation model and the app classification model obtained by training are very high, and the accuracy of app classification of the masses is further improved.

The small-population app sequence table comprises any one or more of a small-population app installation sequence table, a small-population app unloading sequence table and a small-population app active sequence table, and it is understood that the more the types are included, the higher the accuracy is, the larger the corresponding calculated amount is, the fewer the types are, the smaller the calculated amount is, but the accuracy is lower than the types, so that the small-population app sequence table can be set according to specific application requirements. However, when different sequence combinations are selected, the sample data generated by the different sequences correspond to different loss weights, but all the sample data are required to satisfy that the loss weights corresponding to the installation sequences are larger than the loss weights corresponding to the unloading sequences, and the loss weights corresponding to the unloading sequences are larger than the loss weights corresponding to the active list. As an embodiment, the app sequence table includes a app installation sequence table, an app uninstallation sequence table and an app active sequence table, where the app installation sequence is used to store a app installation sequence record, including a user id, app ids arranged according to the installation time sequence, and installation times corresponding to the app ids; the small-population app unloading sequence table is used for storing small-population app unloading sequence records, and comprises user ids, small-population app ids arranged according to the unloading time sequence, and unloading times corresponding to the small-population app ids; the public app activity sequence table is used for storing public app activity sequence records, and comprises user ids, public app ids arranged according to the sequence of the activity time and the activity time corresponding to the public app ids;

in the step S20, a positive sample sequence obtained based on the public app installation sequence table is a first positive sample sequence, a positive sample sequence obtained based on the public app unloading sequence table is a second positive sample sequence, a positive sample sequence obtained based on the public app activity sequence table is a third positive sample sequence, and corresponding first loss weight, second loss weight and third loss weight are set for the first positive sample sequence, the second positive sample sequence and the third positive sample sequence respectively, wherein the first loss weight > the second loss weight > the third loss weight;

in the step S502, a current loss function value is obtained based on the probability prediction values of all samples of the current lot, the sample labels, and the first, second, and third loss weights.

The present invention is not limited to the above-mentioned embodiments, but is intended to be limited to the following embodiments, and any modifications, equivalents and modifications can be made to the above-mentioned embodiments without departing from the scope of the invention.

Claims

1. A class system for popular apps is characterized in that,

the method comprises the steps that a first database and a second database are built in advance, a pre-trained app classification model and an app target vector generation model are stored in a memory and a processor of a computer program, the first database is stored with a small-population app characteristic information record, and the small-population app characteristic information record comprises small-population app ids and a plurality of app characteristic information corresponding to the small-population app characteristic information record; the second database stores an audience app sequence table and an audience app initial vector mapping table, wherein the audience app sequence table comprises one or more of an audience app installation sequence table, an audience app uninstallation sequence table and an audience app active sequence table; the app classification model is obtained based on training of the characteristic information corresponding to the first database by the first sample masses app; the app target vector generation model is obtained by training based on a small-population app sequence record corresponding to the sample user id in the second database and a small-population app initial vector mapping table, wherein the small-population app is an app with an installation amount smaller than a preset installation amount;

step S3, obtaining a center vector corresponding to each class of the small-population apps, obtaining an intra-class distance and an inter-class distance corresponding to each class of small-population apps based on the center vector corresponding to each class of small-population apps and the target vectors of all small-population apps in each class of small-population apps, and changing class labels of all small-population apps in small-population app classes with the ratio of the intra-class distance to the inter-class distance smaller than a preset ratio into M+1th class.

2. The system of claim 1, wherein the system further comprises a controller configured to control the controller,

when the processor is executing the computer program, the following steps are also implemented:

step S01, determining the class of the mass app with the ratio of the intra-class distance to the inter-class distance in the step S3 being greater than or equal to a preset ratio as a class to be processed;

step S02, obtaining a first distance between a target vector of each of the M+1-th class and a center vector corresponding to each class to be processed, obtaining a first distance minimum value, comparing the first distance minimum value with a preset first distance threshold value, and changing a class label of the class app into a label of the class to be processed corresponding to the first distance minimum value if the first distance minimum value is smaller than the first distance threshold value.

3. The system of claim 1, wherein the system further comprises a controller configured to control the controller,

the processor, when executing the computer program, further performs the steps of:

4. The system of claim 1, wherein the system further comprises a controller configured to control the controller,

5. The system of claim 4, wherein the system further comprises a controller configured to control the controller,

the initial vector of the popular apps is a 1*m-dimensional vector, the number of popular apps of the preset time window positive sample sequence is n, and the step S40 further includes:

6. The system of claim 5, wherein the system further comprises a controller configured to control the controller,

the step S2 includes:

in step S22, the j-th row of output vectors in the n×m-dimensional output vectors are determined as the target vector corresponding to the j-th popular app in the preset ranking.

7. The system of claim 4, wherein the system further comprises a controller configured to control the controller,

the target vector generation model framework is a multi-layer neural network model, each app corresponds to an independent input channel in a positive and negative sample sequence, the number of the input channels is equal to the size of the preset window, each layer of neural network is configured with a corresponding first weight value, the last layer of neural network comprises two neurons, a positive sample label is 10, and a negative sample label is 01, and the step S50 comprises:

8. The system of claim 7, wherein the system further comprises a controller configured to control the controller,

the model training ending condition includes that the loss function is smaller than a preset first loss threshold or the loss function is smaller than a preset second loss threshold, and the loss function is unchanged, wherein the first loss threshold is smaller than the second loss threshold.

9. The system of claim 7, wherein the system further comprises a controller configured to control the controller,

the small-population app sequence table comprises a small-population app installation sequence table, a small-population app unloading sequence table and a small-population app active sequence table, wherein the small-population app installation sequence table is used for storing small-population app installation sequence records, and comprises user ids, small-population app ids arranged according to the installation time sequence, and installation times corresponding to the small-population app ids; the small-population app unloading sequence table is used for storing small-population app unloading sequence records, and comprises user ids, small-population app ids arranged according to the unloading time sequence, and unloading times corresponding to the small-population app ids; the public app activity sequence table is used for storing public app activity sequence records, including user ids, public app ids arranged according to the sequence of the activity time, and the activity time corresponding to the public app ids;