CN113469244A

CN113469244A - Xiaozhong app classification system

Info

Publication number: CN113469244A
Application number: CN202110733806.1A
Authority: CN
Inventors: 方毅; 周琦; 吕繁荣; 李正; 孙勇韬; 王志豪
Original assignee: Hangzhou Yunshen Technology Co ltd
Current assignee: Hangzhou Yunshen Technology Co ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-10-01
Anticipated expiration: 2041-06-30
Also published as: CN113469244B

Abstract

The invention relates to a Xiaozhong app classification system, which is implemented by the steps of S1, generating an input feature vector based on the feature information record of each Xiaozhong app in a first database, inputting the input feature vector into an app classification model, and obtaining a category label of each Xiaozhong app so as to generate M types of Xiaozhong apps; step S2, inputting each Xiaozhong app initial vector into the app target vector generation model, and generating a target vector corresponding to each Xiaozhong app; step S3, obtaining a center vector corresponding to each type of the minority app, obtaining an intra-class distance and an inter-class distance corresponding to each type of the minority app based on the center vector corresponding to each type of the minority app and a target vector of all the minority apps in each type of the minority app, and changing a class label of all the minority apps in the minority app class of which the intra-class distance and the inter-class distance are smaller than a preset ratio into an (M +1) th class. The classification accuracy of the Xiaozhong app is improved.

Description

Xiaozhong app classification system

Technical Field

The invention relates to the technical field of computers, in particular to a Xiaozhong app classification system.

Background

With the rapid development of science and technology, the number of apps is also rapidly increased, and in many application scenarios, analysis needs to be performed on the basis of one or more categories of apps, which requires accurate classification of a large number of apps. Most of existing app classifications are obtained by training a classification model directly based on app feature information such as app names and package names to classify apps, the apps can be divided into minority apps and public apps according to the magnitude of installation amount, the minority apps occupy a large amount of all apps, the types of the minority apps are more, the total data volume of the minority apps is very large, the number of samples of each minority app is less, and therefore the existing app classification method is adopted to classify massive minority apps, and classification accuracy is low. Therefore, how to improve the classification accuracy of the app of the minority becomes a technical problem to be solved urgently.

Disclosure of Invention

The invention aims to provide a Xiaozhong app classification system which improves the classification accuracy of the Xiaozhong apps.

According to a first aspect of the present invention, there is provided a little app classification system, including a first database and a second database, the first database storing little app feature information records, the little app feature information records including little app id and corresponding multiple app feature information, the first database and the second database being pre-built, the pre-trained app classification model and app target vector generation model storing a memory and a processor of a computer program; the second database stores a little app sequence table and a little app initial vector mapping table, and the little app sequence table comprises one or more of a little app installation sequence table, a little app uninstallation sequence table and a little app active sequence table; the app classification model is obtained by training on the basis of the characteristic information corresponding to the first sample audience app in the first database; the app target vector generation model is obtained by training based on a Xiaozhong app sequence record corresponding to a sample user id in the second database and a Xiaozhong app initial vector mapping table, wherein the Xiaozhong app is an app with the installation amount smaller than a preset installation amount;

when the processor is executing the computer program, the following steps are implemented:

step S1, generating an input feature vector based on the feature information record of each Xiaozhong app in the first database, inputting the input feature vector into the app classification model, and obtaining a category label of each Xiaozhong app so as to generate M types of Xiaozhong apps;

step S2, inputting each Xiaozhong app initial vector into the app target vector generation model, and generating a target vector corresponding to each Xiaozhong app;

step S3, obtaining a center vector corresponding to each type of the minority app, obtaining an intra-class distance and an inter-class distance corresponding to each type of the minority app based on the center vector corresponding to each type of the minority app and a target vector of all the minority apps in each type of the minority app, and changing a class label of all the minority apps in the minority app class of which the intra-class distance and the inter-class distance are smaller than a preset ratio into an (M +1) th class.

Compared with the prior art, the invention has obvious advantages and beneficial effects. By means of the technical scheme, the Xiaozhong app classification system provided by the invention can achieve considerable technical progress and practicability, has wide industrial utilization value and at least has the following advantages:

according to the method, the Oerson apps are roughly classified through the app classification model, and then the rough classification result is calibrated based on the target vector corresponding to the Oerson apps generated by the app target vector generation model, so that the accuracy of Oerson app classification is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.

Drawings

Fig. 1 is a schematic diagram of a crowd app classification system according to an embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description will be given to a specific implementation and effects of the app classification system for the Xiaozhong app according to the present invention with reference to the accompanying drawings and preferred embodiments.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. A process may be terminated when its operations are completed, but may have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc.

The first embodiment,

An embodiment provides a crowd app classification system, as shown in fig. 1, including a first database and a second database, the first database storing a crowd app feature information record, the crowd app feature information record including a crowd app id and a plurality of corresponding app feature information, and a pre-trained app classification model and an app target vector generation model storing a memory and a processor of a computer program, the app feature information record including app package name information. The second database stores a little-popular app sequence table and a little-popular app initial vector mapping table, the little-popular app sequence table comprises one or more of a little-popular app installation sequence table, a little-popular app unloading sequence table and a little-popular app active sequence table, the little-popular app initial vector mapping table stores an initial vector corresponding to each little-popular app, and the initial vectors can be obtained through random initialization. The app classification model is obtained by training based on the feature information corresponding to the first sample audience app in the first database, and it can be understood that the first sample audience app is a known class of audience apps. The app target vector generation model is obtained by training based on a Xiaozhong app sequence record corresponding to a sample user id in the second database and a Xiaozhong app initial vector mapping table, wherein the Xiaozhong app is an app with the installation amount smaller than a preset installation amount; when the processor is executing the computer program, the following steps are implemented:

step S1, generating an input feature vector based on the feature information record of each Xiaozhong app in the first database, inputting the input feature vector into the app classification model, and obtaining a category label of each Xiaozhong app, so as to generate M types of Xiaozhong apps, wherein M is a positive integer;

the app classification method comprises the steps that corresponding first sample Xiaozhong app feature information is built in a first database on the basis of first sample Xiaozhong app, the known category information of the first sample Xiaozhong app is trained on the basis of the first sample Xiaozhong app and the corresponding category information to obtain an app classification model, the specific training process of the app classification model is directly achieved by an app classification model training method in the prior art, and description is not conducted. Because the accuracy of the minority app classification result obtained by only adopting the app classification model is not high, the target vector corresponding to the minority app is introduced, the minority app classification result obtained by the app classification model is further judged and corrected based on the target vector corresponding to the app, and the accuracy of the minority app classification is improved.

it should be noted that the app target vector generation model can greatly increase the number of reliable samples by learning the sequence relationship of the minority app in the sequence of the minority app and constructing samples through the sequence of the minority app, and improve the training accuracy of the target vector generation model, so that the accuracy of obtaining an accurate target vector corresponding to the minority app is improved, and the accuracy of classifying the minority app obtained based on target vector calibration is improved.

As an embodiment, each feature of the target vectors of all the children apps in each class of children apps may be averaged, and the intra-class distance specific algorithm is: and acquiring a second distance between the target vector of all the children apps in the class of the children apps and the class center vector, and taking the maximum value of the second distance as the corresponding class inner distance of the class. The inter-class distance algorithm is to obtain a third distance between a center vector corresponding to one class of the xian app and center vectors corresponding to other classes of the xian app, and take the minimum value of the third distance as the inter-class distance corresponding to the class. The minority app categories with the intra-class distance and the inter-class distance smaller than the preset specific value are categories with low classification accuracy and reliability, so that the minority apps of the categories are all divided into other categories except the M category to serve as the (M +1) th category, and the list labels of the minority apps with the intra-class distance and the inter-class distance larger than or equal to the preset specific value are kept still, so that the categories with accurate and reliable classification are screened out, and the accuracy of the classification result of the minority apps is improved. And obtaining a central vector corresponding to the category. It should be noted that the preset ratio is specifically set according to the requirement of specific classification accuracy.

Example II,

In the first embodiment, the categories which are classified by the app classification model and have low accuracy and low reliability are filtered, and the crowd app category in the filtered categories is changed into the (M +1) th category. In order to further improve the accuracy of the classification result of the minority app, further analysis may be performed based on the target vector of each minority app in each category, and the minority app with a classification error in each category is filtered out and classified into the (M +1) th category, specifically, as an embodiment, after the step S3 is executed, the method further includes:

step S4, determining the crowd app category of which the intra-class distance and the inter-class distance are greater than or equal to a preset ratio as a to-be-processed category, setting an initial radius, increasing the radius by a preset radius increment step size for each to-be-processed category, acquiring the crowd app density and the crowd app recall ratio of the to-be-processed category at each radius, determining the crowd app actually belonging to the category from the to-be-processed category based on the app density distribution and the app recall ratio distribution of the to-be-processed category at different radii, and changing the label of the undetermined crowd app into the (M +1) th category.

As an embodiment, the step S4 may further include:

step S41, obtaining the corresponding minor app recall rate and minor app density of each category to be processed at different radii based on the following formulas:

R＝xδ

the recalling rate of the audience apps under the current radius is rec, the density of the audience apps under the current radius is Density, the number of the audience apps in the category to be processed is N, the value of the current radius is R, the value of N is obtained from 1, and N is 1,2,3 …, dist_iRepresenting an indicative value, x_iA target vector representing the ith audience app in the category to be processed,

a center vector representing a category to be processed;

step S42, taking the radius corresponding to the category to be processed as an abscissa, taking the lesser app recall rate as an ordinate to obtain a first curve, and taking the lesser app density as an ordinate to obtain a second curve;

step S43, obtaining a target radius based on the first curve and the second curve, and determining the minority app located within the processing category radius range as the minority app actually belonging to the category.

Further, the step S43 may further include:

step S431, acquiring a preset recall rate threshold, judging whether an elbow point exists in a line segment of a second curve corresponding to an abscissa, wherein the first curve is more than or equal to the recall rate threshold, if so, taking a radius value corresponding to the elbow point as the target radius, otherwise, executing step S432;

it should be noted that the elbow points of the segment of the second curve may be determined by directly using the existing elbow point determination method, and a description thereof is not repeated here.

Step S432, performing a staged operation on the first curve and the second curve based on the recall rate threshold, and determining the target radius.

Through the second embodiment, the accuracy of the classification result of each category can be further improved on the basis of the first embodiment, so that the accuracy of the classification result of the Xiaozhong app is improved.

Example III,

It should be noted that, the technical details of the third embodiment may be implemented on the basis of the first or second embodiment, and in order to further improve the accuracy of the result of classifying the app, the category of the app in the (M +1) th category may be further determined, and when the processor executes the computer program, the following steps are further implemented:

step S01, determining the crowd app categories of which the intra-class distance and the inter-class distance in the step S3 are larger than or equal to a preset ratio as categories to be processed;

it can be understood that, in step S3, the category of the crowd app filtered out is a category whose division result is inaccurate and unreliable, and therefore, when the category of the crowd app in the (M +1) th category is further determined, only the category of the crowd app whose intra-class distance and inter-class distance are greater than or equal to a preset ratio is taken as the category to be processed, so as to improve the accuracy of determining the category of the crowd app in the (M +1) th category.

Step S02, obtaining a first distance between a target vector of each of the children apps in the (M +1) th category and a center vector corresponding to each of the categories to be processed, obtaining a minimum value of the first distance, and comparing the minimum value of the first distance with a preset first distance threshold, and if the minimum value of the first distance is smaller than the first distance threshold, changing a category label of the children app to a label of the category to be processed corresponding to the minimum value of the first distance.

It is understood that, according to the embodiment, through steps S01-S02, the (M +1) th class which is originally divided into wrong classes can be further corrected, so as to further improve the accuracy of the classification result of the app.

The technical details of the first embodiment, the second embodiment and the third embodiment can be implemented by the following embodiments.

The installation magnitude of the crowd app and the popular app is greatly different, so that a preset installation amount can be determined according to the installation amount of the app, and the popular app and the crowd app are divided based on the preset installation amount. As an embodiment, the processor, when executing the computer program, further performs the steps of:

step S100, acquiring an app installation amount distribution graph based on the installation amount of the full-amount app, wherein the full-amount app comprises public apps and minority apps, and determining the installation amount corresponding to the turning point of the sudden drop of the app installation amount distribution graph as the preset installation amount.

As an embodiment, when executing the computer program, the processor further implements the following steps, and the constructing the target vector generation model specifically includes:

step S10, selecting a second sample audience app from the audience app sequence list corresponding to the sample user;

step S20, selecting a window sequence including the second sample xianzhong app from the xianzhong app sequence table based on a preset time window, as a positive sample sequence;

step S30, randomly extracting a minority app from the first database and the second sample minority app to construct a negative sample sequence, where the number of the minority apps in the positive sample sequence is equal to the number of the minority apps in the negative sample sequence;

it should be noted that, because the number of the minority apps is huge, and the probability of combining the positive sample sequence is very small by abstracting other minority apps, the construction requirement for constructing the negative sample sequence can be met by randomly abstracting the minority apps, and the construction efficiency is very high. In order to further improve the accuracy of the negative sample sequence construction, as another embodiment, in step S30, a crowd app adjacent to the second sample crowd app may be determined from the positive sample sequence generated based on the sample user corresponding sequence, and then other crowd apps except the crowd app adjacent to the second sample crowd app are randomly extracted from the first database, so as to construct a corresponding negative sample sequence with the second sample crowd app.

Step S40, constructing sample input vectors corresponding to each of the positive sample sequence and the negative sample sequence based on the xiazhong app initial vector mapping table;

as an embodiment, the initial mini app vector is a 1 × m-dimensional vector, the number of mini apps of the preset time window positive sample sequence is n, and the step S40 further includes:

step S401, according to the sequence of the app in each sample sequence, converting each app id into a corresponding initial vector, where the initial vector corresponding to each app id corresponds to one line of the input feature vector, and finally obtaining an n × m dimensional vector.

Step S50, training a preset target vector generation model frame based on the positive sample, the sample label corresponding to the negative sample, and the positive sample input vector corresponding to the positive sample sequence and the negative sample sequence, and generating the target vector generation model.

As an embodiment, the target vector generation model framework is a multi-layer neural network model, each app in the positive and negative sample sequences corresponds to an independent input channel, the number of the input channels is equal to the size of the preset window, that is, if the number of the mini apps in the positive sample sequence of the preset time window is n, the multi-layer neural network model corresponds to n independent input channels, and the sequence of the input channels is consistent with the sequence of the mini apps in the sample sequence. Each layer of neural network is configured with a corresponding first weight value, and the first weight value is a model parameter of the target vector generation model needing to be updated. The last layer of neural network includes two neurons, and correspondingly, the positive exemplar label is 10 and the negative exemplar label is 01, and the step S50 includes:

s501, inputting positive and negative sample data of a current batch into the target vector generation model frame, and obtaining a pair of probability predicted values for each sample;

step S502, obtaining a current loss function value based on the probability predicted values of all samples in the current batch and the sample labels, judging whether the current loss function value meets the preset model training end condition, if so, executing step S504, otherwise, executing step S503;

as an embodiment, the model training end condition includes that a loss function is smaller than a preset first loss threshold or the loss function is smaller than a preset second loss threshold, and remains unchanged, where the first loss threshold is smaller than the second loss threshold.

Step S503, obtaining a current parameter adjusting value based on the partial derivatives of the current loss function, updating a first weight value corresponding to each neural network based on the current parameter adjusting value, taking positive and negative sample data of a next batch as the positive and negative sample data of the current batch, and returning to execute the step S501;

step S504, an input channel of a current target vector generation model frame is used as input, a vector generated by a previous layer of network of the last layer of neural network is used as output, and the target vector generation model is generated.

As an example, the step S2 includes:

step S21, inputting the n app initial vectors of the children into the app target vector generation model according to a preset sequence, generating n x m dimensional vectors, and generating n x m dimensional vector output vectors;

step S22, determining the j-th row of output vectors in the n × m-dimensional output vectors as the target vector corresponding to the j-th little app in the preset sequence, where j has a value from 1 to n.

It can be understood that the target vectors corresponding to the n app children can be obtained simultaneously through the target vector generation model. However, it can be understood that if a target vector corresponding to a preset target xian app is obtained, the target xian app and a randomly extracted (n-1) xian app may be input into the app target vector generation model, and a vector corresponding to a position of the target xian app in an output vector may be determined as a target vector corresponding to the target xian app.

It should be noted that the app target vector generation model and the app classification model in the embodiment of the present invention are obtained by training based on data corresponding to the sample xiaozhong apps, and the data of the xiaozhong apps are basically in the same order and have a small difference in quantity, so that the accuracy and reliability of the app target vector generation model and the app classification model obtained by training are high, and the accuracy of the class xiaozhong apps is further improved.

The app souvenir list comprises any one or more of a app install list, a app uninstall list and a app active list, and it can be understood that the more the types of apps are included, the higher the accuracy is, but the larger the corresponding calculation amount is, the less the types are, the smaller the calculation amount is, but the accuracy is lower than that of the types, so that the app souvenir list can be set according to specific application requirements. However, when different sequence combinations are selected, the sample data generated by different sequences correspond to different loss weights, but it is required that the loss weight corresponding to the installation sequence is greater than the loss weight corresponding to the uninstallation sequence, and the loss weight corresponding to the uninstallation sequence is greater than the loss weight corresponding to the active list. As an embodiment, the xiaozhong app sequence table includes a xiaozhong app installation sequence table, a xiaozhong app uninstallation sequence table and a xiaozhong app active sequence table, and the xiaozhong app installation sequence is used for storing a xiaozhong app installation sequence record, including a user id, a xiaozhong app id arranged according to an installation time sequence, and an installation time corresponding to the xiaozhong app id; the Xiaozhong app uninstalling sequence table is used for storing Xiaozhong app uninstalling sequence records, and comprises a user id, a Xiaozhong app id and uninstalling time corresponding to the Xiaozhong app id, wherein the Xiaozhong app id and the uninstalling time are arranged according to the uninstalling time sequence; the Xiaozhong app active sequence list is used for storing a Xiaozhong app active sequence record, and comprises a user id, a Xiaozhong app id and active time corresponding to the Xiaozhong app id, wherein the Xiaozhong app id and the active time are arranged according to the sequence of the active time;

in the step S20, a positive sample sequence obtained based on the xiazhong app installation sequence table is a first positive sample sequence, a positive sample sequence obtained based on the xiazhong app unloading sequence table is a second positive sample sequence, a positive sample sequence obtained based on the xiazhong app active sequence table is a third positive sample sequence, and corresponding first loss weight, second loss weight and third loss weight are respectively set for the first positive sample sequence, the second positive sample sequence and the third positive sample sequence, where the first loss weight > the second loss weight > the third loss weight;

in step S502, a current loss function value is obtained based on the predicted probability values, the sample labels, the first loss weight, the second loss weight, and the third loss weight of all samples of the current batch.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A crowd app classification system, characterized in that,

the method comprises a first database and a second database which are constructed in advance, a memory and a processor of a computer program are stored in a pre-trained app classification model and an app target vector generation model, a Xiaozhong app feature information record is stored in the first database, and the Xiaozhong app feature information record comprises a Xiaozhong app id and a plurality of corresponding app feature information; the second database stores a little app sequence table and a little app initial vector mapping table, and the little app sequence table comprises one or more of a little app installation sequence table, a little app uninstallation sequence table and a little app active sequence table; the app classification model is obtained by training on the basis of the characteristic information corresponding to the first sample audience app in the first database; the app target vector generation model is obtained by training based on a Xiaozhong app sequence record corresponding to a sample user id in the second database and a Xiaozhong app initial vector mapping table, wherein the Xiaozhong app is an app with the installation amount smaller than a preset installation amount;

2. The system of claim 1,

when the processor is executing the computer program, the following steps are also implemented:

3. The system of claim 1,

the processor, when executing the computer program, further implements the steps of:

4. The system of claim 1,

5. The system of claim 4,

the initial mini-app vector is a 1 × m-dimensional vector, the number of mini-apps of the preset time window positive sample sequence is n, and the step S40 further includes:

6. The system of claim 5,

the step S2 includes:

step S22, determining the j-th row of output vectors in the n × m-dimensional output vectors as the target vector corresponding to the j-th place of the minority app in the preset sequence.

7. The system of claim 4,

the target vector generation model framework is a multilayer neural network model, each app corresponds to an independent input channel in a positive and negative sample sequence, the number of the input channels is equal to the size of the preset window, each layer of neural network is configured with a corresponding first weight value, the last layer of neural network comprises two neurons, a positive sample label is 10, a negative sample label is 01, and the step S50 includes:

8. The system of claim 7,

the model training ending condition comprises that a loss function is smaller than a preset first loss threshold value or the loss function is smaller than a preset second loss threshold value and is kept unchanged, and the first loss threshold value is smaller than the second loss threshold value.

9. The system of claim 7,

the Xiaozhong app sequence table comprises a Xiaozhong app installation sequence table, a Xiaozhong app unloading sequence table and a Xiaozhong app active sequence table, and the Xiaozhong app installation sequence table is used for storing Xiaozhong app installation sequence records and comprises a user id, a Xiaozhong app id and installation time corresponding to the Xiaozhong app id, wherein the Xiaozhong app id and the Xiaozhong app id are arranged according to the installation time sequence; the Xiaozhong app uninstalling sequence table is used for storing Xiaozhong app uninstalling sequence records, and the Xiaozhong app uninstalling sequence records comprise user ids, Xiaozhong appids arranged according to the uninstalling time sequence and uninstalling time corresponding to the Xiaozhong app ids; the Xiaozhong app active sequence table is used for storing Xiaozhong app active sequence records, and comprises user ids, Xiaozhong appids arranged according to the sequence of the active times and the active times corresponding to the Xiaozhong app ids;