CN112417086B

CN112417086B - Data processing method, device, server and storage medium

Info

Publication number: CN112417086B
Application number: CN202011382318.2A
Authority: CN
Inventors: 胡俊杨
Original assignee: Shenzhen Hefei Technology Co ltd
Current assignee: Shenzhen Hefei Technology Co ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2024-02-27
Anticipated expiration: 2040-11-30
Also published as: CN112417086A

Abstract

The application discloses a data processing method, a device, a server and a storage medium, wherein the data processing method comprises the following steps: acquiring application program use records of a plurality of users and user categories of each user; based on application program use records of a plurality of users, generating a single heat vector corresponding to each application program as a first single heat vector and a second single heat vector corresponding to each first single heat vector; generating a single heat vector corresponding to a user category as a third single heat vector aiming at the user corresponding to the application program use record used for generating each first single heat vector; training the initial language model by taking the first independent heat vector and the corresponding third independent heat vector as input data and the second independent heat vector as output data to obtain a trained language model as a target model; based on the target model, an application vector corresponding to each application program and a user category vector corresponding to the application program are obtained. The method can achieve the purpose of acquiring more representative application program vectors.

Description

Data processing method, device, server and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method, apparatus, server, and storage medium.

Background

With rapid advances in technological and living standards, the internet has penetrated various aspects of life. As the internet is deeply popularized, the number of application programs is increasing, and in order to facilitate the development of some services, the features of the application programs may be acquired, however, the accuracy of the features of the application programs extracted by the related manner of extracting the features of the application programs is still to be improved.

Disclosure of Invention

In view of the above problems, the present application proposes a data processing method, apparatus, server, and storage medium.

In a first aspect, an embodiment of the present application provides a data processing method, where the method includes: acquiring application program use records of a plurality of users and user categories of each user; based on application program usage records of the plurality of users, generating a single heat vector corresponding to each application program as a first single heat vector and a second single heat vector corresponding to each first single heat vector, wherein the second single heat vector is a single heat vector corresponding to any other application program in a window where the application program corresponding to each first single heat vector is located; generating a single heat vector corresponding to a user category as a third single heat vector aiming at the user corresponding to the application program usage record used for generating each first single heat vector; the first independent heat vector and the corresponding third independent heat vector are used as input data, the second independent heat vector is used as output data, and the initial language model is trained to obtain a trained language model as a target model; and acquiring an application vector corresponding to each application program and a user category vector corresponding to the application program based on the target model.

In a second aspect, embodiments of the present application provide a data processing apparatus, the apparatus including: the system comprises a data acquisition module, a first generation module, a second generation module, a model training module and a vector acquisition module, wherein the data acquisition module is used for acquiring application program use records of a plurality of users and user categories of each user; the first generation module is configured to generate, based on application usage records of the plurality of users, a unique heat vector corresponding to each application as a first unique heat vector, and a second unique heat vector corresponding to each first unique heat vector, where the second unique heat vector is a unique heat vector corresponding to any other application in a window where the application corresponding to each first unique heat vector is located; the second generation module is used for generating a unique heat vector corresponding to a user category as a third unique heat vector for the user corresponding to the application program usage record used for generating each first unique heat vector; the model training module is used for training an initial language model by taking the first independent heat vector and a third independent heat vector corresponding to the first independent heat vector as input data and the second independent heat vector as output data to obtain a trained language model as a target model; the vector acquisition module is used for acquiring an application vector corresponding to each application program and a user category vector corresponding to the application program based on the target model.

In a third aspect, an embodiment of the present application provides a server, including: one or more processors; a memory; one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more application programs configured to perform the data processing method provided in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored therein program code that is callable by a processor to perform the data processing method provided in the first aspect above.

According to the scheme, through the application program use records of the plurality of users and the user types of the users, based on the application program use records of the plurality of users, the independent heat vector corresponding to each application program is generated to serve as a first independent heat vector, the independent heat vector corresponding to each first independent heat vector is generated, the second independent heat vector is the independent heat vector corresponding to any other application program in a window where the application program corresponding to each first independent heat vector is located, the independent heat vector corresponding to the application program use record used for generating each first independent heat vector is generated to serve as a third independent heat vector, the independent heat vector corresponding to the user type is generated to serve as input data, the second independent heat vector is used as output data, the initial language model is trained, the trained language model is taken as a target model, and the application vector corresponding to each application program and the user type vector corresponding to the application program are obtained based on the target model. Therefore, the application program practical record based on the user can be realized, the vector of the application program is generated, the acquired vector can contain the use behavior rule of the application program, in addition, the user category vector can be acquired in the training model process, and the characteristic information of the application program is acquired from the other dimension.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a flow chart of a data processing method according to one embodiment of the present application.

FIG. 2 shows a flow chart of a data processing method according to another embodiment of the present application.

Fig. 3 is a flowchart illustrating step S220 in a data processing method according to another embodiment of the present application.

FIG. 4 is a schematic diagram of an initial language model according to an embodiment of the present application.

Fig. 5 shows another schematic structural diagram of the initial language model provided in the embodiment of the present application.

FIG. 6 illustrates a block diagram of a data processing apparatus according to one embodiment of the present application.

Fig. 7 is a block diagram of a server for performing a data processing method according to an embodiment of the present application.

Fig. 8 is a storage unit for storing or carrying program code for implementing a data processing method according to an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application.

At present, the internet has been fully popularized, and due to the deep popularization of the internet, the number of application programs is increased, and in order to facilitate the development of some services, the characteristics of the application programs can be obtained, for example, in scenes such as risk control, risk prediction and the like, the crowd using the application programs can be analyzed based on the characteristics of the application programs, and then risk management and control are further performed.

In the related art, vector features of an application may be extracted according to a use time sequence of the application. In one approach, extraction of feature vectors for an application may be implemented based on the Item2Vec technique. The Item2Vec technical scheme is mostly based on a continuous word bag model (CBOW) and a Skip-gram model. The CBOW model is a simplified representation used in natural language processing and Information Retrieval (IR). In this model, text (e.g., sentences or documents) is represented as a packet (multiset) of its words, while grammatical or even word order is ignored, but diversity is maintained. In the Word2Vec algorithm, the specific operation is to use a shallow neural network, and in a direct task of predicting a target Word based on the context of the target Word, a parameter matrix of a first full-connection layer of the shallow neural network is obtained, wherein the parameter matrix is a dense Word vector with a certain dimension corresponding to all words in a corpus. The Skip-gram model and the CBOW model use the same network structure, but the training process is completely different, the Skip-gram uses a mode of predicting words in a certain window around a target center word, and for one center word, a number of sample pairs corresponding to the window size can be generated.

For the CBOW model, since the center word is predicted using a plurality of words around the center word, the parameter update at training is averaged, resulting in poor training of sparse words with a low frequency of occurrence. For Skip-gram models, while training on the new word is better, training time is longer because a sample pair is formed for each surrounding word within the window, resulting in a training time that correlates with the window size of the CBOW. In the scheme of acquiring the feature vector of the application program based on the CBOW model and the Skip-gram model, the feature vector is defaulted in all the corpora, and the composition and the distribution mode of the words are identical. However, in practical applications, the habit of using APP by users of different ages and sexes is different, so that the obtained vector is not specific due to the use of the scheme, and the accuracy of the extracted feature vector of the application program is insufficient.

In order to solve the above problems, the inventor proposes a data processing method, a device, a server and a storage medium according to embodiments of the present application, which can generate a vector of an application based on an application practical record of a user, so that the obtained vector includes a usage behavior rule of the application, and in addition, a user category vector can be obtained in a training model process, and feature information of the application is obtained from another dimension. The specific data processing method is described in detail in the following embodiments.

Referring to fig. 1, fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application. In a specific embodiment, the data processing method is applied to the data processing apparatus 400 shown in fig. 6 and the server 100 (fig. 7) configured with the data processing apparatus 400. In the following, a specific flow of the present embodiment will be described by taking a server as an example, and it will be understood that, of course, the server applied in the present embodiment may be a single physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides basic cloud computing services such as cloud services, cloud computing, cloud storage, CDN (Content Delivery Network ), and an artificial intelligence platform. In the case where the data processing method provided in the embodiment of the present application is performed by a server cluster or a distributed system formed by a plurality of physical servers, different steps in the data processing method may be performed by different physical servers, or may be performed by servers built based on the distributed system in a distributed manner. For example, the step of acquiring application usage records of a plurality of users may be performed by one server alone, and the subsequent step may be performed by another server.

The following details about the flow shown in fig. 1, the data processing method specifically may include the following steps:

step S110: an application usage record for a plurality of users is obtained, along with a user category for each user.

In the embodiment of the application, the electronic device may acquire application program usage records of a plurality of users, and user categories of each of the plurality of users, so as to generate an application vector characterizing features of the application program.

In some embodiments, the application usage record may be a usage of the application during historical usage of the terminal by the user. The application usage record may include one or more of data of which applications were used, the time when each application was started, the duration of use of each application, and the frequency of use of each application. Of course, the specific application usage record may not be limited, and the application usage record may further include more usage information, for example, classification of applications used on different dates, and the like. The use duration of the application program can be understood as the duration of the application program in an operation state, and the operation state can be a foreground operation state or a background operation state; the frequency of use of each application can be understood as the ratio of the number of uses of the application over a period of time to the length of the period of time. Wherein a main process from the start of an application to the application is killed, which is understood as a run, is an entry of instructions executed by the application.

In some embodiments, the user category of the user may be a user category obtained by classifying each user according to a predetermined classification rule. The server can classify the user according to the basic information and the classification rule to obtain the user category. Wherein, the classification rule may include: the specific classification rules may be determined according to the requirements of the actual scene without limitation, such as classification by gender, classification by age, classification by occupation, classification by location, classification by academic, and the like. For example, in some scenarios, where the search for application usage behavior for different professions is more focused, classification rules may be selected to classify by profession.

In some embodiments, the server may collect application usage records of the users from the terminals of the respective users, and it may be appreciated that the terminals of the users may record the usage of the applications in the process of being used, so as to generate application usage records; the server may also obtain application usage records for the user, as well as the user's category, from other servers. The specific manner in which the server obtains the application usage record of the user and the user category may not be limited.

Step S120: based on application program usage records of the plurality of users, generating a single heat vector corresponding to each application program as a first single heat vector, and a second single heat vector corresponding to each first single heat vector, wherein the second single heat vector is a single heat vector corresponding to any other application program in a window where the application program corresponding to each first single heat vector is located.

In the embodiment of the application, after acquiring the application program usage record of the user and the user category of the user, the server may process the application program usage record and the user category to obtain a training sample for training. Specifically, the server may generate, as the first unique heat vector, a unique heat vector corresponding to each application program according to application program usage records of a plurality of users; in addition, for each first independent heat vector, a second independent heat vector corresponding to the first independent heat vector is generated, and the second independent heat vector is an independent heat vector corresponding to any other application program in the window where the application program corresponding to the first independent heat vector is located, that is, the second independent heat vector is an independent heat vector corresponding to an application program around the application program corresponding to the first independent heat vector, which may be understood as a word (an application program corresponding to the second independent heat vector) in a certain window around the target word (an application program corresponding to the first independent heat vector) in the natural language processing. The One-hot vector (One-hot) is used to represent a special bit combination that allows only a single bit to be 1 in the byte, and the other bits must be 0, so called One-hot because only One 1 (hot) is available.

In some implementations, the server may generate at least one first unique heat vector for each application for each user's usage record and a corresponding second unique heat vector for the first unique heat vector. For each user, the number of the generated first unique heat vectors corresponding to each user program may not be limited, and may be one or more.

Step S130: and generating the unique heat vector corresponding to the user category as a third unique heat vector aiming at the user corresponding to the application program usage record used for generating each first unique heat vector.

In this embodiment of the present application, in order to learn the usage behavior rule of the application program for different user classes, the server may further generate a unique heat vector of a corresponding user class for each first unique heat direction in the process of generating the training sample, and use the unique heat vector as a third unique heat vector. It will be appreciated that each first unique heat vector is generated from an application usage record for a user, and thus each first unique heat vector also corresponds to a user for whom a unique heat vector for that user's user category may be generated.

Step S140: and training the initial language model by taking the first independent heat vector and the third independent heat vector corresponding to the first independent heat vector as input data and the second independent heat vector as output data to obtain a trained language model as a target model.

In the embodiment of the present application, after the training sample is determined, the initial language model may be trained according to the training sample, to obtain a trained language model, which is used as the target model. Specifically, the server may train the initial language model with the first unique heat vector and the third unique heat vector corresponding to the first unique heat vector as input data and the second unique heat vector as output data, to obtain a trained language model. It can be appreciated that each first independent heat vector and its corresponding third independent heat vector can form a training sample pair with its corresponding second independent heat vector, and the initial language model is trained by using a large number of training sample pairs, so as to obtain a trained model.

In the embodiment of the present application, the input of the initial language model is the first unique heat vector and the corresponding third unique heat vector, and the output is the predicted application program around the application program corresponding to the first unique heat vector in the window. And continuously updating model parameters according to the difference between the output result of the initial language model and the output data in the sample pair, and finally obtaining a model capable of outputting a result with smaller difference, wherein the model is the model obtained through final training.

Step S150: and acquiring an application vector corresponding to each application program and a user category vector corresponding to the application program based on the target model.

According to the target model obtained in the embodiment of the application program, according to the first unique heat vector of a certain application program and the third unique heat vector of the user category corresponding to the first unique heat vector, the predicted unique heat vector of one application program in a window around the application program is output, so that the target model learns the usage behavior rules of the application programs aiming at different user categories, and accordingly, the target model can fully learn the application characteristics aiming at each application program and learn the user category characteristics of the user using the application program. Because the learned features, the target model can accurately generate the independent heat vector of one application program in the window around the application program corresponding to the first independent heat vector according to the input first independent heat vector and the third independent heat vector, and the learned features are stored through model parameters, so that model parameters of the target model comprise application vectors representing the application features and user category vectors representing the user features, and the application vector of each application program and the user category vector corresponding to each application program can be obtained according to the model parameters in the model file.

The data processing method provided by the embodiment of the application and use behavior rules of the data processing method can be explored aiming at different user categories, and based on the rules, more representative application vectors are obtained. Meanwhile, the dense vector representation of the dimension of the user category can be obtained in the one-time training process, and when the application vector and the user category vector are used for further application, the vector can replace the independent heat vector to have better mathematical meaning and can also be better directly input into the model. For example, the characteristics of the application program using behaviors of the user are characterized on the basis, and the obtained natural stackable attributes of the application vectors (such as the application program used in the past time, the vector obtained based on the scheme is weighted and summed according to the using time length to obtain a vector for characterizing the application program using behaviors of the user in the time) can be utilized to obtain better characteristics.

Referring to fig. 2, fig. 2 is a flow chart illustrating a data processing method according to another embodiment of the present application. The data processing method is applied to the server, and will be described in detail with respect to the flow shown in fig. 2, and the data processing method specifically includes the following steps:

Step S210: an application usage record for a plurality of users is obtained, along with a user category for each user.

In the embodiment of the present application, step S210 may refer to the content of the foregoing embodiment, which is not described herein.

Step S220: based on application program usage records of the plurality of users, generating a single heat vector corresponding to each application program as a first single heat vector, and a second single heat vector corresponding to each first single heat vector, wherein the second single heat vector is a single heat vector corresponding to any other application program in a window where the application program corresponding to each first single heat vector is located.

In some embodiments, referring to fig. 3, the server generates, based on the application usage record of the user, a single hot vector of each application as a first single hot vector, and a second single hot vector corresponding to each first single hot vector, which may include:

step S221: and generating a plurality of application use sequences corresponding to the application use records of each user according to the use time of the application in the application use records of the plurality of users.

The server may generate a plurality of application usage sequences of each user by using time as a dimension according to the application usage record of each user.

As one way, the server generates a plurality of application use sequences corresponding to the application use records of each user according to the use time of the application in the application use records of the plurality of users, and may include:

according to the service time of the application programs in the application program service records of the plurality of users, sequencing the application programs existing in the application program service records according to the sequence of the service time, and obtaining a first service sequence corresponding to the application program service records of each user; and according to a preset segmentation rule, segmenting the first use sequence corresponding to the application program use record of each user to obtain a plurality of second use sequences serving as a plurality of application use sequences corresponding to the application program use record of each user.

Illustratively, for user A, its application usage record is: application A is used at point 9 and 18, application B is used at point 9 and 48, application C is used at point 10 and 18, application D is used at point 10 and 48, application E is used at point 11 and 18, and application F is used at point 11 and 48; then the sequence is as follows according to the sequence of the using time: application A, application B, application C, application D, application E and application F; if the preset segmentation rule is to segment according to the preset interval duration and the preset interval duration is 30 minutes, the three use sequences can be segmented: and the three application sequences are a plurality of application use sequences corresponding to the user A. Of course, the above is only an example, and the preset dividing rule may be other, for example, dividing according to the use time during the day and night, dividing according to the use date during the working day and the non-working day, etc.

Step S222: and generating at least one independent heat vector corresponding to each application program as a first independent heat vector based on a plurality of application use sequences corresponding to application program use records of each user, and taking the independent heat vector corresponding to any other application program in a window corresponding to each first independent heat vector as a second independent heat vector, wherein the window is a target window of the application use sequence, which is centered by the application program corresponding to each first independent heat vector.

In this embodiment, after generating the plurality of application use sequences of the respective users, the server may generate a unique heat vector corresponding to each application program as a first unique heat vector and a second unique heat vector corresponding to each first unique heat vector according to the application use sequences.

As one way, the server generates, based on the application usage record of each user and the corresponding application usage sequences, at least one single hot vector corresponding to each application as a first single hot vector, and single hot vectors corresponding to any other application in a window where the application corresponding to each first single hot vector is located as a second single hot vector, which may include:

Generating a first zero vector according to all application programs existing in application program usage records of the plurality of users, wherein the number of elements contained in the first zero vector is the same as the number of all application programs; generating at least one independent heat vector corresponding to each application program as a first independent heat vector according to the first zero vector and a plurality of application use sequences corresponding to application program use records of each user; and generating a single thermal vector as a second single thermal vector according to the first zero vector for any other application program in the window where the application program corresponding to each first single thermal vector is located.

In this manner, the server may sort all the above applications in a random or certain order, and first construct a zero vector with a length equal to the number of all the applications (i.e. the number of elements is the number of all the applications), where each element corresponds to an application in the sorting result, and for example, the included application includes: application A, application B, application C, application D, application E and application F are sequentially applied A, application B, application C, application D, application E and application F, and a generated first zero vector is (0, 0), and then elements in the first zero vector correspond to each other in sequence: application a, application B, application C, application D, application E, and application F. When generating the one-hot vector of one of the application programs, the element corresponding to the application program is modified to 1, so that the one-hot vector can be obtained, for example, the first one-hot vector generated for the application E is (0,0,0,0,1,0).

Illustratively, if all applications contained in the application usage records of multiple users are: the application A, the application B, the application C, the application D, the application E and the application F are ordered as follows: application a, application B, application C, application D, application E, and application F, and thus the number of applications included is 6, a zero vector (0, 0) of 6 elements can be generated, each element in the zero vector; then, according to the application use sequence of the user and the zero vector, generating a first independent heat vector corresponding to the application program existing in the application use sequence and a second independent heat vector corresponding to the first independent heat vector, if the application use sequence of the user B is application A, application B, application C, application B and application A and the window size is 2, for the first application B in the sequence, other application programs existing in the window where the application B exists comprise application A and application C, for the application B, the generation of the first independent heat vector is (0,1,0,0,0,0), other application programs capable of generating the second independent heat vector corresponding to the first independent heat vector comprise application A and application C, for the application A, the generation of the second independent heat vector is (1,0,0,0,0,0), and for the application C, the generation of the second independent heat vector is (0,0,1,0,0,0); for another example, in the application use sequence of the user B, the application a, the application B, the application C, the application B, and the application a are in the application use sequence, and the window size is 3, and for the application C, other application programs of the window include the application a and the application B, so that a first independent heat vector (0,0,1,0,0,0) corresponding to the application C can be generated, and for the first independent heat vector, an independent heat vector (1,0,0,0,0,0) corresponding to the application a can be generated as a second independent heat vector corresponding to the first independent heat vector, and an independent heat vector (0,1,0,0,0,0) corresponding to the application B can be generated as a second independent heat vector corresponding to the first independent heat vector.

Step S230: and generating the unique heat vector corresponding to the user category as a third unique heat vector aiming at the user corresponding to the application program usage record used for generating each first unique heat vector.

In this embodiment of the present application, for a user corresponding to an application usage record used for generating each first independent heat vector, a zero vector may be first constructed when generating a third independent heat vector corresponding to each first independent heat vector. Specifically, according to all the categories included in the user category, ordering all the categories according to random or certain sequence, and generating a zero vector with the length of the number of all the categories (the number of elements in the vector is the number of all the categories) as a second zero vector; and generating the independent heat vector of the user category corresponding to the application use sequence used by each first independent heat vector according to the second zero vector.

Exemplary user categories include: a. b, c and d, the ordered sequence is: d. c, b and a, a second zero vector (0, 0) can be constructed, the elements in the second vector corresponding to d, c, b and a, respectively; if one of the first independent heat vectors is generated according to one of the application usage sequences of the user C and the user class of the user C is d, a third independent heat vector corresponding to the first independent heat vector is configured as follows: (1,0,0,0).

Step S240: and inputting the first independent heat vector and the third independent heat vector corresponding to the first independent heat vector into the initial language model to obtain an output vector output by the initial language model.

In this embodiment of the present application, an initial language model may be pre-configured to train a target model that may output predicted independent heat vectors of applications surrounding the application according to the independent heat vectors of the input application and the independent heat vectors of the corresponding user categories.

In some implementations, referring to fig. 4, the initial language model 500 may include: an input module 510, a vector conversion module 520, a vector cross module 530, a vector stitching module 540, and a classifier 550. The input module 510, the vector conversion module 520, the vector cross module 530, the vector stitching module 540, and the classifier 550 are connected in sequence. The input module 510 is configured to input a first independent heat vector and a third independent heat vector corresponding to the first independent heat vector; the vector conversion module 520 is configured to convert the input first unique hot vector into an application vector representing an application feature, and convert the input third unique hot vector into a user category vector representing a feature of a user category; the vector cross module 530 is configured to cross the application vector and the user class vector obtained by the conversion of the vector conversion module 520, so as to obtain a plurality of cross-processed vectors; the vector stitching module 540 is configured to stitch the multiple crossed vectors obtained by the vector crossing module 530 to obtain stitched vectors; the classifier 550 is configured to output a classified output vector according to the spliced vector obtained by the vector splicing module 540, where the output vector is a single-hot vector corresponding to other application programs in a surrounding window where the application program corresponding to the first single-hot vector of the input predicted by the model is located, and according to the output vector, the other application programs in the surrounding window where the application program corresponding to the predicted first single-hot vector is located can be determined.

In the training process of the initial language model, the server can adjust model parameters of the initial language model according to the output result of the initial language model until the model parameters meet expectations. Therefore, in one training process, the first independent heat vector and the corresponding third independent heat vector in the training sample can be input into the initial language model to obtain the output vector output by the initial language model.

For the initial language model introduced above, inputting the first unique heat vector and the third unique heat vector corresponding to the first unique heat vector into the initial language model to obtain an output vector output by the initial language model may include: inputting the first independent heat vector and the corresponding third independent heat vector into the input module; the vector conversion module converts the first unique hot vector into an application vector and converts the third unique hot vector into a user category vector; the vector crossing module carries out crossing processing on the application vector and the user category vector obtained by the vector conversion module to obtain a plurality of crossed vectors; the vector splicing module splices the plurality of vectors after the cross processing to obtain spliced vectors; and the classifier outputs a corresponding output vector according to the spliced vector.

In some implementations, referring to fig. 5, the input module 510 of the initial language model 500 may include: a first input unit 511 and a second input unit 512. The first input unit 511 is used for inputting a first independent heat vector; the second input unit 512 is used for inputting a third independent heat vector.

For the initial language model 500, inputting the first unique heat vector and the corresponding third unique heat vector into the input module may include: the first independent heat vector is input to the first input unit, and the third independent heat vector is input to the second input unit.

In addition, the vector conversion module 520 of the initial language model 500 may include a first conversion unit 521 and a second conversion unit 522. The first conversion unit 521 is configured to convert the input first independent heat vector into an application vector; the second conversion unit 522 is configured to convert the second unique heat vector into a user category vector. The first conversion unit 521 and the second conversion unit 522 may be implemented by a full-connection layer, and after the initial language model 500 is trained, parameters of the full-connection layer include an application vector corresponding to each application program and a user class vector corresponding to the application program.

For the initial language model 500, the vector conversion module 520 converts the first unique hot vector of the input into an application vector and the third unique hot vector of the input into a user category vector, may include:

the first conversion unit 521 converts the first unique heat vector inputted into an application vector, and the second conversion unit 522 converts the second unique heat vector inputted into a user category vector.

In some embodiments, the vector cross module 530 performs cross processing on the application vector and the user category vector obtained by the vector conversion module, which may include:

the vector cross module 530 directly splices, multiplies and adds the application vector and the user class vector obtained by the vector conversion module 520 item by item to obtain a first cross vector, a second cross vector and a third cross vector. The application vector and the user category vector are directly spliced, that is, two vectors are spliced into one vector, for example, the application vector is A, the user category vector is B, and the two vectors are spliced into [ A, B ]; the term-by-term multiplication is to multiply each element of the two vectors to obtain a new vector; and adding the elements of the two vectors item by item to obtain a new vector.

In this embodiment, the vectors corresponding to the application program and the user category in the sample are intersected, and the intersecting method includes direct splicing, item-by-item addition and item-by-item multiplication, so that dual information of the user category and the characteristics of the application program is obtained, and the method has a guiding effect on the target pointing to the specific representative vector.

In this embodiment, the vector stitching module 540 stitches the plurality of vectors after the intersecting processing to obtain a stitched vector, including: the vector stitching module 540 stitches the first, second and third intersecting vectors to obtain a stitched vector.

In the above embodiment, the classifier may be implemented using a simple one-layer full-connection layer, where the number of units of the output layer of the full-connection layer is equal to the number of all applications, so as to output the independent heat vectors of different applications, that is, the independent heat vectors of other applications in the window around the application corresponding to the predicted input independent heat vector.

Step S250: and obtaining a loss value between the output vector and a second independent heat vector corresponding to the first independent heat vector.

In this embodiment of the present application, in the training process, after obtaining the output vector according to the input data of the training sample, a loss value between the output vector and the second unique heat vector (i.e., the real data) corresponding to the first unique heat vector may be obtained. As one approach, the penalty value may be determined based on the distance between the vectors.

Step S260: and based on the loss value, iteratively updating the model parameters of the initial language model until the loss value meets a preset difference condition, and obtaining the trained language model as a target model.

In this embodiment of the present application, after a loss value between an output vector and a second unique heat vector corresponding to an input first unique heat vector is calculated, a gradient of each parameter of a model may be calculated according to the loss value, and based on the calculated gradient, each parameter of the model is updated by using a gradient descent method until an error tends to converge, so as to obtain a trained language model as a target model.

Step S270: and acquiring an application vector corresponding to each application program and a user category vector corresponding to the application program based on the target model.

In this embodiment of the present application, since the vector conversion module 520 converts the first unique hot vector into the application vector and converts the third unique hot vector into the user class vector in the initial language model, the application vector and the user class vector corresponding to each application program may be obtained based on the parameter matrix of the vector conversion module in the training obtained target model.

For the initial language model shown in fig. 5, the application vector of each application program can be obtained based on the parameter matrix of the first conversion unit in the target model obtained by training; and obtaining the user category vector corresponding to each application vector based on the parameter matrix of the second conversion unit in the target model obtained through training. Taking the first conversion unit as an example, when the first conversion unit is implemented by using a full connection layer, the full connection layer has a parameter W, which is a matrix with a size of d×c, where C is the size of the set of all application programs (for example, the size is 3 if the set of all application programs is ABC), and d is the length required by the preset vector and can be adjusted as required. In the process of forward propagation of the neural network, one single thermal vector is multiplied by the matrix of the single thermal vector of 1*c by the parameter matrix of d×c through the internal calculation of the full connection layer, and only one value in the single thermal vector takes 1 and the rest are 0, so that for the application program corresponding to the i element in the single thermal vector, the application vector corresponding to the application program is equivalent to the i column of the matrix, namely the i column of the parameter matrix is the application vector corresponding to the application program. And similarly, obtaining the user category vector according to the parameter matrix of the second conversion unit.

In some embodiments, the data processing method may further include: and generating an index relation between each application program and the corresponding application vector and the user category vector, and storing the index relation, wherein the index relation is used for searching the application vector and the user category vector corresponding to the application program. It can be understood that the application vectors of each application program and the corresponding user category vectors thereof can be saved, and the index relation between each application program and the application vectors and between the application category vectors can be formed, so that the application vectors corresponding to each application program and the corresponding user category vectors thereof can be conveniently queried according to the index in the subsequent application by taking the index as the index of the dictionary.

In some implementations, the obtained application vector as well as the user category vector may be used for modeling of risk control. For example, on the basis of the obtained application vector, the features used by the application program of the user are characterized, the application vector of the application program used by the user can be superimposed by utilizing the natural superimposable attribute of the vector to obtain the better features, for example, for the application program used in the past period, a vector for characterizing the behavior of the user in the period can be obtained by weighting and summing according to the use time based on the obtained application vector, so as to be used for modeling of the subsequent risk control.

According to the data processing method provided by the embodiment of the application, as the linguistic data in all the linguistic data libraries are treated equally in the traditional item2vec algorithm, the relation among items in different types of linguistic data is averaged or desalted, so that the method is not specific and representative, and the input information (namely the user category) of the linguistic data category is added, so that the model can learn the relation of the linguistic data elements in different categories; in addition, the vectors corresponding to the application program and the user category in the sample are crossed, the crossing method comprises the modes of direct splicing, item-by-item addition and item-by-item multiplication, the dual information of the characteristics of the user category and the application program is obtained, and the method has a guiding effect on the target pointing to the specific representative vector.

Referring to fig. 6, a block diagram of a data processing apparatus 400 according to an embodiment of the present application is shown. The data processing apparatus 400 applies the above-described server, and the data processing apparatus 400 includes: a data acquisition module 410, a first generation module 420, a second generation module 430, a model training module 440, and a vector acquisition module 450. The data obtaining module 410 is configured to obtain application usage records of a plurality of users, and a user category of each user; the first generating module 420 is configured to generate, based on application usage records of the plurality of users, a unique heat vector corresponding to each application as a first unique heat vector, and a second unique heat vector corresponding to each first unique heat vector, where the second unique heat vector is a unique heat vector corresponding to any other application in a window where the application corresponding to each first unique heat vector is located; the second generating module 430 is configured to generate, for a user corresponding to an application usage record used for generating each first unique hot vector, a unique hot vector corresponding to a user category as a third unique hot vector; the model training module 440 is configured to train the initial language model by using the first unique heat vector and the third unique heat vector corresponding to the first unique heat vector as input data, and the second unique heat vector as output data, so as to obtain a trained language model as a target model; the vector obtaining module 450 is configured to obtain an application vector corresponding to each application program and a user category vector corresponding to the application program based on the target model.

In some implementations, the model training module 440 may be specifically configured to: inputting the first independent heat vector and the third independent heat vector corresponding to the first independent heat vector into the initial language model to obtain an output vector output by the initial language model; acquiring a loss value between the output vector and a second independent heat vector corresponding to the first independent heat vector; and based on the loss value, iteratively updating the model parameters of the initial language model until the loss value meets a preset difference condition, and obtaining the trained language model as a target model.

As one way, the initial language model includes an input module, a vector conversion module, a vector intersection module, a vector stitching module, and a classifier. The model training module 440 inputs the first unique heat vector and the third unique heat vector corresponding to the first unique heat vector into the initial language model, and obtains an output vector output by the initial language model, which may include: inputting the first independent heat vector and the corresponding third independent heat vector into the input module; the vector conversion module converts the first unique hot vector into an application vector and converts the third unique hot vector into a user category vector; the vector crossing module carries out crossing processing on the application vector and the user category vector obtained by the vector conversion module to obtain a plurality of crossed vectors; the vector splicing module splices the plurality of vectors after the cross processing to obtain spliced vectors; the classifier is used for outputting a corresponding output vector according to the spliced vector.

In this embodiment, the vector acquisition module 450 may be specifically configured to: and acquiring an application vector and a user category vector corresponding to each application program based on the parameter matrix of the vector conversion module.

In this embodiment, the input module includes a first input unit and a second input unit. The inputting the first independent heat vector and the corresponding third independent heat vector into the input module may include: the first independent heat vector is input to the first input unit, and the third independent heat vector is input to the second input unit. The vector conversion module comprises a first conversion unit and a second conversion unit. The vector conversion module converting the first unique hot vector of the input into an application vector and converting the third unique hot vector of the input into a user category vector may include: the first conversion unit converts the first single-hot vector input into an application vector, and the second conversion unit converts the second single-hot vector input into a user category vector.

In this embodiment, the vector stitching module performs cross processing on the application vector and the user category vector obtained by the vector conversion module, including: the vector splicing module respectively carries out direct splicing, item-by-item multiplication and item-by-item addition on the application vector and the user category vector obtained by the vector conversion module to obtain a first cross vector, a second cross vector and a third cross vector. The vector splicing module splices the plurality of vectors after the cross processing to obtain a spliced vector, and the vector splicing module comprises: and the vector splicing module splices the first cross vector, the second cross vector and the third cross vector to obtain spliced vectors.

In some embodiments, the first generation module 420 may be specifically configured to: generating a plurality of application use sequences corresponding to the application use records of each user according to the use time of the application in the application use records of the plurality of users; and generating at least one independent heat vector corresponding to each application program as a first independent heat vector based on a plurality of application use sequences corresponding to application program use records of each user, and taking the independent heat vector corresponding to any other application program in a window corresponding to each first independent heat vector as a second independent heat vector, wherein the window is a target window of the application use sequence, which is centered by the application program corresponding to each first independent heat vector.

In this embodiment, the first generating module 420 may generate a plurality of application usage sequences corresponding to the application usage records of each user according to the usage times of the applications in the application usage records of the plurality of users, and may include: according to the service time of the application programs in the application program service records of the plurality of users, sequencing the application programs existing in the application program service records according to the sequence of the service time, and obtaining a first service sequence corresponding to the application program service records of each user; and according to a preset segmentation rule, segmenting the first use sequence corresponding to the application program use record of each user to obtain a plurality of second use sequences serving as a plurality of application use sequences corresponding to the application program use record of each user.

In this embodiment, the first generating module 420 generates, based on the application usage records of each user and corresponding application usage sequences, at least one unique hot vector corresponding to each application as a first unique hot vector, and unique hot vectors corresponding to any other application in a window where the application corresponding to each first unique hot vector is located as a second unique hot vector, which may include: generating a first zero vector according to all application programs existing in application program usage records of the plurality of users, wherein the number of elements contained in the first zero vector is the same as the number of all application programs; generating at least one independent heat vector corresponding to each application program as a first independent heat vector according to the first zero vector and a plurality of application use sequences corresponding to application program use records of each user; and generating a single thermal vector as a second single thermal vector according to the first zero vector for any other application program in the window where the application program corresponding to each first single thermal vector is located.

In some embodiments, the data processing apparatus 400 may further include an index generation module. The index generation module is used for generating index relations between each application program and the corresponding application vector and the corresponding user category vector after the application vector and the corresponding user category vector corresponding to each application program are acquired based on the target model, and storing the index relations, wherein the index relations are used for searching the application vector and the user category vector corresponding to the application program.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

In several embodiments provided herein, the coupling of the modules to each other may be electrical, mechanical, or other.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

In summary, according to the scheme provided by the application program usage records of a plurality of users and the user category of each user are obtained, based on the application program usage records of the plurality of users, the independent heat vector corresponding to each application program is generated as a first independent heat vector, and the second independent heat vector corresponding to each first independent heat vector is the independent heat vector corresponding to any other application program in the window where the application program corresponding to each first independent heat vector is located, the independent heat vector corresponding to the user category is generated as a third independent heat vector for the application program usage record corresponding to each first independent heat vector, the first independent heat vector and the third independent heat vector corresponding to the first independent heat vector are used as input data, the second independent heat vector is used as output data, the initial language model is trained, the trained language model is used as a target model, and the application vector corresponding to each application program and the user category vector corresponding to the application program are obtained based on the target model. Therefore, the application program practical record based on the user can be realized, the vector of the application program is generated, the acquired vector can contain the use behavior rule of the application program, in addition, the user category vector can be acquired in the training model process, and the characteristic information of the application program is acquired from the other dimension.

Referring to fig. 7, a block diagram of a server according to an embodiment of the present application is shown. The server 100 may be a physical server, a cloud server, or the like, capable of running applications. The server 100 in this application may include one or more of the following components: a processor 110, a memory 120, and one or more application programs, wherein the one or more application programs may be stored in the memory 120 and configured to be executed by the one or more processors 110, the one or more program(s) configured to perform the method as described in the foregoing method embodiments.

Processor 110 may include one or more processing cores. The processor 110 connects various portions of the overall server 100 using various interfaces and lines, performs various functions of the server 100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and invoking data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 110 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 110 and may be implemented solely by a single communication chip.

The Memory 120 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Memory 120 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described below, etc. The storage data area may also store data created by the server 100 in use (e.g., phonebook, audio-video data, chat log data), etc.

Referring to fig. 8, a block diagram of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable medium 800 has stored therein program code which can be invoked by a processor to perform the methods described in the method embodiments described above.

The computer readable storage medium 800 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer readable storage medium 800 comprises a non-volatile computer readable medium (non-transitory computer-readable storage medium). The computer readable storage medium 800 has storage space for program code 810 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 810 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, one of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of data processing, the method comprising:

acquiring application program use records of a plurality of users and user categories of each user;

according to the service time of the application programs in the application program service records of the plurality of users, sequencing the application programs existing in the application program service records according to the sequence of the service time, and obtaining a first service sequence corresponding to the application program service records of each user;

according to a preset segmentation rule, segmenting a first use sequence corresponding to the application program use record of each user to obtain a plurality of second use sequences serving as a plurality of application use sequences corresponding to the application program use record of each user;

Generating a first zero vector according to all application programs existing in application program usage records of the plurality of users, wherein the number of elements contained in the first zero vector is the same as the number of all application programs;

generating at least one independent heat vector corresponding to each application program as a first independent heat vector according to the first zero vector and a plurality of application use sequences corresponding to application program use records of each user;

generating a single hot vector as a second single hot vector according to the first zero vector for any other application program in the window where the application program corresponding to each first single hot vector is located, wherein the second single hot vector is the single hot vector corresponding to any other application program in the window where the application program corresponding to each first single hot vector is located;

generating a single heat vector corresponding to a user category as a third single heat vector aiming at the user corresponding to the application program usage record used for generating each first single heat vector;

the first independent heat vector and the corresponding third independent heat vector are used as input data, the second independent heat vector is used as output data, and the initial language model is trained to obtain a trained language model as a target model;

And acquiring application vectors of application features corresponding to each application program and corresponding user category vectors based on model parameters of the target model, wherein the application vectors of the application features corresponding to each application program are used for representing the usage behavior rules of each application program.

2. The method of claim 1, wherein training the initial language model with the first independent heat vector and the corresponding third independent heat vector as input data and the second independent heat vector as output data to obtain a trained language model as a target model, comprises:

inputting the first independent heat vector and the third independent heat vector corresponding to the first independent heat vector into the initial language model to obtain an output vector output by the initial language model;

acquiring a loss value between the output vector and a second independent heat vector corresponding to the first independent heat vector;

and based on the loss value, iteratively updating the model parameters of the initial language model until the loss value meets a preset difference condition, and obtaining the trained language model as a target model.

3. The method of claim 2, wherein the initial language model includes an input module, a vector conversion module, a vector cross module, a vector stitching module, and a classifier, the inputting the first unique hot vector and its corresponding third unique hot vector into the initial language model, obtaining an output vector output by the initial language model, comprising:

Inputting the first independent heat vector and the corresponding third independent heat vector into the input module;

the vector conversion module converts the first unique hot vector into an application vector and converts the third unique hot vector into a user category vector;

the vector crossing module carries out crossing processing on the application vector and the user category vector obtained by the vector conversion module to obtain a plurality of crossed vectors;

the vector splicing module splices the plurality of vectors after the cross processing to obtain spliced vectors;

and the classifier outputs a corresponding output vector according to the spliced vector.

4. The method of claim 3, wherein the obtaining, based on the object model, an application vector and a user class vector corresponding to each application program comprises:

and acquiring an application vector and a user category vector corresponding to each application program based on the parameter matrix of the vector conversion module.

5. The method of claim 3, wherein the input module comprises a first input unit and a second input unit, the inputting the first unique heat vector and its corresponding third unique heat vector into the input module comprising:

Inputting the first independent heat vector to the first input unit, and inputting the third independent heat vector to the second input unit;

the vector conversion module includes a first conversion unit and a second conversion unit, converts the first single hot vector input into an application vector, and converts the third single hot vector input into a user category vector, including:

the first conversion unit converts the first single-hot vector input into an application vector, and the second conversion unit converts the second single-hot vector input into a user category vector.

6. The method of claim 3, wherein the vector stitching module performs cross processing on the application vector and the user category vector obtained by the vector conversion module, and the method comprises:

the vector splicing module respectively carries out direct splicing, item-by-item multiplication and item-by-item addition on the application vector and the user class vector obtained by the vector conversion module to obtain a first cross vector, a second cross vector and a third cross vector;

the vector splicing module splices the plurality of vectors after the cross processing to obtain a spliced vector, and the vector splicing module comprises:

And the vector splicing module splices the first cross vector, the second cross vector and the third cross vector to obtain spliced vectors.

7. The method according to any one of claims 1-6, wherein after the obtaining, based on the object model, an application vector corresponding to each application program and a user category vector corresponding to each application program, the method further comprises:

and generating an index relation between each application program and the corresponding application vector and the user category vector, and storing the index relation, wherein the index relation is used for searching the application vector and the user category vector corresponding to the application program.

8. A data processing apparatus, the apparatus comprising: the system comprises a data acquisition module, a first generation module, a second generation module, a model training module and a vector acquisition module, wherein,

the data acquisition module is used for acquiring application program use records of a plurality of users and user types of each user;

the first generation module is used for sequencing the application programs existing in the application program use records according to the use time of the application programs in the application program use records of the plurality of users and the sequence of the use time, so as to obtain a first use sequence corresponding to the application program use record of each user; according to a preset segmentation rule, segmenting a first use sequence corresponding to the application program use record of each user to obtain a plurality of second use sequences serving as a plurality of application use sequences corresponding to the application program use record of each user; generating a first zero vector according to all application programs existing in application program usage records of the plurality of users, wherein the number of elements contained in the first zero vector is the same as the number of all application programs; generating at least one independent heat vector corresponding to each application program as a first independent heat vector according to the first zero vector and a plurality of application use sequences corresponding to application program use records of each user; generating a single hot vector as a second single hot vector according to the first zero vector for any other application program in the window where the application program corresponding to each first single hot vector is located, wherein the second single hot vector is the single hot vector corresponding to any other application program in the window where the application program corresponding to each first single hot vector is located;

The second generation module is used for generating a unique heat vector corresponding to a user category as a third unique heat vector for the user corresponding to the application program usage record used for generating each first unique heat vector;

the model training module is used for training an initial language model by taking the first independent heat vector and a third independent heat vector corresponding to the first independent heat vector as input data and the second independent heat vector as output data to obtain a trained language model as a target model;

the vector acquisition module is used for acquiring application vectors of application features corresponding to each application program and corresponding user category vectors based on model parameters of the target model, and the application vectors of the application features corresponding to each application program are used for representing the usage behavior rules of each application program.

9. A server, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program code, which is callable by a processor for executing the method according to any one of claims 1-7.