CN117034181A

CN117034181A - Data prediction method, device and readable storage medium

Info

Publication number: CN117034181A
Application number: CN202210454694.0A
Authority: CN
Inventors: 樊鹏
Original assignee: Tencent Technology Shenzhen Co Ltd; Guangzhou Tencent Technology Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Guangzhou Tencent Technology Co Ltd
Priority date: 2022-04-27
Filing date: 2022-04-27
Publication date: 2023-11-10

Abstract

The embodiment of the application discloses a data prediction method, equipment and a readable storage medium, and relates to the field of artificial intelligence, wherein the method comprises the following steps: acquiring target data to be predicted, wherein the target data is used for indicating the identity category of an object; desensitizing the target data to obtain target desensitized data; processing the target desensitization data by adopting N data prediction technologies respectively to obtain N identity prediction sets, wherein each identity prediction set comprises the probability that the object belongs to M predicted identity categories, N is a positive integer greater than or equal to 2, and M is a positive integer; a target identity class of the object is determined based on the N sets of identity predictions. By adopting the embodiment of the application, the accuracy of data prediction can be improved, and the risk of identity information leakage can be reduced.

Description

Data prediction method, device and readable storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a data prediction method, apparatus, and readable storage medium.

Background

At present, a plurality of businesses need to judge the identity of an object, so that personalized service is realized, and the accuracy of the service is improved. At present, identity information of an object is generally predicted manually according to experience, and the prediction mode has low accuracy and high labor cost. In addition, the prediction mode is easy to cause the leakage of the user identity information in the prediction process, and has higher risk of the leakage of the identity information.

Disclosure of Invention

The embodiment of the application provides a data prediction method, data prediction equipment and a readable storage medium, which can improve the accuracy of data prediction and reduce the risk of identity information leakage.

In a first aspect, the present application provides a data prediction method, including:

acquiring target data to be predicted, wherein the target data is used for indicating the identity category of an object;

desensitizing the target data to obtain target desensitized data;

processing the target desensitization data by adopting N data prediction technologies respectively to obtain N identity prediction sets, wherein each identity prediction set comprises the probability that the object belongs to M predicted identity categories, N is a positive integer greater than or equal to 2, and M is a positive integer;

a target identity class of the object is determined based on the N sets of identity predictions.

In a second aspect, the present application provides a data prediction apparatus comprising:

the data acquisition unit is used for acquiring target data to be predicted, wherein the target data is used for indicating the identity category of the object;

the data desensitization unit is used for carrying out desensitization processing on the target data to obtain target desensitization data;

the data processing unit is used for respectively processing the target desensitization data by adopting N data prediction technologies to obtain N identity prediction sets, wherein each identity prediction set comprises the probability that the object belongs to M predicted identity categories, N is a positive integer greater than or equal to 2, and M is a positive integer;

And the identity determining unit is used for determining the target identity category of the object based on the N identity prediction sets.

In a third aspect, the present application provides a computer device comprising: a processor, a memory, a network interface;

the processor is connected to a memory for providing data communication functions, and a network interface for storing a computer program, and for calling the computer program to cause a computer device comprising the processor to execute the data prediction method.

In a fourth aspect, the present application provides a computer readable storage medium having stored therein a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the above-described data prediction method.

In a fifth aspect, the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the data prediction method provided in the various alternatives in the first aspect of the application.

In the embodiment of the application, the data security can be improved and the risk of leakage of the identity information of the object can be reduced by desensitizing the target data to be predicted. The target desensitization data is processed by using a plurality of data prediction technologies, and the processing modes of the target desensitization data by using each data prediction technology are different, so that the processing results are different, and the identity classification of the object is determined by combining the plurality of data prediction technologies, so that the identity prediction from a plurality of dimensions can be realized, and the accuracy of the data prediction is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a data prediction system according to an embodiment of the present application;

fig. 2 is an application scenario schematic diagram of a data prediction method provided by an embodiment of the present application;

FIG. 3 is a schematic flow chart of a data prediction method according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a model training method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an averaging pooling provided by an embodiment of the present application;

FIG. 6 is a schematic flow chart of a training model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of processing desensitized data based on a model provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of another model-based processing of desensitized data provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of processing desensitized data based on a model provided by an embodiment of the present application;

FIG. 10a is a schematic diagram showing a comparison of model effects according to an embodiment of the present application;

FIG. 10b is a schematic diagram showing a comparison of business effects according to an embodiment of the present application;

FIG. 11 is a flowchart illustrating another data prediction method according to an embodiment of the present application;

FIG. 12 is a flowchart of another data prediction method according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a data prediction apparatus according to an embodiment of the present application;

fig. 14 is a schematic diagram of a composition structure of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Among them, machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

In the embodiment of the application, the data (such as target data) related to the user information are all data authorized by the user. The application relates to a machine learning technology in the field of artificial intelligence. Alternatively, the target desensitization data may be processed using machine learning techniques, for example, to yield N sets of identity predictions, such that a target identity class of an object (i.e., user) is determined based on the N sets of identity predictions. Alternatively, the target data may also be desensitized using machine learning techniques, resulting in target desensitized data, and so on. The technical scheme of the application can be applied to the scene of predicting the target data of the user and determining the identity category of the user. By determining the identity category of the object, targeted advertisement delivery can be realized, and the click rate of the user can be improved. Or, by determining the identity category of the object, the financial loan capability of the object can be predicted, so that targeted loan release is performed in the scenes of house purchase, car purchase and the like, and the loss of loan institutions is reduced. The technical scheme of the application can also be applied to other scenes needing to predict the identity category of the object, and the application is not limited to the scenes. By desensitizing target data of the object, data security can be improved, and risk of identity information leakage of the object is reduced. The target desensitization data is processed by using various data prediction technologies to determine the identity category of the object, so that the identity prediction can be performed from multiple dimensions, and the accuracy of the data prediction is improved.

Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a data prediction system according to an embodiment of the present application, as shown in fig. 1, a computer device may perform data interaction with terminal devices, and the number of terminal devices may be one or at least two, for example, when the number of terminal devices is plural, the terminal devices may include terminal device 101a, terminal device 101b, and terminal device 101c in fig. 1. Taking the terminal device 101a as an example, the computer device 102 may obtain target data to be predicted; further, the computer device 102 may desensitize the target data to obtain target desensitized data. Further, the computer device 102 may also process the target desensitized data by using N data prediction techniques, to obtain N identity prediction sets; a target identity class of the object is determined based on the N sets of identity predictions. Alternatively, the computer device 102 may send the target identity class of the object to the terminal device 101a, so that the terminal device 101a performs corresponding service processing based on the target identity class of the object. By desensitizing target data of the object, data security can be improved, and risk of identity information leakage of the object is reduced. The target desensitization data is processed by using various data prediction technologies to determine the identity category of the object, so that the identity prediction can be performed from multiple dimensions, and the accuracy of the data prediction is improved.

It is understood that the computer devices mentioned in the embodiments of the present application include, but are not limited to, terminal devices or servers. In other words, the computer device may be a server or a terminal device, or may be a system formed by the server and the terminal device. The above-mentioned terminal device may be an electronic device, including, but not limited to, a mobile phone, a tablet computer, a desktop computer, a notebook computer, a palm computer, a vehicle-mounted device, an intelligent voice interaction device, an augmented Reality (AR/VR) device, a head mounted display, a wearable device, a smart speaker, a smart home appliance, an aircraft, a digital camera, a camera, and other mobile internet devices (mobile internet device, MID) with network access capability, etc. The servers mentioned above may be independent physical servers, or may be server clusters or distributed systems formed by a plurality of physical servers, or may be cloud servers that provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, vehicle-road collaboration, content distribution networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

Further, referring to fig. 2, fig. 2 is a schematic application scenario diagram of a data prediction method according to an embodiment of the present application. As shown in fig. 2, the computer device 20 may obtain target data 21 to be predicted, for example, the target data 21 includes "name: zhang III, hobbies and interests: sports, usual ground: XX City). Further, the computer device 20 may perform desensitization processing on the target data 21 to obtain target desensitization data 22, for example, the target desensitization data 22 is "name: tension, interest: * X, usual ground: * *". Further, the computer device 20 may process the target desensitization data 22 using N data prediction techniques, respectively, to obtain N identity prediction sets. For example, N is equal to 3, that is, the target desensitization data 22 are processed by using the data prediction technologies corresponding to the 3 target models, so as to obtain 3 identity prediction sets. Each identity prediction set includes probabilities that objects belong to M (e.g., 4) predicted identity categories, e.g., a first identity prediction set includes "first identity category 0.3, second identity category 0.6, third identity category 0.55, fourth identity category 0.35"; the second identity prediction set comprises a first identity class 0.28, a second identity class 0.58, a third identity class 0.62 and a fourth identity class 0.38; the third identity prediction set comprises "first identity class 0.4, second identity class 0.68, third identity class 0.5, fourth identity class 0.4", whereby the target identity class of the object is determined based on the 3 identity prediction sets, e.g. the target identity class of the object is the second identity class. The first identity category, the second identity category, the third identity category and the fourth identity category are 4 different identity categories.

Further, referring to fig. 3, fig. 3 is a flow chart of a data prediction method according to an embodiment of the present application; as shown in fig. 3, the data prediction method may be applied to a computer device, and includes, but is not limited to, the following steps:

s101, obtaining target data to be predicted.

In the embodiment of the application, the computer device may acquire the target data to be predicted from the terminal device, may acquire the target data to be predicted from the local storage, or may acquire the target data to be predicted from the third party terminal. The target data may be used to indicate an identity class of the object, for example, the target data may include identity information of the object, for example, the target data may include interests of the object, a place of residence, a zip code, a job, a type of an application installed on a terminal of the object, an application usage period, a traffic usage condition corresponding to the application, and so on. Because the working states of the objects are different, the types of the applications installed on the held terminals are also different, for example, the working states of the objects with the learning application installed on the terminals may be different types of the students, the working states of the objects with the Tencent meeting installed on the terminals may include a first identity type, a second identity type, a third identity type and the like. Optionally, the first identity category, the second identity category, and the third identity category may be determined according to a position of the user. Therefore, the target identity class of the object, such as the working state of the object, is judged by combining the target data, the identity class of the object can be determined, and further targeted service can be performed based on the target identity class of the object, so that user experience is improved, and resources are saved.

Optionally, when the computer device detects a start instruction for the target application, target data to be predicted is acquired. The target application may refer to a preset application, or may refer to an application having some functions. For example, the target application may refer to an application with an information push function, or the target application may be associated with a shopping class application, and when a purchase instruction for a target product in the target application is detected, the target application may jump to the shopping class application associated with the target application, so that a user can conveniently and quickly purchase the target product, and user experience is improved.

Optionally, the computer device may obtain the target data to be predicted in advance, determine the target identity class of the object based on the target data to be predicted, and store the target identity class, and when a start instruction for the target application is detected, determine the target identity class of the object quickly, so that information is pushed quickly based on the target identity class pertinently, thereby improving data pushing efficiency and improving user experience. Further, the computer device may acquire target data to be predicted every target period, determine a target identity class of the object based on the acquired target data, and update the stored target identity class of the object. Because the identity class of the object is updated, the object identity class is determined by acquiring the object data of the object at each interval of the target time period, and the identity class of the object can be updated, so that the content of information pushing is adjusted, the accuracy of information pushing is improved, the click rate of a user is improved, and the user experience is improved.

Optionally, the computer device may further obtain association data of the associated object, determine an identity class of the associated object based on the association data, and targeted push information for the object based on the identity class of the associated object. Wherein, the associated object may refer to a user having an associated relationship with the object, and may include, but is not limited to, friends, parents, children, spouse, and the like of the user. The associated data is acquired by obtaining user authorization. The association data may be used to indicate an identity class of the associated object. By determining the identity class of the associated object, information related to the identity class of the associated object can be pushed to the object, so that the click rate of the user is improved.

Optionally, the computer device may further process the target data, such as feature processing the target data, optionally, the computer device may construct portrait features for the object, which may include, but are not limited to: user base attributes, device base attributes, network connection attributes, etc. Further, the computer device may construct a business vertical type feature based on the business characteristics. Vertical type features may include user click-through rates, conversions, etc. for certain types of advertisements. Furthermore, the computer equipment can also combine the time dimension to aggregate the portrait features and the business features of different time spans, and can splice the features of the user when processing the target data later to obtain the spliced features, and then desensitize and forecast the spliced features. By carrying out feature processing on the target data, the target feature vector of the object can be combined with the time dimension, so that the features of the object are more complete.

S102, desensitizing treatment is carried out on the target data, and target desensitization data are obtained.

In the embodiment of the application, because the risk of leakage of the identity information of the object possibly exists when the identity class of the object is predicted by using the target data, the risk of data leakage can be reduced by performing the subsequent step processing after the desensitization processing on the target data, and the data security is improved.

Optionally, the computer device may perform desensitization processing on the target data based on a preset desensitization rule, so as to obtain target sensitive data. Specifically, the computer device may acquire key characters in the target data, and replace the key characters with preset characters; and determining target data replaced by the preset characters as target desensitization data. Wherein the key character may be determined according to the type of the target data. For example, if the type of the target data is a name type, the key character may be a character other than a surname, i.e., the key character is a first name, e.g., if the target data is Zhang San, the key character is "San". If the type of the target data is an address type, the key character may include a county or street character, for example, the target data is "Shenzhen south mountain XX street XX technical garden", and the target character may be "south mountain XX street XX technical garden", "XX technical garden", or the like. The preset characters are used to replace key characters, and may include "×", "#", "? "and the like. By replacing key characters with preset characters, target desensitization data can be obtained. For example, the target data is address type and "the XX street XX technical garden in the south mountain area of shenzhen", the key character in the target data is "the XX street XX technical garden", the key character is replaced by a preset character, and the target desensitization data is "the south mountain area of shenzhen".

Optionally, the manner in which the computer device desensitizes the target data may further include: acquiring original sensitivity probability of target data; adding noise to the target data to obtain noise data, and determining target sensitivity probability of the noise data based on the noise data and the original sensitivity probability; and if the target sensitivity probability is in the target probability interval, determining the noise data as target desensitization data. Alternatively, the computer device may perform K-anonymization on the target data to obtain an anonymously processed equivalent group M1. Further, the computer device may extract a set of sensitive attributes S in the equivalent group M1, and extract and analyze an original sensitivity probability α of each sensitive attribute in the set S. Further, the computer device may enter an alpha raw sensitivity probability, adding trace noise to the data of the S set. Specifically, the computer device may construct a vector set that obeys the laplace distribution for the S set, generate noise parameters, and add the noise parameters to the target data to obtain noise data. And determining whether the target sensitivity probability of the noise data is in a target probability interval or not based on the noise data and the original sensitivity probability, and if the target sensitivity probability is in the target probability interval, recording the noise parameters into an equivalent group M1 so as to obtain target desensitization data.

S103, processing the target desensitization data by adopting N data prediction technologies respectively to obtain N identity prediction sets.

In the embodiment of the application, the computer equipment can respectively process the target desensitization data by adopting a plurality of data prediction technologies to obtain a plurality of identity prediction sets. Each identity prediction set comprises probabilities that objects belong to M prediction identity categories, N is a positive integer greater than or equal to 2, and M is a positive integer. Assuming that N is equal to 3 and m is equal to 4, processing the target desensitization data by adopting N data prediction techniques respectively to obtain 3 identity prediction sets, for example, a first identity prediction set comprises a first identity category 0.3, a second identity category 0.6, a third identity category 0.55 and a fourth identity category 0.35; the second identity prediction set comprises a first identity class 0.28, a second identity class 0.62, a third identity class 0.58 and a fourth identity class 0.38; the third identity prediction set includes "first identity class 0.4, second identity class 0.68, third identity class 0.5, fourth identity class 0.4". Each identity prediction set contains M predicted identity categories, that is, each data prediction technology is adopted to process target desensitization data, and the probability that an object belongs to the M predicted identity categories can be determined, so that N identity prediction sets are obtained.

Optionally, the target desensitization data includes one of dense features, sparse features and dense features, so that the computer device can respectively process the dense features and/or the sparse features by adopting N data prediction technologies to obtain N identity prediction sets. Where dense features may refer to non-zero values in the features being greater than a number threshold, i.e., most of the dense features correspond to values other than 0, and only a small portion of the features correspond to values of 0. Sparse features may refer to non-zero values in the features being less than or equal to a number threshold, i.e., most of the sparse features correspond to values of 0, and only a small portion of the features correspond to values of non-0.

Optionally, for example, the N identity prediction sets include at least two of a first identity prediction set, a second identity prediction set, and a third identity prediction set; since the target desensitization data includes sparse and dense features, the computer device may determine N sets of identity predictions in combination with at least two of the following ways:

in the first mode, the computer device may perform feature compression on the sparse features, perform feature stitching on the compressed features and the dense features to obtain first stitched features, and determine a first identity prediction set based on the first stitched features.

In a second mode, the computer device may perform feature compression on the sparse features and the dense features, perform feature stitching on the compressed features to obtain second stitched features, and determine a second identity prediction set based on the second stitched features.

In a third mode, the computer device may perform feature compression on the sparse features, perform feature stitching on the compressed features and the dense features to obtain third stitched features, and perform weight processing on the third stitched features based on the attention mechanism to determine a third identity prediction set.

In the first mode, the target data contains sparse features, that is, most of feature values in the target data are 0, and a small number of feature values are non-0, so that the prediction effect is poor by directly using the sparse features, the sparse features can be compressed, the high-dimensional sparse features are compressed into low-dimensional dense features to be predicted, and the prediction effect can be improved. In the second approach, data processing efficiency may be improved by directly compressing the sparse and dense features. In the third mode, because the sparse features are compressed, the prediction effect is better, and the attention mechanism is used for carrying out weight processing on the features, so that attention training can be carried out on each feature and other features, the importance of each feature and other feature combination is judged according to the weight of the features, and the more important combination gives higher weight, so that the prediction effect is finally improved.

Alternatively, the computer device may determine the set of identity predictions based on any of the three approaches described above. Thereby determining a target identity class for the object based on the set of identity predictions. For example, the identity prediction set includes probabilities that the object belongs to M predicted identity categories, the computer device may obtain a predicted identity category corresponding to a maximum probability, and determine the predicted identity category corresponding to the maximum probability as the target identity category of the object. Or if the probabilities of the M predicted identity categories are greater than the target threshold, the predicted identity categories corresponding to the probabilities can be determined to be the target identity categories of the object, and the target identity categories of the object can be further judged based on manual work by outputting the target identity categories of the object and the corresponding probabilities, so that the identity prediction accuracy is improved.

Optionally, the computer device may process the target desensitization data by using data prediction techniques corresponding to the N target models, to obtain N identity prediction sets. The N target models may include, but are not limited to, logistic regression models, support vector machines (support vector machines, SVM), convolutional neural networks (Convolutional Neural Network, CNN), long and short term memory networks (Long Short Term Memroy, LSTM), deep & Cross, DCN, probabilistic neural networks (Product-based Neural Network, PNN), deep recommendation models (Automatic Feature Interaction Learning via Self-Attentive Neural Networks, autoInt), and the like.

Optionally, the computer device may perform feature compression on the sparse feature based on the Deep & Cross model, perform feature stitching on the compressed feature and the dense feature to obtain a first stitched feature, and determine the first identity prediction set based on the first stitched feature. Optionally, the computer device may perform feature compression on the sparse features and the dense features using the PNN model, perform feature stitching on the compressed features to obtain second stitched features, and determine a second identity prediction set based on the second stitched features. Optionally, the computer device may perform feature compression on the sparse features using an AutoInt model, perform feature stitching on the compressed features and the dense features to obtain third stitched features, and perform weight processing on the third stitched features based on an attention mechanism to determine a third identity prediction set.

In the embodiment of the application, because the model structures of the N target models are different, the characteristic processing modes of each target model are different, and the identity class results of the object obtained by processing the target desensitization data based on each target model are possibly different, so that the identity class condition of the object can be reflected more completely, and the target desensitization data can be predicted from multiple dimensions by combining the multiple target models to process the target desensitization data, thereby improving the accuracy of data processing. Optionally, before processing the target desensitization data by using the N target models, the N target models may be trained in advance, the N target models after the training point are saved, and when the target data to be predicted is subsequently obtained and desensitized, the target desensitization data may be processed based on the saved N target models. The specific training of N object models may refer to the method in the corresponding embodiment of fig. 4, which is not described here too much.

S104, determining the target identity category of the object based on the N identity prediction sets.

In the embodiment of the application, because the probability that the object belongs to M predicted identity categories is determined based on each data prediction technology, N identity prediction sets are obtained, and therefore, the target identity category of the object can be determined based on the N identity prediction sets. An object may refer to a user that needs identity prediction. Optionally, if the target identity class is a user work class, the user work class is used to indicate the position of the working crowd in the work unit, that is, the identity class of the user may be determined based on the position of the user. The target identity class of the object may include, but is not limited to, a first identity class, a second identity class, a third identity class, a fourth identity class, and so on. Alternatively, the identity categories of the users may be preset or pre-divided. Alternatively, the target identity category may comprise a user lifestyle category, and the target identity category of the subject may include, but is not limited to, a first lifestyle category, a second lifestyle category, a third lifestyle category, and so forth. The user life style category can be used for reflecting the condition of the user in work or life and the like, and by determining the life style category of the user, targeted service, such as targeted advertisement pushing, can be realized, resources can be saved, and user experience can be improved.

Optionally, the identity class of the user may be determined based on the work class of the user, for example, the identity class of the user corresponding to the first class work is the first identity class, the identity class of the user corresponding to the second class work is the second identity class, the identity class of the user corresponding to the third class work is the third identity class, and so on. The working categories of the user can include, but are not limited to, social services, cultural relics, scientific research, art, creation, calculation, mathematics and the like. Since there is a difference in the area corresponding to each job category, there is a difference in the kind of information focused by the user for each job category. By determining the working category of the user, targeted information recommendation can be realized, user experience is improved, and then the click rate of the user is improved.

Alternatively, the computer device may determine predicted identity categories belonging to the same category from each of the identity prediction sets, and determine the target identity category of the object based on the predicted identity categories of the same category. In particular, the computer device may determine predicted identity categories belonging to the same category from the N identity prediction sets, and probabilities of the predicted identity categories of the same category; based on the probabilities of the predicted identity categories of the same kind in the N identity prediction sets, determining the probability of each predicted identity category in the M predicted identity categories to obtain the total probability of each predicted identity category; determining the maximum probability from the total probabilities of M predicted identity categories; and determining the predicted identity class corresponding to the maximum probability as the target identity class of the object.

Alternatively, the computer device may determine the probability of each of the M predicted identity categories based on the average of the probabilities of the predicted identity categories of the same category in the N predicted identity sets, resulting in a total probability for each predicted identity category, thereby determining the target identity category of the object.

For example, N equals 3, and the N sets of identity predictions include a first set of identity predictions, a second set of identity predictions, and a third set of identity predictions. The first identity prediction set is, for example, "first identity class 0.3, second identity class 0.6, third identity class 0.55, fourth identity class 0.35"; the second identity prediction set is, for example, "first identity class 0.28, second identity class 0.58, third identity class 0.62, fourth identity class 0.38"; the third set of identity predictions is, for example, "first identity class 0.4, second identity class 0.68, third identity class 0.5, fourth identity class 0.4". By counting the probabilities of the predicted identity categories of the same category, the computer device may determine that the total probability of the predicted identity category being the "first identity category" is (0.3+0.28+0.4)/3=0.327; the total probability of predicting an identity class as "second identity class" is (0.6+0.58+0.68)/3=0.62; the total probability of predicting an identity class as "third identity class" is (0.55+0.62+0.5)/3=0.56; if the total probability of the predicted identity class being the "fourth identity class" is (0.35+0.38+0.4)/3=0.377, the computer device may determine the maximum probability, that is, 0.62, from the probabilities of the 4 predicted identity classes, and determine the predicted identity class "second identity class" corresponding to the maximum probability as the target identity class of the object.

Optionally, if the total probability of each of the M predicted identity categories is greater than the target threshold, the computer device may determine the predicted identity category corresponding to the total probability as the target identity category of the object, and then may manually determine the multiple target identity categories to determine which target identity category the object belongs to.

Optionally, if the probability of the target predicted identity class existing in each identity prediction set is greater than the probability of the other predicted identity classes in the M predicted identity classes, and the target predicted identity classes in each identity prediction set are the same, the target predicted identity class is determined to be the target identity class of the object.

For example, the N identity prediction sets include a first identity prediction set, a second identity prediction set, and a third identity prediction set; the first, second, and third sets of identity predictions each include a first identity class, a second identity class, and a third identity class. And the probability of the second identity class in the first identity prediction set is greater than the probability of the other predicted identity classes in the first identity prediction set; the probability of a second identity class in the second identity prediction set is greater than the probability of other predicted identity classes in the second identity prediction set; the probability of the second identity class in the third identity prediction set is greater than the probability of the other predicted identity classes in the third identity prediction set. In other words, the probability that the second identity class is the target predicted identity class in all the 3 identity prediction sets is the largest, so the second identity class can be directly determined as the target identity class of the object.

Since the probability of the second identity class is the maximum probability in each identity prediction set, even if the total probability of each of the N identity prediction sets is averaged, the average of the total probabilities of the second identity class is greater than the average of the total probabilities of the other prediction classes. Therefore, by directly determining the second identity class as the target identity class of the object, the total probability of other predicted identity classes does not need to be calculated and judged, and the data processing efficiency can be saved.

Optionally, referring to fig. 4, fig. 4 is a schematic flow chart of a model training method according to an embodiment of the present application. The model training method can be applied to computer equipment; as shown in fig. 4, the model training method includes, but is not limited to, the following steps:

S201, obtaining sample data to be predicted.

In the embodiment of the application, the computer device may obtain the sample data to be predicted from the terminal device, may obtain the sample data to be predicted from the local storage, or may obtain the sample data to be predicted from the third party terminal. Wherein the sample data is used to indicate a sample identity class of the sample object. For example, the sample data may include interests of the subject, usual places, postal codes, jobs, types of applications installed on the terminals of the subject, application usage periods, traffic usage conditions corresponding to the applications, and so on.

Optionally, the computer device may obtain at least one initial sample data, and filter the at least one initial sample data based on an anomaly rule to obtain sample filtered data, where the anomaly rule includes at least one anomaly behavior; and filtering the sample filtering data based on a distribution abnormality theorem to obtain sample data to be predicted, wherein the distribution abnormality theorem is used for filtering the data based on probability. That is, after at least one initial sample data is acquired, whether at least one abnormal behavior is included in the initial sample data may be detected, and if an abnormal behavior is included in a certain initial sample data, the initial sample data may be filtered. For example, when the frequency of starting an application by a subject is detected to be greater than a preset number of times in a sleep time period, initial sample data corresponding to the subject is filtered. Or if the detected duration of operating a certain application by the object exceeds 24 hours, filtering the initial sample data corresponding to the object.

Further, filtering the sample filtered data based on the distribution anomaly theorem may be performing anomaly value determination using a laida criterion, thereby filtering the sample filtered data. The rada criterion is that a group of detection data is firstly assumed to contain only random errors, standard deviation is obtained by calculation, a section is determined according to a certain probability, and the errors exceeding the section are considered to be not random errors but coarse errors, so that the data containing the errors should be removed.

In a specific implementation, the computer device may obtain the seed user with the label, that is, the seed user with the identity class label, based on the manual label and the business logic, where the seed user may be a positive sample and a negative sample indicating the exact label. For example, a batch of seed users may be recalled coarsely and then filtered based on manual screening, for example, by manually screening out users that are clearly out of rule, who are lessng using a certain learning class application at lesson time and are marked as first identity class, which may be employees or other identity classes other than students. Further, the computer device may verify the seed user based on business logic, e.g., the first identity class does not have excessive time to play the game application, etc., e.g., a user marked as the first identity class is obtained, the commuting time period is frequent to play the game application, the seed user is indicated as abnormal, filtering may be performed, etc. Further, a filtered seed user base representation may be obtained, which may include some behavioral data of the user in some applications, such as whether the user's terminal is equipped with a cell phone manager, whether to use a cell phone manager harassment interception function, answer an assistant function, etc., to further filter the seed user. Further, an abnormal type index of the seed user, i.e., an abnormal user type evaluation index, may also be calculated. In a real service scene, false users and situations that a computer controls the terminal equipment may exist, in order to eliminate the influence of non-real users on modeling analysis, an abnormal type index is set based on service experience, such as obvious abnormalities of flow use condition of users in certain class of applications, time distribution generated by flow and the like, for example, continuous operation of the terminal equipment by the users in a sleep time period is detected continuously for a plurality of times in a week. Further, the sample filtered data is filtered based on the distribution anomaly theorem. By filtering the abnormal seed users to obtain normal seed users, the normal seed users can be stored in a distributed file system (The Hadoop Distributed File System, HDFS), and the rapid access of subsequent flows is facilitated.

Alternatively, the computer device may perform feature processing, such as offline feature processing, on the stored seed user. In particular, the computer device may construct portraits, such as may construct rich portraits based on user historical behavioral data, which may include, but are not limited to: user base attributes, device base attributes, network connection attributes, etc. Such as: the user basic attribute can reflect the identity information of the user, the equipment basic attribute (mobile phone brand: hua Cheng), and the network connection attribute (the Wi-Fi number of connection is 10 times in the week). Further, the computer device may construct a business vertical type feature based on the business characteristics. Vertical type features may include user click-through rates, conversions, etc. for certain types of advertisements. Further, the computer device may also aggregate portrait features and business features for different time spans in conjunction with the time dimension. For example, the aggregated representation of the user of the last half year/last 3 months/last 1 month/last 1 week can be calculated, and the aggregation method can be any one or more of summation, median and standard deviation. As shown in fig. 5, fig. 5 is a schematic diagram of average pooling provided in an embodiment of the present application, where the numbers in fig. 5 represent features, and by performing average pooling (i.e. averaging) on features with a larger data size, the features can be changed into aggregate features with a smaller data size. In fig. 5, by averaging and pooling the values (1, 2, 3, 0) in the four upper left corners in the left 4*4 grid, the value (1.5) in the upper left corners in the right 2×2 grid can be converted, so that the data of 4*4 can be converted into the data of 2×2, and the subsequent calculation amount can be reduced.

Further, the features in the sample data may also be normalized and discretized. The discretization process comprises the following steps: one-Hot Encoding, for example, for user preference features, becomes: favorite sports: (1, 0), lovely sports: (0,1). The Count Encoding, for example, for the user's WiFi POI (Point of interest) feature, would be used to identify the user and the interest level of this POI. For example, the user has consumed 3 times the POI of 'food-Chinese dish-Yue dish'. Consolidation Encoding, for example, multiple values under certain category variables, can be generalized to the same information. For example, the system version features of the android mobile phone include three versions of "4.2", "4.4", and "5.0", and these three versions can be summarized as "low-version android system". Experiments prove that the Consolidation Encoding processing mode can bring more forward benefits than the direct android system version feature one-hot. Finally, the computer equipment can combine the processed features and store the features in the HDFS system offline, so that the quick access of the subsequent flow is facilitated. For each user, the data input to the model may be a numeric vector of Y1, Y being a positive integer, representing the dimensions of the feature, such as (1,0,31,4,0.2,9.3,8.8, …,0,0,1,2,34), the numeric representing the various features of the user, one user for each Y-dimensional feature.

Alternatively, the computer device may train a plurality of models using the sample data, from which N target models are selected. Specifically, the computer device may input sample data to be predicted into K models for training, and determine index parameters of the K models, where the index parameters are used to reflect classification performance conditions of the models, and K is a positive integer greater than N; n target models are determined from the K models based on index parameters of the K models. When N target models are determined, the sample desensitization data can be processed by adopting data prediction technologies corresponding to the N target models respectively, so as to obtain N sample prediction sets.

The index parameter of the model may be AUC (area under curve), the AUC is a model evaluation index in the machine learning field, the AUC is an area under the ROC curve, and by calculating the AUC value of each model in the K models, the N target models with the best effect can be selected from the K models, and the parameter optimization is to perform grid optimization on the super parameter of the selected model, so that the evaluation index AUC of the model can be expected to be improved. The larger the AUC value, the more likely the classification algorithm in the current model will be to rank positive samples in front of negative samples, resulting in better classification results.

Alternatively, the computer device may randomly divide the feature-processed sample set, i.e., divide the sample data into training sample data and test sample data. For example, the sample data may be divided according to a time window to which the sample data belongs, the sample data with earlier time is used as training sample data, and the sample data with later time is used as verification sample data, where the ratio of the training sample data to the verification sample data may be 5:1. Further, multiple models (e.g., K models) may be trained in parallel based on default parameters, from which better models (e.g., N target models) may be selected. The K models may include, but are not limited to: logistic regression models, SVM models, CNN models, LSTM models, deep & Cross models, PNN models, autoInt models, etc. Further, after the model is retrained based on the parameters, verification is performed on a plurality of verification sample data, and stability of the model effect is tested, so that the model is convenient for subsequent use.

S202, desensitizing treatment is carried out on the sample data, and sample desensitization data are obtained.

In the embodiment of the application, because the risk of leakage of the identity information of the object possibly exists when the identity class of the object is predicted by using the sample data, the risk of data leakage can be reduced by performing the subsequent step processing after the desensitization processing on the sample data, and the data security is improved.

Optionally, the manner in which the computer device desensitizes the sample data may include: acquiring the original sample sensitivity probability of sample data; adding noise to the sample data to obtain sample noise data, and determining target sample sensitivity probability of the sample noise data based on the sample noise data and the original sample sensitivity probability; and if the target sample sensitivity probability is in the target probability interval, determining the sample noise data as sample desensitization data.

Specifically, the computer device may perform K-anonymization on the sample data to obtain an anonymized equivalent group M1, where the equivalent group M1 may be as shown in table 1:

TABLE 1

Further, the computer device may extract a set of sensitive attributes S in the equivalent group M1, and extract and analyze a sample sensitivity probability α of each sensitive attribute in the set S. The set of sensitive attributes S may include sensitive data in the equivalent group M1, or may include all data in the equivalent group M1. That is, in the embodiment of the present application, only sensitive data in the sample data may be processed, or the whole sample data may be processed, which is not limited in the embodiment of the present application. Further, the computer device may enter an α sample sensitivity probability, add a trace noise to the data of the S set, and calculate the formula as shown in formula (1-1):

α _p ＝α+Lap<ΔS/ε> (1-1)

Wherein a given sample sensitive data set s= { S ₁ ，S ₂ ，…，S _n }，α _p The value range belongs to the interval (0, 1)]，α _p The value range of the sensitivity probability value alpha of the target sample with the new sensitivity attribute, namely the sensitivity probability value alpha of the sample data after desensitization, belongs to the interval (0, 1]Alpha is the original sample sensitivity probability value, lap<ΔS/ε>Is a trace random noise parameter.

Alternatively, the computer device may determine the noise parameter in the following manner. The computer device may construct a set of vectors for the S-set that obey a Laplace (Laplace) distribution, with the specific formula shown in formula (1-2):

Δf＝max||f(D ₁ )-f(D ₂ )|| ₁ (1-2)

where Δf represents the sensitivity of the function, D is the dataset, f: D-R ^d As a function, max dataset differences are greatest, I.I ₁ Representing manhattan distance. The Laplace mechanism, given data set D, is provided with a function f: D-R ^d K provides ε -differential privacy if the random mechanism M satisfies the following equation (1-3):

M(D)＝f(D)+Lap(Δf/ε) (1-3)

where Δf is sensitivity, lap (Δf/ε) is the random noise subject to the Laplacian distribution, and the noise magnitude depends on sensitivity Δf and differential privacy ε.

Further by means of the above formula (1-2), generating a sensitivity parameter that corresponds to the alpha sensitivity probability, substituting the sensitivity parameter into the formula (1-3), the computer device can determine random noise that obeys the laplace distribution, i.e. determine noise. By substituting the determined noise into the above formula (1-1), the target sample sensitivity probability of the noise-added data can be calculated; if the target sample is sensitive to probability Meeting target intervals, e.g. alpha _p Belonging to the interval (0, 1)]And inputting the noise parameters into the equivalent group M1 to obtain an equivalent group M2, and obtaining sample desensitization data.

By desensitizing the data, even if the illegal terminal equipment acquires the desensitized data, the data are difficult to decrypt, so that the original data cannot be obtained, the risk of identity information leakage can be reduced, and the data security is improved.

S203, processing the sample desensitization data by adopting N data prediction technologies respectively to obtain N sample prediction sets.

In the embodiment of the application, the computer equipment can respectively process the sample desensitization data by adopting the data prediction technology corresponding to the N target models to obtain N sample prediction sets. Each sample prediction set comprises probabilities that sample objects belong to M prediction identity categories.

Optionally, if the number of sample data to be predicted is a plurality of, the computer device may divide the sample data into a training set and a test set, use the training set to train the model, use the test set to test the model, and improve accuracy of model prediction. Specifically, the computer device may divide the plurality of sample desensitization data, determine a training set and a test set, the data amount of the training set being greater than the data amount of the test set; training the N target models by using the training set respectively to obtain N trained target models; processing the test set based on the N trained target models respectively to obtain N test probabilities; n sample prediction sets are determined based on the N test probabilities.

Optionally, the computer device may read the low-order feature matrix and the high-order feature matrix in the sample desensitization data, and splice the low-order feature matrix and the high-order feature matrix. The low-order feature matrix may include, but is not limited to, features of name, hobbies, etc., and the high-order feature matrix may include, but is not limited to, time. For example, the computer device may uniformly divide the training set into 3 shares, train N target models using a "leave-one-out method", and predict the remaining data and test set with the trained N target models. Then, merging 3 predicted training data to obtain new training data; the 3 predicted test sets were combined using the mean method to obtain a new test data set. And obtaining the output results of the N target models on the new training data and the new testing data, and obtaining a final result, namely the sample identity category to which the sample data belong, by carrying out average pooling on the output results of the N target models.

As shown in fig. 6, fig. 6 is a schematic flow chart of a training model provided in an embodiment of the present application, in which 4 training sets and 4 verification sets are obtained by 4-fold dividing the training set, first target models in N target models are trained by using the 4 training sets respectively, the trained first target models are verified by using the 4 verification sets respectively, 4 probabilities x1 to x4 are obtained, that is, probabilities of sample classes to which sample objects belong, and S is obtained by stitching x1 to x4 ¹ _train . Further, the first target model may be tested using 4 test sets to obtain 4 probabilities c1 through c4, and averaging c1 through c4 to obtain S ¹ _test . For other target models in the N target models, training and testing can be performed in the above-mentioned manner, so as to obtain S corresponding to each target model _train And S is _test By using S _train And S is _test And obtaining output results of the N target models, and carrying out average pooling on the output results of the N target models to obtain a final result, namely a sample identity class to which the sample data belong.

S204, determining the target sample identity category of the sample object based on the N sample prediction sets.

In the embodiment of the present application, the manner of determining the target identity class of the sample object based on the N sample prediction sets may refer to the manner of determining the target identity class of the object based on the N identity prediction sets in step S104 corresponding to fig. 3, which is not described herein.

S205, acquiring sample identity labels of sample data, and training to obtain N target models based on the sample identity labels and the target sample identity types.

In the embodiment of the application, the computer equipment trains N target models based on the sample identity label and the target sample identity category by acquiring the sample identity label of the sample data. The sample identity tag refers to an actual identity class of the sample data, and when the target model is trained, the sample identity tag of the sample data can be predetermined, which is equivalent to knowing the actual identity class of the sample data. And the model is used for processing the sample data, so that a model output result, namely a sample identity class to which the sample object belongs, can be obtained, wherein the sample identity class to which the sample object belongs to one or more of M prediction identity classes. The purpose of the training model is to make the sample identity class to which the sample object belongs and the sample identity label of the sample data as consistent as possible. If the sample identity class of the sample object corresponding to the sample data with the number greater than or equal to the preset number in the plurality of sample data is consistent with the sample identity label, the model at the moment can be saved, and the subsequent use is convenient. If the sample identity class and the sample identity label of the sample object corresponding to the sample data smaller than the preset number in the plurality of sample data are consistent, the model can be continuously trained, model parameters in the model are adjusted, and the sample identity class and the sample identity label of the sample object output based on the model are consistent as much as possible after the sample data are processed based on the model.

Optionally, the N target models may include Deep & Cross models, as shown in fig. 7, fig. 7 is a schematic diagram of processing desensitized data based on the models provided in the embodiment of the present application, where the models in fig. 7 include an embedding and stacking layer, a Cross network layer, a Deep network layer, and a link output layer, and by inputting sample desensitized data into the embedding and stacking layer, feature compression can be implemented on sparse features in the sample desensitized data, and feature stitching is performed on the compressed features and dense features, so as to obtain a first stitched feature. Further, the first splicing characteristic is input into the cross network layer and the depth network layer for processing, so that the first cross characteristic and the first depth characteristic are obtained. The prediction cross characteristics of the degree of demarcation can be effectively learned based on the cross network layer; the deep network layer may capture highly nonlinear interactions. And then, the first cross feature and the first depth feature are spliced and then input into a link output layer, so that the first cross feature and the first depth feature can be combined, and the combined feature is predicted to obtain a model prediction result, namely the sample identity category to which the sample object belongs.

Wherein, since the core architecture of Deep & Cross model comprises two parts, discrete feature embedding and high order Cross features. The discrete feature embedding method is based on the thought of Word2Vec, the problem that Word2Vec initially solves is that the single-hot representation of words is too sparse, the vector form representations among different words are not connected at all, and finally, the Word single-hot representation of ten thousand dimensions is embedded into dense vectors of hundreds of dimensions. The combination of higher order cross features tends to bring forward business effects such as: associated features of "USA" and "thanksinging", "China" and "Chinese New Year". By designing the crossover network feature crossover is applied explicitly at each layer, the prediction crossover features of degree of demarcation are effectively learned, and no manual feature engineering or exhaustive search is required. Secondly, the cross-network is simple and effective, the polynomial series of each layer is highest through design, and the layer depth is determined. The network consists of all cross terms, whose coefficients are different from each other. And the cross-network memory is efficient and easy to realize. Furthermore, the crossover network has nearly an order of magnitude fewer parameters than DNN on LogLoss. Alternatively, the Relu function may be used as an activation function when training the Deep & Cross model, adding Dropout.

Optionally, the N target models may include PNN models, as shown in fig. 8, fig. 8 is a schematic diagram of another model-based processing of desensitized data according to an embodiment of the present application, where the model in fig. 8 includes an embedded layer, a product layer, a first hidden layer, and a second hidden layer. If the sample desensitization data contains N field features, the one-hot vector is X, each field generates an embedding vector, and the model can learn the embedding representation of each field from each field by inputting the sample desensitization data into the embedding layer to obtain a first embedding feature. And splicing the first-order characteristic and the second-order cross characteristic of the first embedded characteristic through the product layer to obtain a second spliced characteristic. And fully learning the high-order combined features of the second splicing features through the first hidden layer to obtain the first hidden features. And fully learning the high-order combination features of the first hidden features through the second hidden layer to obtain the prediction probability, and finally obtaining a model prediction result, namely the sample identity category to which the sample object belongs.

Since the PNN model uses a second order vector lamination (Pair-wisely Connected Product Layer) to vector-multiply the embedded vectors two by two, the resulting result is used as a later input. The PNN model designs a product layer to combine the features, comprises an inner product operation and an outer product operation, and increases the depth of feature combination intersection. For PNN in inner product form, because the result of the multiplication of two vectors is a scalar, the individual scalars can be directly "stitched" into one large vector, which can then be used as the input to the MLP. For PNN in the form of an outer product, the result is a matrix because the multiplication of two vectors corresponds to the matrix multiplication of a column vector with a row vector. The matrices are directly spliced to have too many dimensions as the operation of the previous inner product form, and the simplified scheme is to directly sum the matrices to obtain a new matrix as the input of the MLP. Alternatively, for the hidden layer, a three-layer 200-400-100 structural design may be used; using the relu function as an activation function; dropout is increased.

Optionally, the N target models may include an AutoInt model, as shown in fig. 9, where fig. 9 is a schematic diagram of processing desensitized data based on a model according to another embodiment of the present application, and the model in fig. 9 includes an input layer, an embedded layer, an interaction layer, and an output layer, where the sample desensitized data is input into the input layer, and the input data is transferred to the embedded layer through the input layer, so as to obtain a second embedded feature. The embedded layer can map the discrete features and the continuous features of the input data into an equal-length ebadd vector. Wherein, the discrete feature is a direct lookup table, the multi-valued discrete feature uses average mapping, and the continuous feature is equivalent to multiplying by a bias-free Dense layer. Further, the second embedded feature is processed through the interaction layer, multiple layers can be stacked, high-order intersection of the features is achieved, and the interaction features are obtained. Since the key to feature combining is to know which features are combined together with strong characterization capability, the interaction layer is actually equivalent to feature selection in artificial feature engineering. Meanwhile, based on a Self-Attention (Self-Attention) mechanism, considering that the characteristics of each field and the characteristics of other fields are respectively made into the Attention, judging the importance of the combination of the field characteristics and the other field characteristics according to the weight of the Attention, giving higher weight to the more important combination, and finally generating weighted sum of the characteristics as the result of the combination of the field characteristics and all the other field characteristics. And processing the interaction characteristics through the output layer to obtain a prediction probability, and finally obtaining a model prediction result, namely the sample identity class to which the sample object belongs.

Because the AutoInt model can realize a method for automatically performing high-order intersection by finding out a feature, not only can the weakness of weak capturing capability of MLP on multiplicative feature combinations be overcome, but also the feature combinations can be well explained to be more effective, so that the data prediction efficiency can be improved by using the model. By providing a method for displaying and learning the high-dimensional feature intersection, the interpretability of the model output result can be improved; based on self-attentive neural network, a new method is provided, so that the high-dimensional feature intersection can be automatically learned, and the estimated accuracy is effectively improved. Alternatively, the batch size of the AutoInt model may be 1024 and the embedding dimension d may be 16; further, an Adam optimizer may be used, with the dropout parameter set to 0.5.

Optionally, by predicting a large amount of sample data, the sample identity category to which the sample object belongs can be determined, so that targeted advertisement delivery is performed for the sample object, and further, model parameters can be adjusted according to parameters such as advertisement click rate and advertisement conversion rate of the sample object, and further, the model prediction effect is improved.

Optionally, the model flow may be further cured, the trained target model may be cured, the offline training may be timed, verified, alerted, cured, and the like. The method can also be used for training a data set based on an offline experimental model, and after parameter tuning, the trained model is solidified based on a Saver () method of TensorFlow to generate 4 files: a checkpoint text file for recording a path information list of the model file; a model, ckpt, data file for recording network weight information; the model, ckpt, index, data file and index file are binary files for holding variable weight information in the model. The technical scheme of the application has strong reusability because the trained model is solidified. Further, the user identity type of the positive sample, such as "user life style", can be replaced, then the server accumulates the corresponding log data, and finally the same feature stitching, feature processing and model training methods are used to determine the result.

Optionally, in the embodiment of the application, the effect of the model can be evaluated based on the online traffic of the A/B Test, and the evaluation indexes can include, but are not limited to, the advertisement click rate and the advertisement conversion rate. Specifically, after training the model, the computer device may detect the target data using the trained model, determine a target identity class of the object, and push information for the object based on the target identity class of the object; acquiring advertisement click rate and/or advertisement click rate of an object aiming at pushing information in a preset time period; if the advertisement click rate and/or the advertisement click rate is smaller than the expected threshold, determining that the current loss of the model is larger than the loss threshold, and continuing training the model until the target identity class of the object is determined based on the model, and pushing information for the object based on the target identity class, wherein the corresponding advertisement click rate and/or the advertisement click rate is larger than or equal to the expected threshold.

Since the effect of the model can be determined in this way, N target models can also be selected from a plurality of models in this way. The A/B Test is to make two schemes (such as two pages) for the same target, so that one part of users use the A scheme, the other part of users use the B scheme, and record the use condition of the users to see which scheme is more in line with the design. After determining the target identity category of the user by using the model, information, such as targeted advertisement, can be pushed to the user based on the identity category of the user, and the advertisement click rate and the advertisement conversion rate of the user aiming at the pushed information are obtained in a period of time, so that the scheme, such as the model, is adjusted.

As shown in fig. 10a, fig. 10a is a schematic diagram for comparing model effects, wherein the schematic diagram includes three schemes for predicting a working state of a user, and the three schemes are respectively a scheme for manually making a strong rule, a non-deep learning scheme and a technical scheme of the application. Wherein, AUC value corresponding to the off-line manual establishment strong rule scheme is 0.59; AUC value corresponding to the off-line non-deep learning scheme is 0.65; the AUC value corresponding to the technical scheme of the application is 0.8. In addition, the AUC value corresponding to the on-line manual establishment strong rule scheme is 0.57, and the AUC value corresponding to the on-line non-deep learning scheme is 0.6; the AUC value corresponding to the technical scheme of the application on line is 0.75. Compared with other technical schemes (a strong rule scheme and a non-deep learning scheme which are manually formulated), the AUC of the technical scheme is obviously improved from the aspect of off-line AUC effect. Compared with other technical schemes, the technical scheme of the application has obviously improved AUC from the aspect of off-line AUC efficiency, and the model effect of the technical scheme of the application is better.

Further, as shown in fig. 10b, fig. 10b is a schematic diagram for comparing service effects, in which the click rate of the advertisement in the manual setting of the strong rule scheme is 0.8%, the click rate of the advertisement in the non-deep learning scheme is 1.8%, and the click rate of the advertisement in the technical scheme of the present application is 2.1%; the advertisement conversion rate of the artificial strong rule scheme is 0.1%, the advertisement conversion rate of the non-deep learning scheme is 0.7%, and the advertisement conversion rate of the technical scheme of the application is 1.3%. Compared with other technical schemes, the technical scheme of the application obviously improves the advertisement click rate from the aspect of the advertisement click rate. Compared with other technical schemes, the technical scheme of the application obviously improves the advertisement conversion rate from the aspect of the advertisement conversion rate. It can be seen that the technical scheme of the application has better effect compared with other technical schemes.

According to the embodiment of the application, the target model is trained by acquiring a large amount of sample data, so that the target model can predict the input sample data, and the parameters in the target model can be continuously adjusted, thereby improving the accuracy of model prediction.

Optionally, referring to fig. 11, fig. 11 is a flowchart of another data prediction method according to an embodiment of the present application. The data prediction method can be applied to computer equipment, and the technical scheme of the application is mainly described in three stages of sample data acquisition, model training and model use; as shown in fig. 11, the data prediction method includes, but is not limited to, the steps of:

s301, sample data preparation.

The sample data preparation is mainly used for acquiring sample data, and filtering is performed on the sample data based on rules, so that abnormal data in the sample data are filtered. The method for preparing the sample data corresponds to step S201 in fig. 4, and is not described here again.

S302, characteristic processing of sample data.

The feature processing manner of the sample data corresponds to step S201 in fig. 4, and is not described herein.

S303, selecting a multipath model.

The multi-path model selection is to train K models based on the screened sample data, and select N target models from the K models. The specific manner of selecting the N target models corresponds to step S201 in fig. 4, and is not described herein.

S304, desensitizing the sample data.

The manner of desensitizing the sample data corresponds to step S202 in fig. 4, and is not described herein.

S305, training the selected model based on the desensitized sample data, and storing the trained model.

Because N target models are selected, the N target models can be trained by using the desensitized sample data, and the trained N target models are saved. The specific implementation manner of the training model in step S305 may refer to the implementation manners of step S203 to step S204, which are not described herein.

S306, obtaining target data to be predicted.

S307, desensitizing the target data.

S308, processing the target data after the desensitization processing based on the trained model, and determining the target identity category of the object.

The specific implementation manner of processing the target data in steps S306 to S308 may refer to the implementation manner of processing the target data in steps S101 to S104, which is not described herein.

In the embodiment of the application, the data security can be improved and the risk of leakage of the identity information of the object can be reduced by desensitizing the target data to be predicted. The N data prediction technologies are adopted to process the target desensitization data respectively to obtain N identity prediction sets, and as the target desensitization data are processed by using a plurality of data prediction technologies, the processing modes of the target desensitization data by each data prediction technology are different, the processing results are different, and the identity classification of the object is determined by combining the plurality of data prediction technologies, so that the identity prediction from a plurality of dimensions can be realized, and the accuracy of the data prediction is improved.

Optionally, referring to fig. 12, fig. 12 is a flowchart of another data prediction method according to an embodiment of the present application. As shown in fig. 12, the data prediction method includes, but is not limited to, the steps of:

s401, acquiring the seed users with the marks.

Where the seed user with the marker may be a positive and negative sample indicating a positive marker.

S402, obtaining a basic portrait of the seed user.

The base portrayal may include, among other things, some behavioral data of the seed user in some applications, such as whether a certain class of functions in some applications is activated. Such as whether the functions of the harassment interception function, the answer assistant function, etc. in the mobile phone's home are activated.

S403, calculating an abnormal user type evaluation index.

The abnormal user type evaluation index may be used to determine whether a false user operates the terminal device, for example, whether the false user is an abnormal user may be determined according to time, continuous duration, etc. of the user operation application.

S404, filtering abnormal seed users based on the distribution abnormal theorem.

The abnormal distribution theorem is used for determining abnormal values in sample data based on the abnormal distribution theorem, so that abnormal seed users are filtered.

S405, determining whether the number of seed users meets the standard.

If yes, that is, the number of seed users reaches the standard, which indicates that the sample training data is enough, step S406 is executed; if not, that is, the number of the seed users does not reach the standard, which means that the sample training data is insufficient, step S401 is performed until the number of the seed users reaches the standard. Steps S401 to S405 are offline data preparation processes, and specific implementation may refer to step S201 in fig. 4.

S406, constructing object features.

The object features may include, but are not limited to, portrait features of the object, business vertical type features.

S407, constructing an aggregation feature by combining the time dimension, and performing feature processing on the aggregation feature to obtain a processed feature.

Wherein the computer device can obtain the aggregated features by aggregating portrait features and business features of different time spans. Feature processing of the aggregated features may include normalization feature processing and discretization feature processing, among others.

And S408, combining the processed features and storing the features in an HDFS offline.

Wherein, by combining the processed features, the multi-dimensional features of each user, such as one user for one Y-dimensional feature, can be obtained.

S409, solidifying characteristic processing logic, timing offline automatic calculation and storing offline calculation results in the HDFS.

Wherein by solidifying the logic of the feature processing, the data may be subsequently processed based on the processing logic. Steps S406 to S409 are offline feature processing, and the specific implementation may refer to step S201 in fig. 4.

S410, randomly dividing a sample set of feature processing to be used as a training set and a test set.

The training set can be used for training the model by dividing the sample set into the training set and the testing set, so that the accuracy of the model is improved. And further testing the model by using the test set, determining whether the detection result of the model is the same as the true value of the sample, and determining the accuracy of the model, so that the model can be selected and adjusted based on the accuracy of the model.

S411, training a plurality of models based on default parameters, and selecting N models from the plurality of models.

The default parameters may refer to initial parameters in the model, and the training model actually adjusts the initial parameters in the model for multiple times, so that the result obtained by processing the model input data process based on the adjusted parameters is as same as the sample true value as possible. After training the plurality of models, N models may be selected from the plurality of models based on accuracy of model predictions, e.g., a number of times that a result of model predictions is the same as a sample true value, and a number of times that a result of model predictions is the same as a sample true value is greater than a number of times that a result of predictions of other models is the same as a sample true value. Alternatively, the AUC value for each model may be calculated, and N models may be selected from the plurality of models based on the AUC values for the models. Steps S410 to S411 are model selection processes, and specific implementation may refer to step S201 in fig. 4.

S412, reading in the low-order features and the high-order features, and splicing the features according to columns.

The low-order features and the high-order features are the object features stored in the HDFS, and a user may include the high-order features and the low-order features, and by splicing the high-order features and the low-order features, the situation of the user may be more completely reflected. Alternatively, other ways of stitching the low-order features and the high-order features may be performed.

S413, performing K anonymization processing on the spliced features to obtain an equivalent group M1.

S414, extracting a sensitive attribute set S in M1.

S415, extracting and analyzing the sensitivity probability alpha of each sensitivity attribute in the set S.

S416, adding noise to the set S.

S417, constructing a vector group which obeys the laplace distribution for each sensitive group S.

S418, generating a sensitivity value conforming to the sensitivity probability alpha and inputting the data into the equivalent group M2 to obtain desensitized sample data.

Step S412 to step S418 are data desensitizing processes, and specific implementation may refer to step S202 in fig. 4.

S419, uniformly dividing the training data into Z parts, and training N target models by using a leave-one-out method.

Where Z is a positive integer, leave-one-out refers to dividing a large dataset into q small datasets, where q-1 is used as a training set, the remaining one is used as a test set, then the next one is selected as a test set, the remaining q-1 is used as a training set, and so on. By using the leave-one method, as much effective information as possible can be obtained from limited data, so that the learning samples can be removed from multiple angles, and local extremum is avoided.

S420, predicting the rest training data and test data by using the trained N target models.

S421, combining the Z predicted training data to obtain new training data.

S422, merging the Z predicted test data by using a mean method to obtain new test data.

S423, determining N target models based on the new training data and the new test data, and solidifying the N models.

S424, solidifying the model training flow.

Step S419 to step S424 are processes of training a model based on the sensitive data, and the specific implementation manner may refer to step S203 to step S204 in fig. 4. After the model training process is cured, the offline training of the model, the model verification, the alarming and the curing of the offline processed model can be performed regularly. For example, when any link in the training process is abnormal, an alarm can be given to prompt related personnel to process.

Optionally, the process from step S419 to step S424 may also be referred to as a Stacking integrated learning framework, where Stacking integrated learning framework refers to fusing multiple classification models by a meta classifier, where the secondary classifier outputs a prediction result after training based on a training set, and then the meta classifier trains the output according to the secondary classifier. That is, the Stacking integrated learning framework may predict data through N target models, and then average pool the prediction results of the N target models to obtain a final prediction result.

Optionally, after training to obtain the target model, the target identity class of the object may be determined based on the target model, and specifically, reference may be made to the related description of the above embodiment, which is not repeated herein.

The method of the embodiment of the application is described above, and the device of the embodiment of the application is described below.

Referring to fig. 13, fig. 13 is a schematic diagram showing a composition structure of a data prediction apparatus according to an embodiment of the present application, where the data prediction apparatus may be a computer program (including program code) running in a terminal device; the data prediction device can be used for executing corresponding steps in the data prediction method provided by the embodiment of the application. For example, the data processing apparatus 130 includes:

A data obtaining unit 1301, configured to obtain target data to be predicted, where the target data is used to indicate an identity class of an object;

a data desensitizing unit 1302, configured to perform desensitization processing on the target data, so as to obtain target desensitized data;

the data processing unit 1303 is configured to process the target desensitized data by using N data prediction technologies, to obtain N identity prediction sets, where each identity prediction set includes probabilities that the object belongs to M predicted identity categories, N is a positive integer greater than or equal to 2, and M is a positive integer;

an identity determination unit 1304 for determining a target identity class of the object based on the N sets of identity predictions.

Optionally, dense features and/or sparse features are included in the target desensitization data; the data processing unit 1303 is specifically configured to: and processing the target desensitization data by adopting N data prediction technologies respectively to obtain N identity prediction sets.

Optionally, the N identity prediction sets include at least two of a first identity prediction set, a second identity prediction set, and a third identity prediction set; the data processing unit 1303 is specifically configured to:

performing feature compression on the sparse features, performing feature stitching on the compressed features and the dense features to obtain first stitching features, and determining the first identity prediction set based on the first stitching features; and/or the number of the groups of groups,

Performing feature compression on the sparse features and the dense features, performing feature stitching on the compressed features to obtain second stitching features, and determining the second identity prediction set based on the second stitching features; and/or the number of the groups of groups,

and carrying out feature compression on the sparse features, carrying out feature stitching on the compressed features and the dense features to obtain third stitching features, and carrying out weight processing on the third stitching features based on an attention mechanism to determine the third identity prediction set.

Optionally, the identity determination unit 1304 is specifically configured to:

determining predicted identity categories belonging to the same category and probabilities of the predicted identity categories of the same category from the N identity prediction sets;

based on the probabilities of the predicted identity categories of the same kind in the N identity prediction sets, determining the probability of each predicted identity category in the M predicted identity categories to obtain the total probability of each predicted identity category;

determining the maximum probability from the total probability of the M predicted identity categories;

and determining the predicted identity class corresponding to the maximum probability as the target identity class of the object.

Optionally, the data desensitizing unit 1302 is specifically configured to:

acquiring the original sensitivity probability of the target data;

Adding noise to the target data to obtain noise data, and determining target sensitivity probability of the noise data based on the noise data and the original sensitivity probability;

and if the target sensitivity probability is in a target probability interval, determining the noise data as target desensitization data.

Optionally, the data prediction device 130 further includes: model training unit 1305:

obtaining sample data to be predicted, wherein the sample data is used for indicating a sample identity category of a sample object;

desensitizing the sample data to obtain sample desensitization data;

processing the sample desensitization data by adopting N data prediction technologies respectively to obtain N sample prediction sets, wherein each sample prediction set comprises the probability that a sample object belongs to M prediction identity categories;

determining a target sample identity class for the sample object based on the N sample prediction sets;

acquiring a sample identity tag of the sample data, and training to obtain N target models based on the sample identity tag and the target sample identity class;

the data processing unit 1303 is specifically configured to:

and processing the target desensitization data by adopting data prediction technologies corresponding to the N target models respectively to obtain N identity prediction sets.

Optionally, the model training unit 1305 is specifically configured to:

obtaining at least one initial sample data, and filtering the at least one initial sample data based on an abnormal rule to obtain sample filtering data, wherein the abnormal rule comprises at least one abnormal behavior;

and filtering the sample filtering data based on a distribution abnormality theorem to obtain sample data to be predicted, wherein the distribution abnormality theorem is used for filtering the data based on probability.

Optionally, the model training unit 1305 is specifically configured to:

inputting the sample data to be predicted into K models for training, and determining index parameters of the K models, wherein the index parameters are used for reflecting the classification performance condition of the models, and K is a positive integer greater than N;

determining N target models from the K models based on index parameters of the K models;

and processing the sample desensitization data by adopting data prediction technologies corresponding to the N target models respectively to obtain N sample prediction sets.

It should be noted that, in the embodiment corresponding to fig. 13, the content not mentioned may be referred to the description of the method embodiment, and will not be repeated here.

Referring to fig. 14, fig. 14 is a schematic diagram of a composition structure of a computer device according to an embodiment of the present application. As shown in fig. 14, the computer device 140 may include: processor 1401, memory 1402, and network interface 1403. The processor 1401 is connected to the memory 1402 and the network interface 1403, for example the processor 1401 may be connected to the memory 1402 and the network interface 1403 by a bus. The computer device may be a terminal device or a server.

The processor 1401 is configured to support the data processing apparatus to perform the corresponding functions in the data processing method described above. The processor 1401 may be a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), a hardware chip or any combination thereof. The hardware chip may be an Application-specific integrated circuit (ASIC), a programmable logic device (Programmable Logic Device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (Complex Programmable Logic Device, CPLD), a Field programmable gate array (Field-Programmable Gate Array, FPGA), general array logic (Generic Array Logic, GAL), or any combination thereof.

The memory 1402 stores program codes and the like. Memory 1402 may include Volatile Memory (VM), such as random access Memory (Random Access Memory, RAM); memory 1402 may also include Non-Volatile Memory (NVM), such as Read-Only Memory (ROM), flash Memory (flash Memory), hard Disk (HDD) or Solid State Drive (SSD); memory 1402 may also include a combination of the above types of memory. In an embodiment of the present invention, the memory 1402 is used to store a program for website security detection, interactive traffic data, and the like.

The network interface 1403 is used to provide network communication functionality.

The processor 1401 may call the program code to perform the following operations:

desensitizing the target data to obtain target desensitized data;

It should be understood that the computer device 140 described in the embodiment of the present application may perform the description of the data processing method in the embodiment corresponding to fig. 3, 4, 11 and 12, and may also perform the description of the data processing apparatus in the embodiment corresponding to fig. 13, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

The embodiments of the present application also provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a computer, cause the computer to perform a method as in the previous embodiments, the computer being part of a computer device as mentioned above. Such as the processor 1401 described above. As an example, the program instructions may be executed on one computer device or on multiple computer devices located at one site, or alternatively, on multiple computer devices distributed across multiple sites and interconnected by a communication network, which may constitute a blockchain network.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium and executed by the processor, such that the computer device performs the steps performed in the embodiments of the methods described above.

Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, may include processes of the embodiments of the methods as described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.

The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. A method of data prediction, comprising:

obtaining target data to be predicted, wherein the target data is used for indicating the identity category of an object;

desensitizing the target data to obtain target desensitized data;

processing the target desensitization data by adopting N data prediction technologies respectively to obtain N identity prediction sets, wherein each identity prediction set comprises probabilities that the object belongs to M predicted identity categories, N is a positive integer greater than or equal to 2, and M is a positive integer;

2. The method of claim 1, wherein the target desensitization data includes dense features and/or sparse features therein;

the processing of the target desensitization data by using N data prediction techniques to obtain N identity prediction sets includes:

and processing the dense features and/or the sparse features by adopting N data prediction technologies respectively to obtain N identity prediction sets.

3. The method of claim 2, wherein the N sets of identity predictions include at least two of a first set of identity predictions, a second set of identity predictions, and a third set of identity predictions;

The processing of the dense features and/or the sparse features by using N data prediction techniques, respectively, to obtain N identity prediction sets, includes:

and performing feature compression on the sparse features, performing feature stitching on the compressed features and the dense features to obtain third stitching features, and performing weight processing on the third stitching features based on an attention mechanism to determine the third identity prediction set.

4. A method according to any of claims 1-3, wherein said determining a target identity class of the object based on the N sets of identity predictions comprises:

determining predicted identity categories belonging to the same category and the probability of the predicted identity categories of the same category from the N identity prediction sets;

5. The method of claim 1, wherein the desensitizing the target data to obtain target desensitized data comprises:

acquiring the original sensitivity probability of the target data;

6. The method according to claim 1, wherein the method further comprises:

desensitizing the sample data to obtain sample desensitization data;

determining a target sample identity class of the sample object based on the N sample prediction sets;

7. The method of claim 6, wherein the obtaining sample data to be predicted comprises:

acquiring at least one initial sample data, and filtering the at least one initial sample data based on an abnormal rule to obtain sample filtering data, wherein the abnormal rule comprises at least one abnormal behavior;

8. The method of claim 6, wherein the method further comprises:

the processing of the sample desensitization data by using N data prediction techniques to obtain N sample prediction sets includes:

9. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected to the memory, the network interface for providing data communication functions, the memory for storing program code, the processor for invoking the program code to cause the computer device to perform the method of any of claims 1-8.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1-8.