WO2017202006A1

WO2017202006A1 - Data processing method and device, and computer storage medium

Info

Publication number: WO2017202006A1
Application number: PCT/CN2016/109729
Authority: WO
Inventors: 陈玲; 陈谦; 陈培炫
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2016-05-25
Filing date: 2016-12-13
Publication date: 2017-11-30
Also published as: CN106056444A

Abstract

A data processing method and device, and a computer storage medium. The method comprises: collecting behavior data of a first account, the behavior data comprising Internet-based online behavior data and offline behavior data (S202); obtaining a first characteristic variable of the first account according to the behavior data (S204), wherein the first characteristic variable is used for representing behavior characteristics the first account; inputting the first characteristic variable into a data analysis model (S206), wherein the data analysis model is used for outputting a first numerical value according to the first characteristic variable, and the first numerical value is used for representing the value of probability that a behavior of the first account does not satisfy a preset condition; and recording the first numerical value outputted by the data analysis model (S208).

Description

Data processing method and device, computer storage medium

Technical field

The present invention relates to the field of data processing, and in particular to a data processing method and apparatus, and a computer storage medium.

Background technique

The data processing can be used for each service, taking the existing personal credit information business as an example, and the data processing process is described as follows:

Establish a personal credit rating by collecting data from the bank. In general, the personal credit rating is established using the data in the credit base database. The basic database of credit information includes credit information, public records and inquiry records. Credit information includes credit card records, bank loan records, personal asset records and other credit loan records. Public records include personal housing provident fund, personal pension insurance, etc., and the inquiry records include personal addresses. And contact information, etc. When establishing a personal credit rating, the bank's credit information is used as the primary basis. Obtain an individual's credit rating by means of a sample survey. However, due to the slow update of the bank's credit information, it is impossible to reflect the true credit level of the individual in time, resulting in inaccurate credit levels. At the same time, the data obtained by means of the prior art sample survey cannot fully reflect the true credit level of the bank account, and the resulting credit level is inaccurate, resulting in inaccurate data.

In response to the above problems, no effective solution has been proposed yet.

Summary of the invention

The embodiment of the invention provides a data processing method and device, and a computer storage medium, to solve at least the technical problem that the credit level of the account cannot be accurately obtained and the data is inaccurate.

According to an aspect of an embodiment of the present invention, a data processing method is provided, including: Collecting behavior data of the first account, the behavior data includes online behavior data and offline behavior data based on the Internet; acquiring a first feature variable of the first account according to the behavior data, wherein the first feature a variable for indicating a behavior characteristic of the first account; inputting the first feature variable into a data analysis model, wherein the data analysis model is configured to output a first value according to the first feature variable, the first The value is used to indicate a probability value that the behavior of the first account does not satisfy the preset condition; and the first value output by the data analysis model is recorded.

According to another aspect of the embodiments of the present invention, a data processing apparatus is provided, including: an collecting unit, configured to collect behavior data of a first account, where the behavior data includes online behavior data and offline behavior based on the Internet a data acquisition unit, configured to acquire, according to the behavior data, a first feature variable of the first account, where the first feature variable is used to represent behavior characteristics of the first account, and an input unit is configured to The first feature variable is input to a data analysis model, wherein the data analysis model is configured to output a first value according to the first feature variable, where the first value is used to indicate that the behavior of the first account is not satisfied a probability value of the condition; a recording unit configured to record the first value output by the data analysis model.

The collecting unit, the obtaining unit, the input unit, and the recording unit may use a central processing unit (CPU), a digital signal processor (DSP, Digital Singnal Processor), or Field-Programmable Gate Array (FPGA) implementation.

The embodiment of the invention further provides a computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions are configured to execute the data processing method described above.

In the embodiment of the present invention, the first feature variable is used to represent the behavior characteristic of the first account, and the behavior characteristic of the first account is obtained based on the behavior data of the first account based on the Internet, and then the first feature variable is input into the data analysis model. , the probability value that the behavior of the first account does not satisfy the preset condition can be obtained. Since the behavior data of the first account in the social application can be relatively wide coverage The behavior of an account, the behavior data input into the data analysis model can fully reflect the behavior of the first account, so that the analyzed probability value of the behavior of the first account does not meet the preset condition is more accurate, thereby solving the inaccuracy Get technical questions about the credit level of your account.

DRAWINGS

The drawings described herein are intended to provide a further understanding of the invention, and are intended to be a part of the invention. In the drawing:

1 is a schematic diagram of a network architecture in accordance with an embodiment of the present invention;

2 is a flow chart of a data processing method according to an embodiment of the present invention;

3 is a schematic diagram of a model architecture in accordance with an embodiment of the present invention;

4 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention;

FIG. 5 is a hardware configuration diagram of a server according to an embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is an embodiment of the invention, but not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.

It is to be understood that the terms "first", "second" and the like in the specification and claims of the present invention are used to distinguish similar objects, and are not necessarily used to describe a particular order or order. It is to be understood that the data so used may be interchanged where appropriate, so that the embodiments of the invention described herein can be implemented in a sequence other than those illustrated or described herein. In addition, the terms "comprises" and "comprises" and "the" and "the" are intended to cover a non-exclusive inclusion, for example, a process, method, system, product, or device that comprises a series of steps or units is not limited Those steps or units that are clearly listed may include other steps or units that are not explicitly listed or inherent to such processes, methods, products, or devices.

Example 1

In accordance with an embodiment of the present invention, an embodiment of a method that can be performed by an embodiment of the apparatus of the present application is provided. It is noted that the steps illustrated in the flowchart of the accompanying drawings can be in a computer system such as a set of computer executable instructions. The execution is performed, and although the logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in a different order than the ones described herein.

According to an embodiment of the present invention, a data processing method is provided.

In an embodiment of the present invention, in the embodiment, the data processing method can be applied to a hardware environment formed by the terminal 102 and the server 104 as shown in FIG. 1. As shown in FIG. 1 , the terminal 102 is connected to the server 104 through a network. The network includes but is not limited to: a mobile communication network, a wide area network, a metropolitan area network, or a local area network. The terminal 102 may be a mobile phone terminal, or may be a PC terminal or a notebook terminal. Or a tablet terminal.

The main working principle of the hardware environment system shown in Figure 1 is:

The server 104 collects behavior data of the plurality of terminals 102, including behavior data obtained by the terminal 102 performing actions through the Internet (for example, chatting in an instant messaging application, watching videos, games, etc.) and combining the terminal 102 via the Internet and offline actions. Behavioral data obtained by the action (such as storing motion data in the cloud through a wearable device during exercise, etc.). The server 104 analyzes the feature variables of one or more terminals 102 according to the collected behavior data, and then acquires the probability that the behavior of a certain terminal satisfies a preset condition according to the feature variables of one or more terminals 102 (eg, a certain terminal). Credit rating). Further, when the credits of a certain terminal 102 are acquired through the feature variables of the plurality of terminals 102, the plurality of terminals have an association relationship (such as a friend relationship) with the one terminal 102.

Since the first account is based on social application-based behavior data, it is not limited to the prior art silver. Row data, the collected behavior data covers a wider range, and can reflect the probability value of the behavior of the first account meeting the preset condition from multiple aspects, thereby improving the accuracy of the obtained probability value, thereby solving the existing Technology can't accurately get the technical problem of the credit level of the account.

FIG. 2 is a flowchart of a data processing method according to an embodiment of the present invention. The data processing method provided by the embodiment of the present invention is specifically described below with reference to FIG. 2 . As shown in FIG. 2 , the data processing method mainly includes the following steps:

Step S202: Collect behavior data of the first account, and the behavior data includes online data and offline data based on the Internet.

Step S204: Acquire a first feature variable of the first account according to the behavior data, where the first feature variable is used to represent a behavior feature of the first account.

Step S206, the first feature variable is input into the data analysis model, wherein the data analysis model is configured to output the first value according to the first feature variable, where the first value is used to indicate a probability value that the behavior of the first account does not satisfy the preset condition.

Step S208, recording a first value output by the data analysis model.

The first characteristic variable is used to represent the behavior characteristic of the first account, and the behavior characteristic of the first account is obtained based on the behavior data of the first account based on the Internet, and then the first feature variable is input into the data analysis model, and the first account is obtained. The behavior does not satisfy the probability value of the preset condition. Since the behavior data of the first account in the social application can cover the behavior of the first account relatively widely, the behavior data input into the data analysis model can fully reflect the behavior of the first account, thereby causing the analysis of the first account. The probability value that the behavior does not satisfy the preset condition is more accurate.

Specifically, the behavior data includes online behavior data and offline behavior data of the first account based on the Internet.

The virtual space behavior data on the Internet includes not only:

1) User's basic demographic attribute information, such as name, age, gender, region, education, occupation, etc.;

2) Virtual value-added service data, such as virtual account role dressing, game item purchase, film and television membership service, cloud storage space value-added service, music flow package, etc.;

3) Social interaction behavior data, such as chat, email, voice call, microblog space release, Douban review, knowledge and answer, public article reading, etc.;

4) Economic behavior data, such as payment, wealth management, shopping, stocks, funds, P2P, finance, etc.;

5) Entertainment and leisure behavior data, such as video on demand, music playback, K song, news reading, etc.;

6) Educational behavior data, such as online reading, open class study, vocational test practice, skill training, translation software use, etc.;

7) Other Internet mobile application behavior data, such as App download, search, etc.

The online data can be obtained through the user's mobile phone, tablet or PC computer instant messaging application, game client, APP download platform, financial platform, shopping software, etc. to collect the information filled by the user or the application is actively reported.

Offline related scene data includes not only:

1) O2O (online to offline), such as housekeeping services, urban services, beauty care, etc.;

2) wear device data, such as medical health, sports, etc.;

3) LBS (location based service) location data, such as navigation, check-in, special car, etc.;

4) Travel data, such as ticket ordering, hotel reservations, etc.

It can be seen that the behavior data includes actions in various online and offline scenes, and almost includes behavior data of various aspects of life. Therefore, the probability values obtained according to the behavior data also more accurately reflect the true probability value of the account. In addition, when the behavior data changes, it will be immediately fed back to the server or instantly obtained by the server. Therefore, the behavior data is updated quickly, and the probability value obtained according to the behavior data of the instant update may reflect that the behavior of the first account is not satisfied. Set the probability of the condition. The probability value that does not satisfy the preset condition may be the probability of default, such as not complying with the contract Definite behavior, etc.

For example, when the credit of the user A is obtained according to the behavior data of the user A, the account chat behavior of the user A in the instant messaging application, the behavior of watching the video in the video application, the behavior of downloading the application, and the like may be collected. Extracting the first feature variable from the behavior data separately can obtain the first feature variable of different categories. For example, the first feature variable of the instant communication class, the first feature variable of the video class, and the first feature variable of the download class, all of the first feature variables of the different classes are input into the data analysis model, and the first value is output. It is also possible to input portions of the first characteristic variables of different classes into the data analysis model.

Generally, the friend of the user A has a similarity with the user A, and the behavior data of the friend of the user A can also reflect the probability that the behavior of the user A does not satisfy the preset condition. Therefore, when the first feature variable is input to the data analysis model, the feature variable associated with the friend of the user A can also be input at the same time.

That is, the inputting the first feature variable into the data analysis model includes: acquiring the second feature variable, wherein the second feature variable is used to represent behavior characteristics of the plurality of second account accounts having an association relationship with the first account; And the second characteristic variable is input to the data analysis model, wherein the data analysis model is further configured to output the first value according to the first feature variable and the second feature variable.

The method of obtaining the second feature variable is the same as the method of acquiring the first feature variable, which will be described in detail later. The first account with the associated relationship and the plurality of the second account are friends, and the plurality of second accounts are friends of the first account. Both the online behavior and the offline behavior in the above example can be mapped to the behavior of an application account through a certain correspondence. For example, the second account registers the navigation service and the instant messaging application by using the mobile phone number, and collects the behavior data of the second account in the navigation service and the behavior of the second account in the instant messaging application when acquiring the behavior data of the second account. data.

Further, inputting the first feature variable and the second feature variable to the data analysis model includes: acquiring intimacy between the plurality of second account accounts and the first account, wherein the intimacy is according to each second The interaction behavior between the account and the first account is generated; the third feature variable is obtained according to the intimacy and the second feature variable by using the following formula:

υ'=f((α ₁ ,α ₂ ,...,α _i ,...,α _n ),(υ ₁ ,υ ₂ ,...,υ _i ,...,υ _n )),

Where υ' represents the third characteristic variable, i represents the i-th second account, α _i is the intimacy of the i-th second account and the first account, and υ _i is the second characteristic variable of the i-th second account f is a weighted average of the second characteristic variable and the intimacy of the first n second accounts in the order of indicating the high to low intimacy; the first characteristic variable and the third characteristic variable are input to the data analysis model .

In this embodiment, the second feature variable of the second account is processed to make it better reflect the behavior characteristics of the first account. Therefore, when the second variable is acquired, each second characteristic variable is multiplied by the corresponding weight value, and then the weighted average is performed. The weight value indicates the intimacy of the first account and the second account. The closer the first account is to the second account, the greater the weight value; conversely, the smaller the weight value. The intimacy can be measured by the interaction between the first account and the second account. For example, the more chats between the first account and the second account, the more intimate the relationship. The higher the degree of overlap between the first account and the second account, the more intimate the relationship between the two accounts. The intimacy and coincidence can be obtained by means of a training model. Interactive behaviors include interactions in the circle of friends, payment interactions (such as red envelopes), and sports interactions (such as walking 10,000 steps). Intimacy can be reflected by the above information interaction, including the number of times the information is sent and received, the number of days, etc., as well as the ratio of sending and receiving information, the number of times of information interaction every day. This information includes text information, video information, and voice information. Intimacy can also be obtained by commenting, praising, marking a friend as a special friend, giving a gift, or pulling black.

For example, the third characteristic variable

That is, the intimacy is a weighted average of the second characteristic variable and the intimacy of the top 10 friends arranged in order of intimacy.

The general characteristics of a group can reflect the characteristics of a certain user in this group. Therefore, the probability value of the behavior that does not satisfy the preset condition can be obtained according to the characteristics of a group, and the credit degree of the user can be more accurately reflected. It should be noted that the second feature of acquiring multiple second accounts is obtained. In the variable, the first n second accounts of the intimacy are selected according to the intimacy of the first account and the first account, and the third feature variable is generated according to the intimacy and the second feature variable.

Due to the wide range of behavioral data collected, the data formats obtained are also different. Therefore, after obtaining the behavior data, the abnormal data is deleted, the duplicate data is removed, the data with larger fluctuation values is filtered out, and the missing data is completed. The abnormal data may be data that is obviously beyond a certain range. For example, the normal person's age does not exceed one hundred. If the collected data shows that the age is 100, the abnormal data is deleted. If the collected data shows that the age includes 0 and 49, 0 and 49 are in the range of 0 to 100, however, most of the other data are between 18 and 45, so 0 and 49 belong to the singular point with a large fluctuation value.

After basic processing of behavioral data, behavioral data is divided into multiple dimensions based on data sources and business characteristics. For example, basic information, social interaction, financial management, etc., are classified and integrated into the database. When writing to the database, you can agree on the data type and data structure. For example, the type of the value is int and the type of the region name is a string. Other forms are also available, no longer one by one.

Because of the large amount of behavioral data stored in these stores, and the correlation between many data, it is necessary to filter these data and obtain more significant features to input into the data analysis model.

a) The civil servants in the basic attributes are more stable and have reliable economic sources, which can reflect the user's economic ability and willingness to repay;

b) Users who often do not respond to messages in a timely manner in social interactions may be lazy, reflecting the user's delay in personality;

c) Users who frequently purchase value-added services and shop online can reflect the user's financial ability;

d) Stocks, funds, and P2P purchases in economic behavior can reflect the user's risk tolerance and economic ability;

e) hit the car but often cancel the order or the lower rating can reflect the user's reputation;

f) If the users who are frequented by the users are all high-quality, adherent to the agreement, the people with strong economic ability can feedback the users themselves to a certain extent.

The above basic attributes, social interaction behavior, purchase behavior, taxi behavior and friend attributes can all reflect the behavior characteristics of the first account.

In an embodiment of the present invention, acquiring the first feature variable of the first account according to the behavior data includes: acquiring an information gain of the feature in the behavior data, the information gain is used to indicate the amount of information included in the behavior data; and determining whether the information gain is Within the preset value range; if the information gain is within the preset value range, the derived variable is constructed according to the behavior data, wherein the derived variable is the merged or split behavior data; if the information gain is outside the preset value range, Then, the feature corresponding to the information gain outside the preset value range is deleted, and the derivative variable is constructed according to the remaining features; the derived variable is used as the first feature variable.

In an embodiment of the present invention, deleting a feature corresponding to an information gain outside a preset value range, and then constructing a derivative variable according to the remaining feature includes: deleting a feature corresponding to an information gain outside a preset value range, Obtaining the correlation coefficient of the remaining features; combining the features whose correlation coefficient is greater than or equal to the preset coefficient into one merge feature; and using the merged feature as a derivative variable.

Features and eigenvalues constitute behavioral data. For example, the collected behavior data including the number of text chats, the number of voice calls, the amount of payment, etc., are all characteristics in the behavior data. For example, text chat 9 times, voice call 10 times and payment amount 100, the numbers are called feature values. The information gain can reflect the amount of information of a feature. If the amount of information is less than the threshold, the feature can be deleted. For example, the information gain is sorted for each type of feature, and the feature with the information gain less than the threshold is deleted. Then, the correlation of the remaining features is detected. If there are more relevant features, the features with stronger correlation are combined to obtain the first feature variable. If a feature is weakly correlated and highly significant, then this feature can be refined into multiple features. For example, split the number of chats into evening chats, daytime chats, weekend chats, and weekday chats. Number of days, etc. Conversely, you can combine night chats and day chats into chats.

In this embodiment, the behavior data can be flexibly split and merged to construct the first feature variable, and when splitting and merging, the same or different methods can be used for multiple features (for example, some features adopt the principal component) The analysis method and other features using clustering methods, etc.) increase the flexibility of constructing the first feature variable.

In an embodiment of the present invention, when the data analysis model is established, each subcategory may be generated corresponding to each category according to the category classified by the behavior data, and each submodel can output a first sub-value. Processing these first sub-values yields the first value of the data analysis model output. Further, when the sub-model is established, the sub-model may be trained according to the sample data of each category, and each category may be further divided, and a low-level model is established for the divided data, and multiple low-level models are constructed. The submodel, and then the submodel constitutes the data analysis model.

In an embodiment of the present invention, before inputting the first feature variable and the second feature variable to the data analysis model, the method further includes: dividing the behavior data into a plurality of categories; respectively for each of the plurality of categories The category establishes a sub-model, wherein each sub-model is configured to output a first sub-value according to the first characteristic variable and/or the second characteristic variable, wherein the first sub-value is used to represent the category corresponding to the sub-model, The behavior of an account does not satisfy the probability value of the preset condition; and the plurality of sub-models corresponding to the plurality of categories are constructed as a data analysis model.

In an embodiment of the present invention, establishing a sub-model for each of the plurality of categories includes: establishing a sub-model for each category using the same or different training models; or using the same or different training The model establishes a low-level model for each subcategory under each category, and constructs a low-level model corresponding to multiple subcategories under each category as a sub-model.

The training models used to build sub-models in each category can be the same or different. For example, among the 10 categories, 5 categories use the decision tree training model, and the other 5 use the neural network training sub-model. type.

In an embodiment of the present invention, the constructing the plurality of sub-models corresponding to the plurality of categories into the data analysis model comprises: constructing the plurality of sub-models into the data analysis model in the following manner:

Where P _always represents the first value, i is the i-th sub-model of the plurality of sub-models, and n is the number of the plurality of sub-models,

For the coefficient of the i-th sub-model, P _i ' is the first sub-value output of the i-th sub-model, and P ₀ is a constant feature.

Further, dividing the behavior data into a plurality of categories includes: dividing the behavior data into a plurality of categories according to the type of the service included in the behavior data; or dividing the data including the target object in the behavior data into one category, and not in the behavior data The data including the target object is divided into another category.

According to the hierarchical division, the division according to the business type, and the division according to whether or not the target object is included, the three division methods may use any one of them to separately construct the sub-model, or may construct the sub-model by any two or three combinations. For example, first establish a sub-model according to whether or not the target object is divided, and then divide the sub-model below the sub-model according to the business type.

The sub-services mainly refer to the previous data categories, such as basic information, value-added services, social interactions, economic behaviors, etc.; grouping is mainly based on business characteristics. For example, in economic activities, there are credit cards and users without credit cards in payment, shopping, wealth management, etc. The behavioral performance is quite different, so it can be divided into two groups to build the model separately. The layering is mainly at the level of the whole model architecture. For example, the sub-model layer can also be divided into multiple dimension layers, each layer is adopted. Machine learning algorithms can be quite different.

When generating a submodel, the detailed approach is as follows:

1) Obtain good and bad samples, divide training sets and test sets; good samples are behavior data at the time of compliance, and bad samples are behavior data at the time of default.

2) According to the characteristics of the sub-model business, extract the characteristics of the multi-dimensional user and friends, adopt regression, Classification, segmentation, multiple machine learning algorithms, training of multi-layer submodels. Take the social interaction sub-model as an example. The steps are as follows:

1. Extract at least the following dimensions and friends characteristics: text chat, voice message, video call, picture release, comment like, question and answer interaction;

2. Using LR (logical regression), decision tree, neural network, GBDT and other machine learning algorithms to train the dimensional layer model of the social interaction sub-model, and output the credit probability value;

3. Train the social interaction sub-model using the algorithm described in 2, and output the credit probability value (first sub-value).

3) Using the credit probability value output by the submodel as the input value, using the formula

Train the total model and output the predicted probability value (first value);

This embodiment will be described with reference to FIG. 3.

The data processing method of this embodiment is mainly divided into four parts, including data acquisition, data processing, feature mining and model construction.

(1) Data collection. This includes collecting online data and offline scene data. Online data includes data on games, finance, apps, shopping, social, and education, such as game titles, purchase amounts, and more. Offline scene data includes data such as life, navigation, travel, check-in, medical, and sports. For example, medical records, booking hotels, tourist locations, etc.

(2) Data processing. Includes cleaning, integration, and standardization. Cleaning includes deduplication, deletion of singularities, removal of abnormal data and information supplementation, integration includes dividing the same category of data into the same category, and normalization includes normalization of data types and normalization of storage data structures.

(3) Feature mining. The processed data is mined, for example, using graph calculation and text mining methods. The characteristics of mining include data on user basic information, social interaction, personality traits, hobbies, emotional orientation, life circle, physical health and financial management.

(4) Model construction. Classify the mined features and build a model for each category. For example, social interaction classes, hobbies, health and personality. Create a model for each category type. And each model can be obtained using different learning and training methods. For the social interaction class model, the characteristics of the social interaction class can also be subdivided into chat features, phone features and video features. After building the submodel, the total model is obtained. The first feature variable and the third feature variable are then input into the sub-model to obtain a first value of the total model output.

For example, the first feature variable includes the feature a1, the feature a2, and the feature a3. Then, the feature b1, the feature b2, and the feature b3 of the second account are also acquired as the third feature variable, and input into the submodel as follows: y=f( A1*b1)+f(a2*b2)+f(a3*b3). The feature a1, the feature a2, and the feature a3 and the feature b1, the feature b2, and the feature b3 are three pairs of features that sequentially correspond. For example, feature a1 represents the payment amount of the first account, feature b1 represents the payment amount of the second account, feature a2 represents the game type of the first account, feature b2 represents the game type of the second account, and feature a3 represents the movement of the first account. The number of times, feature b3 represents the number of times of movement of the second account.

In an embodiment of the present invention, in order to improve the readability of the first value, the first value is converted into a credit program capable of embodying the first account. The first value represents the probability value of the first account default, and after converting to the third value, the credit level of the first account may be indicated. That is, after recording the first value output by the data analysis model, the method further comprises: converting the first value to the third value S by using the following method:

Wherein, S is used to indicate the degree to which the behavior of the first account satisfies the preset condition, b represents a reference value, p represents a first value, and st represents a step size.

When the credit is obtained in this embodiment, the features used comprehensively cover the online and offline behavior characteristics of the user, including not only basic user information, social interaction, financial activities, hobbies, life circles, but also deepening the user's personality characteristics, emotional inclination, etc. It is more able to characterize the stable features of the user's mental outlook and personality. At the same time, the multi-layered machine learning algorithm is adopted, which can improve the complexity and predictive ability of the algorithm while improving the accuracy of the user's credit program.

It should be noted that, for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the present invention is not limited by the described action sequence. Because certain steps may be performed in other sequences or concurrently in accordance with the present invention. In addition, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware, but in many cases, the former is A better implementation. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, or a network device, etc.) to perform the methods described in various embodiments of the present invention.

Example 2

According to an embodiment of the present invention, there is further provided a data processing apparatus for implementing the data processing method, the data processing apparatus is mainly used to perform the data processing method provided by the foregoing content of the embodiment of the present invention, and the following is an embodiment of the present invention. The data processing device provided is specifically introduced:

4 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention. As shown in FIG. 4, the data processing apparatus mainly includes an acquisition unit 10, an acquisition unit 20, an input unit 30, and a recording unit 40.

The collecting unit 10 is configured to collect behavior data of the first account, and the behavior data includes online behavior data and offline behavior data based on the Internet.

The obtaining unit 20 is configured to obtain a first feature variable of the first account according to the behavior data, where the first feature variable is used to represent a behavior feature of the first account.

The input unit 30 is configured to input the first feature variable into the data analysis model, wherein the data analysis The model is configured to output a first value according to the first feature variable, where the first value is used to indicate a probability value that the behavior of the first account does not satisfy the preset condition.

The recording unit 40 is for recording the first value of the data analysis model output.

Behavioral data includes actions in a variety of online and offline scenarios, including behavioral data for all aspects of life. Therefore, the probability values obtained from these behavioral data also more accurately reflect the true probability value of the account. In addition, when the behavior data changes, it will be immediately fed back to the server or instantly obtained by the server. Therefore, the behavior data is updated quickly, and the probability value obtained according to the behavior data of the instant update may reflect that the current behavior of the first account is not satisfied. Set the probability of the condition. The probability value that does not satisfy the preset condition may be a probability of default, such as non-compliance with the contracted behavior.

That is, the input unit includes: a first acquiring subunit, configured to acquire a second feature variable, wherein the second feature variable is used to represent a behavior characteristic of the plurality of second account accounts having an association relationship with the first account; the input subunit, And a method for inputting the first feature variable and the second feature variable to the data analysis model, wherein the data analysis model is further configured to output the first value according to the first feature variable and the second feature variable.

Further, the input subunit includes: a first obtaining module, configured to acquire a closeness between the plurality of second accounts and the first account, wherein the intimacy is generated according to an interaction behavior of each second account and the first account; The third feature variable is obtained from the intimacy and the second characteristic variable using the following formula:

Where υ' represents the third characteristic variable, i represents the i-th second account, α _i is the intimacy of the i-th second account and the first account, and υ _i is the second characteristic variable of the i-th second account , f is a weighted average of the second characteristic variable and the intimacy of the first n second accounts in the order of indicating the intimacy from high to low; the input module is configured to use the first feature variable and the third feature variable Enter the data analysis model.

For example, the third characteristic variable

The general characteristics of a group can reflect the characteristics of a certain user in this group. Therefore, the probability value of the behavior that does not satisfy the preset condition can be obtained according to the characteristics of a group, and the credit degree of the user can be more accurately reflected. It should be noted that, when acquiring the second feature variable of the plurality of second accounts, first selecting the intimacy ranking of the first n second accounts according to the intimacy of the first account and the first account, and then according to the intimacy And generating a third feature variable with the second feature variable.

Due to the wide range of behavioral data collected, the data formats obtained are also different. Therefore, after obtaining the behavior data, the abnormal data is deleted, the duplicate data is removed, the data with larger fluctuation values is filtered out, and the missing data is completed. The abnormal data may be data that is obviously beyond a certain range. For example, usually the age of the person does not exceed one hundred, if the collected data shows the age is 100, the exception data is deleted. If the collected data shows that the age includes 0 and 49, 0 and 49 are in the range of 0 to 100, however, most of the other data are between 18 and 45, so 0 and 49 belong to the singular point with a large fluctuation value.

In an embodiment of the present invention, the acquiring unit includes: an acquiring subunit, configured to acquire an information gain of a feature in the behavior data, where the information gain is used to represent the amount of information included in the behavior data; a determining subunit for determining whether the information gain is within a preset value range; constructing a subunit for constructing a derivative variable according to the behavior data when the information gain is within a preset value range, wherein the derived variable is a merge or Decomposed behavior data; delete sub-units, when the information gain is outside the preset value range, delete the feature corresponding to the information gain outside the preset value range, and then construct the derivative variable according to the remaining features; determine the sub-unit Used to use the derived variable as the first feature variable.

In an embodiment of the present invention, the deleting the subunit includes: a second acquiring module, configured to acquire a correlation coefficient of the remaining feature after deleting the feature corresponding to the information gain outside the preset numerical range; The feature that the correlation coefficient is greater than or equal to the preset coefficient is merged into one merge feature; the determining module is configured to use the merged feature as a derivative variable.

Features and eigenvalues constitute behavioral data. For example, the collected behavior data including the number of text chats, the number of voice calls, the amount of payment, etc., are all characteristics in the behavior data. For example, text chat 9 times, voice call 10 times and payment amount 100, the numbers are called feature values. The information gain can reflect the amount of information of a feature. If the amount of information is less than the threshold, the feature can be deleted. For example, the information gain is sorted for each type of feature, and the feature with the information gain less than the threshold is deleted. Then, the correlation of the remaining features is detected. If there are more relevant features, the features with stronger correlation are combined to obtain the first feature variable. If a feature is weakly correlated and highly significant, then this feature can be refined into multiple features. For example, split the number of chats into evening chats, daytime chats, weekend chats, and weekday chats. Conversely, you can combine night chats and day chats into chats.

In an embodiment of the present invention, the apparatus further includes: a dividing unit, configured to divide the behavior data into a plurality of categories before inputting the first feature variable and the second feature variable to the data analysis model; a sub-model for each of the plurality of categories, wherein each sub-model is configured to output a first sub-value according to the first characteristic variable and/or the second characteristic variable, wherein the first sub-value is used In the category corresponding to the sub-model, the behavior of the first account does not satisfy the probability value of the preset condition; and the second establishing unit is configured to construct the plurality of sub-models corresponding to the plurality of categories as the data analysis model.

In an embodiment of the present invention, the first establishing unit includes: a first establishing subunit, configured to respectively establish a submodel for each category by using the same or different training models; or a second establishing subunit, for A low-level model is established for each subcategory under each category by using the same or different training models, and a low-level model corresponding to multiple subcategories under each category is constructed as a sub-model.

The training models used to build sub-models in each category can be the same or different. For example, among the 10 categories, 5 categories use the decision tree training model, and the other 5 use the neural network training sub-model.

In an embodiment of the present invention, the second establishing unit is further configured to construct multiple sub-models into a data analysis model in the following manner:

For the coefficient of the i-th sub-model, P _i ' is the first sub-value output of the i-th sub-model, and P ₀ is a constant.

In an embodiment of the present invention, the dividing unit includes: a first dividing subunit, configured to divide the behavior data into multiple categories according to a service type included in the behavior data; or a second dividing subunit, configured to perform the behavior Data in the data including the target object is divided into one class, which will behave The data in the data that does not include the target object is divided into another category.

In an embodiment of the present invention, the apparatus further includes: a converting unit, configured to convert the first value into the third value S by using the following method after recording the first value output by the data analysis model:

Example 3

According to an embodiment of the present invention, a server for implementing the above data processing method is further provided. As shown in FIG. 5, the server mainly includes a processor 501, a data interface 503, a memory 505, and a network interface 507, where:

The data interface 503 transmits the behavior data acquired by the third party tool to the processor 501 mainly by means of data transmission.

The memory 505 is mainly used to store behavior data and data analysis models.

The network interface 507 is mainly used for network communication with the server, and obtains behavior data provided by the terminal from other servers.

The processor 501 is mainly configured to perform the following operations:

Collecting behavior data of the first account, the behavior data includes online behavior data and offline behavior data based on the Internet; and acquiring the first feature of the first account according to the behavior data And the first feature variable is used to represent a behavior characteristic of the first account; the first feature variable is input to a data analysis model, wherein the data analysis model is used according to the first feature variable And outputting a first value, where the first value is used to indicate a probability value that the behavior of the first account does not satisfy a preset condition; and the first value output by the data analysis model is recorded.

The processor 501 is further configured to acquire a second feature variable, where the second feature variable is used to represent behavior characteristics of a plurality of second account accounts that have an association relationship with the first account; the first feature variable and The second characteristic variable is input to the data analysis model, wherein the data analysis model is further configured to output the first value according to the first feature variable and the second feature variable.

The processor 501 is further configured to acquire the intimacy between the plurality of second accounts and the first account, where the intimacy is generated according to an interaction behavior of each of the second accounts and the first account. Obtaining a third characteristic variable according to the intimacy and the second characteristic variable by using the following formula:

Where υ' denotes the third characteristic variable, i denotes the i-th second account, α _i is the intimacy of the i-th second account and the first account, υ _i is the ith second The second characteristic variable of the account number, f is a weighted average value of the second characteristic variable and the intimacy of the first n second account numbers in the order of indicating the intimacy from high to low; The first feature variable and the third feature variable are input to the data analysis model.

In the embodiment of the present invention, the specific examples in this embodiment may refer to the examples described in Embodiment 1 and Embodiment 2, and details are not described herein again.

Example 4

Embodiments of the present invention also provide a storage medium. In the embodiment, the above storage medium may be used to store program codes of the data processing method of the embodiment of the present invention.

In this embodiment, the foregoing storage medium may be located in a mobile communication network, a wide area network, or a metropolitan area. At least one of a plurality of network devices in a network of a network or a local area network.

In the present embodiment, the storage medium is arranged to store program code for performing the following steps:

S1: Collect behavior data of the first account, where the behavior data includes online behavior data and offline behavior data based on the Internet.

S2. Acquire a first feature variable of the first account according to the behavior data, where the first feature variable is used to represent a behavior feature of the first account.

S3, the first feature variable is input to a data analysis model, wherein the data analysis model is configured to output a first value according to the first feature variable, where the first value is used to represent behavior of the first account The probability value that does not satisfy the preset condition.

S4, recording the first value output by the data analysis model.

In an embodiment of the present invention, the storage medium may include, but is not limited to, a USB flash drive, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, and a magnetic memory. A variety of media that can store program code, such as a disc or a disc.

In an embodiment of the present invention, the processor performs acquiring the second feature variable according to the stored program code in the storage medium, where the second feature variable is used to indicate that the relationship relationship with the first account is Behavioral characteristics of the second account; inputting the first feature variable and the second feature variable to the data analysis model, wherein the data analysis model is further configured to use the first feature variable and the The second characteristic variable outputs the first value.

In an embodiment of the present invention, the processor performs, according to the stored program code in the storage medium, acquiring the intimacy between the plurality of second accounts and the first account, wherein the intimacy is according to each The interaction behavior of the second account with the first account is generated; and the third feature variable is obtained according to the intimacy and the second feature variable by using the following formula:

Where υ' represents the third characteristic variable, i represents the i-th second account, α _i is the intimacy of the i-th second account and the first account, and υ _i is the ith second The second characteristic variable of the account number, f is a weighted average value of the second characteristic variable and the intimacy of the first n second account numbers in the order of indicating the intimacy from high to low; The first feature variable and the third feature variable are input to the data analysis model.

For an example of the embodiment of the present invention, reference may be made to the examples described in the foregoing Embodiment 1 and Embodiment 2, and details are not described herein again.

The serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments.

The integrated unit in the above embodiment, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in the above-described computer readable storage medium. Based on such understanding, the technical solution of the present invention may contribute to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause one or more computer devices (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.

In the above-mentioned embodiments of the present invention, the descriptions of the various embodiments are different, and the parts that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner. For example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interface, indirect coupling of the unit or module or The communication connection can be in electrical or other form.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

The above description is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It should be considered as the scope of protection of the present invention.

Industrial applicability

In the embodiment of the present invention, the first feature variable is used to represent the behavior characteristic of the first account, and the behavior characteristic of the first account is obtained based on the behavior data of the first account based on the Internet, and then the first feature variable is input into the data analysis model. , the probability value that the behavior of the first account does not satisfy the preset condition can be obtained. Since the behavior data of the first account in the social application can cover the behavior of the first account relatively widely, the behavior data input into the data analysis model can fully reflect the behavior of the first account, thereby causing the analysis of the first account. The probability value that the behavior does not satisfy the preset condition is more accurate, thereby solving the technical problem that the credit level of the account cannot be accurately obtained.

Claims

A data processing method comprising:

Collecting behavior data of the first account, the behavior data including online behavior data and offline behavior data based on the Internet;

And acquiring, according to the behavior data, a first feature variable of the first account, where the first feature variable is used to represent a behavior feature of the first account;

Entering the first feature variable into a data analysis model, wherein the data analysis model is configured to output a first value according to the first feature variable, where the first value is used to indicate that the behavior of the first account is not satisfied The probability value of the preset condition;

Recording the first value of the data analysis model output.
The method of claim 1 wherein entering the first characteristic variable into the data analysis model comprises:

Obtaining a second feature variable, wherein the second feature variable is used to represent behavior characteristics of a plurality of second account accounts having an association relationship with the first account;

Inputting the first feature variable and the second feature variable to the data analysis model, wherein the data analysis model is further configured to output the first according to the first feature variable and the second feature variable A value.
The method of claim 2, wherein inputting the first feature variable and the second feature variable to the data analysis model comprises:

Acquiring the intimacy between the plurality of second accounts and the first account, wherein the intimacy is generated according to an interaction behavior of each of the second accounts and the first account;

The third feature variable is obtained according to the intimacy and the second characteristic variable by using the following formula:

υ'=f((α 1 ,α 2 ,...,α i ,...,α n ),(υ 1 ,υ 2 ,...,υ i ,...,υ n )),

Where υ' represents the third characteristic variable, i represents the i-th second account, α i is the intimacy of the i-th second account and the first account, and υ i is the ith second The second characteristic variable of the account number, f is a weighted average value of the second characteristic variable and the intimacy of the first n second account numbers in the order of indicating the intimacy from high to low;

The first feature variable and the third feature variable are input to the data analysis model.
The method according to claim 1, wherein the acquiring the first feature variable of the first account according to the behavior data comprises:

Obtaining an information gain of a feature in the behavior data, the information gain being used to represent an amount of information included in the behavior data;

Determining whether the information gain is within a preset value range;

And if the information gain is within the preset value range, constructing a derivative variable according to the behavior data, wherein the derived variable is the merged or split behavior data;

If the information gain is outside the preset value range, deleting the feature corresponding to the information gain outside the preset value range, and constructing the derivative variable according to the remaining features;

The derived variable is taken as the first characteristic variable.
The method according to claim 4, wherein deleting the feature corresponding to the information gain outside the preset value range, and constructing the derived variable according to the remaining features comprises:

Acquiring a correlation coefficient of the remaining features after deleting a feature corresponding to the information gain outside the preset value range;

Combining the feature whose correlation coefficient is greater than or equal to the preset coefficient into one merge feature;

The merged feature is taken as the derived variable.
The method of claim 2, wherein before the inputting the first feature variable and the second feature variable to a data analysis model, the method further comprises:

Dividing the behavior data into a plurality of categories;

Establishing a sub-model for each of the plurality of categories, wherein each sub-module The type is configured to output a first sub-value according to the first characteristic variable and/or the second characteristic variable, wherein the first sub-value is used to indicate that under the category corresponding to the sub-model, the The behavior of an account does not satisfy the probability value of the preset condition;

A plurality of sub-models corresponding to the plurality of categories are constructed as the data analysis model.
The method of claim 6 wherein establishing a sub-model for each of the plurality of categories separately comprises:

Create a submodel for each category using the same or different training models; or

The low-level models are respectively established for the sub-categories under each category by using the same or different training models, and the low-level models corresponding to the plurality of sub-categories under each of the categories are constructed as the sub-models.
The method of claim 6, wherein constructing the plurality of sub-models corresponding to the plurality of categories as the data analysis model comprises:

The plurality of sub-models are constructed as the data analysis model in the following manner:

Wherein P always represents the first value, i is an i-th sub-model of the plurality of sub-models, and n is the number of the plurality of sub-models,
For the coefficient of the i-th sub-model, P i ' is the first sub-value output of the i-th sub-model, and P 0 is a constant.
The method of claim 6 wherein dividing the behavioral data into a plurality of categories comprises:

Dividing the behavior data into a plurality of categories according to a type of service included in the behavior data; or

The data including the target object in the behavior data is divided into one class, and the data that does not include the target object in the behavior data is divided into another class.
The method of claim 1 wherein said data analysis model output is recorded After the first value, the method further includes:

The first value is converted to a third value S by the following method:

Wherein S is used to indicate the degree to which the behavior of the first account satisfies the preset condition, b represents a reference value, p represents the first value, and st represents a step size.
A data processing device comprising:

The collecting unit is configured to collect behavior data of the first account, where the behavior data includes online behavior data and offline behavior data based on the Internet;

An acquiring unit, configured to acquire, according to the behavior data, a first feature variable of the first account, where the first feature variable is used to represent a behavior feature of the first account;

An input unit, configured to input the first feature variable into a data analysis model, wherein the data analysis model is configured to output a first value according to the first feature variable, where the first value is used to represent the first The probability that the behavior of the account does not satisfy the preset condition;

And a recording unit, configured to record the first value output by the data analysis model.
The apparatus of claim 11 wherein said input unit comprises:

a first acquiring sub-unit, configured to acquire a second feature variable, where the second feature variable is used to represent behavior characteristics of a plurality of second account accounts that are associated with the first account;

Input subunits for inputting the first feature variable and the second feature variable to the data analysis model, wherein the data analysis model is further configured to use the first feature variable and the second The feature variable outputs the first value.
The apparatus of claim 12 wherein said input subunit comprises:

a first obtaining module, configured to acquire intimacy between the plurality of second accounts and the first account, wherein the intimacy is based on an interaction behavior between each of the second accounts and the first account Generate

a calculation module, configured to acquire a third characteristic variable according to the intimacy and the second characteristic variable by using the following formula:

υ'=f((α 1 ,α 2 ,...,α i ,...,α n ),(υ 1 ,υ 2 ,...,υ i ,...,υ n )),

Where υ' denotes the third characteristic variable, i denotes the i-th second account, α i is the intimacy of the i-th second account and the first account, υ i is the ith second The second characteristic variable of the account number, f is a weighted average value of the second characteristic variable and the intimacy of the first n second account numbers in the order of indicating the intimacy from high to low;

And an input module, configured to input the first feature variable and the third feature variable to the data analysis model.
The apparatus of claim 11, wherein the obtaining unit comprises:

Obtaining a subunit, configured to acquire an information gain of a feature in the behavior data, where the information gain is used to represent an amount of information included in the behavior data;

a determining subunit, configured to determine whether the information gain is within a preset value range;

Constructing a subunit, configured to construct a derived variable according to the behavior data when the information gain is within the preset value range, wherein the derived variable is the merged or split behavior data;

And deleting a sub-unit, configured to delete a feature corresponding to the information gain that is outside the preset value range when the information gain is outside the preset value range, and then construct the derivative variable according to the remaining features;

Determining a subunit for using the derived variable as the first characteristic variable.
The apparatus of claim 14, wherein the deleting subunit comprises:

a second acquiring module, configured to acquire a correlation coefficient of the remaining feature after deleting a feature corresponding to the information gain that is outside the preset value range;

a merging module, configured to combine the feature whose correlation coefficient is greater than or equal to the preset coefficient into one merge feature;

A determination module is used to use the merged feature as the derived variable.
The device of claim 12, wherein the device further comprises:

a dividing unit, configured to divide the behavior data into a plurality of categories before inputting the first feature variable and the second feature variable to a data analysis model;

a first establishing unit, configured to respectively establish a sub-model for each of the plurality of categories, wherein each sub-model is configured to output a first according to the first characteristic variable and/or the second characteristic variable a sub-value, wherein the first sub-value is used to indicate a probability value that the behavior of the first account does not satisfy the preset condition under a category corresponding to the sub-model;

a second establishing unit, configured to construct a plurality of sub-models corresponding to the plurality of categories as the data analysis model.
The apparatus of claim 16, wherein the first establishing unit comprises:

a first establishing sub-unit for establishing a sub-model for each category using the same or different training models; or

a second establishing sub-unit, configured to establish a low-level model for each sub-category under each category by using the same or different training models, and the low-level model corresponding to the plurality of sub-categories under each category Constructed as the submodel.
The apparatus according to claim 16, wherein the second establishing unit is further configured to construct the plurality of sub-models into the data analysis model in the following manner:

Wherein P always represents the first value, i is an i-th sub-model of the plurality of sub-models, and n is the number of the plurality of sub-models,
For the coefficient of the i-th sub-model, P i ' is the first sub-value output of the i-th sub-model, and P 0 is a constant.
The apparatus of claim 16, wherein the dividing unit comprises:

a first dividing subunit, configured to use the line according to a type of service included in the behavior data Divide data into multiple categories; or

The second dividing subunit is configured to divide the data including the target object in the behavior data into one class, and divide the data in the behavior data that does not include the target object into another class.
The apparatus of claim 11 wherein said apparatus further comprises:

a converting unit, configured to convert the first value into a third value S after recording the first value output by the data analysis model:

Wherein S is used to indicate the degree to which the behavior of the first account satisfies the preset condition, b represents a reference value, p represents the first value, and st represents a step size.
A computer storage medium having stored therein computer executable instructions configured to perform the data processing method of claim 1.