CN113205217B

CN113205217B - Data processing method, device, equipment and storage medium

Info

Publication number: CN113205217B
Application number: CN202110494782.9A
Authority: CN
Inventors: 孙建强
Original assignee: Shanghai Yitan Network Technology Co ltd
Current assignee: Shanghai Yitan Network Technology Co ltd
Priority date: 2021-05-07
Filing date: 2021-05-07
Publication date: 2022-07-15
Anticipated expiration: 2041-05-07
Also published as: CN113205217A

Abstract

The invention discloses a data processing method, a data processing device, data processing equipment and a storage medium. Aiming at the problem that the training efficiency of the model is low due to the fact that the training data occupy too large memory of a computer and too long loading time when the existing model loads the training data, the characteristic data related to the time span are combined, the characteristic number of the input model is reduced, the memory occupied by the training data is greatly reduced on the premise that the prediction precision of the model is not damaged, the time consumed by loading the data is shortened, and the model training efficiency is improved.

Description

Data processing method, device, equipment and storage medium

Technical Field

The invention belongs to the technical field of data mining and algorithms in the Internet industry, and particularly relates to a data processing method, a data processing device, data processing equipment and a storage medium for user model data loading.

Background

The current artificial intelligence technology is developed rapidly, and as deep learning of the artificial intelligence core technology, the application of the artificial intelligence technology in more and more fields such as the internet, consumer credit and the like is far ahead of the traditional method. Training of deep learning models requires enormous computational resources and computational data, i.e., large-scale data sets.

In the aspect of data mining modeling, the used characteristics include a type of user behavior characteristics related to time, which are used for describing the trend of the characteristics along with time, such as the consumption amount of the app consumed by the user in the past 15 days, 30 days and 60 days, which are characteristics describing the consumption capacity of the user. The number of such user features is related to the number of time spans of interest.

When the time span is large, the behavior tendency of the user within the time span cannot be described in more detail. If the consumption amount of the user is 100 yuan in the last 15 days, the information of 'how many yuan the user consumes in each day' cannot be obtained; in order to describe the trend of the user behavior characteristic, the value of the characteristic needs to be concerned in a large time span. Therefore, when the model training data is loaded, the memory occupied by the training data in the computer is too large, the loading time is too long, and the training speed of the model is slowed down.

In addition, when model training data is loaded, memory consumption is usually reduced through real-time loading, but only a small amount of data can be loaded for training in each loading, the real-time data loading can cause frequent interaction between a CPU and a GPU, the frequent interaction between the GPU and the CPU can reduce the training work efficiency of the GPU, and the training speed of the model is further slowed down.

Disclosure of Invention

The invention aims to provide a data processing method, a data processing device, data processing equipment and a data processing storage medium, which can greatly reduce the memory occupied by training data and shorten the time consumed by loading data on the premise of ensuring that the model prediction precision is not damaged.

In order to solve the problems, the technical scheme of the invention is as follows:

a data processing method is used for loading user behavior prediction model data, and comprises the following steps:

acquiring historical data of user behaviors, and extracting user behavior data based on a time sequence;

based on the principle that the smaller the time interval between the last occurrence of the user behavior and the current time interval is, the larger the probability of the user behavior occurring again is, multiplying the user behavior data by a function c (t) attenuated along with time, and then summing sigma c (t) a to represent the summary feature of the user on the behavior; wherein c (t) e^-ktK is the attenuation coefficient;

and collecting the collected characteristics of all users to form a data set, and loading the data set serving as training data into the user behavior prediction model when the user behavior prediction model needs to load the training data.

According to an embodiment of the invention, the determination of the attenuation coefficient k comprises the steps of:

a1: let historical user behavior data be a_historyThe distance from the current time interval is t, and the user behavior value of the current time is a_todayThe initial value of the attenuation coefficient k is 0.01;

a2: calculating the summary feature valida ═ Σ e of each user^-kta_history；

A3: arranging user data according to the descending order of valida, and uniformly dividing the user data into n groups;

a4: computing the valida mean sum a of each group of users_todayAveraging to obtain two average value sequences;

a5: calculating a correlation coefficient between the two mean value sequences;

a6: if the correlation coefficient is larger than 0.85, taking valida as the summary feature of the corresponding user; if the correlation coefficient is less than or equal to 0.85, multiplying the attenuation coefficient k by 1.5 to obtain a new attenuation coefficient k, and repeating the steps A2-A6.

According to an embodiment of the present invention, when the user behavior prediction model needs to load training data, loading the data set into the user behavior prediction model as training data further includes:

reading a preset number of data files from the data set to obtain a data file list; the preset number is the maximum number of files which can be loaded by the current memory resource;

loading the file corresponding to the data file list to a memory;

and calling a preset training interface function to read the data file from the memory for model training.

According to an embodiment of the present invention, the reading a preset number of data files from a data set further includes:

and reading a preset number of data files from the data set in a random sampling mode.

According to an embodiment of the present invention, the step of calling a preset training interface function to read a data file from the memory for model training further includes:

determining operation parameters of a training interface function, wherein the operation parameters comprise the times of each data file participating in training, the number of data files loaded in each iteration and the iteration times;

and calling the training interface function to enable the training interface function to read a data file from the memory according to the operation parameters to perform model training.

A data processing apparatus for loading user behaviour prediction model data, the data processing apparatus comprising:

the data acquisition module is used for acquiring historical data of user behaviors and extracting user behavior data based on time series;

the data processing module is used for multiplying the user behavior data by a function c (t) attenuated along with time based on the principle that the smaller the time interval between the last occurrence of the user behavior and the current time interval is, the larger the probability of the user behavior occurring again is, and then summing sigma c (t) a is used for representing the summary characteristic of the user on the behavior; wherein c (t) e^-ktK is the attenuation coefficient;

and the data loading module is used for collecting the collected characteristics of all users to form a data set, and when the user behavior prediction model needs to load training data, the data set is used as the training data to be loaded into the user behavior prediction model.

According to an embodiment of the invention, the data loading module comprises a data reading unit, a data loading unit and a training processing unit;

the data reading unit is used for reading a preset number of data files from the data set to obtain a data file list; the preset number is the maximum number of files which can be loaded by the current memory resource;

the data loading unit is used for loading the file corresponding to the data file list to a memory;

and the training processing unit is used for calling a preset training interface function to read a data file from the memory to perform model training.

A data processing apparatus comprises a memory and a processor, the memory has computer readable instructions stored therein, and the processor executes the computer readable instructions to implement a data processing method in an embodiment of the invention.

A computer-readable medium storing a computer program which, when executed by one or more processors, implements a data processing method in an embodiment of the invention.

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects:

1) aiming at the problem that the training efficiency of the model is low due to the fact that the training data occupies too large computer memory and the loading time is too long when the existing model loads the training data, the data processing method in the embodiment of the invention reduces the feature number of the input model by combining the feature data related to the time span, greatly reduces the memory occupied by the training data on the premise of ensuring that the prediction precision of the model is not damaged, shortens the time consumed by loading the data, and improves the training efficiency of the model.

2) The data processing method in one embodiment of the invention aims at reducing the consumption of the memory by the existing real-time loading, but only a small amount of data can be loaded for training each loading, moreover, the real-time data loading can cause the frequent interaction between the CPU and the GPU, reduce the training work efficiency of the GPU and slow the training speed of the model, by the number of data files loaded each time training data is loaded, which is the maximum number of files that can be loaded by the current memory resource, then, the interface function is trained, namely, the data file can be read from the memory for model training, the data volume of the data file loaded each time can be maximized, thereby reducing the data loading times, the data reading efficiency of the CPU can be maximized, and the data reading interaction times of the CPU and the GPU are reduced, so that the GPU can perform model training more stably, and the model training efficiency is improved.

Drawings

FIG. 1 is a flow diagram of a data processing method in an embodiment of the invention;

FIG. 2 shows an embodiment of the present invention^-ktA functional diagram of (a);

FIG. 3 is a flow chart of the determination of the attenuation coefficient k according to one embodiment of the present invention;

FIG. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

A data processing method, an apparatus, a device and a storage medium according to the present invention will be described in detail with reference to the accompanying drawings and specific embodiments. Advantages and features of the present invention will become apparent from the following description and from the claims.

Example one

The embodiment provides a data processing method aiming at the problem that when the existing model loads training data, the training data occupies too large computer memory, the loading time is too long, and the model training efficiency is low.

Specifically, referring to fig. 1, the data processing method is used for loading user behavior prediction model data, and includes the following steps:

s1: acquiring historical data of user behaviors, and extracting user behavior data based on a time sequence;

s2: based on the principle that the smaller the interval between the last occurrence of the user behavior and the current time interval, the larger the probability of the user behavior occurring again, the user behavior data is multiplied by a function c (t) attenuated along with the time, and then the sum sigma c (t) a is summed for representing the summary feature of the user on the behavior; wherein c (t) e^-ktK is the attenuation coefficient;

s3: and collecting the collected characteristics of all users to form a data set, and loading the data set serving as training data into the user behavior prediction model when the user behavior prediction model needs to load the training data.

The determination of the attenuation coefficient k comprises the following steps:

a1: let historical user behavior data be a_historyT is the interval from the current time, and a is the user behavior value of the current time_todayThe initial value of the attenuation coefficient k is 0.01;

A3: arranging the user data according to the descending order of valida, and uniformly dividing the user data into n groups;

a4: for calculating each groupValida mean sum of households_todayAveraging to obtain two average value sequences;

a5: calculating a correlation coefficient between the two mean value sequences;

In step S1, history data of user behavior is acquired, and user behavior data based on time series is extracted. Taking the consumption credit as an example, the consumption credit is a product of financial innovation and is a loan which is continuously made by commercial banks and is used for personal consumption purposes (non-business purposes) of natural persons (illegal persons or organizations). Personal consumption credit refers to a credit loan provided to an individual consumer in commercial currency by a bank or other financial institution in the form of credit, mortgage, warranty, or guarantee. The consumption credit is divided into buyer credit and seller credit according to the object of accepting loan. The buyer credit is a loan issued to a consumer who purchases a consumer product, such as a personal travel loan, a personal comprehensive consumption loan, a personal short-term credit loan, and the like. The seller credit is a loan issued to enterprises selling consumer goods, such as a personal petty loan, a personal house loan, a personal automobile loan and the like, which is performed by mortgage on installments; depending on the security, the loan may be classified into mortgage loan, pledge loan, guarantee loan, and credit loan.

According to the 'consumption credit user behavior report' released recently by 58 finance, the consumption credit user uses the consumption credit for daily consumption and purchasing of household appliance 3C products, and KOL, acquaintance recommendation and online advertisement are main cognitive channels; interest and commission are main factors considered by users, and products with low cost and high safety degree are favored; the loan amount is generally concentrated in 2000 yuan to 15000 yuan, the daily interest rate of the loan is ten thousand 2.1 to ten thousand 5, and most users pay by stages.

From the analysis of the user industry, the Top 3 industry where credit consumers are located is production management/research and development, educational training, personnel/administration/logistics, respectively, accounting for 8%, 7%, and 6%, respectively. The personal monthly salary of the credit user is concentrated on 5000 yuan to 20000 yuan, and the monthly income of the family is concentrated on 10000 yuan to 30000 yuan. Therefore, people with stable occupation and moderate income prefer to use consumption credit products to solve daily consumption demands in advance. This also indicates that in the face of a wide middle-class market, spending credits have a greater potential to be mined.

For companies that provide credit services, it is desirable to have access to a community of users who need to apply for credit, through big data analysis, in preparation for better credit development. Credit companies typically build machine learning models to predict whether credit users will consume or loan in the near term or in the long term, where the machine learning models need to be trained to meet the accuracy requirements of the prediction. The training data usually adopts the historical credit behavior data of the users, and under the condition of popularizing the expense credit, the number of the users is increased continuously, and then the historical credit behavior data of the users are increased sharply. If the historical data is not properly processed, a large amount of computer memory is occupied during model training, the loading time is long, and the training efficiency of the model is low.

In order to improve the training efficiency of the model, the embodiment processes the training data to be loaded. Since the historical credit behavior data of the user has a great correlation with time, when the embodiment starts to process the historical credit behavior data of the user in terms of time series, the user behavior data based on the time series needs to be extracted from the historical credit behavior data of the user.

In step S2, based on the principle that the smaller the interval between the last occurrence of the user behavior and the current time interval, the greater the probability of the user behavior occurring again, the user behavior data is multiplied by a function c (t) that decays with time, and then the sum Σ c (t) a is summed up to characterize the summary feature of the user on the behavior; wherein c (t) ═ e^-ktAnd k is an attenuation coefficient.

For a time-dependent user behavior a (e.g. a consumption amount of a day), the smaller the time interval t from the current time interval t, the greater the effect thereof, and vice versa.

For example, there are two users who have consumed 100 dollars in the past 15 days, the first user being consumed in the first 5 days of the past 15 days, and the second user being consumed in the last 5 days of the past 15 days. Then the second user is much more likely to consume today than the first user.

Based on the principle that the smaller the distance from the current time interval, the greater the occurrence probability of the user behavior, the method can multiply each behavior a of the user by a function c (t) decaying along with the time, and then sum sigma c (t) a to characterize the summary feature of the user on the behavior.

Where the function c (t) is a decreasing function of the time interval t, here an exponential function as follows

c(t)＝e^-kt

Where k is the "attenuation coefficient" to be determined. Referring to fig. 2, the function of c (t) is shown, and it can be seen that the larger the value of k, the faster the attenuation. Thus, the summary feature of the user may be expressed as Σ e^-tta。

Due to the summary feature ∑ e^-kta, representing the comprehensive potential of the user on the behavior characteristics, and summarizing the characteristics sigma e under the influence of randomness^-ktThe user with high a value does not necessarily show prominence on the behavior a today, and the user with low a value does not necessarily show worse today. But the performance of the user on the behavior tends to be prominent when the summary feature value is higher as a whole. Therefore, the attenuation coefficient k can be determined by:

(1) the users are according to sigma e^-kta is divided into n groups G after being sorted from high to low₁,G₂,…,G_n(e.g., 20 to 30 groups, as determined by the user volume);

(2) take each group of users ∑ e^-ktThe average of a and the average of today over the behavior a;

note: n is a radical of an alkyl radical_uThe number of consumption days, t, of user u_uiFor the ith consumption of the distance of the user u todayInter space, a_uiAmount of i-th consumption of user u, a_uThe amount that user u consumes today.

The appropriate k is chosen to maximize the correlation coefficient of the two averages for each group of users, typically requiring a correlation coefficient greater than 0.85.

Specifically, referring to fig. 3, the determination of the attenuation coefficient k includes the following steps:

A3: arranging user data according to the descending order of valida, and uniformly dividing the user data into n groups; such as G₁,G₂,…,G_n；

A4: computing the valida mean value sum a of each group of users_todayAveraging to obtain two averaged sequences [ valida ]₁,valida₂,…,valida_n]And [ a_today1,a_today2,…,a_todayn]；

A5: calculating a correlation coefficient between the two mean value sequences;

In step S3, the collected features of each user are collected to form a data set, and when the user behavior prediction model needs to be loaded with training data, the data set is loaded as training data into the user behavior prediction model. When the user behavior prediction model needs to be loaded with training data, loading the data set as the training data into the user behavior prediction model further comprises:

reading a preset number of data files from the data set to obtain a data file list; the preset quantity is the maximum file quantity which can be loaded by the current memory resource of the CPU;

loading files corresponding to the data file list into a memory;

Specifically, in this embodiment, the maximum number of files that can be loaded by the current memory resource is used as a reference, and the maximum number of files that can be loaded by the current memory resource is used as the number of files that can be read in a single data file reading operation.

According to the above rules, when reading data files from the training data set each time, a set number of data files is read, and the set number is equal to the maximum number of files that can be loaded by the current memory resource. That is, each time a data file is read from the training data set, the maximum number of data files that can be loaded by the current memory resource is read. And if the number of the data files in the training data set is less than the maximum number which can be loaded by the current memory resource when the data files are read at a certain time, all the data files in the training data set are read.

The data files read from the training data set constitute a data file list, and information such as the names and storage paths of the respective data files is stored in the list.

And loading the file corresponding to the data file list to the memory. For example, according to the file name and the storage path described in the data file list, the file with the corresponding name is read from the corresponding storage path, and the read file is loaded into the memory. The file loading process can be performed in batches through multiple threads, so that more data files can be loaded through one-time loading.

According to the above processing of the present embodiment, each time data is loaded, the memory resource of the CPU can be fully utilized, so that each data loading can be made to play the most useful role. Under the condition that the total training data volume is certain, the number of data files loaded by single data is increased, and the data loading times can be reduced, so that the data loading interaction times of the GPU and the CPU can be reduced. On one hand, the data loading efficiency of the CPU is higher; on the other hand, the GPU may perform the model training process more attentively, so that the model training efficiency may be improved.

And calling a preset training interface function to read a data file from the memory to perform model training. For example, after the training data is loaded into the memory, a preset training interface function may be called to read a data file from the memory for model training, that is, a subsequent model training process is performed.

In the data reading and model training process, determining operation parameters of a training interface function, wherein the operation parameters comprise the number of times of each data file participating in training, the number of data files loaded in each iteration and the number of iterations. The application of a single training data can be repeated for many times, for example, a model is trained for many times by using the single training data, or a plurality of data files can be loaded for training in a single training, and the specific data loading and training mode can be flexibly set by setting parameters of a training interface function.

Example two

The present embodiment provides a data processing apparatus, configured to load user behavior prediction model data, and referring to fig. 4, the data processing apparatus includes:

the data acquisition module 1 is used for acquiring historical data of user behaviors and extracting user behavior data based on time series;

the data processing module 2 is used for multiplying the user behavior data by a function c (t) attenuated along with time based on the principle that the smaller the time interval between the last occurrence of the user behavior and the current time interval is, the larger the probability of the user behavior occurring again is, and then summing sigma c (t) a, and is used for representing the summary feature of the user on the behavior; wherein c (t) ═ e^-ktK is the attenuation coefficient;

and the data loading module 3 is used for collecting the collected characteristics of all the users to form a data set, and loading the data set serving as training data into the user behavior prediction model when the user behavior prediction model needs to load the training data.

The data loading module 3 includes a data reading unit, a data loading unit, and a training processing unit. The data reading unit is used for reading a preset number of data files from the data set to obtain a data file list; the preset number is the maximum number of files that can be loaded by the current memory resource. And the data loading unit is used for loading the file corresponding to the data file list to the memory. And the training processing unit is used for calling a preset training interface function to read the data file from the memory to perform model training.

The functions and implementation methods of the data acquisition module 1, the data processing module 2, and the data loading module 3 are all as described in the first embodiment, and are not described herein again.

EXAMPLE III

The second embodiment of the present invention describes the data processing apparatus in detail from the perspective of the modular functional entity, and the following describes the data processing apparatus in detail from the perspective of hardware processing.

Referring to fig. 5, the data processing apparatus 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, one or more storage media 530 (e.g., one or more mass storage devices) storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a sequence of instructions operating on the data processing apparatus 500.

Further, the processor 510 may be arranged to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the data processing device 500.

The data processing apparatus 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Vista, and the like.

Those skilled in the art will appreciate that the data processing device configuration shown in fig. 5 is not meant to be limiting of the data processing device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium. The computer readable storage medium has instructions stored therein, which when executed on a computer, cause the computer to perform the steps of the data processing method in the first embodiment.

The modules in the second embodiment, if implemented in the form of software functional modules and sold or used as independent products, can be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which essentially or partly contributes to the prior art, or all or part of the technical solution may be embodied in the form of software, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a portable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the apparatus and the device described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the embodiments. Even if various changes are made to the present invention, they are still within the scope of the present invention provided that they fall within the scope of the claims of the present invention and their equivalents.

Claims

1. A data processing method is used for loading user behavior prediction model data, and is characterized by comprising the following steps:

based on the principle that the smaller the interval between the last occurrence of the user behavior and the current time interval, the larger the probability of the user behavior occurring again, the user behavior data is multiplied by a function c (t) attenuated along with the time, and then the sum sigma c (t) a is summed for representing the summary feature of the user on the behavior; wherein a is user behavior data, c (t) e^-ktK is the attenuation coefficient;

collecting the collected characteristics of each user to form a data set, and loading the data set serving as training data into a user behavior prediction model when the user behavior prediction model needs to load the training data;

a4: computing the valida mean value sum a of each group of users_todayAveraging to obtain two average value sequences;

a5: calculating a correlation coefficient between the two mean value sequences;

2. The data processing method of claim 1, wherein loading the data set as training data into the user behavior prediction model when the user behavior prediction model requires loading of training data further comprises:

loading the file corresponding to the data file list to a memory;

3. The data processing method of claim 2, wherein reading a preset number of data files from a data set further comprises:

4. The data processing method of claim 2, wherein said calling a preset training interface function to read a data file from said memory for model training further comprises:

determining operation parameters of a training interface function, wherein the operation parameters comprise the times of each data file participating in training, the number of data files loaded in each iteration and the number of iterations;

5. A data processing apparatus for loading user behavior prediction model data, the data processing apparatus comprising:

the data processing module is used for multiplying the user behavior data by a function c (t) attenuated along with time based on the principle that the smaller the time interval between the last occurrence of the user behavior and the current time interval is, the larger the probability of the user behavior occurring again is, and then summing sigma c (t) a is used for representing the summary characteristic of the user on the behavior; wherein the content of the first and second substances,a is user behavior data, c (t) e^-ktAnd k is an attenuation coefficient, the determination of the attenuation coefficient k comprising the steps of:

a5: calculating a correlation coefficient between the two mean value sequences;

a6: if the correlation coefficient is larger than 0.85, taking valida as the summary feature of the corresponding user; if the correlation coefficient is less than or equal to 0.85, multiplying the attenuation coefficient k by 1.5 to obtain a new attenuation coefficient k, and repeating the steps A2-A6;

6. The data processing apparatus of claim 5, wherein the data loading module comprises a data reading unit, a data loading unit, and a training processing unit;

7. A data processing apparatus comprising a memory having computer readable instructions stored therein and a processor which when executing the computer readable instructions implements a data processing method as claimed in any one of claims 1 to 4.

8. A computer-readable medium storing a computer program, characterized in that the computer program, when executed by one or more processors, implements a data processing method according to any one of claims 1 to 4.