CN113205217B - Data processing method, device, equipment and storage medium - Google Patents

Data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN113205217B
CN113205217B CN202110494782.9A CN202110494782A CN113205217B CN 113205217 B CN113205217 B CN 113205217B CN 202110494782 A CN202110494782 A CN 202110494782A CN 113205217 B CN113205217 B CN 113205217B
Authority
CN
China
Prior art keywords
data
training
user
user behavior
loading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110494782.9A
Other languages
Chinese (zh)
Other versions
CN113205217A (en
Inventor
孙建强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yitan Network Technology Co ltd
Original Assignee
Shanghai Yitan Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yitan Network Technology Co ltd filed Critical Shanghai Yitan Network Technology Co ltd
Priority to CN202110494782.9A priority Critical patent/CN113205217B/en
Publication of CN113205217A publication Critical patent/CN113205217A/en
Application granted granted Critical
Publication of CN113205217B publication Critical patent/CN113205217B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Human Resources & Organizations (AREA)
  • General Business, Economics & Management (AREA)
  • Finance (AREA)
  • General Engineering & Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Game Theory and Decision Science (AREA)
  • Technology Law (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a data processing method, a data processing device, data processing equipment and a storage medium. Aiming at the problem that the training efficiency of the model is low due to the fact that the training data occupy too large memory of a computer and too long loading time when the existing model loads the training data, the characteristic data related to the time span are combined, the characteristic number of the input model is reduced, the memory occupied by the training data is greatly reduced on the premise that the prediction precision of the model is not damaged, the time consumed by loading the data is shortened, and the model training efficiency is improved.

Description

Data processing method, device, equipment and storage medium
Technical Field
The invention belongs to the technical field of data mining and algorithms in the Internet industry, and particularly relates to a data processing method, a data processing device, data processing equipment and a storage medium for user model data loading.
Background
The current artificial intelligence technology is developed rapidly, and as deep learning of the artificial intelligence core technology, the application of the artificial intelligence technology in more and more fields such as the internet, consumer credit and the like is far ahead of the traditional method. Training of deep learning models requires enormous computational resources and computational data, i.e., large-scale data sets.
In the aspect of data mining modeling, the used characteristics include a type of user behavior characteristics related to time, which are used for describing the trend of the characteristics along with time, such as the consumption amount of the app consumed by the user in the past 15 days, 30 days and 60 days, which are characteristics describing the consumption capacity of the user. The number of such user features is related to the number of time spans of interest.
When the time span is large, the behavior tendency of the user within the time span cannot be described in more detail. If the consumption amount of the user is 100 yuan in the last 15 days, the information of 'how many yuan the user consumes in each day' cannot be obtained; in order to describe the trend of the user behavior characteristic, the value of the characteristic needs to be concerned in a large time span. Therefore, when the model training data is loaded, the memory occupied by the training data in the computer is too large, the loading time is too long, and the training speed of the model is slowed down.
In addition, when model training data is loaded, memory consumption is usually reduced through real-time loading, but only a small amount of data can be loaded for training in each loading, the real-time data loading can cause frequent interaction between a CPU and a GPU, the frequent interaction between the GPU and the CPU can reduce the training work efficiency of the GPU, and the training speed of the model is further slowed down.
Disclosure of Invention
The invention aims to provide a data processing method, a data processing device, data processing equipment and a data processing storage medium, which can greatly reduce the memory occupied by training data and shorten the time consumed by loading data on the premise of ensuring that the model prediction precision is not damaged.
In order to solve the problems, the technical scheme of the invention is as follows:
a data processing method is used for loading user behavior prediction model data, and comprises the following steps:
acquiring historical data of user behaviors, and extracting user behavior data based on a time sequence;
based on the principle that the smaller the time interval between the last occurrence of the user behavior and the current time interval is, the larger the probability of the user behavior occurring again is, multiplying the user behavior data by a function c (t) attenuated along with time, and then summing sigma c (t) a to represent the summary feature of the user on the behavior; wherein c (t) e-ktK is the attenuation coefficient;
and collecting the collected characteristics of all users to form a data set, and loading the data set serving as training data into the user behavior prediction model when the user behavior prediction model needs to load the training data.
According to an embodiment of the invention, the determination of the attenuation coefficient k comprises the steps of:
a1: let historical user behavior data be ahistoryThe distance from the current time interval is t, and the user behavior value of the current time is atodayThe initial value of the attenuation coefficient k is 0.01;
a2: calculating the summary feature valida ═ Σ e of each user-ktahistory
A3: arranging user data according to the descending order of valida, and uniformly dividing the user data into n groups;
a4: computing the valida mean sum a of each group of userstodayAveraging to obtain two average value sequences;
a5: calculating a correlation coefficient between the two mean value sequences;
a6: if the correlation coefficient is larger than 0.85, taking valida as the summary feature of the corresponding user; if the correlation coefficient is less than or equal to 0.85, multiplying the attenuation coefficient k by 1.5 to obtain a new attenuation coefficient k, and repeating the steps A2-A6.
According to an embodiment of the present invention, when the user behavior prediction model needs to load training data, loading the data set into the user behavior prediction model as training data further includes:
reading a preset number of data files from the data set to obtain a data file list; the preset number is the maximum number of files which can be loaded by the current memory resource;
loading the file corresponding to the data file list to a memory;
and calling a preset training interface function to read the data file from the memory for model training.
According to an embodiment of the present invention, the reading a preset number of data files from a data set further includes:
and reading a preset number of data files from the data set in a random sampling mode.
According to an embodiment of the present invention, the step of calling a preset training interface function to read a data file from the memory for model training further includes:
determining operation parameters of a training interface function, wherein the operation parameters comprise the times of each data file participating in training, the number of data files loaded in each iteration and the iteration times;
and calling the training interface function to enable the training interface function to read a data file from the memory according to the operation parameters to perform model training.
A data processing apparatus for loading user behaviour prediction model data, the data processing apparatus comprising:
the data acquisition module is used for acquiring historical data of user behaviors and extracting user behavior data based on time series;
the data processing module is used for multiplying the user behavior data by a function c (t) attenuated along with time based on the principle that the smaller the time interval between the last occurrence of the user behavior and the current time interval is, the larger the probability of the user behavior occurring again is, and then summing sigma c (t) a is used for representing the summary characteristic of the user on the behavior; wherein c (t) e-ktK is the attenuation coefficient;
and the data loading module is used for collecting the collected characteristics of all users to form a data set, and when the user behavior prediction model needs to load training data, the data set is used as the training data to be loaded into the user behavior prediction model.
According to an embodiment of the invention, the data loading module comprises a data reading unit, a data loading unit and a training processing unit;
the data reading unit is used for reading a preset number of data files from the data set to obtain a data file list; the preset number is the maximum number of files which can be loaded by the current memory resource;
the data loading unit is used for loading the file corresponding to the data file list to a memory;
and the training processing unit is used for calling a preset training interface function to read a data file from the memory to perform model training.
A data processing apparatus comprises a memory and a processor, the memory has computer readable instructions stored therein, and the processor executes the computer readable instructions to implement a data processing method in an embodiment of the invention.
A computer-readable medium storing a computer program which, when executed by one or more processors, implements a data processing method in an embodiment of the invention.
Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects:
1) aiming at the problem that the training efficiency of the model is low due to the fact that the training data occupies too large computer memory and the loading time is too long when the existing model loads the training data, the data processing method in the embodiment of the invention reduces the feature number of the input model by combining the feature data related to the time span, greatly reduces the memory occupied by the training data on the premise of ensuring that the prediction precision of the model is not damaged, shortens the time consumed by loading the data, and improves the training efficiency of the model.
2) The data processing method in one embodiment of the invention aims at reducing the consumption of the memory by the existing real-time loading, but only a small amount of data can be loaded for training each loading, moreover, the real-time data loading can cause the frequent interaction between the CPU and the GPU, reduce the training work efficiency of the GPU and slow the training speed of the model, by the number of data files loaded each time training data is loaded, which is the maximum number of files that can be loaded by the current memory resource, then, the interface function is trained, namely, the data file can be read from the memory for model training, the data volume of the data file loaded each time can be maximized, thereby reducing the data loading times, the data reading efficiency of the CPU can be maximized, and the data reading interaction times of the CPU and the GPU are reduced, so that the GPU can perform model training more stably, and the model training efficiency is improved.
Drawings
FIG. 1 is a flow diagram of a data processing method in an embodiment of the invention;
FIG. 2 shows an embodiment of the present invention-ktA functional diagram of (a);
FIG. 3 is a flow chart of the determination of the attenuation coefficient k according to one embodiment of the present invention;
FIG. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention.
Detailed Description
A data processing method, an apparatus, a device and a storage medium according to the present invention will be described in detail with reference to the accompanying drawings and specific embodiments. Advantages and features of the present invention will become apparent from the following description and from the claims.
Example one
The embodiment provides a data processing method aiming at the problem that when the existing model loads training data, the training data occupies too large computer memory, the loading time is too long, and the model training efficiency is low.
Specifically, referring to fig. 1, the data processing method is used for loading user behavior prediction model data, and includes the following steps:
s1: acquiring historical data of user behaviors, and extracting user behavior data based on a time sequence;
s2: based on the principle that the smaller the interval between the last occurrence of the user behavior and the current time interval, the larger the probability of the user behavior occurring again, the user behavior data is multiplied by a function c (t) attenuated along with the time, and then the sum sigma c (t) a is summed for representing the summary feature of the user on the behavior; wherein c (t) e-ktK is the attenuation coefficient;
s3: and collecting the collected characteristics of all users to form a data set, and loading the data set serving as training data into the user behavior prediction model when the user behavior prediction model needs to load the training data.
The determination of the attenuation coefficient k comprises the following steps:
a1: let historical user behavior data be ahistoryT is the interval from the current time, and a is the user behavior value of the current timetodayThe initial value of the attenuation coefficient k is 0.01;
a2: calculating the summary feature valida ═ Σ e of each user-ktahistory
A3: arranging the user data according to the descending order of valida, and uniformly dividing the user data into n groups;
a4: for calculating each groupValida mean sum of householdstodayAveraging to obtain two average value sequences;
a5: calculating a correlation coefficient between the two mean value sequences;
a6: if the correlation coefficient is larger than 0.85, taking valida as the summary feature of the corresponding user; if the correlation coefficient is less than or equal to 0.85, multiplying the attenuation coefficient k by 1.5 to obtain a new attenuation coefficient k, and repeating the steps A2-A6.
In step S1, history data of user behavior is acquired, and user behavior data based on time series is extracted. Taking the consumption credit as an example, the consumption credit is a product of financial innovation and is a loan which is continuously made by commercial banks and is used for personal consumption purposes (non-business purposes) of natural persons (illegal persons or organizations). Personal consumption credit refers to a credit loan provided to an individual consumer in commercial currency by a bank or other financial institution in the form of credit, mortgage, warranty, or guarantee. The consumption credit is divided into buyer credit and seller credit according to the object of accepting loan. The buyer credit is a loan issued to a consumer who purchases a consumer product, such as a personal travel loan, a personal comprehensive consumption loan, a personal short-term credit loan, and the like. The seller credit is a loan issued to enterprises selling consumer goods, such as a personal petty loan, a personal house loan, a personal automobile loan and the like, which is performed by mortgage on installments; depending on the security, the loan may be classified into mortgage loan, pledge loan, guarantee loan, and credit loan.
According to the 'consumption credit user behavior report' released recently by 58 finance, the consumption credit user uses the consumption credit for daily consumption and purchasing of household appliance 3C products, and KOL, acquaintance recommendation and online advertisement are main cognitive channels; interest and commission are main factors considered by users, and products with low cost and high safety degree are favored; the loan amount is generally concentrated in 2000 yuan to 15000 yuan, the daily interest rate of the loan is ten thousand 2.1 to ten thousand 5, and most users pay by stages.
From the analysis of the user industry, the Top 3 industry where credit consumers are located is production management/research and development, educational training, personnel/administration/logistics, respectively, accounting for 8%, 7%, and 6%, respectively. The personal monthly salary of the credit user is concentrated on 5000 yuan to 20000 yuan, and the monthly income of the family is concentrated on 10000 yuan to 30000 yuan. Therefore, people with stable occupation and moderate income prefer to use consumption credit products to solve daily consumption demands in advance. This also indicates that in the face of a wide middle-class market, spending credits have a greater potential to be mined.
For companies that provide credit services, it is desirable to have access to a community of users who need to apply for credit, through big data analysis, in preparation for better credit development. Credit companies typically build machine learning models to predict whether credit users will consume or loan in the near term or in the long term, where the machine learning models need to be trained to meet the accuracy requirements of the prediction. The training data usually adopts the historical credit behavior data of the users, and under the condition of popularizing the expense credit, the number of the users is increased continuously, and then the historical credit behavior data of the users are increased sharply. If the historical data is not properly processed, a large amount of computer memory is occupied during model training, the loading time is long, and the training efficiency of the model is low.
In order to improve the training efficiency of the model, the embodiment processes the training data to be loaded. Since the historical credit behavior data of the user has a great correlation with time, when the embodiment starts to process the historical credit behavior data of the user in terms of time series, the user behavior data based on the time series needs to be extracted from the historical credit behavior data of the user.
In step S2, based on the principle that the smaller the interval between the last occurrence of the user behavior and the current time interval, the greater the probability of the user behavior occurring again, the user behavior data is multiplied by a function c (t) that decays with time, and then the sum Σ c (t) a is summed up to characterize the summary feature of the user on the behavior; wherein c (t) ═ e-ktAnd k is an attenuation coefficient.
For a time-dependent user behavior a (e.g. a consumption amount of a day), the smaller the time interval t from the current time interval t, the greater the effect thereof, and vice versa.
For example, there are two users who have consumed 100 dollars in the past 15 days, the first user being consumed in the first 5 days of the past 15 days, and the second user being consumed in the last 5 days of the past 15 days. Then the second user is much more likely to consume today than the first user.
Based on the principle that the smaller the distance from the current time interval, the greater the occurrence probability of the user behavior, the method can multiply each behavior a of the user by a function c (t) decaying along with the time, and then sum sigma c (t) a to characterize the summary feature of the user on the behavior.
Where the function c (t) is a decreasing function of the time interval t, here an exponential function as follows
c(t)=e-kt
Where k is the "attenuation coefficient" to be determined. Referring to fig. 2, the function of c (t) is shown, and it can be seen that the larger the value of k, the faster the attenuation. Thus, the summary feature of the user may be expressed as Σ e-tta。
Due to the summary feature ∑ e-kta, representing the comprehensive potential of the user on the behavior characteristics, and summarizing the characteristics sigma e under the influence of randomness-ktThe user with high a value does not necessarily show prominence on the behavior a today, and the user with low a value does not necessarily show worse today. But the performance of the user on the behavior tends to be prominent when the summary feature value is higher as a whole. Therefore, the attenuation coefficient k can be determined by:
(1) the users are according to sigma e-kta is divided into n groups G after being sorted from high to low1,G2,…,Gn(e.g., 20 to 30 groups, as determined by the user volume);
(2) take each group of users ∑ e-ktThe average of a and the average of today over the behavior a;
Figure BDA0003053809260000071
note: n is a radical of an alkyl radicaluThe number of consumption days, t, of user uuiFor the ith consumption of the distance of the user u todayInter space, auiAmount of i-th consumption of user u, auThe amount that user u consumes today.
The appropriate k is chosen to maximize the correlation coefficient of the two averages for each group of users, typically requiring a correlation coefficient greater than 0.85.
Specifically, referring to fig. 3, the determination of the attenuation coefficient k includes the following steps:
a1: let historical user behavior data be ahistoryThe distance from the current time interval is t, and the user behavior value of the current time is atodayThe initial value of the attenuation coefficient k is 0.01;
a2: calculating the summary feature valida ═ Σ e of each user-ktahistory
A3: arranging user data according to the descending order of valida, and uniformly dividing the user data into n groups; such as G1,G2,…,Gn
A4: computing the valida mean value sum a of each group of userstodayAveraging to obtain two averaged sequences [ valida ]1,valida2,…,validan]And [ atoday1,atoday2,…,atodayn];
A5: calculating a correlation coefficient between the two mean value sequences;
a6: if the correlation coefficient is larger than 0.85, taking valida as the summary feature of the corresponding user; if the correlation coefficient is less than or equal to 0.85, multiplying the attenuation coefficient k by 1.5 to obtain a new attenuation coefficient k, and repeating the steps A2-A6.
In step S3, the collected features of each user are collected to form a data set, and when the user behavior prediction model needs to be loaded with training data, the data set is loaded as training data into the user behavior prediction model. When the user behavior prediction model needs to be loaded with training data, loading the data set as the training data into the user behavior prediction model further comprises:
reading a preset number of data files from the data set to obtain a data file list; the preset quantity is the maximum file quantity which can be loaded by the current memory resource of the CPU;
loading files corresponding to the data file list into a memory;
and calling a preset training interface function to read the data file from the memory for model training.
Specifically, in this embodiment, the maximum number of files that can be loaded by the current memory resource is used as a reference, and the maximum number of files that can be loaded by the current memory resource is used as the number of files that can be read in a single data file reading operation.
According to the above rules, when reading data files from the training data set each time, a set number of data files is read, and the set number is equal to the maximum number of files that can be loaded by the current memory resource. That is, each time a data file is read from the training data set, the maximum number of data files that can be loaded by the current memory resource is read. And if the number of the data files in the training data set is less than the maximum number which can be loaded by the current memory resource when the data files are read at a certain time, all the data files in the training data set are read.
The data files read from the training data set constitute a data file list, and information such as the names and storage paths of the respective data files is stored in the list.
And loading the file corresponding to the data file list to the memory. For example, according to the file name and the storage path described in the data file list, the file with the corresponding name is read from the corresponding storage path, and the read file is loaded into the memory. The file loading process can be performed in batches through multiple threads, so that more data files can be loaded through one-time loading.
According to the above processing of the present embodiment, each time data is loaded, the memory resource of the CPU can be fully utilized, so that each data loading can be made to play the most useful role. Under the condition that the total training data volume is certain, the number of data files loaded by single data is increased, and the data loading times can be reduced, so that the data loading interaction times of the GPU and the CPU can be reduced. On one hand, the data loading efficiency of the CPU is higher; on the other hand, the GPU may perform the model training process more attentively, so that the model training efficiency may be improved.
And calling a preset training interface function to read a data file from the memory to perform model training. For example, after the training data is loaded into the memory, a preset training interface function may be called to read a data file from the memory for model training, that is, a subsequent model training process is performed.
In the data reading and model training process, determining operation parameters of a training interface function, wherein the operation parameters comprise the number of times of each data file participating in training, the number of data files loaded in each iteration and the number of iterations. The application of a single training data can be repeated for many times, for example, a model is trained for many times by using the single training data, or a plurality of data files can be loaded for training in a single training, and the specific data loading and training mode can be flexibly set by setting parameters of a training interface function.
Example two
The present embodiment provides a data processing apparatus, configured to load user behavior prediction model data, and referring to fig. 4, the data processing apparatus includes:
the data acquisition module 1 is used for acquiring historical data of user behaviors and extracting user behavior data based on time series;
the data processing module 2 is used for multiplying the user behavior data by a function c (t) attenuated along with time based on the principle that the smaller the time interval between the last occurrence of the user behavior and the current time interval is, the larger the probability of the user behavior occurring again is, and then summing sigma c (t) a, and is used for representing the summary feature of the user on the behavior; wherein c (t) ═ e-ktK is the attenuation coefficient;
and the data loading module 3 is used for collecting the collected characteristics of all the users to form a data set, and loading the data set serving as training data into the user behavior prediction model when the user behavior prediction model needs to load the training data.
The data loading module 3 includes a data reading unit, a data loading unit, and a training processing unit. The data reading unit is used for reading a preset number of data files from the data set to obtain a data file list; the preset number is the maximum number of files that can be loaded by the current memory resource. And the data loading unit is used for loading the file corresponding to the data file list to the memory. And the training processing unit is used for calling a preset training interface function to read the data file from the memory to perform model training.
The functions and implementation methods of the data acquisition module 1, the data processing module 2, and the data loading module 3 are all as described in the first embodiment, and are not described herein again.
EXAMPLE III
The second embodiment of the present invention describes the data processing apparatus in detail from the perspective of the modular functional entity, and the following describes the data processing apparatus in detail from the perspective of hardware processing.
Referring to fig. 5, the data processing apparatus 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, one or more storage media 530 (e.g., one or more mass storage devices) storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a sequence of instructions operating on the data processing apparatus 500.
Further, the processor 510 may be arranged to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the data processing device 500.
The data processing apparatus 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Vista, and the like.
Those skilled in the art will appreciate that the data processing device configuration shown in fig. 5 is not meant to be limiting of the data processing device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium. The computer readable storage medium has instructions stored therein, which when executed on a computer, cause the computer to perform the steps of the data processing method in the first embodiment.
The modules in the second embodiment, if implemented in the form of software functional modules and sold or used as independent products, can be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which essentially or partly contributes to the prior art, or all or part of the technical solution may be embodied in the form of software, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a portable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the apparatus and the device described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the embodiments. Even if various changes are made to the present invention, they are still within the scope of the present invention provided that they fall within the scope of the claims of the present invention and their equivalents.

Claims (8)

1. A data processing method is used for loading user behavior prediction model data, and is characterized by comprising the following steps:
acquiring historical data of user behaviors, and extracting user behavior data based on a time sequence;
based on the principle that the smaller the interval between the last occurrence of the user behavior and the current time interval, the larger the probability of the user behavior occurring again, the user behavior data is multiplied by a function c (t) attenuated along with the time, and then the sum sigma c (t) a is summed for representing the summary feature of the user on the behavior; wherein a is user behavior data, c (t) e-ktK is the attenuation coefficient;
collecting the collected characteristics of each user to form a data set, and loading the data set serving as training data into a user behavior prediction model when the user behavior prediction model needs to load the training data;
the determination of the attenuation coefficient k comprises the following steps:
a1: let historical user behavior data be ahistoryT is the interval from the current time, and a is the user behavior value of the current timetodayThe initial value of the attenuation coefficient k is 0.01;
a2: calculating the summary feature valida ═ Σ e of each user-ktahistory
A3: arranging user data according to the descending order of valida, and uniformly dividing the user data into n groups;
a4: computing the valida mean value sum a of each group of userstodayAveraging to obtain two average value sequences;
a5: calculating a correlation coefficient between the two mean value sequences;
a6: if the correlation coefficient is larger than 0.85, taking valida as the summary feature of the corresponding user; if the correlation coefficient is less than or equal to 0.85, multiplying the attenuation coefficient k by 1.5 to obtain a new attenuation coefficient k, and repeating the steps A2-A6.
2. The data processing method of claim 1, wherein loading the data set as training data into the user behavior prediction model when the user behavior prediction model requires loading of training data further comprises:
reading a preset number of data files from the data set to obtain a data file list; the preset number is the maximum number of files which can be loaded by the current memory resource;
loading the file corresponding to the data file list to a memory;
and calling a preset training interface function to read the data file from the memory for model training.
3. The data processing method of claim 2, wherein reading a preset number of data files from a data set further comprises:
and reading a preset number of data files from the data set in a random sampling mode.
4. The data processing method of claim 2, wherein said calling a preset training interface function to read a data file from said memory for model training further comprises:
determining operation parameters of a training interface function, wherein the operation parameters comprise the times of each data file participating in training, the number of data files loaded in each iteration and the number of iterations;
and calling the training interface function to enable the training interface function to read a data file from the memory according to the operation parameters to perform model training.
5. A data processing apparatus for loading user behavior prediction model data, the data processing apparatus comprising:
the data acquisition module is used for acquiring historical data of user behaviors and extracting user behavior data based on time series;
the data processing module is used for multiplying the user behavior data by a function c (t) attenuated along with time based on the principle that the smaller the time interval between the last occurrence of the user behavior and the current time interval is, the larger the probability of the user behavior occurring again is, and then summing sigma c (t) a is used for representing the summary characteristic of the user on the behavior; wherein the content of the first and second substances,a is user behavior data, c (t) e-ktAnd k is an attenuation coefficient, the determination of the attenuation coefficient k comprising the steps of:
a1: let historical user behavior data be ahistoryThe distance from the current time interval is t, and the user behavior value of the current time is atodayThe initial value of the attenuation coefficient k is 0.01;
a2: calculating the summary feature valida ═ Σ e of each user-ktahistory
A3: arranging user data according to the descending order of valida, and uniformly dividing the user data into n groups;
a4: computing the valida mean sum a of each group of userstodayAveraging to obtain two average value sequences;
a5: calculating a correlation coefficient between the two mean value sequences;
a6: if the correlation coefficient is larger than 0.85, taking valida as the summary feature of the corresponding user; if the correlation coefficient is less than or equal to 0.85, multiplying the attenuation coefficient k by 1.5 to obtain a new attenuation coefficient k, and repeating the steps A2-A6;
and the data loading module is used for collecting the collected characteristics of all users to form a data set, and when the user behavior prediction model needs to load training data, the data set is used as the training data to be loaded into the user behavior prediction model.
6. The data processing apparatus of claim 5, wherein the data loading module comprises a data reading unit, a data loading unit, and a training processing unit;
the data reading unit is used for reading a preset number of data files from the data set to obtain a data file list; the preset number is the maximum number of files which can be loaded by the current memory resource;
the data loading unit is used for loading the file corresponding to the data file list to a memory;
and the training processing unit is used for calling a preset training interface function to read a data file from the memory to perform model training.
7. A data processing apparatus comprising a memory having computer readable instructions stored therein and a processor which when executing the computer readable instructions implements a data processing method as claimed in any one of claims 1 to 4.
8. A computer-readable medium storing a computer program, characterized in that the computer program, when executed by one or more processors, implements a data processing method according to any one of claims 1 to 4.
CN202110494782.9A 2021-05-07 2021-05-07 Data processing method, device, equipment and storage medium Active CN113205217B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110494782.9A CN113205217B (en) 2021-05-07 2021-05-07 Data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110494782.9A CN113205217B (en) 2021-05-07 2021-05-07 Data processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113205217A CN113205217A (en) 2021-08-03
CN113205217B true CN113205217B (en) 2022-07-15

Family

ID=77029027

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110494782.9A Active CN113205217B (en) 2021-05-07 2021-05-07 Data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113205217B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103793465A (en) * 2013-12-20 2014-05-14 武汉理工大学 Cloud computing based real-time mass user behavior analyzing method and system
CN106997360A (en) * 2016-01-25 2017-08-01 阿里巴巴集团控股有限公司 The treating method and apparatus of user behavior data
CN110516817A (en) * 2019-09-03 2019-11-29 北京华捷艾米科技有限公司 A kind of model training data load method and device
WO2020191282A2 (en) * 2020-03-20 2020-09-24 Futurewei Technologies, Inc. System and method for multi-task lifelong learning on personal device with improved user experience

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10496927B2 (en) * 2014-05-23 2019-12-03 DataRobot, Inc. Systems for time-series predictive data analytics, and related methods and apparatus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103793465A (en) * 2013-12-20 2014-05-14 武汉理工大学 Cloud computing based real-time mass user behavior analyzing method and system
CN106997360A (en) * 2016-01-25 2017-08-01 阿里巴巴集团控股有限公司 The treating method and apparatus of user behavior data
CN110516817A (en) * 2019-09-03 2019-11-29 北京华捷艾米科技有限公司 A kind of model training data load method and device
WO2020191282A2 (en) * 2020-03-20 2020-09-24 Futurewei Technologies, Inc. System and method for multi-task lifelong learning on personal device with improved user experience

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于时间衰减因子的梯度提升回归树房源推荐算法;冯炽,叶桦;《信息技术与信息化》;20190325;第48-51页 *

Also Published As

Publication number Publication date
CN113205217A (en) 2021-08-03

Similar Documents

Publication Publication Date Title
AU2022211812A1 (en) Method and system of dynamic model selection for time series forecasting
CN106097044A (en) A kind of data recommendation processing method and device
CN110264342A (en) A kind of business audit method and device based on machine learning
US20210125272A1 (en) Using Inferred Attributes as an Insight into Banking Customer Behavior
Brown et al. Life-cycle costing: A practical guide for energy managers
Chorro et al. Discriminating between GARCH models for option pricing by their ability to compute accurate VIX measures
CN113205217B (en) Data processing method, device, equipment and storage medium
CN107844874A (en) Enterprise operation problem analysis system and its method
Li et al. Econometric analysis of disequilibrium relations between internet finance and real economy in China
CN104750877A (en) Statistical analysis method used for cloud computing resource pricing
Soulas et al. Online machine learning algorithms for currency exchange prediction
CN113515528A (en) Asset screening system and method based on big data and ORACLE mass data
CN112465611A (en) Method and device for pushing article information to user and electronic equipment
CN112418670A (en) Case allocation method, device, equipment and medium
CN112836742A (en) System resource adjusting method, device and equipment
CN113450141A (en) Intelligent prediction method and device based on electricity selling quantity characteristics of large-power customer groups
CN111639910A (en) Standing book generation method, device, equipment and storage medium
Glau et al. A unified view of LIBOR models
CN109474703A (en) Personalized product combines method for pushing, apparatus and system
Zhang et al. Research on P2P Default Risk Prediction Based on Logistic Regression
CN111967966B (en) Automatic wake-up method and system for sleep clients of mobile phone banks
Sreejanya et al. Big Data Analysis Using Financial Risk Management
CN112347371A (en) Resource returning and ratio increasing method and device based on social text information and electronic equipment
CN116204724A (en) Financial product recommendation method and device
CN115564561A (en) Enterprise data processing method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant