CN113836411A

CN113836411A - Data processing method and device and computer equipment

Info

Publication number: CN113836411A
Application number: CN202111108679.2A
Authority: CN
Inventors: 卢晓威; 何其真; 钟礼刚
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2021-12-24
Anticipated expiration: 2041-09-22
Also published as: CN113836411B

Abstract

The application discloses a data processing method, a data processing device and computer equipment, wherein the method comprises the following steps: processing user behavior data in a first storage unit to obtain a corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period; merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is generated in a previous time period of the current time period; performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set; training a model based on the online user feature dataset. The present application also provides a computer-readable storage medium. The method and the device can effectively ensure the quantity and freshness of the online training data, and improve the online training efficiency of the model.

Description

Data processing method and device and computer equipment

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, and a computer device.

Background

With the development of internet technology, more and more users choose to browse, select or purchase the required goods on the internet. Each e-commerce platform provides abundant and various commodities for users, and meanwhile, recommends commodities to the users in different degrees by adopting various recommendation technologies. In order to achieve the purpose of recommending various useful information to a user in time and avoiding recommending useless information as much as possible, user behavior data of clicking or browsing commodity advertisements by a plurality of users are collected, and therefore a click rate estimation model capable of estimating click probabilities of different users on recommended data is trained.

Generally speaking, since the preference of the user changes with time, the click rate estimation model trained according to the historical user behavior data, that is, the offline user behavior data, is often inaccurate in the accuracy of the click rate of the user on the recommended data. However, in the prior art, the storage modes of the historical user behavior data and the real-time user behavior data are different, so that the historical user behavior data and the real-time user behavior data cannot be effectively utilized to perform model training at the same time; that is, the training data that can be used to train the model on-line is single, resulting in a poor accuracy of the trained model on-line.

Disclosure of Invention

The application provides a data processing method, a data processing device and computer equipment, which can solve the problems that the training data is single and the accuracy of a trained online model is not high.

First, to achieve the above object, the present application provides a data processing method, including:

processing user behavior data in a first storage unit to obtain a corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period; merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of a current time period; performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set; training a model based on the online user feature dataset.

In one example, the processing the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set includes: pulling each piece of initial user behavior data from the first storage unit; sequentially marking each piece of initial user behavior data with message codes; and extracting effective fields of the initial user behavior data coded by the marked message to obtain corresponding user behavior data, and recording the corresponding user behavior data to a first online user behavior data set.

In one example, the valid field includes at least one of a user ID, user identity information, behavior data generation time, and recommendation data.

In one example, the processing the user behavior data in the first storage unit further includes: and distributing the initial user behavior data generated by the same user side in the first storage unit to the same computing node in the Flink computing engine to perform data processing.

In one example, merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, includes: performing duplicate removal operation on all user behavior data of the first online user behavior data set and the second online user behavior data set, and sequencing according to the generation time; clearing the overdue user behavior data with the generation time smaller than a preset time threshold value to obtain third online user behavior data; and replacing and storing the third online user behavior data to the second storage unit.

In one example, the method further comprises: scanning all user behavior data of the first online user behavior data set to generate a corresponding first version number, wherein the version number comprises snapshot information of all behavior data of the first online user behavior data set; storing the first version number to the second storage unit; and storing the first online user behavior data set and the first version number into a preset third storage unit.

In one example, after the training of the model is completed and the online user behavior data set in the second storage unit is updated again, the method further comprises: and inquiring the first online user behavior data set from a third storage unit according to the first version number for executing a regression training process of the model.

In addition, to achieve the above object, the present application also provides a data processing apparatus, including:

the processing module is used for processing the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period; the merging module merges the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of the current time period; the extraction module is used for extracting the characteristics of each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user characteristic data set; and the training module is used for training a model based on the online user characteristic data set.

Further, the present application also proposes a computer device, which includes a memory and a processor, wherein the memory stores a computer program that can be executed on the processor, and the computer program implements the steps of the data processing method as described above when executed by the processor.

Further, to achieve the above object, the present application also provides a computer-readable storage medium storing a computer program, which is executable by at least one processor to cause the at least one processor to perform the steps of the data processing method as described above.

Compared with the prior art, the data processing method, the data processing device, the computer equipment and the computer readable storage medium provided by the application can process the user behavior data in the first storage unit to obtain the corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period; merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of a current time period; performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set; training a model based on the online user feature dataset. After the new online user behavior data set is obtained, the online user behavior data in the second storage unit is directly updated and replaced, so that the online user characteristic data set is rapidly extracted for online model training, the quantity and freshness of online training data are guaranteed, and the online training efficiency of the model is improved.

Drawings

FIG. 1 is a schematic diagram of an application environment according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a data processing method of the present application;

FIG. 3 is a flow diagram of a data processing framework in an illustrative example of the present application;

FIG. 4 is a flowchart illustrating the effect of generating training data during an online regression training process according to an exemplary embodiment of the present application;

FIG. 5 is a block diagram of a program module of an embodiment of the data processing apparatus of the present application;

FIG. 6 is a diagram of an alternative hardware architecture of the computer device of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the descriptions in this application referring to "first", "second", etc. are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present application.

Fig. 1 is a schematic diagram of an application environment according to an embodiment of the present application. Referring to fig. 1, the computer device 1 is connected to a data server 20, and the data server 20 is connected to a user terminal 10. Any user terminal 10 can access data on the data server 20, for example, access data on the data server 20 by accessing an App page or a web page, then the data server 20 can recommend recommended data to the user terminal 10 through the App page or the web page, and the data server 20 can obtain user information data and user behavior data on the user terminal 10 after obtaining authorization of the user terminal 10, and store the user information data and the user behavior data in a corresponding database, for example, a Kafka distributed log system.

Therefore, after the computer device 1 is connected to the data server 20, the initial user behavior data of all the user sides in the current time period acquired by the data server 20 can be acquired and stored in a preset first storage unit; processing the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period; merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of a current time period; performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set; training a model, such as a click through rate prediction model, based on the online user feature data set.

In this embodiment, the data server 20 may be a mobile phone, a tablet, a portable device, a PC, or other data service platforms, such as a video service platform, an online shopping platform, etc.; the user terminal 10 can be used as a mobile phone, a tablet, a portable device, a PC, etc.; the computer device 1 can be used as a mobile phone, a tablet, a portable device, a PC, a server or the like. Of course, in other embodiments, the computer device 1 may be combined with the data server 20 into the same electronic device, or the computer device 1 may also be attached to the data server 20 as a separate functional module to implement the data processing function.

Example one

Fig. 2 is a schematic flowchart of an embodiment of a data processing method according to the present application. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The following description is made by way of example with the computer apparatus 1 as the execution subject.

As shown in fig. 2, the data processing method may include steps S200 to S206.

Step S200, processing the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set, where the first storage unit stores the user behavior data in the current time period.

Specifically, the computer device 1 is connected to a data server, the data server is dedicated to providing data services for users, each user side can access data on the data server, for example, data on the data server is accessed by accessing an App page or a web page, then the data server can obtain user behavior data on the user side, including user information data and initial user behavior data, by the data server after obtaining user side authorization, and then store the user behavior data in a preset first storage unit, for example, a database corresponding to the data server. Wherein the user information data includes: data such as user ID (Identity document), user gender, age, occupation or online age; and the user behavior data includes: and clicking, browsing, commenting and visiting the target recommendation data by the user, and whether to purchase the product or service corresponding to the target recommendation data, wherein the recommendation data can be a text link or a picture link of the commodity advertisement. In this embodiment, when each user accesses data on the data server through a respective user side, for example, by accessing an App page or a web page, the data server may record access log information of each user for the target recommended data. For example, the data server buries a webpage or an App page of target data in advance, and then the access condition of each user to the webpage or the App page of the target data can be detected; embedding points in video frame data of target data in advance, and then detecting the watching condition of each user on the video data of the target data; the access situation or the viewing situation includes initial user behavior data such as clicking, browsing, commenting, access time and whether to purchase a product or a service. Then, the initial user behavior data generated by all the user terminals is stored in a preset first storage unit, such as a Kafka distributed log system.

Next, the computer device 1 performs data processing, such as data cleaning, on each piece of the initial user behavior data in the storage unit, so as to obtain a corresponding first online user behavior data set.

In an exemplary example, the processing, by the computer device 1, of the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set includes: pulling each piece of initial user behavior data from the first storage unit; sequentially marking each piece of initial user behavior data with message codes; and extracting effective fields of the initial user behavior data coded by the marked message to obtain corresponding user behavior data, and recording the corresponding user behavior data to a first online user behavior data set. Wherein the valid field includes at least one of a user ID, user identity information, behavior data generation time, and recommendation data. For example, the behavior of the user browsing the recommended data and clicking the recommended data through the user side, the generated initial user behavior data includes the following fields: user ID, time stamp of behavior occurrence, ID of recommended data of behavior action, merchant ID corresponding to the recommended data, event exposure or event click, recommended data exposure position and the like. Therefore, the computer device 1 may extract a preset valid field, a user ID, user identity information, behavior data generation time, recommendation data, and the like from the initial user behavior data according to a preset character recognition technology.

Of course, in an exemplary example, the processing, by the computer device 1, the user behavior data in the first storage unit further includes: and distributing the initial user behavior data generated by the same user side in the first storage unit to the same computing node in the Flink computing engine to perform data processing.

In this embodiment, the computer device 1 stores the collected initial user behavior data in Kafka, and then pulls the initial user behavior data in Kafka for consumption by the Flink calculation engine. Specifically, since the Flink calculation engine includes a plurality of calculation nodes, each calculation node can set the concurrency degree; processing is performed simultaneously at different compute nodes to improve throughput. But also due to concurrency, i.e., when different compute nodes simultaneously process initial user behavior data of the same user side, an overlapping or repetitive computing process may occur. In order to solve the problem, the computer device 1 uses the keyby operator in the Flink computing engine to send the parallel messages generated by the same user side to the same computing node for computing, so that the disorder information of the user granularity is changed into sequential execution, and the concurrency problem is solved.

Step S202, merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, where the second online user behavior data set is an online user behavior data set generated in a previous time period of the current time period.

Specifically, after the computer device 1 performs data processing on all initial user behavior data in Kafka to obtain a corresponding first online user behavior data set, the first online user behavior data set is stored in a preset second storage unit, such as an online storage Redis (Remote Dictionary service), which is an open-source log-type and Key-Value database written in ANSI C language, supporting a network, and can be based on a memory or can be persisted. In this embodiment, the Redis is configured to store online user behavior data, and when a part of the online user behavior data set, such as a second online user behavior data set, has been stored in the Redis, then the computer device 1 needs to merge the first online user behavior data set and the second online user behavior data set to obtain a third merged online user behavior data set.

In an exemplary example, the merging, by the computer device 1, the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, where the merging includes: performing duplicate removal operation on all user behavior data of the first online user behavior data set and the second online user behavior data set, and sequencing according to the generation time; clearing the overdue user behavior data with the generation time smaller than a preset time threshold value to obtain third online user behavior data; and replacing and storing the third online user behavior data to the second storage unit. Specifically, the computer device 1 integrates each piece of user behavior data of the first online user behavior data set and the second online user behavior data set, then sorts the pieces of user behavior data according to the generation time, and eliminates online user behavior data with an earlier generation time, thereby ensuring that online user behavior data in a third online user behavior data set in the Redis is sufficient and latest online user behavior data.

In this embodiment, the computer device 1 pulls the first online user behavior data set and the second online user behavior data set into different computing nodes of the Flink computing engine to perform data merging, where the data merging includes allocating, by a keyby operator, user behavior data corresponding to the same user side to the same computing node in the Flink computing engine to perform data merging, so as to avoid a concurrency problem.

Because the process of merging data is processed through a plurality of nodes concurrently, the concurrency problem is easily caused, for example, a computing node A processes a message A, acquires a history message A1 from Redis, merges a current message A, and then writes the current message A into Redis; however, at the same time, the computing node B also acquires the Redis history message when processing the message B, but the message data does not contain a, and finally the message of a is covered by B, which causes a concurrency problem. Therefore, the computer device 1 allocates the user behavior data corresponding to the same user side to the same computing node through the keyby operator of the Flink computing engine to perform data merging, and the concurrency problem can be effectively avoided.

Step S204, performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set.

And S206, training a model based on the online user characteristic data set.

After the computer device 1 merges the first online user behavior data set and the second online user behavior data set to obtain the third online user behavior data set, further performing feature extraction on the third online user behavior data set to obtain a corresponding online user feature data set; the online user feature data set is then used to train a preset model, such as a click-through rate prediction model, where the click-through rate prediction model may be an initial model or a mature model trained by online user behavior data.

In an illustrative example, the computer device 1 further performs the following steps in the processing and storing of the online user behavior data set: scanning all user behavior data of the first online user behavior data set to generate a corresponding first version number, wherein the version number comprises snapshot information of all behavior data of the first online user behavior data set; storing the first version number to the second storage unit; and storing the first online user behavior data set and the first version number into a preset third storage unit. That is to say, in the process of acquiring the online user behavior data set each time, the computer device 1 generates a version number of the online user behavior data set, that is, performs snapshot capture on each piece of user behavior data in the online user behavior data set, and then stores the version number with the user online data set, including storing the version number in a second storage unit, that is, an online storage unit Redis, and also storing the version number in a third storage unit, that is, an offline storage unit, for example, a Hive database.

Thus, after the computer device 1 has finished training the model and the online user behavior data set in the second storage unit is updated again, the computer device 1 further performs the steps of: and inquiring the first online user behavior data set from a third storage unit according to the first version number for executing a regression training process of the model.

That is, the computer device 1 stores the version number of the online user behavior dataset generated each time to Redis, stores the latest online user behavior dataset to Redis, and stores the version number of the online user behavior dataset generated each time and the online user behavior dataset generated each time to Hive. In the subsequent process, the computer device 1 may obtain the corresponding online user behavior data set from the Hive according to the version number in the Redis, thereby implementing online regression training.

As shown in fig. 3, fig. 3 is a flow diagram of data processing in an illustrative example of the present application. In this embodiment, the computer apparatus 1 acquires initial user behavior data including a click behavior or other exposure behavior for recommended data from a data server, and then stores the initial user behavior data in Kafka; then, cleaning initial user behavior data in Kafka through a preset Flink calculation engine to obtain an online user behavior data set; then, storing the data into Redis and Hive respectively, wherein the Redis stores online user behavior data in advance for an online storage unit, so that the obtained online user behavior data set needs to be combined with the original online user behavior data of the Redis and then stored in a replacement manner; then, the computer device 1 performs feature extraction on the online user behavior data in Redis through a preset online engine, so as to obtain corresponding user feature data; and finally, inputting the obtained user characteristic data into a preset click rate estimation model for training.

Referring to fig. 4, fig. 4 is a flowchart illustrating an effect of a process of generating training data in an online regression training process according to an exemplary embodiment of the present application.

In this embodiment, the computer device 1 first obtains initial user behavior data from a data server, and then cleans the initial user behavior data through a preset Flink calculation engine to obtain an online user behavior data set; then, scanning each online user behavior data of the online user behavior data set, and generating a version number; then, respectively storing the online user behavior data set and the version number into Redis and behavior data Hive, wherein Redis stores online user behavior data in advance for an online storage unit, and therefore the obtained online user behavior data and the original online user behavior data of Redis need to be merged and then stored in a replacement manner; on one hand, feature extraction can be carried out on an online user behavior data set in Redis through a preset online engine, so that corresponding user feature data are obtained and stored in a request record Hive; in the regression training process, the computer device 1 may query the historical user feature data stored in the Hive request record through the version number, splice the historical user feature data into user feature data in corresponding time periods, and perform feature extraction on the user behavior data in the behavior data Hive to obtain the user feature data in corresponding time periods, so as to perform regression training on the click rate estimation model.

In summary, the data processing method provided in this embodiment can process the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set, where the first storage unit stores the user behavior data in the current time period; merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of a current time period; performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set; training a model based on the online user feature dataset. After the new online user behavior data set is obtained, the online user behavior data in the second storage unit is directly updated and replaced, so that the online user characteristic data set is rapidly extracted for online model training, the quantity and freshness of online training data are guaranteed, and the online training efficiency of the model is improved.

Example two

Fig. 5 schematically shows a block diagram of a data processing apparatus according to the second embodiment of the present application, which may be partitioned into one or more program modules, which are stored in a storage medium and executed by one or more processors to implement the second embodiment of the present application. The program modules referred to in the embodiments of the present application refer to a series of computer program instruction segments that can perform specific functions, and the following description will specifically describe the functions of the program modules in the embodiments.

As shown in fig. 5, the data processing apparatus 400 may include a processing module 410, a merging module 420, an extraction module 430, and a training module 440, wherein:

the processing module 410 is configured to process the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set, where the first storage unit stores the user behavior data in the current time period.

The merging module 420 merges the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, where the second online user behavior data set is an online user behavior data set generated in a previous time period of the current time period.

An extracting module 430, configured to perform feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set.

A training module 440 for training a model based on the online user characteristic dataset.

In an exemplary embodiment, the processing module 410 is further configured to: pulling each piece of initial user behavior data from the first storage unit; sequentially marking each piece of initial user behavior data with message codes; and extracting effective fields of the initial user behavior data coded by the marked message to obtain corresponding user behavior data, and recording the corresponding user behavior data to a first online user behavior data set. Wherein the valid field includes at least one of a user ID, user identity information, behavior data generation time, and recommendation data.

In an exemplary embodiment, the processing module 410 is further configured to: and distributing the initial user behavior data generated by the same user side in the first storage unit to the same computing node in the Flink computing engine to perform data processing.

In an exemplary embodiment, the merging module 420 is further configured to: performing duplicate removal operation on all user behavior data of the first online user behavior data set and the second online user behavior data set, and sequencing according to the generation time; clearing the overdue user behavior data with the generation time smaller than a preset time threshold value to obtain third online user behavior data; and replacing and storing the third online user behavior data to the second storage unit.

In an exemplary embodiment, the merge module 430 is further configured to: scanning all user behavior data of the first online user behavior data set to generate a corresponding first version number, wherein the version number comprises snapshot information of all behavior data of the first online user behavior data set; storing the first version number to the second storage unit; and storing the first online user behavior data set and the first version number into a preset third storage unit.

In the exemplary embodiment, training module 450 is further configured to: and inquiring the first online user behavior data set from a third storage unit according to the first version number for executing a regression training process of the model.

EXAMPLE III

Fig. 6 schematically shows a hardware architecture diagram of a computer device 1 suitable for implementing the data processing method according to the third embodiment of the present application. In the present embodiment, the computer device 1 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set or stored in advance. For example, the server may be a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers) with a gateway function. As shown in fig. 6, the computer device 1 includes at least, but is not limited to: memory 510, processor 520, and network interface 530 may be communicatively linked to each other by a system bus. Wherein:

the memory 510 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 510 may be an internal storage module of the computer device 1, such as a hard disk or a memory of the computer device 1. In other embodiments, the memory 510 may also be an external storage device of the computer device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 1. Of course, the memory 510 may also comprise both an internal memory module of the computer device 1 and an external memory device thereof. In this embodiment, the memory 510 is generally used for storing an operating system installed in the computer apparatus 1 and various types of application software, such as program codes of a data processing method. In addition, the memory 510 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 520 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other model training chip in some embodiments. The processor 520 is generally used for controlling the overall operation of the computer device 1, such as performing control and processing related to data interaction or communication with the computer device 1. In this embodiment, processor 520 is configured to execute program codes stored in memory 510 or process data.

Network interface 530 may include a wireless network interface or a wired network interface, and network interface 530 is typically used to establish communication links between computer device 1 and other computer devices. For example, the network interface 530 is used to connect the computer apparatus 1 with an external terminal through a network, establish a data transmission channel and a communication link between the computer apparatus 1 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), or Wi-Fi.

It should be noted that FIG. 6 only shows a computer device having components 510 and 530, but it should be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

In this embodiment, the program code of the data processing method stored in the memory 510 may also be divided into one or more program modules and executed by one or more processors (in this embodiment, the processor 520) to implement the embodiments of the present application.

Example four

The present embodiments also provide a computer-readable storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the steps of:

In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the computer-readable storage medium may also include both internal and external storage devices of the computer device. In this embodiment, the computer-readable storage medium is generally used for storing an operating system and various types of application software installed in the computer device, for example, the program codes of the data processing method in the embodiment, and the like. Further, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different from that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications that can be made by the use of the equivalent structures or equivalent processes in the specification and drawings of the present application or that can be directly or indirectly applied to other related technologies are also included in the scope of the present application.

Claims

1. A method of data processing, the method comprising:

processing user behavior data in a first storage unit to obtain a corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period;

merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of the current time period;

performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set;

training a model based on the online user feature dataset.

2. The data processing method of claim 1, wherein the processing the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set comprises:

pulling each piece of initial user behavior data from the first storage unit;

sequentially marking each piece of initial user behavior data with message codes;

and extracting effective fields of the initial user behavior data coded by the marked message to obtain corresponding user behavior data, and recording the corresponding user behavior data to a first online user behavior data set.

3. The data processing method of claim 2, wherein the valid field includes at least one of a user ID, user identity information, action data generation time, and recommendation data.

4. The data processing method of any of claims 1-3, wherein the processing the user behavior data in the first storage unit further comprises:

and distributing the initial user behavior data generated by the same user side in the first storage unit to the same computing node in the Flink computing engine to perform data processing.

5. The data processing method of claim 1, wherein merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, comprises:

performing duplicate removal operation on all user behavior data of the first online user behavior data set and the second online user behavior data set, and sequencing according to the generation time;

clearing the overdue user behavior data with the generation time smaller than a preset time threshold value to obtain third online user behavior data;

and replacing and storing the third online user behavior data to the second storage unit.

6. The data processing method of any of claims 1-5, wherein the method further comprises:

scanning all user behavior data of the first online user behavior data set to generate a corresponding first version number, wherein the version number comprises snapshot information of all behavior data of the first online user behavior data set;

storing the first version number to the second storage unit;

and storing the first online user behavior data set and the first version number into a preset third storage unit.

7. The data processing method of claim 6, wherein after the training of the model is completed and the online user behavior data set in the second storage unit is updated again, the method further comprises:

and inquiring the first online user behavior data set from a third storage unit according to the first version number for executing a regression training process of the model.

8. A data processing apparatus, characterized in that the apparatus comprises:

the processing module is used for processing the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period;

the merging module merges the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of the current time period;

the extraction module is used for extracting the characteristics of each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user characteristic data set;

and the training module is used for training a model based on the online user characteristic data set.

9. Computer arrangement, characterized in that the computer arrangement comprises a memory, a processor, the memory having stored thereon a computer program executable on the processor, the computer program, when being executed by the processor, realizing the steps of the data processing method according to any one of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executable by at least one processor to cause the at least one processor to perform the steps of the data processing method according to any one of claims 1 to 7.