CN113836411A - Data processing method and device and computer equipment - Google Patents

Data processing method and device and computer equipment Download PDF

Info

Publication number
CN113836411A
CN113836411A CN202111108679.2A CN202111108679A CN113836411A CN 113836411 A CN113836411 A CN 113836411A CN 202111108679 A CN202111108679 A CN 202111108679A CN 113836411 A CN113836411 A CN 113836411A
Authority
CN
China
Prior art keywords
user behavior
behavior data
data set
online user
online
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111108679.2A
Other languages
Chinese (zh)
Other versions
CN113836411B (en
Inventor
卢晓威
何其真
钟礼刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bilibili Technology Co Ltd
Original Assignee
Shanghai Bilibili Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bilibili Technology Co Ltd filed Critical Shanghai Bilibili Technology Co Ltd
Priority to CN202111108679.2A priority Critical patent/CN113836411B/en
Publication of CN113836411A publication Critical patent/CN113836411A/en
Application granted granted Critical
Publication of CN113836411B publication Critical patent/CN113836411B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application discloses a data processing method, a data processing device and computer equipment, wherein the method comprises the following steps: processing user behavior data in a first storage unit to obtain a corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period; merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is generated in a previous time period of the current time period; performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set; training a model based on the online user feature dataset. The present application also provides a computer-readable storage medium. The method and the device can effectively ensure the quantity and freshness of the online training data, and improve the online training efficiency of the model.

Description

Data processing method and device and computer equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, and a computer device.
Background
With the development of internet technology, more and more users choose to browse, select or purchase the required goods on the internet. Each e-commerce platform provides abundant and various commodities for users, and meanwhile, recommends commodities to the users in different degrees by adopting various recommendation technologies. In order to achieve the purpose of recommending various useful information to a user in time and avoiding recommending useless information as much as possible, user behavior data of clicking or browsing commodity advertisements by a plurality of users are collected, and therefore a click rate estimation model capable of estimating click probabilities of different users on recommended data is trained.
Generally speaking, since the preference of the user changes with time, the click rate estimation model trained according to the historical user behavior data, that is, the offline user behavior data, is often inaccurate in the accuracy of the click rate of the user on the recommended data. However, in the prior art, the storage modes of the historical user behavior data and the real-time user behavior data are different, so that the historical user behavior data and the real-time user behavior data cannot be effectively utilized to perform model training at the same time; that is, the training data that can be used to train the model on-line is single, resulting in a poor accuracy of the trained model on-line.
Disclosure of Invention
The application provides a data processing method, a data processing device and computer equipment, which can solve the problems that the training data is single and the accuracy of a trained online model is not high.
First, to achieve the above object, the present application provides a data processing method, including:
processing user behavior data in a first storage unit to obtain a corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period; merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of a current time period; performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set; training a model based on the online user feature dataset.
In one example, the processing the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set includes: pulling each piece of initial user behavior data from the first storage unit; sequentially marking each piece of initial user behavior data with message codes; and extracting effective fields of the initial user behavior data coded by the marked message to obtain corresponding user behavior data, and recording the corresponding user behavior data to a first online user behavior data set.
In one example, the valid field includes at least one of a user ID, user identity information, behavior data generation time, and recommendation data.
In one example, the processing the user behavior data in the first storage unit further includes: and distributing the initial user behavior data generated by the same user side in the first storage unit to the same computing node in the Flink computing engine to perform data processing.
In one example, merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, includes: performing duplicate removal operation on all user behavior data of the first online user behavior data set and the second online user behavior data set, and sequencing according to the generation time; clearing the overdue user behavior data with the generation time smaller than a preset time threshold value to obtain third online user behavior data; and replacing and storing the third online user behavior data to the second storage unit.
In one example, the method further comprises: scanning all user behavior data of the first online user behavior data set to generate a corresponding first version number, wherein the version number comprises snapshot information of all behavior data of the first online user behavior data set; storing the first version number to the second storage unit; and storing the first online user behavior data set and the first version number into a preset third storage unit.
In one example, after the training of the model is completed and the online user behavior data set in the second storage unit is updated again, the method further comprises: and inquiring the first online user behavior data set from a third storage unit according to the first version number for executing a regression training process of the model.
In addition, to achieve the above object, the present application also provides a data processing apparatus, including:
the processing module is used for processing the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period; the merging module merges the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of the current time period; the extraction module is used for extracting the characteristics of each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user characteristic data set; and the training module is used for training a model based on the online user characteristic data set.
Further, the present application also proposes a computer device, which includes a memory and a processor, wherein the memory stores a computer program that can be executed on the processor, and the computer program implements the steps of the data processing method as described above when executed by the processor.
Further, to achieve the above object, the present application also provides a computer-readable storage medium storing a computer program, which is executable by at least one processor to cause the at least one processor to perform the steps of the data processing method as described above.
Compared with the prior art, the data processing method, the data processing device, the computer equipment and the computer readable storage medium provided by the application can process the user behavior data in the first storage unit to obtain the corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period; merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of a current time period; performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set; training a model based on the online user feature dataset. After the new online user behavior data set is obtained, the online user behavior data in the second storage unit is directly updated and replaced, so that the online user characteristic data set is rapidly extracted for online model training, the quantity and freshness of online training data are guaranteed, and the online training efficiency of the model is improved.
Drawings
FIG. 1 is a schematic diagram of an application environment according to an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a data processing method of the present application;
FIG. 3 is a flow diagram of a data processing framework in an illustrative example of the present application;
FIG. 4 is a flowchart illustrating the effect of generating training data during an online regression training process according to an exemplary embodiment of the present application;
FIG. 5 is a block diagram of a program module of an embodiment of the data processing apparatus of the present application;
FIG. 6 is a diagram of an alternative hardware architecture of the computer device of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the descriptions in this application referring to "first", "second", etc. are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present application.
Fig. 1 is a schematic diagram of an application environment according to an embodiment of the present application. Referring to fig. 1, the computer device 1 is connected to a data server 20, and the data server 20 is connected to a user terminal 10. Any user terminal 10 can access data on the data server 20, for example, access data on the data server 20 by accessing an App page or a web page, then the data server 20 can recommend recommended data to the user terminal 10 through the App page or the web page, and the data server 20 can obtain user information data and user behavior data on the user terminal 10 after obtaining authorization of the user terminal 10, and store the user information data and the user behavior data in a corresponding database, for example, a Kafka distributed log system.
Therefore, after the computer device 1 is connected to the data server 20, the initial user behavior data of all the user sides in the current time period acquired by the data server 20 can be acquired and stored in a preset first storage unit; processing the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period; merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of a current time period; performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set; training a model, such as a click through rate prediction model, based on the online user feature data set.
In this embodiment, the data server 20 may be a mobile phone, a tablet, a portable device, a PC, or other data service platforms, such as a video service platform, an online shopping platform, etc.; the user terminal 10 can be used as a mobile phone, a tablet, a portable device, a PC, etc.; the computer device 1 can be used as a mobile phone, a tablet, a portable device, a PC, a server or the like. Of course, in other embodiments, the computer device 1 may be combined with the data server 20 into the same electronic device, or the computer device 1 may also be attached to the data server 20 as a separate functional module to implement the data processing function.
Example one
Fig. 2 is a schematic flowchart of an embodiment of a data processing method according to the present application. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The following description is made by way of example with the computer apparatus 1 as the execution subject.
As shown in fig. 2, the data processing method may include steps S200 to S206.
Step S200, processing the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set, where the first storage unit stores the user behavior data in the current time period.
Specifically, the computer device 1 is connected to a data server, the data server is dedicated to providing data services for users, each user side can access data on the data server, for example, data on the data server is accessed by accessing an App page or a web page, then the data server can obtain user behavior data on the user side, including user information data and initial user behavior data, by the data server after obtaining user side authorization, and then store the user behavior data in a preset first storage unit, for example, a database corresponding to the data server. Wherein the user information data includes: data such as user ID (Identity document), user gender, age, occupation or online age; and the user behavior data includes: and clicking, browsing, commenting and visiting the target recommendation data by the user, and whether to purchase the product or service corresponding to the target recommendation data, wherein the recommendation data can be a text link or a picture link of the commodity advertisement. In this embodiment, when each user accesses data on the data server through a respective user side, for example, by accessing an App page or a web page, the data server may record access log information of each user for the target recommended data. For example, the data server buries a webpage or an App page of target data in advance, and then the access condition of each user to the webpage or the App page of the target data can be detected; embedding points in video frame data of target data in advance, and then detecting the watching condition of each user on the video data of the target data; the access situation or the viewing situation includes initial user behavior data such as clicking, browsing, commenting, access time and whether to purchase a product or a service. Then, the initial user behavior data generated by all the user terminals is stored in a preset first storage unit, such as a Kafka distributed log system.
Next, the computer device 1 performs data processing, such as data cleaning, on each piece of the initial user behavior data in the storage unit, so as to obtain a corresponding first online user behavior data set.
In an exemplary example, the processing, by the computer device 1, of the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set includes: pulling each piece of initial user behavior data from the first storage unit; sequentially marking each piece of initial user behavior data with message codes; and extracting effective fields of the initial user behavior data coded by the marked message to obtain corresponding user behavior data, and recording the corresponding user behavior data to a first online user behavior data set. Wherein the valid field includes at least one of a user ID, user identity information, behavior data generation time, and recommendation data. For example, the behavior of the user browsing the recommended data and clicking the recommended data through the user side, the generated initial user behavior data includes the following fields: user ID, time stamp of behavior occurrence, ID of recommended data of behavior action, merchant ID corresponding to the recommended data, event exposure or event click, recommended data exposure position and the like. Therefore, the computer device 1 may extract a preset valid field, a user ID, user identity information, behavior data generation time, recommendation data, and the like from the initial user behavior data according to a preset character recognition technology.
Of course, in an exemplary example, the processing, by the computer device 1, the user behavior data in the first storage unit further includes: and distributing the initial user behavior data generated by the same user side in the first storage unit to the same computing node in the Flink computing engine to perform data processing.
In this embodiment, the computer device 1 stores the collected initial user behavior data in Kafka, and then pulls the initial user behavior data in Kafka for consumption by the Flink calculation engine. Specifically, since the Flink calculation engine includes a plurality of calculation nodes, each calculation node can set the concurrency degree; processing is performed simultaneously at different compute nodes to improve throughput. But also due to concurrency, i.e., when different compute nodes simultaneously process initial user behavior data of the same user side, an overlapping or repetitive computing process may occur. In order to solve the problem, the computer device 1 uses the keyby operator in the Flink computing engine to send the parallel messages generated by the same user side to the same computing node for computing, so that the disorder information of the user granularity is changed into sequential execution, and the concurrency problem is solved.
Step S202, merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, where the second online user behavior data set is an online user behavior data set generated in a previous time period of the current time period.
Specifically, after the computer device 1 performs data processing on all initial user behavior data in Kafka to obtain a corresponding first online user behavior data set, the first online user behavior data set is stored in a preset second storage unit, such as an online storage Redis (Remote Dictionary service), which is an open-source log-type and Key-Value database written in ANSI C language, supporting a network, and can be based on a memory or can be persisted. In this embodiment, the Redis is configured to store online user behavior data, and when a part of the online user behavior data set, such as a second online user behavior data set, has been stored in the Redis, then the computer device 1 needs to merge the first online user behavior data set and the second online user behavior data set to obtain a third merged online user behavior data set.
In an exemplary example, the merging, by the computer device 1, the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, where the merging includes: performing duplicate removal operation on all user behavior data of the first online user behavior data set and the second online user behavior data set, and sequencing according to the generation time; clearing the overdue user behavior data with the generation time smaller than a preset time threshold value to obtain third online user behavior data; and replacing and storing the third online user behavior data to the second storage unit. Specifically, the computer device 1 integrates each piece of user behavior data of the first online user behavior data set and the second online user behavior data set, then sorts the pieces of user behavior data according to the generation time, and eliminates online user behavior data with an earlier generation time, thereby ensuring that online user behavior data in a third online user behavior data set in the Redis is sufficient and latest online user behavior data.
In this embodiment, the computer device 1 pulls the first online user behavior data set and the second online user behavior data set into different computing nodes of the Flink computing engine to perform data merging, where the data merging includes allocating, by a keyby operator, user behavior data corresponding to the same user side to the same computing node in the Flink computing engine to perform data merging, so as to avoid a concurrency problem.
Because the process of merging data is processed through a plurality of nodes concurrently, the concurrency problem is easily caused, for example, a computing node A processes a message A, acquires a history message A1 from Redis, merges a current message A, and then writes the current message A into Redis; however, at the same time, the computing node B also acquires the Redis history message when processing the message B, but the message data does not contain a, and finally the message of a is covered by B, which causes a concurrency problem. Therefore, the computer device 1 allocates the user behavior data corresponding to the same user side to the same computing node through the keyby operator of the Flink computing engine to perform data merging, and the concurrency problem can be effectively avoided.
Step S204, performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set.
And S206, training a model based on the online user characteristic data set.
After the computer device 1 merges the first online user behavior data set and the second online user behavior data set to obtain the third online user behavior data set, further performing feature extraction on the third online user behavior data set to obtain a corresponding online user feature data set; the online user feature data set is then used to train a preset model, such as a click-through rate prediction model, where the click-through rate prediction model may be an initial model or a mature model trained by online user behavior data.
In an illustrative example, the computer device 1 further performs the following steps in the processing and storing of the online user behavior data set: scanning all user behavior data of the first online user behavior data set to generate a corresponding first version number, wherein the version number comprises snapshot information of all behavior data of the first online user behavior data set; storing the first version number to the second storage unit; and storing the first online user behavior data set and the first version number into a preset third storage unit. That is to say, in the process of acquiring the online user behavior data set each time, the computer device 1 generates a version number of the online user behavior data set, that is, performs snapshot capture on each piece of user behavior data in the online user behavior data set, and then stores the version number with the user online data set, including storing the version number in a second storage unit, that is, an online storage unit Redis, and also storing the version number in a third storage unit, that is, an offline storage unit, for example, a Hive database.
Thus, after the computer device 1 has finished training the model and the online user behavior data set in the second storage unit is updated again, the computer device 1 further performs the steps of: and inquiring the first online user behavior data set from a third storage unit according to the first version number for executing a regression training process of the model.
That is, the computer device 1 stores the version number of the online user behavior dataset generated each time to Redis, stores the latest online user behavior dataset to Redis, and stores the version number of the online user behavior dataset generated each time and the online user behavior dataset generated each time to Hive. In the subsequent process, the computer device 1 may obtain the corresponding online user behavior data set from the Hive according to the version number in the Redis, thereby implementing online regression training.
As shown in fig. 3, fig. 3 is a flow diagram of data processing in an illustrative example of the present application. In this embodiment, the computer apparatus 1 acquires initial user behavior data including a click behavior or other exposure behavior for recommended data from a data server, and then stores the initial user behavior data in Kafka; then, cleaning initial user behavior data in Kafka through a preset Flink calculation engine to obtain an online user behavior data set; then, storing the data into Redis and Hive respectively, wherein the Redis stores online user behavior data in advance for an online storage unit, so that the obtained online user behavior data set needs to be combined with the original online user behavior data of the Redis and then stored in a replacement manner; then, the computer device 1 performs feature extraction on the online user behavior data in Redis through a preset online engine, so as to obtain corresponding user feature data; and finally, inputting the obtained user characteristic data into a preset click rate estimation model for training.
Referring to fig. 4, fig. 4 is a flowchart illustrating an effect of a process of generating training data in an online regression training process according to an exemplary embodiment of the present application.
In this embodiment, the computer device 1 first obtains initial user behavior data from a data server, and then cleans the initial user behavior data through a preset Flink calculation engine to obtain an online user behavior data set; then, scanning each online user behavior data of the online user behavior data set, and generating a version number; then, respectively storing the online user behavior data set and the version number into Redis and behavior data Hive, wherein Redis stores online user behavior data in advance for an online storage unit, and therefore the obtained online user behavior data and the original online user behavior data of Redis need to be merged and then stored in a replacement manner; on one hand, feature extraction can be carried out on an online user behavior data set in Redis through a preset online engine, so that corresponding user feature data are obtained and stored in a request record Hive; in the regression training process, the computer device 1 may query the historical user feature data stored in the Hive request record through the version number, splice the historical user feature data into user feature data in corresponding time periods, and perform feature extraction on the user behavior data in the behavior data Hive to obtain the user feature data in corresponding time periods, so as to perform regression training on the click rate estimation model.
In summary, the data processing method provided in this embodiment can process the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set, where the first storage unit stores the user behavior data in the current time period; merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of a current time period; performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set; training a model based on the online user feature dataset. After the new online user behavior data set is obtained, the online user behavior data in the second storage unit is directly updated and replaced, so that the online user characteristic data set is rapidly extracted for online model training, the quantity and freshness of online training data are guaranteed, and the online training efficiency of the model is improved.
Example two
Fig. 5 schematically shows a block diagram of a data processing apparatus according to the second embodiment of the present application, which may be partitioned into one or more program modules, which are stored in a storage medium and executed by one or more processors to implement the second embodiment of the present application. The program modules referred to in the embodiments of the present application refer to a series of computer program instruction segments that can perform specific functions, and the following description will specifically describe the functions of the program modules in the embodiments.
As shown in fig. 5, the data processing apparatus 400 may include a processing module 410, a merging module 420, an extraction module 430, and a training module 440, wherein:
the processing module 410 is configured to process the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set, where the first storage unit stores the user behavior data in the current time period.
The merging module 420 merges the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, where the second online user behavior data set is an online user behavior data set generated in a previous time period of the current time period.
An extracting module 430, configured to perform feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set.
A training module 440 for training a model based on the online user characteristic dataset.
In an exemplary embodiment, the processing module 410 is further configured to: pulling each piece of initial user behavior data from the first storage unit; sequentially marking each piece of initial user behavior data with message codes; and extracting effective fields of the initial user behavior data coded by the marked message to obtain corresponding user behavior data, and recording the corresponding user behavior data to a first online user behavior data set. Wherein the valid field includes at least one of a user ID, user identity information, behavior data generation time, and recommendation data.
In an exemplary embodiment, the processing module 410 is further configured to: and distributing the initial user behavior data generated by the same user side in the first storage unit to the same computing node in the Flink computing engine to perform data processing.
In an exemplary embodiment, the merging module 420 is further configured to: performing duplicate removal operation on all user behavior data of the first online user behavior data set and the second online user behavior data set, and sequencing according to the generation time; clearing the overdue user behavior data with the generation time smaller than a preset time threshold value to obtain third online user behavior data; and replacing and storing the third online user behavior data to the second storage unit.
In an exemplary embodiment, the merge module 430 is further configured to: scanning all user behavior data of the first online user behavior data set to generate a corresponding first version number, wherein the version number comprises snapshot information of all behavior data of the first online user behavior data set; storing the first version number to the second storage unit; and storing the first online user behavior data set and the first version number into a preset third storage unit.
In the exemplary embodiment, training module 450 is further configured to: and inquiring the first online user behavior data set from a third storage unit according to the first version number for executing a regression training process of the model.
EXAMPLE III
Fig. 6 schematically shows a hardware architecture diagram of a computer device 1 suitable for implementing the data processing method according to the third embodiment of the present application. In the present embodiment, the computer device 1 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set or stored in advance. For example, the server may be a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers) with a gateway function. As shown in fig. 6, the computer device 1 includes at least, but is not limited to: memory 510, processor 520, and network interface 530 may be communicatively linked to each other by a system bus. Wherein:
the memory 510 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 510 may be an internal storage module of the computer device 1, such as a hard disk or a memory of the computer device 1. In other embodiments, the memory 510 may also be an external storage device of the computer device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 1. Of course, the memory 510 may also comprise both an internal memory module of the computer device 1 and an external memory device thereof. In this embodiment, the memory 510 is generally used for storing an operating system installed in the computer apparatus 1 and various types of application software, such as program codes of a data processing method. In addition, the memory 510 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 520 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other model training chip in some embodiments. The processor 520 is generally used for controlling the overall operation of the computer device 1, such as performing control and processing related to data interaction or communication with the computer device 1. In this embodiment, processor 520 is configured to execute program codes stored in memory 510 or process data.
Network interface 530 may include a wireless network interface or a wired network interface, and network interface 530 is typically used to establish communication links between computer device 1 and other computer devices. For example, the network interface 530 is used to connect the computer apparatus 1 with an external terminal through a network, establish a data transmission channel and a communication link between the computer apparatus 1 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), or Wi-Fi.
It should be noted that FIG. 6 only shows a computer device having components 510 and 530, but it should be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.
In this embodiment, the program code of the data processing method stored in the memory 510 may also be divided into one or more program modules and executed by one or more processors (in this embodiment, the processor 520) to implement the embodiments of the present application.
Example four
The present embodiments also provide a computer-readable storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the steps of:
processing user behavior data in a first storage unit to obtain a corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period; merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of a current time period; performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set; training a model based on the online user feature dataset.
In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the computer-readable storage medium may also include both internal and external storage devices of the computer device. In this embodiment, the computer-readable storage medium is generally used for storing an operating system and various types of application software installed in the computer device, for example, the program codes of the data processing method in the embodiment, and the like. Further, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different from that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications that can be made by the use of the equivalent structures or equivalent processes in the specification and drawings of the present application or that can be directly or indirectly applied to other related technologies are also included in the scope of the present application.

Claims (10)

1. A method of data processing, the method comprising:
processing user behavior data in a first storage unit to obtain a corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period;
merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of the current time period;
performing feature extraction on each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user feature data set;
training a model based on the online user feature dataset.
2. The data processing method of claim 1, wherein the processing the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set comprises:
pulling each piece of initial user behavior data from the first storage unit;
sequentially marking each piece of initial user behavior data with message codes;
and extracting effective fields of the initial user behavior data coded by the marked message to obtain corresponding user behavior data, and recording the corresponding user behavior data to a first online user behavior data set.
3. The data processing method of claim 2, wherein the valid field includes at least one of a user ID, user identity information, action data generation time, and recommendation data.
4. The data processing method of any of claims 1-3, wherein the processing the user behavior data in the first storage unit further comprises:
and distributing the initial user behavior data generated by the same user side in the first storage unit to the same computing node in the Flink computing engine to perform data processing.
5. The data processing method of claim 1, wherein merging the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, comprises:
performing duplicate removal operation on all user behavior data of the first online user behavior data set and the second online user behavior data set, and sequencing according to the generation time;
clearing the overdue user behavior data with the generation time smaller than a preset time threshold value to obtain third online user behavior data;
and replacing and storing the third online user behavior data to the second storage unit.
6. The data processing method of any of claims 1-5, wherein the method further comprises:
scanning all user behavior data of the first online user behavior data set to generate a corresponding first version number, wherein the version number comprises snapshot information of all behavior data of the first online user behavior data set;
storing the first version number to the second storage unit;
and storing the first online user behavior data set and the first version number into a preset third storage unit.
7. The data processing method of claim 6, wherein after the training of the model is completed and the online user behavior data set in the second storage unit is updated again, the method further comprises:
and inquiring the first online user behavior data set from a third storage unit according to the first version number for executing a regression training process of the model.
8. A data processing apparatus, characterized in that the apparatus comprises:
the processing module is used for processing the user behavior data in the first storage unit to obtain a corresponding first online user behavior data set, wherein the first storage unit stores the user behavior data in the current time period;
the merging module merges the first online user behavior data set with a second online user behavior data set in a preset second storage unit to obtain a third online user behavior data set, wherein the second online user behavior data set is an online user behavior data set generated in a previous time period of the current time period;
the extraction module is used for extracting the characteristics of each piece of user behavior data of the third online user behavior data set to obtain a corresponding online user characteristic data set;
and the training module is used for training a model based on the online user characteristic data set.
9. Computer arrangement, characterized in that the computer arrangement comprises a memory, a processor, the memory having stored thereon a computer program executable on the processor, the computer program, when being executed by the processor, realizing the steps of the data processing method according to any one of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executable by at least one processor to cause the at least one processor to perform the steps of the data processing method according to any one of claims 1 to 7.
CN202111108679.2A 2021-09-22 2021-09-22 Data processing method and device and computer equipment Active CN113836411B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111108679.2A CN113836411B (en) 2021-09-22 2021-09-22 Data processing method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111108679.2A CN113836411B (en) 2021-09-22 2021-09-22 Data processing method and device and computer equipment

Publications (2)

Publication Number Publication Date
CN113836411A true CN113836411A (en) 2021-12-24
CN113836411B CN113836411B (en) 2024-08-27

Family

ID=78960391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111108679.2A Active CN113836411B (en) 2021-09-22 2021-09-22 Data processing method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN113836411B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030130899A1 (en) * 2002-01-08 2003-07-10 Bruce Ferguson System and method for historical database training of non-linear models for use in electronic commerce
CN111177121A (en) * 2019-12-26 2020-05-19 平安普惠企业管理有限公司 Order data feedback method and device, computer equipment and storage medium
CN111858158A (en) * 2020-06-19 2020-10-30 北京金山云网络技术有限公司 Data processing method and device and electronic equipment
CN112070226A (en) * 2020-09-02 2020-12-11 北京百度网讯科技有限公司 Training method, device and equipment of online prediction model and storage medium
CN112307762A (en) * 2020-12-24 2021-02-02 完美世界(北京)软件科技发展有限公司 Search result sorting method and device, storage medium and electronic device
CN112597395A (en) * 2020-12-28 2021-04-02 上海众源网络有限公司 Object recommendation method, device, equipment and storage medium
CN112612768A (en) * 2020-12-11 2021-04-06 上海哔哩哔哩科技有限公司 Model training method and device
CN112613938A (en) * 2020-12-11 2021-04-06 上海哔哩哔哩科技有限公司 Model training method and device and computer equipment
WO2021151360A1 (en) * 2020-08-28 2021-08-05 平安科技(深圳)有限公司 Data leak warning method and apparatus, device, and computer readable storage medium
CN113220657A (en) * 2021-05-14 2021-08-06 上海哔哩哔哩科技有限公司 Data processing method and device and computer equipment
CN113268645A (en) * 2021-05-07 2021-08-17 北京三快在线科技有限公司 Information recall method, model training method, device, equipment and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030130899A1 (en) * 2002-01-08 2003-07-10 Bruce Ferguson System and method for historical database training of non-linear models for use in electronic commerce
CN111177121A (en) * 2019-12-26 2020-05-19 平安普惠企业管理有限公司 Order data feedback method and device, computer equipment and storage medium
CN111858158A (en) * 2020-06-19 2020-10-30 北京金山云网络技术有限公司 Data processing method and device and electronic equipment
WO2021151360A1 (en) * 2020-08-28 2021-08-05 平安科技(深圳)有限公司 Data leak warning method and apparatus, device, and computer readable storage medium
CN112070226A (en) * 2020-09-02 2020-12-11 北京百度网讯科技有限公司 Training method, device and equipment of online prediction model and storage medium
US20210248513A1 (en) * 2020-09-02 2021-08-12 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for training online prediction model, device and storage medium
CN112612768A (en) * 2020-12-11 2021-04-06 上海哔哩哔哩科技有限公司 Model training method and device
CN112613938A (en) * 2020-12-11 2021-04-06 上海哔哩哔哩科技有限公司 Model training method and device and computer equipment
CN112307762A (en) * 2020-12-24 2021-02-02 完美世界(北京)软件科技发展有限公司 Search result sorting method and device, storage medium and electronic device
CN112597395A (en) * 2020-12-28 2021-04-02 上海众源网络有限公司 Object recommendation method, device, equipment and storage medium
CN113268645A (en) * 2021-05-07 2021-08-17 北京三快在线科技有限公司 Information recall method, model training method, device, equipment and storage medium
CN113220657A (en) * 2021-05-14 2021-08-06 上海哔哩哔哩科技有限公司 Data processing method and device and computer equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
岑凯伦等: "大数据下基于Spark的电商实时推荐系统的设计与实现", 现代计算机(专业版), no. 24, 25 August 2016 (2016-08-25) *

Also Published As

Publication number Publication date
CN113836411B (en) 2024-08-27

Similar Documents

Publication Publication Date Title
CN109582876B (en) Tourist industry user portrait construction method and device and computer equipment
CN111339073A (en) Real-time data processing method and device, electronic equipment and readable storage medium
CN112613938B (en) Model training method and device and computer equipment
CN113220657B (en) Data processing method and device and computer equipment
CN108334641B (en) Method, system, electronic equipment and storage medium for collecting user behavior data
CN103502899A (en) Dynamic predictive modeling platform
CN110781372B (en) Method and device for optimizing website, computer equipment and storage medium
US11809455B2 (en) Automatically generating user segments
CN114663198A (en) Product recommendation method, device and equipment based on user portrait and storage medium
CN111080417A (en) Processing method for improving booking smoothness rate, model training method and system
CN110717597A (en) Method and device for acquiring time sequence characteristics by using machine learning model
US20140046708A1 (en) Systems and methods for determining a cloud-based customer lifetime value
CN112561565A (en) User demand identification method based on behavior log
CN111861605A (en) Business object recommendation method
CN110807050B (en) Performance analysis method, device, computer equipment and storage medium
CN113139826A (en) Method and device for determining distribution authority of advertisement space and computer equipment
CN113792039B (en) Data processing method and device, electronic equipment and storage medium
CN113688022A (en) Browser performance monitoring method, device, equipment and medium
CN111127057B (en) Multi-dimensional user portrait recovery method
CN111737080A (en) Abnormal transaction suspicion monitoring method and device, computer equipment and storage medium
CN115187330A (en) Product recommendation method, device, equipment and medium based on user label
CN113836411B (en) Data processing method and device and computer equipment
CN112560938A (en) Model training method and device and computer equipment
CN113159877B (en) Data processing method, device, system and computer readable storage medium
CN115271769A (en) Method, device and equipment for estimating delivery effect data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant