CN114528935A - Incremental model training method and device based on streaming data and electronic equipment - Google Patents

Incremental model training method and device based on streaming data and electronic equipment Download PDF

Info

Publication number
CN114528935A
CN114528935A CN202210150386.9A CN202210150386A CN114528935A CN 114528935 A CN114528935 A CN 114528935A CN 202210150386 A CN202210150386 A CN 202210150386A CN 114528935 A CN114528935 A CN 114528935A
Authority
CN
China
Prior art keywords
model
training
incremental
real
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210150386.9A
Other languages
Chinese (zh)
Inventor
田大钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dingdao Zhilian Beijing Technology Co ltd
Original Assignee
Dingdao Zhilian Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dingdao Zhilian Beijing Technology Co ltd filed Critical Dingdao Zhilian Beijing Technology Co ltd
Priority to CN202210150386.9A priority Critical patent/CN114528935A/en
Publication of CN114528935A publication Critical patent/CN114528935A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Abstract

The application provides an incremental model training method and device based on streaming data and electronic equipment, wherein the method comprises the following steps: acquiring original stream data; extracting real-time characteristics of the original stream data, and adding the characteristics into a full-scale characteristic library; determining a main model and a secondary model according to the initial training model, and performing repeated alternate training on the main model and the secondary model in a micro batch training mode based on the real-time characteristics to obtain an incremental training model; in each alternate training process, performing model increment training on one of the main model and the auxiliary model, and simultaneously performing online reasoning service by using the other model; and correcting the incremental training model based on the full-scale feature library obtained in the historical time period to obtain a corrected training model. By adopting the double-model alternative micro-batch incremental training and online reasoning service, the method and the system realize real-time and uninterrupted online reasoning service, and enable the model to track and respond to real user behavior change more timely.

Description

Incremental model training method and device based on streaming data and electronic equipment
Technical Field
The application relates to the technical field of machine learning, in particular to an incremental model training method and device based on stream data and electronic equipment.
Background
Incremental model training is a method of machine learning and is currently gaining wide attention. In which input data is continuously used to extend the knowledge of existing models, i.e. to further train the models, which represents a dynamic learning technique. In the existing traditional machine learning training scheme, for example, under an offline condition, a full-scale feature library is generated by performing feature engineering on data, then some machine learning algorithms are used for training full-scale feature data, and finally a generated model is deployed on a line for reasoning.
Most of the existing incremental model training methods are performed on the basis of an offline scene, and model training and updating are performed again by using new feature data on the basis of training by using part or all of full-quantity feature data. Although the purpose of incremental model training can be achieved, the requirements of actual service scenes cannot be well met in the aspects of real-time performance of model training, error evaluation of models and the like, in addition, in the face of the streaming data scenes widely applied in the conventional service system, the data rules in the streaming data cannot be better fitted by applying the conventional incremental learning technology, and model updating is difficult to be carried out in time, so that the behavioral interest of a user in a short time cannot be reflected, and the user experience and commercial indexes cannot be improved.
Disclosure of Invention
In view of this, an object of the embodiments of the present application is to provide a method and an apparatus for incremental model training based on streaming data, and an electronic device, which perform incremental model training, updating, and online reasoning processes based on an original model by using a dual-model architecture, so as to solve the above problems in the prior art.
In a first aspect, an embodiment of the present application provides a method for incremental model training based on streaming data, where the method includes: acquiring original stream data; extracting real-time characteristics of the original stream data, and adding the characteristics into a full-scale characteristic library; determining a main model and a secondary model according to the initial training model, and performing repeated alternate training on the main model and the secondary model in a micro batch training mode based on the real-time characteristics to obtain an incremental training model; in each alternate training process, performing model incremental training on one of the main model and the auxiliary model, and simultaneously using the other model to perform online reasoning service, wherein the model performing the model incremental training in each alternate training process is different from the previous alternate training process; and correcting the incremental training model based on the full-scale feature library obtained in the historical time period to obtain a corrected training model.
Optionally, the modifying the incremental training model based on the full-scale feature library obtained in the historical time period to obtain a modified training model includes: obtaining an evaluation result of the incremental training model according to user service indexes, wherein the user service indexes comprise click rate, browsing duration, playing duration, newly increased rate, retention rate, daily life, monthly life and conversion rate; and correcting the incremental training model according to the evaluation result and the full-scale feature library obtained in the historical time period to obtain a corrected training model.
In the implementation process, the influence of the user behavior interest on the model in a short period can be eliminated by adopting the full-scale characteristics in the historical time period to correct the model, and the real data rule is reflected more comprehensively, so that the user behavior change rule is reflected more comprehensively and accurately.
Optionally, the performing multiple alternate training on the primary model and the secondary model based on the real-time features in a micro batch training manner to obtain an incremental training model, including: presetting a trigger condition according to an actual service scene, wherein the trigger condition is whether the real-time characteristic data volume reaches a preset characteristic data volume; and judging whether the data volume of the real-time characteristics meets a preset triggering condition, if so, triggering a micro-batch training instruction, and carrying out micro-batch training and updating on the main (or auxiliary) model to obtain a trained incremental training model.
In the implementation process, through the trigger type design, when the data volume of the real-time features reaches the preset feature data volume, the model can carry out incremental training by itself to learn new features, so that the model can be trained and updated as required.
Optionally, the initial training model is a model obtained by training based on full-scale feature data of a full-scale feature library under an offline condition; and alternately training the main model and the auxiliary model until the incremental training process of the models is finished.
Optionally, the online reasoning service deploys the incremental training model on a server or a cloud in the form of an online service, the business system triggers and calls the online reasoning service in an intelligent service scene through an interactive process with the user, some context information of the user is transmitted, and the online reasoning service returns a reasoning result to the business system according to the incremental training model and returns the reasoning result to the user side through processing of the business system.
In the implementation process, the incremental training model is adopted to carry out real-time online reasoning service, so that the online reasoning service is carried out uninterruptedly, and the reasoning model can be updated synchronously along with the updating of the incremental training model.
Optionally, extracting real-time features of the original stream data, and adding the extracted real-time features into a full-scale feature library, including: extracting the real-time characteristics of the original stream data by adopting a real-time computing technology in big data, wherein the real-time computing technology is a Flink real-time computing framework; and adding the real-time extracted features into a full-scale feature library.
Optionally, the features extracted in real time are stored in a message queue mode for model increment training, and the message queue is a Kafka distributed publish-subscribe message system.
In the implementation process, a trigger type model training mode is carried out by adopting a characteristic message queue mode, when the data volume of the real-time characteristic reaches the preset characteristic data volume, the model can carry out incremental training by itself and learn new characteristics, and therefore the model can be trained and updated as required.
In a second aspect, an embodiment of the present application further provides a device for incremental model training based on streaming data, where the device includes: the acquisition module is used for acquiring original stream data; the extraction module is used for extracting the real-time characteristics of the original flow data and adding the real-time characteristics into a full-scale characteristic library; the training module is used for determining a main model and a secondary model according to the initial training model and carrying out repeated alternate training on the main model and the secondary model in a micro batch training mode based on the real-time characteristics to obtain an incremental training model; in each alternate training process, performing model incremental training on one of the main model and the auxiliary model, and simultaneously using the other model to perform online reasoning service, wherein the model performing the model incremental training in each alternate training process is different from the previous alternate training process; the reasoning module is used for deploying the incremental training model updated in real time as an online reasoning service to realize the real-time online reasoning service by using the latest model; and the correction module is used for correcting the incremental training model based on the full-scale feature library obtained in the historical time period to obtain a corrected training model.
In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores a computer program, and the processor executes, when running the computer program, the steps in any implementation manner of the incremental model training method based on streaming data described above.
In a fourth aspect, an embodiment of the present application further provides a readable storage medium, where a computer program is stored in the readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps in any implementation manner of the incremental model training method based on streaming data.
In summary, the present application provides an incremental model training method, an incremental model training device and an electronic device based on streaming data, which perform real-time feature extraction on original streaming data by using a real-time computing technology, perform triggered incremental training on an initial training model by combining a micro-batch training technology, implement on-demand training of a model, and adopt a dual-model alternating mode of a main model and a sub-model, thereby substantially increasing the updating and deployment speed of the model, and implementing real-time and uninterrupted online reasoning service, so as to track and respond to real user behavior changes more timely, and then modify the model by combining an online evaluation effect of the model, so that the model can reflect a data rule of a longer time dimension, avoid forgetting historical features, and achieve the purpose of lifelong learning.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a schematic flowchart of a method for incremental model training based on streaming data according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of a dual-model rotation process in an incremental model training method based on streaming data according to an embodiment of the present application;
FIG. 3 is a schematic flow chart illustrating another incremental model training method based on streaming data according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an incremental model training apparatus based on streaming data according to an embodiment of the present application.
Fig. 5 is a schematic structural diagram of an electronic device for incremental model training based on streaming data according to an embodiment of the present application.
Icon: 400-a model training device; 410-an obtaining module; 420-an extraction module; 430-a training module; 440-an inference module; 450-a correction module; 500-model training electronics; 510-a processor; 520-a memory; 530-bus.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. In the description of the present application, the terms "primary," "secondary," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without any creative effort belong to the protection scope of the embodiments of the present application.
Referring to fig. 1, fig. 1 is a schematic flowchart of a method for incremental model training based on streaming data according to an embodiment of the present application, including the following steps:
s1: original stream data is acquired.
Where the raw stream data is a set of sequential, large, fast, continuous arriving data sequences, a stream of data can be generally viewed as a dynamic collection of data that grows indefinitely over time.
Optionally, the raw stream data may be at least one of log files generated by a user using a mobile or Web application, online shopping data, in-game player activities, social networking site information, financial transaction hall or geospatial services, and telemetry data from devices or instruments connected in a data center, and is applied to the fields of network monitoring, sensor networks, aerospace, weather measurement and control, financial services, and the like. For example, a company can analyze raw streaming data to gain insight into aspects of its business and customer activities, such as service usage (for metering/billing), server activity, web site click-throughs, and geographic locations of equipment, personnel, and physical objects, to quickly respond to new situations. For example, a company may continually analyze social media streams, tracking changes in the public's opinion of their brands and products, and reacting in time when necessary.
S2: and extracting the real-time characteristics of the original stream data, and adding the characteristics into a full-scale characteristic library.
All the features extracted in the historical time period are stored in the full-scale feature library and used for subsequent model correction, so that the historical features are prevented from being forgotten due to long-term learning of new features. The extracted real-time features need to be sent to the message queue Kafka for storage while being stored in the full-scale feature library so as to be used by a downstream training module.
Optionally, performing real-time feature extraction on the original streaming data by using a real-time computing technology in big data, where the real-time computing technology is a Flink real-time computing framework, the features extracted in real time are stored in a message queue, and the message queue is a Kafka distributed publish-subscribe message system.
In some embodiments, the feature extraction may be data cleaning, sampling, and the like on the original stream data, the data cleaning may remove dirty data, such as some incredible samples, the data sampling may solve the problem of imbalance between positive and negative samples by using random sampling, hierarchical sampling, and the like in the classification problem;
in some embodiments, the feature extraction may also be preprocessing such as normalization, discretization, binarization, dummy coding, Hash, etc. on the original stream data;
in some embodiments, the feature extraction may also be feature selection on the original stream data, removing redundant features and noise features, where the feature selection method may be filter feature selection, ordering to leave Top-related feature parts by evaluating the degree of correlation between individual features and result values, or wrapping (wrapper) feature selection — recursive feature deletion algorithm, screening various feature subsets by considering the feature selection as a feature subset search problem, evaluating the result with a model, or embedding (embedding) feature selection, analyzing the importance of features according to the model, such as feature selection by a regularization method.
In some embodiments, the original stream data may be subjected to dimension reduction on line, for example, a Principal Component Analysis (PCA) or a Linear Discriminant Analysis (LDA) is adopted, and dimension reduction is performed on the data in advance according to actual service requirements, so that the rate of extracting on-line real-time features is increased, model training is increased, and thus, a model is updated in time.
Taking the original stream data as the online shopping data generated by the user using the mobile or Web application as an example, extracting the real-time characteristics of the original stream data by using a real-time computing technology may include: cleaning the original stream data, such as rejecting users who never buy the items within 30 days; the method can also comprise the following steps: the total amount of clicking/collecting/adding shopping carts/purchasing of different types of commodities by each user, the time for each user to purchase commodities, the popularity of the commodities or the sequencing condition of the commodities in categories, the interactive total number of the commodities and the like.
S3: determining a main model and a secondary model according to the initial training model, and performing repeated alternate training on the main model and the secondary model in a micro batch training mode based on the real-time characteristics to obtain an incremental training model; in each alternate training process, performing model incremental training on one of the main model and the auxiliary model, and simultaneously using the other model to perform online reasoning service, wherein the model for performing model incremental training in each alternate training process is different from the previous alternate training process;
the initial training model is a model obtained by training based on full-scale feature data of a full-scale feature library under the offline condition; in the micro-batch training, a small part of samples in a training set are used for training a model in the model training process, and the batch size of the small part of training samples is determined according to preset characteristic data volume in the incremental model training method based on the streaming data provided by the application; the online reasoning service is different from model training, wherein the training is to learn a certain ability from existing data, and the reasoning is to simplify and use the ability, so that the online reasoning service can quickly and efficiently operate on unknown data to obtain an expected result; the online reasoning service is to take the trained incremental training model as a reasoning model and provide the online reasoning service for the user.
Optionally, the performing, on the basis of the real-time features, multiple times of alternate training on the primary model and the secondary model in a micro-batch training manner to obtain an incremental training model, includes: presetting a trigger condition according to an actual service scene, wherein the trigger condition is whether the real-time characteristic data volume reaches a preset characteristic data volume; and judging whether the data volume of the real-time characteristics meets a preset triggering condition, if so, triggering a micro-batch training instruction, and carrying out micro-batch training and updating on the main (or auxiliary) model to obtain a trained incremental training model.
Optionally, the primary model and the secondary model are alternately trained until the model increment training process is finished. The ending of the incremental training process refers to ending in a manual intervention mode, and may be setting some threshold values, stopping incremental training if some service indexes are lower than or higher than the set threshold values, and correcting the incremental training model.
In the implementation process, the model can be trained and updated completely as required through triggered micro batch model incremental learning, manual operation is reduced, automation of model incremental training is achieved, and data rules in stream data are better fitted, so that the model is updated in time, and the model can better reflect behavior interest of a user in a short time.
Optionally, the online reasoning service deploys the incremental training model on a server or a cloud in the form of an online service, the business system triggers and calls the online reasoning service in an intelligent service scene through an interactive process with the user, some context information of the user is transmitted, and the online reasoning service returns a reasoning result to the business system according to the incremental training model and returns the reasoning result to the user side through processing of the business system.
In the implementation process, the mode of primary model and secondary model dual-mode alternation is adopted, so that the online model training and the real-time online reasoning service can be simultaneously realized, and the model used for the online reasoning service is synchronously updated along with the updating of the model training, so that the problem that the reasoning service cannot be provided in the model copying process is solved, and the model updating and deploying speed is greatly improved.
Taking original stream data as online shopping data generated by a user by using a mobile or Web application program as an example, according to the method, extracting real-time characteristics related to the online shopping data from the original stream data, such as browsing duration/click number/total purchase number/collection number/number of shopping carts added by a user to a part of commodities, and performing incremental training and updating and online reasoning service on the main model and the auxiliary model by adopting micro-batch training so that the incremental training model can reflect the requirements or the like of each commodity through the historical behaviors of the user; the incremental training model updated in real time is deployed as an online reasoning service, the online reasoning service is triggered and called under intelligent service scenes such as intelligent recommendation and the like through the interactive process of a business system and a user, some query and inquiry information of a certain user, such as a commodity name searched in the past or a browsed commodity interface, is collected, then the business system processes the reasoning result of the incremental training model and returns the reasoning result to a user side to finish the intelligent recommendation.
S4: correcting the incremental training model based on the full-scale feature library obtained in the historical time period to obtain a corrected training model;
optionally, the modifying the incremental training model based on the full-scale feature library obtained in the historical time period to obtain a modified training model includes: obtaining an evaluation result of the incremental training model according to user service indexes, wherein the user service indexes comprise click rate, browsing duration, playing duration, newly increased rate, retention rate, daily life, monthly life and conversion rate; and correcting the incremental training model according to the evaluation result and the full-scale feature library obtained in the historical time period to obtain a corrected training model.
The user service index can be set according to different service requirements; the historical time period can be a long time interval of nearly months, nearly half years or even nearly years, and can be flexibly adjusted according to actual requirements; the history forgetting characteristic means that contents learned before are almost completely forgotten after new knowledge is learned, and the history forgetting characteristic is a catastrophic forgetting problem in deep learning; the correction method is similar to the idea of transfer learning, the trained incremental training model is utilized, and the model parameters are adjusted according to the data rule of the historical forgetting characteristic, so that the correction process of the model is completed, and the corrected model can better reflect the data rule of a longer time dimension.
In the implementation process, the influence of the user behavior interest in a short period on the model can be eliminated by adopting the full feature of the full feature library in the historical time period for correction, so that the corrected model can reflect the rule of data in a near period and can reflect the data rule of a longer time dimension, the problem of forgetting the historical feature in the deep learning process is effectively solved, the behavior change rule of the user is reflected more comprehensively and accurately, and the purpose of lifelong learning of the model is also achieved.
For example, users need to be grouped according to related online shopping stream data, if recently learned characteristics are behavior characteristics of users with high liveness, it can be found by evaluating user service indexes of model effects that the reaction of the existing model to behavior characteristic rules of users with low liveness is fading, at this moment, model correction is needed, and the existing model can be corrected by adopting the related characteristics of online shopping data of nearly three months, so as to eliminate the influence of user behavior interest on the model in a short term.
According to the incremental model training method based on the streaming data, the real-time feature extraction is carried out on the original streaming data by adopting a real-time computing technology, the trigger type incremental training is carried out on the initial training model by combining a micro batch training technology, the on-demand training of the model is realized, the updating and deployment speed of the model is greatly improved by adopting a mode of alternating a main model and a secondary model, the real-time and uninterrupted online reasoning service is realized, the real user behavior change can be tracked and responded more timely, then the model is corrected by combining the online evaluation effect of the model, so that the model can reflect the data rule of a longer time dimension, the history feature is prevented from being forgotten, and the purpose of lifelong learning is realized.
Referring to fig. 2, fig. 2 is a schematic diagram of a dual-model rotation process in an incremental model training method based on streaming data according to an embodiment of the present application, including the following steps:
s21: and acquiring real-time characteristics of the extracted original stream data.
Optionally, the obtained real-time features are added into a full-scale feature library and are simultaneously sent to the message queue Kafka for storage so as to be used by a downstream training module.
S22: and performing model increment training on the main model in a micro batch training mode to obtain a trained increment training model 1.
Optionally, in combination with a micro batch training technique, performing triggered model incremental training on the main model based on the real-time features, so as to implement on-demand training and updating of the model.
S23: deploying the secondary model as an online reasoning service.
Optionally, before S22 and S23, determining a main model and a sub model according to an initial training model, where the initial training model is offline, and training the obtained model in advance based on full-scale feature data of a full-scale feature library; in the method of the present application, model incremental training is performed on a main model or a sub-model, which is not limited, in other embodiments, S22 may perform model incremental training on the sub-model in a micro batch training manner, and at this time S23 deploys the main model as an online inference service.
S24: and performing online reasoning service by adopting the main model after micro-batch training.
Optionally, after micro batch training, the main model is updated to the incremental training model 1, and the incremental training model 1 is used for performing online reasoning service.
S25: incremental training model 1 is replicated to update the secondary model for the next micro-batch training.
Optionally, after the primary model is used to replace the secondary model for online reasoning service, the incremental training model 1 is copied to update the secondary model, and the updated secondary model is used for next micro-batch training. In the process of updating the auxiliary model, the main model is responsible for the online reasoning service, and the problem that the online reasoning service cannot be provided in the process of copying and copying the model is solved.
S26: and performing model increment training on the secondary model by adopting a micro batch training mode to obtain a trained increment training model 2.
S27: incremental training model 2 is replicated to update the master model for the next micro-batch training.
In the dual-model alternation process, the micro-batch training and the online reasoning service are alternately carried out on the main model and the auxiliary model, so that the on-line model can be trained as required and the online reasoning service can be continuously carried out in real time, the model updating and deploying speed is improved, the model used for the online reasoning service is synchronously updated along with the updating of the model incremental training, and the problem that the online reasoning service cannot be provided in the model copying process is solved. In the alternate training process, the characteristics extracted in real time are combined, the problem of the real-time performance of model training is solved well by adopting a micro-batch training technology, the updating speed of the model is accelerated, and the real user behavior change is tracked and responded more timely.
Referring to fig. 3, fig. 3 is a schematic flowchart of another incremental model training method based on streaming data according to an embodiment of the present application, including the following steps:
s31: original stream data is acquired.
Alternatively, the raw stream data may be log data collected from HTTP, DataBase, and DataSync by the log collector flash.
S32: and sending the original stream data to a message queue Kafka.
Optionally, the original stream data is collected and read by using a journal collection program flash, and is synchronized into the message queue Kafka for buffering, so as to facilitate subsequent calling.
S33: and extracting real-time characteristics of the original stream data.
Optionally, the data in Kafka is consumed by using a Flink or Spark real-time computing technology, such as cleaning, field matching, and the like, and this step is consistent with the step S2 described above, and is not described herein again.
S34: and adding the real-time characteristics into a full-scale characteristic library, and sending the real-time characteristics to a message queue Kafka for storage.
Optionally, the extracted real-time features are stored in the full-scale feature library on the one hand, and are sent to the message queue Kafka again on the other hand for storage for use by a downstream training module.
S35: determining a main model and an auxiliary model according to an initial training model, alternately performing model incremental training and updating and online reasoning service on the main model and the auxiliary model in a micro batch training mode based on the real-time characteristics, and obtaining an incremental training model after multiple times of micro batch training.
The initial training model is a model obtained by training in advance based on full-scale feature data of a full-scale feature library under the offline condition; the online reasoning service can be deployed on a server, and helps a user to process reasoning services by utilizing the high performance of the server; the online reasoning service can also deploy the model on a mobile terminal, such as an embedded terminal of a mobile phone or an internet of things; in each alternate training process, model increment training is carried out on one of the main model and the auxiliary model, and the other model is used for carrying out online reasoning service, wherein the model for carrying out the model increment training in each alternate training process is different from the previous alternate training process.
In some embodiments, the online reasoning framework employed by the online reasoning service may be a tensorb reasoning framework, a pytorch reasoning framework, a TensorRT reasoning framework, an Xgboost reasoning framework, an OpenVINO reasoning framework, a Mediapipe reasoning framework, or a self-researched reasoning framework may also be used.
S36: and evaluating the incremental training model according to the user service index.
Optionally, the user service index is set as new, remaining, daily life, monthly life, conversion rate and the like of a certain product, APP or platform, wherein the new increase may represent a daily new user, a monthly new user and the like of the product, APP or platform; retention represents the retention status of a new user registered at a later time, which may be daily retention; daily activity means daily active users, specifically historical registered users, and after the registration date, the products, the APP or the platform are registered; the monthly life represents the number of independent active users in a month, namely daily life in one month is added up and then the independent users are removed with weight.
S37: and correcting the initial incremental training model according to the evaluation result and the full-scale feature library obtained in the historical time period to obtain a corrected training model.
Optionally, according to the evaluation result of the service index acquisition model, by setting some thresholds, when some service indexes are lower than or higher than the set thresholds, stopping the incremental training performed by the main model and the secondary model alternately, and modifying the obtained incremental training model to obtain a modified training model.
Optionally, after S37, the method further includes returning the modified training model to the initial training model, and performing model incremental training, model evaluation, and model modification again according to S35-S37.
In the above embodiment, the real-time computing technology may also be a Storm, SparkStreaming, or other real-time computing framework, and the message queue may also be Redis, ActiveMQ, or other. The general big data real-time computing technology related algorithm stated here is mature in the industry, and has many choices and is not listed, no matter whether data acquisition, synchronization, storage or computation.
According to the incremental model training method based on the streaming data, the original streaming data is subjected to real-time feature extraction by adopting a real-time computing technology, the initial training model is subjected to trigger type incremental training by adopting a feature message queue mode and combining a micro-batch training technology, the on-demand training of the model is realized, the main model and the auxiliary model are adopted in a dual-model alternating mode, the updating and deployment speed of the model is greatly increased, real-time and uninterrupted online reasoning service is realized, the real user behavior change can be tracked and responded more timely, and then the model is corrected by combining the online evaluation effect of the model, so that the model can reflect the data rule of longer time dimension, the history feature is prevented from being forgotten, and the purpose of lifelong learning is realized.
Referring to fig. 4, fig. 4 is a schematic structural diagram of an incremental model training device based on streaming data according to an embodiment of the present application, where the model training device 400 includes: an obtaining module 410, configured to obtain original stream data; an extraction module 420, which extracts the real-time characteristics of the original stream data and adds the characteristics into a full-scale characteristic library; the training module 430 is configured to acquire an initial training model, divide the initial training model into a main model and a sub-model, and alternately perform model incremental training and updating and online reasoning service on the main model and the sub-model in a micro batch training manner based on the real-time characteristics; the inference module 440 is configured to deploy the incremental training model updated in real time as an online inference service, so as to implement real-time online inference service using the latest model; and the correcting module 450 is configured to correct the incremental training model based on the full-scale feature library obtained in the historical time period, so as to obtain a corrected training model.
For a detailed description of the incremental model training apparatus, please refer to the description of the related method steps in the above embodiments.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, where the electronic device 500 includes: the memory 510 and the processor 520 are connected via a bus 530, and the memory 510 stores a computer program, which when read and executed by the processor 520, enables the electronic device 500 to perform all or part of the processes of the methods in the embodiments described above, so as to implement incremental model training based on streaming data.
It should be understood that the electronic device may be a Personal Computer (PC), a tablet Computer, a smart phone, or other electronic device having a logical computing function.
The embodiment of the application also provides a readable storage medium, wherein a computer program is stored in the readable storage medium, and when the computer program is read and executed by a processor, the steps in the incremental model training method based on the streaming data are executed.
The above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for incremental model training based on streaming data is characterized by comprising the following steps:
acquiring original stream data;
extracting real-time characteristics of the original stream data, and adding the characteristics into a full-scale characteristic library;
determining a main model and a secondary model according to the initial training model, and performing repeated alternate training on the main model and the secondary model in a micro batch training mode based on the real-time characteristics to obtain an incremental training model; in each alternate training process, performing model incremental training on one of the main model and the auxiliary model, and simultaneously using the other model to perform online reasoning service, wherein the model performing the model incremental training in each alternate training process is different from the previous alternate training process;
and correcting the incremental training model based on the full-scale feature library obtained in the historical time period to obtain a corrected training model.
2. The method of claim 1, wherein the modifying the incremental training model based on the full-scale feature library obtained over a historical period of time to obtain a modified training model comprises:
obtaining an evaluation result of the incremental training model according to user service indexes, wherein the user service indexes comprise click rate, browsing duration, playing duration, newly increased rate, retention rate, daily life, monthly life and conversion rate;
and correcting the incremental training model according to the evaluation result and the full-scale feature library obtained in the historical time period to obtain a corrected training model.
3. The method of claim 1, wherein the performing a plurality of alternating training on the primary model and the secondary model based on the real-time features by means of micro-batch training to obtain an incremental training model comprises:
presetting a trigger condition according to an actual service scene, wherein the trigger condition is whether the real-time characteristic data volume reaches a preset characteristic data volume;
and judging whether the data volume of the real-time characteristics meets a preset triggering condition, if so, triggering a micro-batch training instruction, and carrying out micro-batch training and updating on the main (or auxiliary) model to obtain a trained incremental training model.
4. The method according to claim 1, wherein the initial training model is a model obtained by training based on full-scale feature data of a full-scale feature library under an offline condition; and alternately training the main model and the auxiliary model until the incremental training process of the models is finished.
5. The method of claim 1, wherein the online reasoning service is that the incremental training model is deployed on a server or a cloud in the form of an online service, the business system triggers and calls the online reasoning service in an intelligent service scene through an interactive process with a user, some context information of the user is transmitted, and the online reasoning service returns a reasoning result to the business system according to the incremental training model and returns the reasoning result to the user side through processing of the business system.
6. The method of claim 1, wherein said extracting real-time features of said raw stream data into a full-scale feature library comprises:
extracting the real-time characteristics of the original stream data by adopting a real-time computing technology in big data, wherein the real-time computing technology is a Flink real-time computing framework;
and adding the real-time extracted features into a full-scale feature library.
7. The method of claim 6, further comprising: and storing the features extracted in real time in a message queue mode for model increment training, wherein the message queue is a Kafka distributed publishing and subscribing message system.
8. An incremental model training device based on streaming data, comprising:
the acquisition module is used for acquiring original stream data;
the extraction module is used for extracting the real-time characteristics of the original flow data and adding the real-time characteristics into a full-scale characteristic library;
the training module is used for determining a main model and a secondary model according to the initial training model and performing repeated alternate training on the main model and the secondary model in a micro-batch training mode based on the real-time characteristics to obtain an incremental training model; in each alternate training process, performing model incremental training on one of the main model and the auxiliary model, and simultaneously using the other model to perform online reasoning service, wherein the model performing the model incremental training in each alternate training process is different from the previous alternate training process;
the reasoning module is used for deploying the incremental training model updated in real time as an online reasoning service to realize the real-time online reasoning service by using the latest model;
and the correction module is used for correcting the incremental training model based on the full-scale feature library obtained in the historical time period to obtain a corrected training model.
9. An electronic device comprising a memory storing a computer program and a processor executing the computer program to perform the method of incremental model training based on streaming data of any of claims 1-7.
10. A readable storage medium storing a computer program which, when executed on a processor, performs the method for incremental model training based on streaming data of any one of claims 1 to 7.
CN202210150386.9A 2022-02-18 2022-02-18 Incremental model training method and device based on streaming data and electronic equipment Pending CN114528935A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210150386.9A CN114528935A (en) 2022-02-18 2022-02-18 Incremental model training method and device based on streaming data and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210150386.9A CN114528935A (en) 2022-02-18 2022-02-18 Incremental model training method and device based on streaming data and electronic equipment

Publications (1)

Publication Number Publication Date
CN114528935A true CN114528935A (en) 2022-05-24

Family

ID=81623659

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210150386.9A Pending CN114528935A (en) 2022-02-18 2022-02-18 Incremental model training method and device based on streaming data and electronic equipment

Country Status (1)

Country Link
CN (1) CN114528935A (en)

Similar Documents

Publication Publication Date Title
CN109960761B (en) Information recommendation method, device, equipment and computer readable storage medium
Liu et al. A semi-supervised and inductive embedding model for churn prediction of large-scale mobile games
CN110880127B (en) Consumption level prediction method and device, electronic equipment and storage medium
CN112380449B (en) Information recommendation method, model training method and related device
CN113568819A (en) Abnormal data detection method and device, computer readable medium and electronic equipment
CN110956278A (en) Method and system for retraining machine learning models
CN112288554A (en) Commodity recommendation method and device, storage medium and electronic device
CN115221396A (en) Information recommendation method and device based on artificial intelligence and electronic equipment
CN113656699B (en) User feature vector determining method, related equipment and medium
CN116304128B (en) Multimedia information recommendation system based on big data
CN116821516A (en) Resource recommendation method, device, equipment and storage medium
CN114528935A (en) Incremental model training method and device based on streaming data and electronic equipment
CN116091133A (en) Target object attribute identification method, device and storage medium
CN113421172B (en) Policy information pushing method and device
CN109446432A (en) A kind of information recommendation method and device
WO2023048807A1 (en) Hierarchical representation learning of user interest
CN113377640B (en) Method, medium, device and computing equipment for explaining model under business scene
AU2020335019B2 (en) Evaluation method based on mobile news client and system thereof
CN113138977A (en) Transaction conversion analysis method, device, equipment and storage medium
CN114693409A (en) Product matching method, device, computer equipment, storage medium and program product
CN113761272A (en) Data processing method, data processing equipment and computer readable storage medium
CN115482019A (en) Activity attention prediction method and device, electronic equipment and storage medium
CN111309706A (en) Model training method and device, readable storage medium and electronic equipment
Wang et al. Website recommendation with side information aided variational autoencoder
CN114417944B (en) Recognition model training method and device, and user abnormal behavior recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination