CN116629383A - Stream training method, device, equipment and medium for model - Google Patents

Stream training method, device, equipment and medium for model Download PDF

Info

Publication number
CN116629383A
CN116629383A CN202310632866.3A CN202310632866A CN116629383A CN 116629383 A CN116629383 A CN 116629383A CN 202310632866 A CN202310632866 A CN 202310632866A CN 116629383 A CN116629383 A CN 116629383A
Authority
CN
China
Prior art keywords
server
training
data
model
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310632866.3A
Other languages
Chinese (zh)
Inventor
张鑫
朱志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Lexin Software Technology Co Ltd
Original Assignee
Shenzhen Lexin Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Lexin Software Technology Co Ltd filed Critical Shenzhen Lexin Software Technology Co Ltd
Priority to CN202310632866.3A priority Critical patent/CN116629383A/en
Publication of CN116629383A publication Critical patent/CN116629383A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application relates to artificial intelligence technology and provides a streaming training method, device, equipment and medium of a model, on one hand, a distributed model training system comprises a first data reporting server, a second data reporting server, a feature server, a frequency control server, a sample stream Kafka queue, a training client and a training server, wherein the servers are distributed in the training process, and the problem of higher coupling between training components is solved; on the other hand, the data reporting server consumes the behavior data generated in real time from the Kafka queue, performs data transmission based on the sample flow Kafka queue, improves the real-time performance of the model through flow training, combines a distributed architecture, supports training of massive data, and effectively ensures training efficiency.

Description

Stream training method, device, equipment and medium for model
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for stream training of a model.
Background
With the wide application of deep learning in various industries and fields, a great deal of computing resources and time are consumed for training a large-scale neural network model.
The traditional offline training method needs to preprocess a large amount of data and load the data into a memory for training. However, this approach often fails to handle very large data sets due to memory capacity limitations. In addition, most of traditional model training cannot achieve real-time training and real-time updating, and user behaviors cannot be captured quickly and timely and feedback is achieved, so that hysteresis exists in model effect.
In order to solve the problems, some online learning training techniques are proposed in the industry, but these methods still apply the traditional model development flow, only the data is structured into stream batch input, large-scale distributed training cannot be performed, the method is very inflexible, and the high coupling among the training components is achieved, meanwhile, the problem of consistency of the feature data still exists, and the improvement of the model effect is limited.
Disclosure of Invention
The embodiment of the application provides a streaming training method, a streaming training device, streaming training computer equipment and streaming training storage medium for a model, and aims to solve the problems that large-scale distributed training cannot be supported during model training, the coupling between training components is high, and the model timeliness is poor.
In a first aspect, an embodiment of the present application provides a streaming training method of a model, which is applied to a distributed model training system based on user behavior, where the distributed model training system includes a first data reporting server, a second data reporting server, a feature server, a frequency control server, a sample stream Kafka queue, a training client, and a training server, and the streaming training method includes:
The first data reporting server consumes the behavior data generated in real time from a target Kafka queue for reporting the behavior data to obtain a first real-time characteristic;
the second data reporting server consumes the behavior data generated in real time from the target Kafka queue, utilizes the consumed data to construct statistical characteristics to obtain second real-time characteristics, and stores the second real-time characteristics to the characteristic server for the characteristic server to process the second real-time characteristics;
the first data reporting server calls the feature server, acquires feature data from the feature server, splices the acquired feature data with the first real-time feature to obtain sample data, and stores the sample data into the sample stream Kafka queue; the feature data comprises offline features and features obtained after processing based on the second real-time features;
the training client consumes data from the sample flow Kafka queue, acquires a pre-configured sample attribute and a behavior record of a user in the frequency control server, constructs a sample data packet according to the consumed data, the sample attribute and the behavior record, and sends the sample data packet to the training server;
The training server acquires model training parameters from preset model configuration, and trains based on the model training parameters, the sample data packet and a preset model image file of the training server to obtain a target model.
In a second aspect, an embodiment of the present application provides a streaming training device for a model, which is operated in a distributed model training system, where the distributed model training system includes a first data reporting server, a second data reporting server, a feature server, a frequency control server, a sample stream Kafka queue, a training client, and a training server, and the streaming training device includes:
the first data reporting server is used for consuming the behavior data generated in real time from a target Kafka queue for reporting the behavior data to obtain a first real-time characteristic;
the second data reporting server is configured to consume behavior data generated in real time from the target Kafka queue, construct a statistical feature by using the consumed data to obtain a second real-time feature, and store the second real-time feature to the feature server for the feature server to process the second real-time feature;
the first data reporting server is further configured to invoke the feature server, acquire feature data from the feature server, splice the acquired feature data with the first real-time feature to obtain sample data, and store the sample data to the sample stream Kafka queue; the feature data comprises offline features and features obtained after processing based on the second real-time features;
The training client is used for consuming data from the sample flow Kafka queue, acquiring a pre-configured sample attribute and a behavior record of a user in the frequency control server, constructing a sample data packet according to the consumed data, the sample attribute and the behavior record, and sending the sample data packet to the training server;
the training server is used for acquiring model training parameters from preset model configuration, and training based on the model training parameters, the sample data packet and a preset model image file of the training server to obtain a target model.
In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the streaming training method of the model described in the first aspect when the processor executes the computer program.
In a fourth aspect, embodiments of the present application further provide a computer readable storage medium, where the computer readable storage medium stores a computer program, which when executed by a processor, causes the processor to perform the streaming training method of the model according to the first aspect.
The embodiment of the application provides a streaming training method, a streaming training device, streaming training equipment and streaming training media for a model, which are characterized in that on one hand, a distributed model training system comprises a first data reporting server, a second data reporting server, a feature server, a frequency control server, a sample stream Kafka queue, a training client and a training server, wherein the servers are distributed in the training process, and the problem of higher coupling between training components is solved; on the other hand, the data reporting server consumes the behavior data generated in real time from the Kafka queue, performs data transmission based on the sample flow Kafka queue, improves the real-time performance of the model through flow training, combines a distributed architecture, supports training of massive data, and effectively ensures training efficiency.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is an application scenario schematic diagram of a streaming training method of a model provided by an embodiment of the present application;
FIG. 2 is a flow chart of a method for training a model according to an embodiment of the present application;
FIG. 3 is a schematic block diagram of a flow training device of a model provided by an embodiment of the present application;
fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic application scenario diagram of a streaming training method of a model according to an embodiment of the present application; fig. 2 is a flow chart of a flow training method of a model according to an embodiment of the present application, where the flow training method of the model is applied to a distributed model training system based on user behavior and performs data interaction with a user, and the method is executed by application software installed in the distributed model training system.
As shown in fig. 2, the method is applied to a distributed model training system based on user behaviors, where the distributed model training system includes a first data reporting server, a second data reporting server, a feature server, a frequency control server, a sample stream Kafka queue, a training client, and a training server, and the method includes steps S101 to S105.
S101, the first data reporting server consumes behavior data generated in real time from a target Kafka queue for reporting the behavior data, and obtains a first real-time feature.
In this embodiment, the technical solution is described using the distributed model training system as the execution body. The user end (such as an intelligent terminal of a smart phone, a tablet personal computer and the like) used by the user can perform data interaction with the distributed model training system, and particularly, the distributed model training system provides a streaming training platform of a model, and the user can log in the streaming training platform of the model by using the user end. The user interface of the streaming training platform of the model is displayed on the terminal interface of the user terminal, and at least one data uploading interface (which can comprise a picture uploading interface, a voice uploading interface, a text uploading interface and the like) exists in the user interface. The user can upload data through the uploading interface for use when the distributed model training system trains the model.
In this embodiment, the target Kafka queue is configured to uniformly store behavior data of a user, such as behavior data of exposure, clicking, and the like.
Specifically, the first data reporting server may perform preliminary embedded point filtering in the target Kafka queue, so as to retain behavior data of a specified scene.
Wherein, the specified scene may include, but is not limited to: commodity recommendation scenes, information recommendation scenes, video recommendation scenes, and the like.
S102, the second data reporting server consumes the behavior data generated in real time from the target Kafka queue, utilizes the consumed data to construct statistical features to obtain second real-time features, and stores the second real-time features to the feature server for the feature server to process the second real-time features.
Specifically, the manner in which the second data reporting server consumes the behavior data generated in real time from the target Kafka queue is similar to the manner in which the first data reporting server consumes the behavior data generated in real time from the target Kafka queue for reporting the behavior data, which is described above, and is not repeated here.
In this embodiment, the constructing the statistical feature using the consumed data to obtain the second real-time feature includes:
the second data reporting server acquires at least one statistical category and statistical logic of each statistical category;
the second data reporting server performs statistical processing on the consumed data according to the statistical logic of each statistical category to obtain at least one statistical feature;
And the second data reporting server combines the at least one statistical characteristic to obtain the second real-time characteristic.
Wherein the statistical categories may include, but are not limited to, one or a combination of more of the following: food, sports, entertainment, etc.
Wherein the statistics logic for each statistics category may include, but is not limited to, one or a combination of more of the following: the number of clicks under each statistics category, the number of logins (liveness), negative feedback behavior (e.g., recommended video is quickly swiped), etc.
Through carrying out statistical processing on the consumed data, interest preference of a user can be determined according to the obtained statistical characteristics, and then a user portrait is constructed, so that accuracy of a model is improved.
S103, the first data reporting server calls the feature server, acquires feature data from the feature server, splices the acquired feature data with the first real-time feature to obtain sample data, and stores the sample data into the sample stream Kafka queue; the feature data comprises offline features and features obtained after processing based on the second real-time features.
In this embodiment, the feature server provides feature storage and feature output services for training and prediction.
Wherein the feature store is derived from two types: firstly, the offline features can be synchronized to the feature server through an offline feature pushing tool, secondly, the real-time features are calculated in real time through the first data reporting server and the second data reporting server and synchronized to the feature server.
In this embodiment, before the obtaining the feature data from the feature server, the method further includes:
and the feature server acquires historical behavior data at preset time intervals to serve as the offline feature.
The preset time interval may be configured in a user-defined manner, such as a month, a day, etc.
The traditional characteristic updating frequency is low, and is usually fixed frequency updating, feedback on user behaviors is lagged, and the characteristic of the fixed frequency updating cannot obtain expected effects due to the existence of the lagging, so that the method is not suitable for scenes with high requirements on real-time performance of models.
In this embodiment, because the real-time features and the offline features are stored in the feature server at the same time, unlike the traditional model development that all adopts offline training and online calling modes, in this embodiment, not only offline data can be periodically collected for training, but also streaming training of real-time data is supported, and user behavior data can be learned in seconds, so that timely feedback is made on the user behavior, core indexes such as CTR (Click-Through-Rate) of service scenes such as recommendation search and the like are helped to be raised to more than 30%, and meanwhile, the problem that the features used during training and the features used during calling are inconsistent is solved, and deviation of model effects is avoided.
S104, the training client consumes data from the sample flow Kafka queue, obtains a pre-configured sample attribute and a behavior record of a user in the frequency control server, constructs a sample data packet according to the consumed data, the sample attribute and the behavior record, and sends the sample data packet to the training server.
In this embodiment, the distributed model training system provides a stable training service based on a distributed architecture. The distributed model training system comprises the first data reporting server, the second data reporting server, the feature server, the frequency control server, the sample flow Kafka queue, the training client, the training server and other services, and can transfer and circulate through a real-time interface and the Kafka queue, so that the coupling among the services is reduced, the cross-platform migration and deployment can be flexibly performed, the development efficiency is improved, meanwhile, the on-line efficiency of the model is improved, the pre-estimated service deployment can be performed more quickly, the model in the original mode needs 3 weeks from development to landing, the implementation can be completed only by 2 weeks, the on-line data-off-line efficiency of an algorithm is improved by 50%, the consistency of on-line data is ensured, and the performance bottleneck problem caused by the shortage of single training resources is solved.
In this embodiment, the sample attributes may include, but are not limited to, one or more of the following: positive-negative proportion of samples, control of sample lot number (i.e., maximum sample amount employed during training), etc. For example: the positive-negative ratio of the sample can be 1:1, or 1:10, etc.
In this embodiment, before the obtaining the behavior record of the user in the frequency control server, the method further includes:
the frequency control server caches each exposure sample for a preset time period;
and in the preset time period, when clicking behaviors aiming at materials corresponding to any exposure sample are detected, discarding the any exposure sample by the frequency control server, and reserving the clicking behaviors as behavior records.
The positive sample refers to a sample that is fed back by the user (i.e., after the sample is exposed to the user, the user clicks the material corresponding to the exposure behavior), and the negative sample refers to a sample that is not fed back by the user (i.e., after the sample is exposed to the user, the user does not click the material corresponding to the exposure behavior).
The preset duration may be configured in a user-defined manner, for example, 180 seconds.
For example: after exposing a short video to a user, the frequency control server caches the short video ID, and when the clicking action of the user on the short video is detected within 180 seconds, the sample can be determined to be a positive sample, namely the exposure action obtains positive feedback of the user. However, since the exposure behavior must exist in advance regardless of positive feedback or negative feedback, in order to avoid recording the positive feedback data that is clicked by the user as a negative sample (i.e., the exposure behavior is not fed back by the user), the exposure sample that is fed back by the user is directly discarded, and only the click record is kept as a positive sample, so as to ensure the accuracy of the click sample and the exposure sample logic.
Further, a sample data packet is constructed according to the consumed data, the sample attributes and the behavior records, namely, a Batch sample data packet is generated by unified packaging, and the sample data packet is sent to the training server through an interface for training of a model.
S105, the training server acquires model training parameters from preset model configuration, and trains based on the model training parameters, the sample data packet and a preset model image file of the training server to obtain a target model.
The training server comprises a preset model diagram file and preset model configuration.
The preset model configuration is used for storing configured model training parameters.
The model training parameters refer to some parameter items which need to be configured during training.
The preset model diagram file may be a pre-written network diagram, and may be generated based on a network structure of a deep fm (Deep Factorization Machine, depth factor decomposition machine) model developed by a Tensorflow.
In this embodiment, after receiving a Batch sample data packet sent by the training client, the training server enters the preset model map file to perform training, and may create a model parameter cache locally during training.
Specifically, the distributed model training system further comprises a parameter server; after the target model is obtained, the method further comprises the following steps:
the parameter server stores model parameters of the target model.
The model parameters refer to a model after training, and are actually a pile of data.
In the above embodiment, after training, the model parameters are synchronized to the parameter server in time, and then the downstream service is invoked.
In this embodiment, the distributed model training system further includes a prediction server; after the target model is obtained, the method further comprises the following steps:
responding to a call request of a model requester for the target model, calling the feature server by the pre-estimation server to acquire features, and calling the parameter server to acquire the model parameters;
the prediction server predicts based on the obtained characteristics, the model parameters and the target model to obtain a prediction result;
and the prediction server feeds back the prediction result to the model requester.
For example: the prediction server calls the feature server, acquires the features such as age and gender, and can predict the interestingness of people with the same age and gender on the video or the product by adopting the same features when the model is called to perform interest recommendation, so that the video or the product can be recommended in a targeted manner.
In the above embodiment, since the feature used in training and the feature used in calling have consistency, deviation of model effect can be avoided.
In this embodiment, the distributed model training system further includes a training monitor server, and the method further includes:
the training monitoring server acquires at least one index;
the training monitoring server collects index values of each index generated in the training process;
the training monitoring server establishes a time-index value curve graph corresponding to each index by utilizing the index value of each index, and when an inflection point appears in the curve graph of the index, the inflection point is determined to be an abnormal point; or the training monitoring server acquires historical data in a configuration time range, calculates the average value of each index according to the historical data, calculates the deviation degree of the index value of each index from the corresponding average value, and determines the abnormal point according to the detected index value when the deviation degree corresponding to the index value is detected to be greater than or equal to a preset threshold value;
and the training monitoring server reports the abnormal points.
For example: the at least one indicator may include, but is not limited to, one or more of the following: AUC (Area Under ROC Curve), ACC (Accuracy), loss, avg prediction (average prediction), avg Label (average Label), and the like.
For example: the training monitoring server may be built using Grafana.
Wherein the outliers may be output once every specified time period (e.g., 30 seconds), the application is not limited.
Through the embodiment, the training process can be monitored in real time, training abnormality can be found in time in an auxiliary mode, and the abnormal point is automatically located.
According to the technical scheme, on one hand, the distributed model training system comprises a first data reporting server, a second data reporting server, a feature server, a frequency control server, a sample flow Kafka queue, a training client and a training server, wherein the servers are distributed in the training process, and the problem of high coupling between training components is solved; on the other hand, the data reporting server consumes the behavior data generated in real time from the Kafka queue, performs data transmission based on the sample flow Kafka queue, improves the real-time performance of the model through flow training, combines a distributed architecture, supports training of massive data, and effectively ensures training efficiency.
The embodiment of the application also provides a streaming training device of the model, which is used for executing any embodiment of the streaming training method of the model. Specifically, referring to fig. 3, fig. 3 is a schematic block diagram of a streaming training device 100 of a model according to an embodiment of the present application.
As shown in fig. 3, the streaming training device 100 of the model operates in a distributed model training system, and includes a first data reporting server 101, a second data reporting server 102, a feature server 103, a frequency control server 104, a sample stream Kafka queue 105, a training client 106, and a training server 107.
The first data reporting server 101 is configured to consume behavior data generated in real time from a target Kafka queue for reporting the behavior data, so as to obtain a first real-time feature.
In this embodiment, the technical solution is described using the distributed model training system as the execution body. The user end (such as an intelligent terminal of a smart phone, a tablet personal computer and the like) used by the user can perform data interaction with the distributed model training system, and particularly, the distributed model training system provides a streaming training platform of a model, and the user can log in the streaming training platform of the model by using the user end. The user interface of the streaming training platform of the model is displayed on the terminal interface of the user terminal, and at least one data uploading interface (which can comprise a picture uploading interface, a voice uploading interface, a text uploading interface and the like) exists in the user interface. The user can upload data through the uploading interface for use when the distributed model training system trains the model.
In this embodiment, the target Kafka queue is configured to uniformly store behavior data of a user, such as behavior data of exposure, clicking, and the like.
Specifically, the first data reporting server may perform preliminary embedded point filtering in the target Kafka queue, so as to retain behavior data of a specified scene.
Wherein, the specified scene may include, but is not limited to: commodity recommendation scenes, information recommendation scenes, video recommendation scenes, and the like.
The second data reporting server 102 is configured to consume behavior data generated in real time from the target Kafka queue, construct a statistical feature by using the consumed data to obtain a second real-time feature, and store the second real-time feature to the feature server 103 for the feature server 103 to process the second real-time feature.
Specifically, the manner in which the second data reporting server 102 consumes the behavior data generated in real time from the target Kafka queue is similar to the manner in which the first data reporting server 101 consumes the behavior data generated in real time from the target Kafka queue for reporting the behavior data, which is described above, and is not described herein.
In this embodiment, the constructing the statistical feature using the consumed data to obtain the second real-time feature includes:
The second data reporting server 102 obtains at least one statistical category and statistical logic of each statistical category;
the second data reporting server 102 performs statistical processing on the consumed data according to the statistical logic of each statistical category to obtain at least one statistical feature;
the second data reporting server 102 combines the at least one statistical feature to obtain the second real-time feature.
Wherein the statistical categories may include, but are not limited to, one or a combination of more of the following: food, sports, entertainment, etc.
Wherein the statistics logic for each statistics category may include, but is not limited to, one or a combination of more of the following: the number of clicks under each statistics category, the number of logins (liveness), negative feedback behavior (e.g., recommended video is quickly swiped), etc.
Through carrying out statistical processing on the consumed data, interest preference of a user can be determined according to the obtained statistical characteristics, and then a user portrait is constructed, so that accuracy of a model is improved.
The first data reporting server 101 is further configured to call the feature server 103, obtain feature data from the feature server 103, splice the obtained feature data with the first real-time feature to obtain sample data, and store the sample data to the sample stream Kafka queue 105; the feature data comprises offline features and features obtained after processing based on the second real-time features.
In this embodiment, the feature server 103 provides feature storage and feature output services for training and prediction.
Wherein the feature store is derived from two types: firstly, the offline feature may be synchronized to the feature server 103 by an offline feature pushing tool, and secondly, the real-time feature may be calculated in real time by the first data reporting server 101 and the second data reporting server 102 and synchronized to the feature server 103.
In this embodiment, before the feature data is obtained from the feature server 103, the feature server 103 obtains the historical behavior data as the offline feature at preset time intervals.
The preset time interval may be configured in a user-defined manner, such as a month, a day, etc.
The traditional characteristic updating frequency is low, and is usually fixed frequency updating, feedback on user behaviors is lagged, and the characteristic of the fixed frequency updating cannot obtain expected effects due to the existence of the lagging, so that the method is not suitable for scenes with high requirements on real-time performance of models.
In this embodiment, because the real-time features and the offline features are stored in the feature server 103 at the same time, unlike the traditional model development that all adopts offline training and online calling, in this embodiment, not only offline data can be periodically collected for training, but also streaming training of real-time data is supported, and user behavior data can be learned in seconds, so that timely feedback is made to user behavior, and core indexes such as CTR (Click-Through-Rate) of service scenes such as recommendation search and the like are helped to be raised to more than 30%, and meanwhile, the problem that features used during training are inconsistent with features used during calling is solved, and deviation of model effects is avoided.
The training client 106 consumes data from the sample stream Kafka queue 105, obtains a pre-configured sample attribute and obtains a behavior record of the user in the frequency control server 104, constructs a sample data packet according to the consumed data, the sample attribute and the behavior record, and sends the sample data packet to the training server 107.
In this embodiment, the distributed model training system provides a stable training service based on a distributed architecture. The distributed model training system includes, for example, the first data reporting server 101, the second data reporting server 102, the feature server 103, the frequency control server 104, the sample flow Kafka queue 105, the training client 106, the training server 107 and other services, which can transfer and circulate through a real-time interface and the Kafka queue, so that the coupling between the services is reduced, the cross-platform migration and deployment can be flexibly performed, the development efficiency is improved, the model online efficiency is improved, the estimated service deployment can be performed more quickly, the model can be developed to the ground in the original mode only by 3 weeks, the algorithm landing efficiency is improved by 50%, the uniformity of online offline data can be ensured, and the performance bottleneck problem caused by the lack of single training resources is solved.
In this embodiment, the sample attributes may include, but are not limited to, one or more of the following: positive-negative proportion of samples, control of sample lot number (i.e., maximum sample amount employed during training), etc. For example: the positive-negative ratio of the sample can be 1:1, or 1:10, etc.
In this embodiment, before the behavior record of the user in the frequency control server is obtained, the frequency control server 104 caches each exposure sample for a preset period of time;
and within the preset time period, when the clicking action of the material corresponding to any exposure sample is detected, the frequency control server 104 discards the any exposure sample and keeps the clicking action as the action record.
The positive sample refers to a sample that is fed back by the user (i.e., after the sample is exposed to the user, the user clicks the material corresponding to the exposure behavior), and the negative sample refers to a sample that is not fed back by the user (i.e., after the sample is exposed to the user, the user does not click the material corresponding to the exposure behavior).
The preset duration may be configured in a user-defined manner, for example, 180 seconds.
For example: after exposing a short video to the user, the frequency control server 104 caches the short video ID, and when detecting that the user has a click action on the short video within 180 seconds, the sample can be determined to be a positive sample, that is, the exposure action obtains positive feedback of the user. However, since the exposure behavior must exist in advance regardless of positive feedback or negative feedback, in order to avoid recording the positive feedback data that is clicked by the user as a negative sample (i.e., the exposure behavior is not fed back by the user), the exposure sample that is fed back by the user is directly discarded, and only the click record is kept as a positive sample, so as to ensure the accuracy of the click sample and the exposure sample logic.
Further, a sample data packet is constructed according to the consumed data, the sample attribute and the behavior record, that is, a Batch sample data packet is generated by unified packaging, and the sample data packet is sent to the training server 107 through an interface for training of the model.
The training server 107 is configured to obtain model training parameters from a preset model configuration, and perform training based on the model training parameters, the sample data packet, and a preset model image file of the training server 107, so as to obtain a target model.
The training server 107 includes a preset model map file and a preset model configuration.
The preset model configuration is used for storing configured model training parameters.
The model training parameters refer to some parameter items which need to be configured during training.
The preset model diagram file may be a pre-written network diagram, and may be generated based on a network structure of a deep fm (Deep Factorization Machine, depth factor decomposition machine) model developed by a Tensorflow.
In this embodiment, after receiving a Batch sample data packet sent by the training client 106, the training server 107 enters the preset model map file to perform training, and may create a model parameter cache locally during training.
Specifically, the distributed model training system further comprises a parameter server; and after the target model is obtained, the parameter server stores model parameters of the target model.
The model parameters refer to a model after training, and are actually a pile of data.
In the above embodiment, after training, the model parameters are synchronized to the parameter server in time, and then the downstream service is invoked.
In this embodiment, the distributed model training system further includes a prediction server; after the target model is obtained, responding to a call request of a model requester for the target model, calling the feature server 103 by the pre-estimation server to obtain features, and calling the parameter server to obtain the model parameters;
the prediction server predicts based on the obtained characteristics, the model parameters and the target model to obtain a prediction result;
and the prediction server feeds back the prediction result to the model requester.
For example: the prediction server calls the feature server 103, acquires the features such as age and gender, and can predict the interest degree of people with the same age and gender in the same age layer or on the same gender for the video or the product by adopting the same features when the model is called for interest recommendation so as to recommend the video or the product in a targeted manner.
In the above embodiment, since the feature used in training and the feature used in calling have consistency, deviation of model effect can be avoided.
In this embodiment, the distributed model training system further includes a training monitor server, where the training monitor server obtains at least one index;
the training monitoring server collects index values of each index generated in the training process;
the training monitoring server establishes a time-index value curve graph corresponding to each index by utilizing the index value of each index, and when an inflection point appears in the curve graph of the index, the inflection point is determined to be an abnormal point; or the training monitoring server acquires historical data in a configuration time range, calculates the average value of each index according to the historical data, calculates the deviation degree of the index value of each index from the corresponding average value, and determines the abnormal point according to the detected index value when the deviation degree corresponding to the index value is detected to be greater than or equal to a preset threshold value;
and the training monitoring server reports the abnormal points.
For example: the at least one indicator may include, but is not limited to, one or more of the following: AUC (Area Under ROC Curve), ACC (Accuracy), loss, avg prediction (average prediction), avg Label (average Label), and the like.
For example: the training monitoring server may be built using Grafana.
Wherein the outliers may be output once every specified time period (e.g., 30 seconds), the application is not limited.
Through the embodiment, the training process can be monitored in real time, training abnormality can be found in time in an auxiliary mode, and the abnormal point is automatically located.
According to the technical scheme, on one hand, the distributed model training system comprises a first data reporting server, a second data reporting server, a feature server, a frequency control server, a sample flow Kafka queue, a training client and a training server, wherein the servers are distributed in the training process, and the problem of high coupling between training components is solved; on the other hand, the data reporting server consumes the behavior data generated in real time from the Kafka queue, performs data transmission based on the sample flow Kafka queue, improves the real-time performance of the model through flow training, combines a distributed architecture, supports training of massive data, and effectively ensures training efficiency.
The streaming training apparatus of the above model may be implemented in the form of a computer program which may be run on a computer device as shown in fig. 4.
Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 400 is a server, or a cluster of servers. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
Referring to fig. 4, the computer apparatus 400 includes a processor 402, a memory, and a network interface 405 connected by a device bus 401, wherein the memory may include a storage medium 403 and an internal memory 404.
The storage medium 403 may store an operating system 4031 and a computer program 4032. The computer program 4032, when executed, may cause the processor 402 to perform a streaming training method of a model.
The processor 402 is used to provide computing and control capabilities, supporting the operation of the overall computer device 400.
The internal memory 404 provides an environment for the execution of a computer program 4032 in the storage medium 403, which computer program 4032, when executed by the processor 402, causes the processor 402 to perform a streaming training method of the model.
The network interface 405 is used for network communication, such as providing transmission of data information, etc. It will be appreciated by those skilled in the art that the architecture shown in fig. 4 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting of the computer device 400 to which the present inventive arrangements may be implemented, and that a particular computer device 400 may include more or less components than those shown, or may be combined with some components, or may have a different arrangement of components.
The processor 402 is configured to execute the computer program 4032 stored in the memory, so as to implement the streaming training method of the model disclosed in the embodiment of the present application.
Those skilled in the art will appreciate that the embodiment of the computer device shown in fig. 4 is not limiting of the specific construction of the computer device, and in other embodiments, the computer device may include more or less components than those shown, or certain components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may include only a memory and a processor, and in such embodiments, the structure and function of the memory and the processor are consistent with the embodiment shown in fig. 4, and will not be described again.
It should be appreciated that in embodiments of the present application, the processor 402 may be a central processing unit (Central Processing Unit, CPU), the processor 402 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In another embodiment of the application, a computer-readable storage medium is provided. The computer readable storage medium may be a nonvolatile computer readable storage medium or a volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program when executed by a processor implements a streaming training method of the model disclosed by the embodiment of the application.
The data in this case were obtained legally.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein. Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, for example, the division of the units is merely a logical function division, there may be another division manner in actual implementation, or units having the same function may be integrated into one unit, for example, multiple units or components may be combined or may be integrated into another apparatus, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present application.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units may be stored in a storage medium if implemented in the form of software functional units and sold or used as stand-alone products. Based on such understanding, the technical solution of the present application may be essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a background server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (10)

1. The utility model provides a streaming training method of model, is applied to the distributed model training system based on user's action, distributed model training system includes first data report server, second data report server, feature server, frequency accuse server, sample stream Kafka queue, training client, training server, characterized in that includes:
the first data reporting server consumes the behavior data generated in real time from a target Kafka queue for reporting the behavior data to obtain a first real-time characteristic;
the second data reporting server consumes the behavior data generated in real time from the target Kafka queue, utilizes the consumed data to construct statistical characteristics to obtain second real-time characteristics, and stores the second real-time characteristics to the characteristic server for the characteristic server to process the second real-time characteristics;
The first data reporting server calls the feature server, acquires feature data from the feature server, splices the acquired feature data with the first real-time feature to obtain sample data, and stores the sample data into the sample stream Kafka queue; the feature data comprises offline features and features obtained after processing based on the second real-time features;
the training client consumes data from the sample flow Kafka queue, acquires a pre-configured sample attribute and a behavior record of a user in the frequency control server, constructs a sample data packet according to the consumed data, the sample attribute and the behavior record, and sends the sample data packet to the training server;
the training server acquires model training parameters from preset model configuration, and trains based on the model training parameters, the sample data packet and a preset model image file of the training server to obtain a target model.
2. The method of claim 1, wherein constructing statistical features using the consumed data results in second real-time features, comprising:
The second data reporting server acquires at least one statistical category and statistical logic of each statistical category;
the second data reporting server performs statistical processing on the consumed data according to the statistical logic of each statistical category to obtain at least one statistical feature;
and the second data reporting server combines the at least one statistical characteristic to obtain the second real-time characteristic.
3. The method of streaming training a model of claim 1, wherein prior to the obtaining feature data from the feature server, the method further comprises:
and the feature server acquires historical behavior data at preset time intervals to serve as the offline feature.
4. The method for streaming training of a model according to claim 1, wherein before said obtaining a behavior record of a user in the frequency control server, the method further comprises:
the frequency control server caches each exposure sample for a preset time period;
and in the preset time period, when clicking behaviors aiming at materials corresponding to any exposure sample are detected, discarding the any exposure sample by the frequency control server, and reserving the clicking behaviors as behavior records.
5. The method of streaming model training according to claim 1, wherein the distributed model training system further comprises a parameter server; after the target model is obtained, the method further comprises the following steps:
the parameter server stores model parameters of the target model.
6. The method of streaming model training according to claim 5, wherein the distributed model training system further comprises a predictive server; after the target model is obtained, the method further comprises the following steps:
responding to a call request of a model requester for the target model, calling the feature server by the pre-estimation server to acquire features, and calling the parameter server to acquire the model parameters;
the prediction server predicts based on the obtained characteristics, the model parameters and the target model to obtain a prediction result;
and the prediction server feeds back the prediction result to the model requester.
7. The method of streaming model training according to claim 1, wherein the distributed model training system further comprises a training monitoring server, the method further comprising:
the training monitoring server acquires at least one index;
The training monitoring server collects index values of each index generated in the training process;
the training monitoring server establishes a time-index value curve graph corresponding to each index by utilizing the index value of each index, and when an inflection point appears in the curve graph of the index, the inflection point is determined to be an abnormal point; or the training monitoring server acquires historical data in a configuration time range, calculates the average value of each index according to the historical data, calculates the deviation degree of the index value of each index from the corresponding average value, and determines the abnormal point according to the detected index value when the deviation degree corresponding to the index value is detected to be greater than or equal to a preset threshold value;
and the training monitoring server reports the abnormal points.
8. The utility model provides a stream-type trainer of model, is operated in distributed model training system, distributed model training system includes first data report server, second data report server, feature server, frequency accuse server, sample stream Kafka queue, training client, training server, its characterized in that includes:
the first data reporting server is used for consuming the behavior data generated in real time from a target Kafka queue for reporting the behavior data to obtain a first real-time characteristic;
The second data reporting server is configured to consume behavior data generated in real time from the target Kafka queue, construct a statistical feature by using the consumed data to obtain a second real-time feature, and store the second real-time feature to the feature server for the feature server to process the second real-time feature;
the first data reporting server is further configured to invoke the feature server, acquire feature data from the feature server, splice the acquired feature data with the first real-time feature to obtain sample data, and store the sample data to the sample stream Kafka queue; the feature data comprises offline features and features obtained after processing based on the second real-time features;
the training client is used for consuming data from the sample flow Kafka queue, acquiring a pre-configured sample attribute and a behavior record of a user in the frequency control server, constructing a sample data packet according to the consumed data, the sample attribute and the behavior record, and sending the sample data packet to the training server;
the training server is used for acquiring model training parameters from preset model configuration, and training based on the model training parameters, the sample data packet and a preset model image file of the training server to obtain a target model.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements a streaming training method of a model according to any of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform a streaming training method of a model according to any of claims 1 to 7.
CN202310632866.3A 2023-05-31 2023-05-31 Stream training method, device, equipment and medium for model Pending CN116629383A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310632866.3A CN116629383A (en) 2023-05-31 2023-05-31 Stream training method, device, equipment and medium for model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310632866.3A CN116629383A (en) 2023-05-31 2023-05-31 Stream training method, device, equipment and medium for model

Publications (1)

Publication Number Publication Date
CN116629383A true CN116629383A (en) 2023-08-22

Family

ID=87613149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310632866.3A Pending CN116629383A (en) 2023-05-31 2023-05-31 Stream training method, device, equipment and medium for model

Country Status (1)

Country Link
CN (1) CN116629383A (en)

Similar Documents

Publication Publication Date Title
US11392416B2 (en) Automated reconfiguration of real time data stream processing
US20190394083A1 (en) Pipeline system for time-series data forecasting
US20200143246A1 (en) Demand classification based pipeline system for time-series data forecasting
US10762390B2 (en) Computer-based visualization of machine-learning models and behavior
CN111124819B (en) Method and device for full link monitoring
CN112000636A (en) User behavior statistical analysis method based on Flink streaming processing
CN106155817B (en) Service information processing method, server and system
US10521263B2 (en) Generic communication architecture for cloud microservice infrastructure
US11237813B1 (en) Model driven state machine transitions to configure an installation of a software program
US10303818B2 (en) Enhancing processing speeds for generating a model on an electronic device
CN111935025B (en) Control method, device, equipment and medium for TCP transmission performance
CN112181678A (en) Service data processing method, device and system, storage medium and electronic device
CN114051052B (en) Behavior event configuration method, system, device, server and storage medium
Motlagh et al. Quality of monitoring for cellular networks
CN116629383A (en) Stream training method, device, equipment and medium for model
US10540669B2 (en) Managing object values and resource consumption
US8547996B2 (en) Self learning performance optimized data transfer via one or several communication channels between systems
CN113742313A (en) Data warehouse construction method and device, computer equipment and storage medium
CN107566187B (en) SLA violation monitoring method, device and system
Siokis et al. 5GMediaHUB QoS/QoE Monitoring Engine
US12099507B2 (en) Systems and methods for reducing the cardinality of metrics queries
CN115396319B (en) Data stream slicing method, device, equipment and storage medium
CN113823368B (en) Resource allocation method and device
CN115604667B (en) Message sending method, device, computer equipment and storage medium
US11902081B1 (en) Managing collection agents via an agent controller

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination