CN112925947A - Training sample processing method, device, equipment and storage medium - Google Patents

Training sample processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN112925947A
CN112925947A CN202110197693.8A CN202110197693A CN112925947A CN 112925947 A CN112925947 A CN 112925947A CN 202110197693 A CN202110197693 A CN 202110197693A CN 112925947 A CN112925947 A CN 112925947A
Authority
CN
China
Prior art keywords
data
event data
dotting event
dotting
storing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110197693.8A
Other languages
Chinese (zh)
Inventor
胡志勇
孟蕊
张冠星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bigo Technology Singapore Pte Ltd
Original Assignee
Bigo Technology Singapore Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bigo Technology Singapore Pte Ltd filed Critical Bigo Technology Singapore Pte Ltd
Priority to CN202110197693.8A priority Critical patent/CN112925947A/en
Publication of CN112925947A publication Critical patent/CN112925947A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements
    • G06Q30/0255Targeted advertisements based on user history

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Strategic Management (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Library & Information Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a training sample processing method, a training sample processing device, training sample processing equipment and a storage medium. The method comprises the following steps: obtaining dotting event data, analyzing the dotting event data through a stream calculation engine, analyzing user response data of recommended contents corresponding to the dotting events, and taking the user response data as tags corresponding to the dotting event data; acquiring feature data corresponding to recommended content, and preprocessing the feature data through a stream calculation engine based on a preset first preprocessing rule; storing dotting event data into a distributed columnar database, and storing labels of the dotting event data and corresponding preprocessing characteristic data into associated fields of the dotting event data; after the labels of the dotting event data are all stored in the corresponding associated fields, the dotting event data and the data in the associated fields are used as training samples, and the training samples are stored in a distributed message system or a distributed file system, so that the problem that the training samples cannot be produced in real time is solved.

Description

Training sample processing method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of data processing, in particular to a training sample processing method, device, equipment and storage medium.
Background
Under scenes such as short video recommendation, live broadcast recommendation and advertisement, the timeliness of recommended content is increasingly important. In the recommendation system, timeliness plays a very important role in recommendation effect, and the faster the model update speed of the recommendation system is, the more recent habits and popular trends of users can be reflected, and the more interesting content can be recommended to the users. The timeliness of the recommendation system consists of two major components, one is the timeliness of the features, and the other is the timeliness of the models.
In order to realize content recommendation with high timeliness, the model needs to be trained quickly, and the features need to be generated in real time so as to realize the recommendation link needs to be operated quickly. In order to realize the rapid training of the model, the model can be trained on line through an on-line learning technology in the prior art so as to improve the training speed of the model. However, the training samples of the online training model need to be output in real time, and the big data needs to be processed in real time to realize the real-time output of the training samples. The current big data real-time processing technology is more perfect in foundation, some excellent stream processing platforms are increasingly mature, and real-time data processing can be achieved through a stream computing engine provided by the stream processing platforms. However, the current real-time data processing technology is not applied to real-time training samples, i.e., the traditional training sample output means cannot realize the real-time output of the training samples.
Disclosure of Invention
The embodiment of the application provides a training sample processing method, a training sample processing device, training sample processing equipment and a storage medium, which can solve the problem that a training sample cannot be output in real time and ensure timeliness of a recommendation model and a recommendation system.
In a first aspect, an embodiment of the present application provides a training sample processing method, including:
obtaining dotting event data, analyzing the dotting event data through a stream calculation engine, analyzing user response data of recommended contents corresponding to the dotting events, and taking the user response data as tags corresponding to the dotting event data;
acquiring feature data corresponding to the recommended content, and preprocessing the feature data through the stream calculation engine based on a preset first preprocessing rule;
storing the dotting event data into a distributed column database, and storing the labels of the dotting event data and the corresponding preprocessing characteristic data into the associated fields of the dotting event data;
and after all the labels of the dotting event data are stored in the corresponding associated fields, taking the data in the dotting event data and the associated fields as training samples, and storing the training samples in the distributed message system or the distributed file system.
In a second aspect, an embodiment of the present application provides a training sample processing apparatus, including:
the system comprises a tag analysis module, a flow calculation engine and a data processing module, wherein the tag analysis module is configured to acquire dotting event data, analyze the dotting event data through the flow calculation engine, analyze user response data of recommended contents corresponding to a dotting event, and use the user response data as tags corresponding to the dotting event data;
the characteristic preprocessing module is configured to acquire characteristic data corresponding to the recommended content, preprocess the characteristic data through the stream calculation engine based on a preset first preprocessing rule, and store the dotting event data and the characteristic data in a distributed message system;
a data summarization module configured to store the dotting event data into a distributed columnar database, and store tags of the dotting event data and corresponding pre-processing feature data into associated fields of the dotting event data;
and the training sample generation module is configured to take the dotting event data and the data in the associated fields as training samples and store the training samples in the distributed message system or the distributed file system after all the labels of the dotting event data are stored in the corresponding associated fields.
In a third aspect, an embodiment of the present application provides an electronic device, including:
a memory and one or more processors;
the memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement a training sample processing method as described in the first aspect.
In a fourth aspect, embodiments of the present application provide a storage medium containing computer-executable instructions for performing the training sample processing method according to the first aspect when executed by a computer processor.
According to the method, the dotting event data are acquired and analyzed through the stream calculation engine, the user response data of the recommended content corresponding to the dotting event are analyzed, and the user response data serve as the label of the corresponding dotting event data; acquiring feature data corresponding to the recommended content, and preprocessing the feature data through the stream calculation engine based on a preset first preprocessing rule; storing the dotting event data into a distributed column database, and storing the labels of the dotting event data and the corresponding preprocessing characteristic data into the associated fields of the dotting event data; and after all the labels of the dotting event data are stored in the corresponding associated fields, taking the data in the dotting event data and the associated fields as training samples, and storing the training samples in the distributed message system or the distributed file system. By adopting the technical means, the dotting event data are analyzed through the stream computing engine, the label information of the dotting event data is rapidly collected, the characteristic data are preprocessed through the stream computing engine, the memory of the characteristic data is reduced, and the subsequent cache pressure is reduced. As the dotting event data and the characteristic data are different data sources and are merged in the distributed column-type database and then output to the downstream model, faults can be quickly found and solved, the generation of a large amount of error data to cause huge influence on the recommendation model is avoided, the fact that a large amount of manpower and material resources are spent on rolling back the recommendation model is avoided, and the stability of the recommendation system is improved. Based on the flow calculation engine and the data convergence processing, the fast and stable output of the training sample is realized, the timeliness of the recommendation model is guaranteed, and the timeliness of the recommendation system is guaranteed.
Drawings
Fig. 1 is a flowchart of a training sample processing method according to an embodiment of the present disclosure;
FIG. 2 is a flowchart illustrating a process of merging dotting event data and associated data according to an embodiment of the present application;
FIG. 3 is a flow chart of training sample production in accordance with one embodiment of the present disclosure;
FIG. 4 is a flowchart of training sample storage according to a first embodiment of the present application;
FIG. 5 is a flow chart of another training sample processing method provided in the first embodiment of the present application;
FIG. 6 is a block diagram of a training sample processing system according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a training sample processing apparatus according to a second embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to a third embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, specific embodiments of the present application will be described in detail with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some but not all of the relevant portions of the present application are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
The training sample processing method, the training sample processing device, the training sample processing equipment and the training sample processing storage medium aim to acquire dotting event data, analyze the dotting event data through a stream calculation engine, analyze user response data of recommended contents corresponding to a dotting event, and use the user response data as a label corresponding to the dotting event data; acquiring feature data corresponding to the recommended content, and preprocessing the feature data through the stream calculation engine based on a preset first preprocessing rule; storing the dotting event data into a distributed column database, and storing the labels of the dotting event data and the corresponding preprocessing characteristic data into the associated fields of the dotting event data; and after all the labels of the dotting event data are stored in the corresponding associated fields, taking the data in the dotting event data and the associated fields as training samples, and storing the training samples in the distributed message system or the distributed file system. Compared with a traditional training sample output mode, the method is not applied to the application of a real-time data processing technology, namely the traditional training sample output method cannot guarantee the timeliness of the training samples, so that the timeliness of the model cannot be guaranteed, the timeliness of a recommendation system is reduced, and the data value is reduced. Based on this, embodiments of the present application provide a training sample processing method, apparatus, device, and storage medium, which process a training sample in real time based on a stream calculation engine to efficiently output the training sample in real time, improve model timeliness, and improve timeliness of a recommendation system.
The first embodiment is as follows:
fig. 1 is a flowchart of a training sample processing method according to an embodiment of the present disclosure, where the training sample processing method provided in this embodiment may be performed by a training sample processing apparatus, and the training sample processing apparatus may be implemented by software and/or hardware.
The following description will be given taking a training sample processing apparatus as an example of a subject that performs a training sample processing method. Referring to fig. 1, the training sample processing method includes:
s110, dotting event data are obtained, the dotting event data are analyzed through a stream calculation engine, user response data of recommended contents corresponding to the dotting events are analyzed, and the user response data serve as tags corresponding to the dotting event data.
Specifically, the dotting event data is user behavior data collected by embedding points at the client, and can reflect the user's preference degree for recommended content corresponding to the dotting event. Therefore, the dotting event data can be analyzed, the reaction data of the user to the recommended content is obtained, and the reaction data is used as the label information of the dotting event data, so that efficient real-time output of training samples can be achieved subsequently. Furthermore, as the amount of dotting event data is overlarge, a stream computing engine is introduced to process the dotting event data in real time so as to improve the output rate of training sample data. The stream computation engine can adopt a flink, which is an open source stream processing framework and provides a distributed stream data stream engine, and any data stream program can be executed in a data parallel and pipeline mode. The embodiment of the application analyzes and processes a large amount of dotting event data streams based on flink batch processing and stream processing characteristics so as to improve the real-time output efficiency of training samples.
And S120, acquiring feature data corresponding to the recommended content, and preprocessing the feature data through the stream calculation engine based on a preset first preprocessing rule.
The characteristic data is a use basis when the recommended content is produced, and represents the content information of the recommended content. The data source of the feature data is very stable, but the memory of the feature data is very large, which easily affects the subsequent data caching efficiency. Therefore, the feature data are preprocessed, the memory of the feature data after feature preprocessing is small, and the subsequent data buffering pressure is low, so that the data buffering efficiency is improved, and the output efficiency of the training sample is improved.
In addition, in the embodiment of the application, the flow calculation engine based on the flink executes analysis of the dotting event data and preprocessing of the feature data in parallel, and an independent processing module is correspondingly constructed for two data processing flows while the data processing rate is improved, so that the training sample processing system is fully decoupled, and the stability of the training sample processing system is improved.
S130, storing the dotting event data into a distributed column database, and storing the labels of the dotting event data and the corresponding preprocessing characteristic data into the associated fields of the dotting event data.
Specifically, when the dotting event data is produced, the same stream identifier is automatically configured for the dotting event data and the corresponding feature data, and when the label of the dotting event data and the preprocessed feature data are analyzed, the consistent stream identifier is also configured for the label and the preprocessed feature data, so that the dotting event data and the associated data are merged later. Illustratively, referring to fig. 2, fig. 2 is a flowchart of merging dotting event data and associated data in the first embodiment of the present application. As shown in fig. 2, the merging process of dotting event data and associated data includes:
s1301, determining a storage row of the dotting event data in the distributed columnar database according to the flow identification of the dotting event data at a second time node after the dotting event data are stored in the distributed columnar database;
s1302, storing the label and the pre-processing characteristic data to the rest columns of the storage row, wherein the label and the pre-processing characteristic data are configured with the same stream identification as the dotting event data.
Illustratively, dotting event data is stored in a distributed columnar Database, which may employ HBase (Hadoop Database), which is a highly reliable, high-performance, column-oriented, scalable distributed storage system. The HBase is used as intermediate storage, and dotting event data, tags and corresponding preprocessing characteristic data are stored in fields of the same row based on the characteristics of column storage of the HBase, so that the purpose of converging the dotting event data and the associated data is achieved. Further, after the dotting event data is stored in the HBase, in order to avoid that a large amount of data streams such as the dotting event data and the associated data are stored in the HBase at the same time to cause instability of a training sample processing system, and thus the associated data is lost, the tag and the preprocessing characteristic data are stored in an associated field in the HBase after the dotting event data is stored for one day. And determining a storage row of the dotting event data according to the stream identifier of the dotting event data, and storing the label and the preprocessing characteristic data into the rest columns of the storage row.
In the embodiment of the application, in order to fully decouple the training sample processing system, the HBase is also independently formed into a storage flow, after the flow calculation engine of the flink executes the label analysis flow and the preprocessing flow in parallel, the generated label and the preprocessing characteristic data are respectively stored into the corresponding fields, so that the independence of each flow is ensured, the full decoupling of all links in the training sample processing system is ensured, and the stability of the training sample processing system is improved.
And S140, after all the labels of the dotting event data are stored in the corresponding associated fields, taking the dotting event data and the data in the associated fields as training samples, and storing the training samples in the distributed message system or the distributed file system.
Specifically, after the dotting event data and the associated data are stored in the HBase, preprocessing of the training sample is completed, and then only the dotting event data and the associated data need to be stored in a training sample database, and the training sample in the training sample database is randomly pulled by a downstream recommendation model. However, the dotting event data is delayed greatly and is unstable, if the dotting event data and the associated data are stored in the training sample database too early, the label information of the dotting event data is incomplete, and the training effect of the training sample is affected, so that the problem is how to judge that the label information is completely stored in the associated field. In contrast, when the dotting event data and the associated data are stored in the training sample database, the dotting event data and the associated data can be read from the HBase after being cached for a period of time, but the caching task is very complex. Therefore, when the data in the HBase is read, the reading of the related data is delayed, so that the same purpose as caching is achieved. Specifically, referring to fig. 3, fig. 3 is a flowchart of training sample production in the first embodiment of the present application. As shown in fig. 3, the training sample output process includes:
s1401, judging whether all the tags of the dotting event data are stored in corresponding associated fields, and if all the tags of the dotting event data are stored in the corresponding associated fields, taking the data in the dotting event data and the associated fields as sample data of a first response level;
and S1402, taking the dotting event data and the data in the associated field as sample data of a second response level at a first time node after the label starts to be stored in the distributed columnar database.
Illustratively, whether all tags of the dotting event data are stored in the HBase is judged, if not, whether all tags are stored in the HBase is inquired until all tags are stored in the HBase. After the label is completely stored in the HBase, the dotting event data and the associated data are stored in a training sample database, so that a downstream recommendation model can conveniently pull the dotting event data and the associated data in the training sample database. According to the embodiment of the application, after the label data are completely stored in the HBase, the dotting event data and the associated data are quickly stored in the training sample database, the training sample is equivalent to real-time output, and the timeliness of the dotting event data and the associated data is guaranteed. Therefore, the dotting event data and the associated data which are quickly stored to the training sample database after the label data are completely stored are used as the sample data of the first response level.
Illustratively, in order to enrich the time granularity of the training sample, after the tag begins to be stored for one hour after the HBase, the dotting event data and the associated data are stored in the training sample database, and the data are used as sample data of a second response level.
The embodiment of the application supports the training samples with multiple time granularities through different training sample output strategies, and enriches the use scenes of the training samples.
Further, for the sample data of the first response level and the sample data of the second response level, the two sample data need to be stored in different training sample databases, so that the recommendation model can pull the corresponding training sample from the corresponding training sample database. Specifically, referring to fig. 4, fig. 4 is a flowchart of training sample storage in the first embodiment of the present application. As shown in fig. 4, the training sample storage process includes:
s1403, storing the sample data of the first response level into a theme message unit of the distributed message system, wherein the theme message unit is used for storing message data of the same theme;
and S1404, storing the sample data of the second response level into the distributed file system.
Illustratively, the distributed messaging system employs Kafka, which is a high-throughput distributed publish-subscribe messaging system that can store sample data of a first response level into Kafka topoic. And providing real-time training samples to the recommendation model through the Kafka cluster, storing the dotting event data and the associated data through the Kafka topic, and pulling the dotting event data and the associated data from the Kafka topic when the recommendation model pulls the training samples from the Kafka.
Illustratively, the Distributed File system adopts HDFS (hadoop Distributed File system), which is a highly fault-tolerant Distributed File system and can store sample data of the second response level into the HDFS.
According to the embodiment of the application, the real-time training samples are provided for the recommendation model through the real-time pushing characteristic of the Kafka cluster, so that the real-time output of the training samples is realized, the timeliness of the recommendation model is improved, and the timeliness of a recommendation system is improved.
On the other hand, fig. 5 is a flowchart of another training sample processing method provided in the first embodiment of the present application. Referring to fig. 5, the training sample processing method includes:
s210, obtaining dotting event data sent by a client and characteristic data of push content corresponding to the dotting event sent by a server through a stream data management system.
S220, storing the dotting event data and the corresponding characteristic data into the distributed message system, wherein the dotting event data and the corresponding characteristic data are configured with the same stream identification.
Specifically, dotting event data and characteristic data collected by the stream data management system are stored in Kafka.
S230, obtaining dotting event data, analyzing the dotting event data through a stream calculation engine, analyzing user response data of recommended contents corresponding to the dotting event, and taking the user response data as a label corresponding to the dotting event data.
And S240, acquiring feature data corresponding to the recommended content, and preprocessing the feature data through the stream calculation engine based on a preset first preprocessing rule.
S250, storing the dotting event data into a distributed column database, and storing the labels of the dotting event data and the corresponding preprocessing characteristic data into the associated fields of the dotting event data.
And S260, after all the labels of the dotting event data are stored in the corresponding associated fields, taking the dotting event data and the data in the associated fields as training samples, and storing the training samples in the distributed message system or the distributed file system.
The steps S230 to S260 may refer to steps S110 to S140.
S270, taking the dotting event data and the corresponding characteristic data as backup data, and storing the backup data into the distributed file system.
And S280, preprocessing the backup data through the stream calculation engine based on a preset second preprocessing rule, acquiring preprocessed backup data, and storing the preprocessed backup data into the distributed file system.
Specifically, dotting event data and corresponding feature data are stored in an original log of the HDFS as backup data. When the training sample produces a fault or other processing is performed by using the dotting event data and the corresponding characteristic data, the backup data in the original log can be directly used. The characteristic data and the dotting event data are preprocessed, so that the memory of the backup data is reduced, and the cache pressure of the backup data is reduced. Furthermore, the preprocessed backup data are stored in the HDFS, so that the corresponding preprocessed backup data can be directly pulled from the HDFS. Understandably, the locations where the backup data and the sample data are stored in the HDFS are different.
Exemplarily, referring to fig. 6, fig. 6 is a schematic diagram of a framework of a training sample processing system according to an embodiment of the present application. As shown in fig. 6, the server sends the feature data to the stream data management system, and the client sends the dotting event data to the stream data management system. And storing dotting event data and characteristic data in the stream data management system into a distributed message system, pulling the dotting event data and the characteristic data by a stream calculation engine, and executing label analysis and characteristic preprocessing in parallel. And storing the labels and the pre-processing characteristic data obtained by processing of the stream calculation engine and the dotting event data into the associated fields of the distributed columnar database. According to the real-time sample generation strategy and the hour-level sample generation strategy, dotting event data and associated data are respectively used as sample data of a first response level and a second response level at different time nodes, the sample data of the first response level is stored in a distributed message system, and the sample data of the second response level is stored in a distributed file system. On the other hand, dotting event data and characteristic data in the distributed message system are stored in the distributed file system and serve as backup data, the backup data are pulled by the stream computing engine and preprocessed, and the preprocessed backup data are stored in the distributed file system. The recommendation model can pull appropriate sample data from a database of the corresponding sample data storage according to requirements.
Furthermore, in order to ensure the stable operation of the tasks in each link of the system, a series of real-time monitoring and alarming are set for all the tasks in the operation process, which covers the data source, the logic in the task operation, the quality of data output, data delay and the like. More detailed sample output conditions are obtained through a large number of monitoring indexes, system faults are discovered in time, the faults are solved, and the reliability and stability of the system are improved.
In summary, by acquiring dotting event data, analyzing the dotting event data through a stream calculation engine, analyzing user response data of recommended content corresponding to a dotting event, and using the user response data as a tag corresponding to the dotting event data; acquiring feature data corresponding to the recommended content, and preprocessing the feature data through the stream calculation engine based on a preset first preprocessing rule; storing the dotting event data into a distributed column database, and storing the labels of the dotting event data and the corresponding preprocessing characteristic data into the associated fields of the dotting event data; and after all the labels of the dotting event data are stored in the corresponding associated fields, taking the data in the dotting event data and the associated fields as training samples, and storing the training samples in the distributed message system or the distributed file system. By adopting the technical means, the dotting event data are analyzed through the stream computing engine, the label information of the dotting event data is rapidly collected, the characteristic data are preprocessed through the stream computing engine, the memory of the characteristic data is reduced, and the subsequent cache pressure is reduced. As the dotting event data and the characteristic data are different data sources and are converged in the distributed column-type database and then output to the downstream model, faults can be quickly found and solved, the generation of a large amount of error data to cause huge influence on the recommendation model is avoided, the fact that a large amount of manpower and material resources are spent on rolling back the recommendation model is avoided, and the stability of the training sample processing system is improved. Based on the flow calculation engine and the data convergence processing, the fast and stable output of the training sample is realized, the timeliness of the recommendation model is guaranteed, and the timeliness of the recommendation system is guaranteed.
Example two:
based on the above embodiments, fig. 7 is a schematic structural diagram of a training sample processing apparatus according to a second embodiment of the present application. Referring to fig. 7, the training sample processing apparatus provided in this embodiment specifically includes: the system comprises a label analysis module 21, a feature preprocessing module 22, a data summarization module 23 and a training sample generation module 24.
The tag analysis module 21 is configured to acquire dotting event data, analyze the dotting event data through a stream calculation engine, analyze user response data of recommended content corresponding to a dotting event, and use the user response data as a tag corresponding to the dotting event data;
a feature preprocessing module 22, configured to acquire feature data corresponding to the recommended content, and preprocess the feature data through the stream calculation engine based on a preset first preprocessing rule, where the dotting event data and the feature data are stored in a distributed message system;
a data summarization module 23 configured to store the dotting event data into a distributed columnar database, and store the tags of the dotting event data and the corresponding pre-processing feature data into the associated fields of the dotting event data;
and a training sample generation module 24, configured to, after all the tags of the dotting event data are stored in the corresponding associated fields, take the data in the dotting event data and the associated fields as training samples, and store the training samples in the distributed message system or the distributed file system.
In the above, by acquiring dotting event data, analyzing the dotting event data by a stream calculation engine, analyzing user response data of recommended content corresponding to a dotting event, and using the user response data as a tag corresponding to the dotting event data; acquiring feature data corresponding to the recommended content, and preprocessing the feature data through the stream calculation engine based on a preset first preprocessing rule; storing the dotting event data into a distributed column database, and storing the labels of the dotting event data and the corresponding preprocessing characteristic data into the associated fields of the dotting event data; and after all the labels of the dotting event data are stored in the corresponding associated fields, taking the data in the dotting event data and the associated fields as training samples, and storing the training samples in the distributed message system or the distributed file system. By adopting the technical means, the dotting event data are analyzed through the stream computing engine, the label information of the dotting event data is rapidly collected, the characteristic data are preprocessed through the stream computing engine, the memory of the characteristic data is reduced, and the subsequent cache pressure is reduced. As the dotting event data and the characteristic data are different data sources and are merged in the distributed column-type database and then output to the downstream model, faults can be quickly found and solved, the generation of a large amount of error data to cause huge influence on the recommendation model is avoided, the fact that a large amount of manpower and material resources are spent on rolling back the recommendation model is avoided, and the stability of the recommendation system is improved. Based on the flow calculation engine and the data convergence processing, the fast and stable output of the training sample is realized, the timeliness of the recommendation model is guaranteed, and the timeliness of the recommendation system is guaranteed.
The training sample processing device provided by the second embodiment of the present application can be used for executing the training sample processing method provided by the first embodiment, and has corresponding functions and beneficial effects.
Example three:
an embodiment of the present application provides an electronic device, and with reference to fig. 8, the electronic device includes: an input device 33, an output device 34, a memory 32, and one or more processors 31; the memory 32 for storing one or more programs; when the one or more programs are executed by the one or more processors 31, the one or more processors 31 are enabled to implement the training sample processing method provided in the first embodiment. The electronic device provided above can be used to execute the training sample processing method provided in the first embodiment above, and has corresponding functions and advantages.
Example four:
embodiments of the present application also provide a storage medium containing computer-executable instructions that, when executed by a computer processor, are configured to perform a training sample processing method. Of course, the storage medium provided in the embodiments of the present application contains computer-executable instructions, and the computer-executable instructions are not limited to the training sample processing method described above, and may also perform related operations in the training sample processing method provided in any embodiment of the present application.
The foregoing is considered as illustrative of the preferred embodiments of the invention and the technical principles employed. The present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the claims.

Claims (10)

1. A method of training sample processing, comprising:
obtaining dotting event data, analyzing the dotting event data through a stream calculation engine, analyzing user response data of recommended contents corresponding to the dotting events, and taking the user response data as tags corresponding to the dotting event data;
acquiring feature data corresponding to the recommended content, and preprocessing the feature data through the stream calculation engine based on a preset first preprocessing rule;
storing the dotting event data into a distributed column database, and storing the labels of the dotting event data and the corresponding preprocessing characteristic data into the associated fields of the dotting event data;
and after all the labels of the dotting event data are stored in the corresponding associated fields, taking the data in the dotting event data and the associated fields as training samples, and storing the training samples in the distributed message system or the distributed file system.
2. The method of claim 1, wherein after all the tags of the dotting event data are stored in the corresponding associated fields, taking the data in the dotting event data and the associated fields as training samples comprises:
judging whether the tags of the dotting event data are all stored in the corresponding associated fields or not, and if the tags of the dotting event data are all stored in the corresponding associated fields, taking the data in the dotting event data and the associated fields as sample data of a first response level;
and at a first time node after the tag begins to be stored in the distributed columnar database, taking the dotting event data and the data in the associated field as sample data of a second response level.
3. The method of claim 2, wherein storing the training samples into the distributed messaging system or a distributed file system comprises:
storing the sample data of the first response level into a theme message unit of the distributed message system, wherein the theme message unit is used for storing message data of the same theme;
and storing the sample data of the second response level into the distributed file system.
4. The method of claim 1, further comprising, prior to said obtaining dotting event data:
acquiring dotting event data sent by a client and characteristic data of push content corresponding to a dotting event sent by a server through a stream data management system;
and storing the dotting event data and the corresponding characteristic data into the distributed message system, wherein the dotting event data and the corresponding characteristic data are configured with the same stream identification.
5. The method of claim 4, wherein storing the tag of the dotting event data and the corresponding preprocessed feature data into associated fields of the dotting event data, respectively, comprises:
determining a storage row of the dotting event data in the distributed columnar database according to the flow identification of the dotting event data at a second time node after the dotting event data is stored in the distributed columnar database;
and storing the label and the pre-processing characteristic data into the rest columns of the storage row, wherein the label and the pre-processing characteristic data are configured with the same stream identification as the dotting event data.
6. The method of claim 4, further comprising, after said storing said dotting event data and corresponding characteristic data into said distributed messaging system:
and taking the dotting event data and the corresponding characteristic data as backup data, and storing the backup data into the distributed file system.
7. The method of claim 6, further comprising, after said storing said backup data in a distributed file system:
and preprocessing the backup data through the stream calculation engine based on a preset second preprocessing rule to obtain preprocessed backup data, and storing the preprocessed backup data into the distributed file system.
8. A training sample processing apparatus, comprising:
the system comprises a tag analysis module, a flow calculation engine and a data processing module, wherein the tag analysis module is configured to acquire dotting event data, analyze the dotting event data through the flow calculation engine, analyze user response data of recommended contents corresponding to a dotting event, and use the user response data as tags corresponding to the dotting event data;
the characteristic preprocessing module is configured to acquire characteristic data corresponding to the recommended content, preprocess the characteristic data through the stream calculation engine based on a preset first preprocessing rule, and store the dotting event data and the characteristic data in a distributed message system;
a data summarization module configured to store the dotting event data into a distributed columnar database, and store tags of the dotting event data and corresponding pre-processing feature data into associated fields of the dotting event data;
and the training sample generation module is configured to take the dotting event data and the data in the associated fields as training samples and store the training samples in the distributed message system or the distributed file system after all the labels of the dotting event data are stored in the corresponding associated fields.
9. An electronic device, comprising:
a memory and one or more processors;
the memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the training sample processing method of any of claims 1-7.
10. A storage medium containing computer-executable instructions for performing the training sample processing method of any of claims 1-7 when executed by a computer processor.
CN202110197693.8A 2021-02-22 2021-02-22 Training sample processing method, device, equipment and storage medium Pending CN112925947A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110197693.8A CN112925947A (en) 2021-02-22 2021-02-22 Training sample processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110197693.8A CN112925947A (en) 2021-02-22 2021-02-22 Training sample processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112925947A true CN112925947A (en) 2021-06-08

Family

ID=76170054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110197693.8A Pending CN112925947A (en) 2021-02-22 2021-02-22 Training sample processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112925947A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705629A (en) * 2021-08-09 2021-11-26 北京三快在线科技有限公司 Training sample generation method and device, storage medium and electronic equipment
CN116910567A (en) * 2023-09-12 2023-10-20 腾讯科技(深圳)有限公司 Online training sample construction method and related device for recommended service

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165282A (en) * 2018-08-01 2019-01-08 王冠 A kind of network data grasping means and system
CN109408347A (en) * 2018-09-28 2019-03-01 北京九章云极科技有限公司 A kind of index real-time analyzer and index real-time computing technique
CN110555076A (en) * 2019-08-22 2019-12-10 上海数禾信息科技有限公司 Data marking method, processing method and device
CN110728370A (en) * 2019-09-16 2020-01-24 北京达佳互联信息技术有限公司 Training sample generation method and device, server and storage medium
CN111666490A (en) * 2020-04-28 2020-09-15 中国平安财产保险股份有限公司 Information pushing method, device, equipment and storage medium based on kafka

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165282A (en) * 2018-08-01 2019-01-08 王冠 A kind of network data grasping means and system
CN109408347A (en) * 2018-09-28 2019-03-01 北京九章云极科技有限公司 A kind of index real-time analyzer and index real-time computing technique
CN110555076A (en) * 2019-08-22 2019-12-10 上海数禾信息科技有限公司 Data marking method, processing method and device
CN110728370A (en) * 2019-09-16 2020-01-24 北京达佳互联信息技术有限公司 Training sample generation method and device, server and storage medium
CN111666490A (en) * 2020-04-28 2020-09-15 中国平安财产保险股份有限公司 Information pushing method, device, equipment and storage medium based on kafka

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杨成: "基于排序模型的个性化推荐系统设计及研究", 《中国优秀硕士学位论文全文数据库·信息科技辑》, no. 11, 15 November 2018 (2018-11-15), pages 1 - 68 *
王姣尧 等: "采用核相关滤波的快速TLD视觉目标跟踪", 中国图像图形学报, vol. 23, no. 11, pages 1686 - 1696 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705629A (en) * 2021-08-09 2021-11-26 北京三快在线科技有限公司 Training sample generation method and device, storage medium and electronic equipment
CN116910567A (en) * 2023-09-12 2023-10-20 腾讯科技(深圳)有限公司 Online training sample construction method and related device for recommended service
CN116910567B (en) * 2023-09-12 2024-03-15 腾讯科技(深圳)有限公司 Online training sample construction method and related device for recommended service

Similar Documents

Publication Publication Date Title
US20190334789A1 (en) Generating Specifications for Microservices Implementations of an Application
CN112925947A (en) Training sample processing method, device, equipment and storage medium
US20210043194A1 (en) Counterfactual annotated dialogues for conversational computing
CN116775183A (en) Task generation method, system, equipment and storage medium based on large language model
US20230106416A1 (en) Graph-based labeling of heterogenous digital content items
CN112084224A (en) Data management method, system, device and medium
WO2022053163A1 (en) Distributed trace anomaly detection with self-attention based deep learning
CN113986746A (en) Performance test method and device and computer readable storage medium
US20240119074A1 (en) Recognizing polling questions from a conference call discussion
CN111159135A (en) Data processing method and device, electronic equipment and storage medium
CN114328913A (en) Text classification method and device, computer equipment and storage medium
CN111949521B (en) Software performance test method and device
CN117675838A (en) Automatic synchronization and sharing method and system for intelligent measurement master station data
Namiot et al. On data stream processing in IoT applications
Karakaya Software engineering issues in big data application development
CN111737371B (en) Data flow detection classification method and device capable of dynamically predicting
CN115858911A (en) Information recommendation method and device, electronic equipment and computer-readable storage medium
CN113076254A (en) Test case set generation method and device
Zhen et al. Improved Hybrid Collaborative Fitering Algorithm Based on Spark Platform
Li Construction of landscape architecture art design based on streaming media data processing
TOPRAK et al. The effect of asynchronous message queues on the communication of iot devices
US11989217B1 (en) Systems and methods for real-time data processing of unstructured data
Jamil Real time stream processing for internet of things
CN116303262A (en) Recommended information log landing method, device, equipment and storage medium
Mourougaradjane et al. Prediction of Web Service Performance and Successability using Comparative Analysis of Machine Learning and Deep Learning Algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination