CN113705629A

CN113705629A - Training sample generation method and device, storage medium and electronic equipment

Info

Publication number: CN113705629A
Application number: CN202110909830.6A
Authority: CN
Inventors: 刘磊; 仝晔
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2021-11-26
Anticipated expiration: 2041-08-09
Also published as: CN113705629B

Abstract

The embodiment of the application provides a training sample generation method, a training sample generation device, a storage medium and electronic equipment, relates to the technical field of data processing, and aims to provide a high-quality training sample generation method. The method comprises the following steps: acquiring exposure data, and storing the exposure data into a column type storage engine, wherein the exposure data carries a request ID; inquiring whether request data with the same request ID as the request ID carried by the exposure data exist in a cache storing a plurality of request data; under the condition that the request data with the same request ID as the request ID carried by the exposure data exist, storing the inquired request data into a row of the exposure data in the columnar storage engine; generating a training sample according to each row in the column-wise storage engine.

Description

Training sample generation method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a training sample generation method and apparatus, a storage medium, and an electronic device.

Background

In the field of artificial intelligence such as internet search, recommendation and advertisement, the neural network model can infer and identify the user intention according to the user online request or behavior, and give personalized response or feedback, so that the user is helped to finish behaviors such as clicking and ordering. The neural network model is obtained by training a large number of training samples, so how to efficiently and accurately collect and process the training samples is of great importance to the efficiency and effect of training the neural network model.

The collection scheme of the training samples in the related technology has the defects of large occupied memory, data redundancy, data delay and the like.

Disclosure of Invention

In view of the above problems, embodiments of the present invention provide a training sample generation method, apparatus, storage medium, and electronic device, so as to overcome the above problems or at least partially solve the above problems.

In a first aspect of the embodiments of the present invention, a training sample generation method is provided, where the method includes:

acquiring exposure data, and storing the exposure data into a column type storage engine, wherein the exposure data carries a request ID;

inquiring whether request data with the same request ID as the request ID carried by the exposure data exist in a cache storing a plurality of request data;

under the condition that the request data with the same request ID as the request ID carried by the exposure data exist, storing the inquired request data into a row of the exposure data in the columnar storage engine;

generating a training sample according to each row in the column-wise storage engine.

Optionally, the method further comprises:

storing the row Key of the exposure data in the column-type storage engine into a KV storage engine, wherein the row Key of the column-type storage engine comprises a request ID, the KV storage engine takes the request ID carried by the exposure data contained in the row Key as a Key, and takes other information contained in the row Key as Value;

inquiring whether request data with the same request ID as the request ID carried by the exposure data exists in a cache storing a plurality of request data, wherein the method comprises the following steps:

reading Key from the KV storage engine;

and comparing the request ID carried by the plurality of request data in the cache with the read Key respectively to determine whether the request data same as the request ID carried by the exposure data exists.

Optionally, before acquiring the exposure data, the method further comprises:

acquiring request data carrying a request ID;

caching the acquired request data into a message queue according to the acquired sequence;

and reading the request data from the message queue one by one, and comparing the request ID carried by the read request data with the request ID carried by the exposure data to determine whether the request data same as the request ID carried by the exposure data exists.

Optionally, the method further comprises:

acquiring characteristic data associated with the request data;

associating the request data with the feature data to obtain associated data;

caching the acquired request data into a message queue according to the acquired sequence, comprising:

and caching the associated data corresponding to the acquired request data into a message queue according to the acquired sequence.

Optionally, after acquiring the exposure data, the method further comprises:

acquiring user behavior data, wherein the user behavior data carries a request ID;

inquiring whether user behavior data which is the same as the request ID carried by the exposure data exists in the user behavior data;

and under the condition that the user behavior data which is the same as the request ID carried by the exposure data exists, storing the inquired user behavior data into a row of the exposure data in the columnar storage engine.

Optionally, the method further comprises:

determining exposure data which is the same as the request ID carried by the user behavior data, wherein the determining comprises the following steps:

reading Key from the KV storage engine;

and comparing the request ID carried by the user behavior data with the read Key to determine the exposure data which is the same as the request ID carried by the user behavior data.

Optionally, the method further comprises:

and deleting the request data which is different from the request ID carried by the exposure data in the cache at preset time intervals.

In a second aspect of the embodiments of the present invention, there is provided a training sample generation apparatus, including:

the system comprises an acquisition module, a column type storage engine and a display module, wherein the acquisition module is used for acquiring exposure data and storing the exposure data into the column type storage engine, and the exposure data carries a request ID;

the query module is used for querying whether request data which are the same as the request ID carried by the exposure data exist in a cache in which a plurality of request data are stored;

the storage module is used for storing the inquired request data to a row of the exposure data in the columnar storage engine under the condition that the request data with the same request ID as the exposure data is inquired to exist;

and the generating module is used for generating a training sample according to each row in the column storage engine.

In a third aspect of the embodiments of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the training sample generation method disclosed in the embodiments of the present application.

In a fourth aspect of the embodiments of the present invention, an electronic device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the training sample generation method disclosed in the embodiments of the present application is implemented.

The embodiment of the invention has the following advantages:

in this embodiment, exposure data may be acquired and stored in a column-wise storage engine, where the exposure data carries a request ID; inquiring whether request data with the same request ID as the request ID carried by the exposure data exist in a cache storing a plurality of request data; under the condition that the request data with the same request ID as the request ID carried by the exposure data exist, storing the inquired request data into a row of the exposure data in the columnar storage engine; generating a training sample according to each row in the column-wise storage engine.

In this manner, exposure data and request data having the same request ID are stored in rows in the columnar storage engine, and a training sample can be generated from each row of data in the columnar storage engine. The columnar storage engine is used for storing data, so that the memory pressure can be reduced; only the request data with the same request ID is stored, so that the storage space of the columnar storage engine is saved; only a small amount of stored request data is processed, so that data redundancy is avoided, and computing resources are saved; the corresponding request data can be immediately acquired after the exposure data is acquired in real time, so that data delay is avoided; the obtained training samples can be applied to off-line model training and streaming model training, and can meet the requirements of various model training.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart illustrating the steps of a training sample generation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a method for generating training samples according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a training sample format according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a training sample generating device according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

After a user requests an online service system at a terminal once, the system calls a model reasoning service of recommendation, search and advertisement, and the whole link is clicked and reported to generate two types of data: one is traffic data and one is request data. The flow data comprises exposure data and user behavior data, the exposure data comprises exposure representation, request ID and other fields and refers to data displayed at a user terminal, the user behavior data comprises behavior identification, request ID and other fields and refers to data displayed through operations of clicking, collection and the like of a user or data displayed on a page for a long time; the request data includes a request ID and a request context, which means data that a user requests to be possibly presented.

According to the time attribute, the sequence of the three data generation is as follows: request data (first) - > exposure data (middle) - > user behavior data (later), and the three data are jointly used as raw materials for producing model training samples.

According to the primary behavior of the user, the obtained request data, exposure data and user behavior data are in a funnel relationship, and the request data are as follows: exposure data: user behavior data R: n: m, (R > N > M). Examples are: a user opens a take-away app (application) home page, a background recalls hundreds to thousands of merchants according to the position attribute of the user, and data of each merchant is subjected to reasoning calculation through (recommendation/search/advertisement) model service, so that a large amount of request data is generated; but only one part of the request data is intercepted and returned to be displayed to the user, and the displayed data is exposure data; after the user looks at the exposed merchant, an operation behavior (e.g., click) may be generated, or no click may be generated, and only after the click, the operation behavior data is calculated.

The three data are related to the behavior of the same user, such as all containing the same request ID. The three can be associated by requesting ID, and finally forming a model training sample after some specific logic processing.

In the related art, the training samples are collected by an off-line batch process and a real-time streaming process. By adopting an offline batch processing method, a large amount of request data, exposure data, user behavior data, characteristic data and the like need to be loaded into the memory for correlation, and as the service iteration data becomes larger, the service memory becomes a bottleneck, so that OOM (out of memory) is easy to affect data output; due to the large data volume, the whole calculation task consumes much time which can reach hours or even tens of hours; the reported real-time data needs to be converted into offline data, yesterday full data can be obtained at least every 1 day, and the data source layer surface is delayed for 1 day; the data delay influences subsequent use, the iterative update period of the model is prolonged, and the use effect of the model is influenced. By adopting a real-time stream processing method, because the generation of the three data has a sequence, if the three data can be associated at the same time, the data which arrives first needs to be backlogged in the memory, and the data which arrives later is waited, so that the later the data (such as user behavior data) arrives, the more the backlogged request data or exposure data in the memory is, the insufficient memory of a calculation engine is caused, and the OOM is easy to occur; if the overstocked data is discarded to avoid OOM, the number of positive and negative samples in the training samples is unbalanced, and the training effect of the model is influenced; furthermore, if the real-time stream processing method collects the streaming training samples and the offline training samples, the streaming samples and the offline samples need to be stored separately, and the process of separate storage consumes a large amount of computing resources.

In order to solve the problems that the method for collecting the training samples in the related art has large occupied memory, data redundancy and data delay, can be only applied to an offline training model or an online training model and the like, the applicant proposes that: an external memory (a column type storage engine, a KV storage engine and a message queue) is introduced to replace data to be associated in the internal memory, and a circulation flow and a data structure are reasonably designed, so that the performance is improved, the data redundancy is reduced, the data loss is avoided, and a universal training sample collection method is finally formed.

Referring to fig. 1, a flowchart illustrating steps of a training sample generation method according to an embodiment of the present invention is shown, and as shown in fig. 1, the training sample generation method may specifically include the following steps:

step S110: and acquiring exposure data, and storing the exposure data into a column type storage engine, wherein the exposure data carries a request ID.

In the training samples, useful data mainly include exposure data and user behavior data, positive samples are user behavior data, and negative samples are data which are exposed but not operated by a user. The data amount of the exposure data is generally larger than that of the user behavior data, so if the user behavior data is the main drive flow, some negative samples will be missed, i.e., the exposure data becomes less. The request data is used as the main drive stream, and because the data amount of the request data is much larger than that of the exposure data, there is much useless request data. Therefore, the applicant proposes exposure data as a main drive stream.

The columnar storage engine is an external memory, and the columnar storage engine is used for storing data, so that the memory pressure can be effectively reduced, and the OOM (out of order) of the memory is avoided. The column type storage saves more space than the line type storage, can efficiently search data in data columns, can take any column as an index, can reduce irrelevant input and output as much as possible in the query process, and avoids full-table scanning.

When each piece of exposure data is acquired, the exposure data carries an exposure identifier, and the exposure data is directly stored in a column-type storage engine Hbase (a database). The exposure data carries a request ID. The request ID is included and unique in the request data, the exposure data, and the user behavior data, and may include an ID of the request content and a user ID. The request ID is kept unchanged for the same data as the request data, the exposure data, and the user behavior data. For example: the merchant data carries a merchant ID, when the merchant data is requested by a user, the merchant data also carries the user ID, and the request ID of the merchant data can be obtained according to the merchant ID and the user ID; when the merchant data is exposed after being requested, the merchant data carries an exposure identifier, but the request ID is unchanged; when the merchant data is clicked by the user after being exposed, the exposure identifier becomes a behavior identifier, the behavior identifier represents that the merchant data is clicked by the user or collected by the user, and the like, but the request ID of the merchant data still keeps unchanged.

Step S120: and inquiring whether request data with the same request ID as the request ID carried by the exposure data exists in a cache storing a plurality of request data.

The request data is generated earlier than the exposure data, and therefore, a plurality of request data is buffered before the exposure data is acquired. After the exposure data is obtained, whether the request data with the same request ID as the request ID carried by the exposure data exists in the plurality of cached request data is inquired.

Step S130: and under the condition that the request data with the same request ID as the request ID carried by the exposure data exists, storing the inquired request data into a row of the exposure data in the columnar storage engine.

And if the request data with the same request ID as the exposure data are inquired to exist, respectively storing each piece of inquired request data in the row of the columnar storage engine for storing the exposure data with the same request ID.

Step S140: generating a training sample according to each row in the column-wise storage engine.

When in query, all data of each row can be acquired only according to the row key of each row in the columnar storage engine. From the data of each row in the columnar storage engine, a training sample can be obtained.

Based on the exposure signatures in each row of data in the columnar storage engine, a labeled training sample is obtained, which represents the row of data that is exposed.

By adopting the technical scheme of the embodiment of the application, exposure data can be obtained and stored in a column type storage engine, wherein the exposure data carries a request ID; inquiring whether request data with the same request ID as the request ID carried by the exposure data exist in a cache storing a plurality of request data; under the condition that the request data with the same request ID as the request ID carried by the exposure data exist, storing the inquired request data into a row of the exposure data in the columnar storage engine; generating a training sample according to each row in the column-wise storage engine.

In this manner, exposure data and request data having the same request ID are stored in rows in the columnar storage engine, and a training sample can be generated from each row of data in the columnar storage engine. The columnar storage engine is used for storing data, so that the memory pressure can be reduced; only the request data with the same request ID is stored, so that the storage space of the columnar storage engine is saved; only a small amount of stored request data is processed, so that data redundancy is avoided, and computing resources are saved; after the exposure data is acquired in real time, the corresponding request data can be acquired quickly, so that data delay is avoided; the obtained training samples can be applied to off-line model training and streaming model training, and can meet the requirements of various model training.

To further improve the query as to whether there is the same request data as the request ID carried by the exposure data, the applicant thought to introduce a KV storage engine, which is a kind of external storage, to improve retrieval performance.

Optionally, as an embodiment, the training sample generating method further includes:

step S210: storing the row Key of the exposure data in the column-type storage engine into a KV storage engine, wherein the row Key of the column-type storage engine comprises a request ID, the KV storage engine takes the request ID carried by the exposure data contained in the row Key as a Key, and takes other information contained in the row Key as Value.

Since the request ID is invariant, the row key (rowkey) of the columnar storage engine includes the request ID, which may be the service identification _ timestamp _ request ID. The key (primary key) of the KV (key Value pair) storage engine is the request ID, and the Value (key Value) is other information than the request ID of the row key of the columnar storage engine, and may be the service identifier _ timestamp.

When each piece of exposure data is acquired, the exposure data is stored in the column-type storage engine, and then row key composition elements of the column-type storage engine are written into the KV storage engine.

Step S220: inquiring whether request data with the same request ID as the request ID carried by the exposure data exists in a cache storing a plurality of request data, wherein the method comprises the following steps:

step S221: reading a Key from the KV storage engine.

Step S222: and comparing the request ID carried by the plurality of request data in the cache with the read Key respectively to determine whether the request data same as the request ID carried by the exposure data exists.

Whether the request data with the same request ID as the request ID carried by the exposure data exists is inquired through KV storage, each Key can be read from the KV storage firstly, then the request IDs carried by the cached request data are compared, and the comparison is carried out with the read keys. And if the request ID identical to the Key exists, proving that the request data identical to the request ID carried by the exposure data exists. And obtaining the row Key of the column storage engine through the Key and the corresponding Value in the KV storage, and storing the request data carrying the same request ID into the row of the row Key of the column storage engine.

By adopting the technical scheme of the embodiment of the application, the request ID in the row Key of the column type storage engine is directly used as the Key of the KV storage engine; compared with the method that whether the request data with the same request ID as the exposure data is in the cache is directly inquired, whether the request data with the same request ID as the exposure data is in the cache is inquired through the Key of the KV storage engine, and the method has the advantages of low delay and high retrieval performance.

To further avoid memory OOM, the applicant proposes that a message queue, which is an external memory, may be used as a storage space for buffering the requested data.

Optionally, as an embodiment, before acquiring the exposure data, the method further includes:

step S310: and acquiring request data carrying the request ID.

The request data is generated prior to the exposure data, and may be acquired prior to acquiring the exposure data.

Step S320: and caching the acquired request data into a message queue according to the acquired sequence.

And caching the acquired request data into the message queue according to the sequence of the acquired request data by utilizing the sequential read-write performance of the message queue, and waiting for acquiring the exposure data.

Step S330: inquiring whether request data with the same request ID as the request ID carried by the exposure data exists in a cache storing a plurality of request data, wherein the method comprises the following steps: and reading the request data from the message queue one by one, and comparing the request ID carried by the read request data with the request ID carried by the exposure data to determine whether the request data same as the request ID carried by the exposure data exists.

After obtaining exposure data, storing the exposure data in a column-type storage engine, and then inquiring whether request data identical to a request ID carried by the exposure data exists in a cache in which a plurality of request data are stored, specifically: and reading the request data from the message queue one by one, acquiring the request ID of the request data, comparing the request data ID of the request data with the request ID of the exposure data, and if the same request ID exists, proving that the request data which is the same as the request ID carried by the exposure data exists in the message queue.

Then, storing the inquired request data to a row where the exposure data is located in the columnar storage engine; from each row in the columnar storage engine, a training sample is generated.

By adopting the technical scheme of the embodiment of the application, the acquired request data can be stored in the message queue firstly, after the preset time is delayed, whether the request data which is the same as the request ID carried by the exposure data exists is inquired through the delay consumption service, and if the request data exists, the request data is stored in different columns under the row key with the same request ID. By delaying consumption service, the request data carrying the same request ID with the exposure data can be stored in the columnar storage engine after waiting for the exposure data, so that the request data carrying different request IDs are filtered out, the purpose of filtering the request data is achieved, and the storage space and the computing resources of the columnar storage engine are saved.

step S410: acquiring characteristic data associated with the request data;

step S420: associating the request data with the feature data to obtain associated data;

step S430: caching the acquired request data into a message queue according to the acquired sequence, comprising: and caching the associated data corresponding to the acquired request data into a message queue according to the acquired sequence.

When a user requests data, the requested data has many characteristics. In order to enable the neural network model trained by the generated training sample to have more accurate judgment capability, the characteristic data of the request data can be reserved, and the characteristic data refers to a label representing the request data. When the requested data is a merchant, the characteristic data can be labels of chain stores, fast food stores, snack stores, stir-fry stores and the like; when the requested data is a video, the characteristic data can be tags of spicy, sweet, numb, snack, main menu, barbecue and the like.

When the request data is generated, feature data associated with the request data is obtained. The association with the characteristic data is performed by a stream processing framework. And writing the associated data into the message queue for caching according to the sequence of the acquired request data by utilizing the sequential read-write function of the message queue.

By adopting the technical scheme of the embodiment of the application, the characteristic data and the request data can be associated to obtain associated data, then the associated data is written into the message queue buffer, then the request data which is in the message queue buffer and has the same request ID as the request ID carried by the exposure data is inquired, then the corresponding associated data is stored into the columnar storage engine, and finally the training sample is generated according to the data in the columnar storage engine. Therefore, the generated training sample contains the characteristic data, and the neural network model obtained by training has more accurate judgment capability.

The data useful in the training samples includes user behavior data, and therefore, the user behavior data is also acquired to generate the training samples. Optionally, as an embodiment, after acquiring the exposure data, the method further includes:

step S510: and acquiring user behavior data, wherein the user behavior data carries a request ID.

The user behavior data is data obtained after clicking or collecting the exposure data, or data with exposure duration reaching preset duration, so that the user behavior data is obtained after the exposure data is obtained. The user behavior data carries a behavior identifier and carries a request ID.

Step S520: and inquiring whether the user behavior data which is the same as the request ID carried by the exposure data exists in the user behavior data.

In order to prevent other user behavior data corresponding to exposure data not belonging to the column-wise storage engine from being stored in the column-wise storage engine, it is necessary to first inquire whether user behavior data identical to the request ID carried by the exposure number exists in the user behavior data.

Step S530: and under the condition that the user behavior data which is the same as the request ID carried by the exposure data exists, storing the inquired user behavior data into a row of the exposure data in the columnar storage engine.

And if the inquired user behavior data is the same as the request ID carried by the exposure data, respectively storing each piece of inquired user behavior data in the row of the column storage engine, which stores the exposure data carrying the same request ID, and storing the behavior identification in the row.

Generating a training sample according to each row in the column-wise storage engine: when in query, all data of each row can be acquired only according to the row key of each row in the columnar storage engine. From the data of each row in the columnar storage engine, a training sample can be obtained. According to whether the behavior mark exists in the mark of each line, whether the line of data is clicked, collected or exposed for a long time or not can be known.

By adopting the technical scheme of the embodiment of the application, the user behavior data can be acquired and stored in the column type storage engine. In this way, whether offline model training or streaming model training, all exposure data and user behavior data for each row in the columnar storage engine can be obtained by only traversing the row. In addition, when the embodiment and the foregoing embodiments are executed in combination, traversing each row in the columnar storage engine can take all exposure data, user behavior data, request data, feature data, and the like of the row, and can also know whether the row of data is clicked, collected, exposed for a long time, and the like by the user according to the identification of the row.

Optionally, as an embodiment, the method further includes:

step S610: storing the row Key of the exposure data in the column-type storage engine into a KV storage engine, wherein the row Key of the column-type storage engine comprises a request ID, the KV storage engine takes the request ID carried by the exposure data contained in the row Key as a Key, and takes other information contained in the row Key as Value;

step S620: determining exposure data which is the same as the request ID carried by the user behavior data, wherein the determining comprises the following steps:

step S621: reading Key from the KV storage engine;

step S622: and comparing the request ID carried by the user behavior data with the read Key to determine the exposure data which is the same as the request ID carried by the user behavior data.

Since the request ID is not changed, the row key of the columnar storage engine includes the request ID, which may be the service identification _ timestamp _ request ID. The key of the KV storage engine is the request ID, the Value is other information of the row key of the column storage engine except the request ID, and may be the service identifier _ timestamp.

Whether user behavior data identical to the request ID carried by the exposure data exists in the cache is inquired through KV storage, each Key can be read from the KV storage firstly, then the request ID carried by the user behavior data in the cache is compared, and the comparison is carried out with the read Key. And if the request ID identical to the Key exists, the user behavior data identical to the request ID carried by the exposure data is proved to exist. And obtaining the row Key of the column storage engine through the Key and the corresponding Value in the KV storage, and storing the user behavior data carrying the same request ID into the row of the row Key of the column storage engine.

By adopting the technical scheme of the embodiment of the application, the request ID in the row Key of the column type storage engine is directly used as the Key of the KV storage engine; compared with the method that whether the cache has the user behavior data with the same request ID as the exposure data or not is directly inquired, whether the cache has the user behavior data with the same request ID as the exposure data or not is inquired through the Key of the KV storage engine, and the method has the advantages of low delay and high retrieval performance.

Optionally, as an embodiment, the method further includes: and deleting the request data which is different from the request ID carried by the exposure data in the cache at preset time intervals.

In order to reduce the pressure of storing the request data, under the condition that the request data which carries the same request ID as the exposure data in the buffer or the message queue are all stored in the columnar storage engine, the request data which is different from the request ID carried by the exposure data in the buffer or the message queue can be deleted at preset time intervals, so that redundant request data can be removed.

Optionally, as an embodiment, referring to fig. 2, a schematic diagram of a training sample generation method is shown, where the training sample generation method includes:

the first step is as follows: the exposure data is used as a main drive flow, when the request data comes first, the request data is firstly associated with the real-time characteristic data through a flow processing frame, the associated request data and the characteristic data are firstly written into a message queue for buffering by utilizing the sequential read-write performance of the message queue, and the exposure data is waited.

The second step is that: after each piece of exposure data comes, the result is directly written into a column type storage engine HBase, and the row keys are as follows: the service identifier _ timestamp _ request ID, and at the same time, writing the constituent elements of the row Key into the KV storage engine Key: request ID, Value is: service identification _ timestamp. Wherein, the Key and Value in KV memory are spliced to be the row Key in HBase. KV memory is introduced to improve the retrieval performance of row keys.

The third step: the data in the message queue are read and written in sequence, after a certain time (configurable) delay, the line key corresponding to the exposure data is found in the KV cache by delaying the consumption service, and then the request data is updated to different columns under the uniform line key.

The fourth step: when the user behavior data finally arrive, the user behavior data go to the KV memory to take the corresponding row key, and then the data are updated to different columns under the same row key in the HBase.

The fifth step: in the flow-type training and off-line training, all the data of the following exposure, click, feature and the like can be taken as long as the row key of each row is traversed. For example: rowkey-service a _ timestamp _ request id, column name: exposure flag/behavior flag, column value: feature/other fields.

It is understood that, for the convenience of illustration in fig. 2, the number of the stream processing frames is divided into 3, but in practical application, the number is only 1. And the flow processing framework is used for processing each piece of acquired data, including request data, exposure data and user behavior data. The format of the training sample obtained finally is shown in fig. 3, and includes a data identifier, a request ID, a feature set, and an additional field, and whether the data is clicked or collected can be known according to the data identifier, and the data identifier may be represented by 1 or 0.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Fig. 4 is a schematic structural diagram of a training sample generation apparatus according to an embodiment of the present invention, and as shown in fig. 4, the training sample generation apparatus includes an obtaining module, an inquiring module, a storage module, and a generation module, where:

Optionally, as an embodiment, the apparatus further includes:

the first KV module is used for storing row keys of the exposure data in the row-wise storage engine into the KV storage engine, wherein the row keys of the row-wise storage engine comprise request IDs, the KV storage engine takes the request IDs carried by the exposure data contained in the row keys as keys, and takes other information contained in the row keys as Value;

the query module further comprises:

the first reading submodule is used for reading a Key from the KV storage engine;

and the first comparison sub-module is used for comparing the request IDs carried by the plurality of pieces of request data in the cache with the read keys respectively so as to determine whether the request data identical to the request IDs carried by the exposure data exists.

Optionally, as an embodiment, the apparatus further includes:

the request data acquisition module is used for acquiring request data carrying a request ID before acquiring the exposure data;

the caching module is used for caching the acquired request data into a message queue according to the acquired sequence;

the query module further comprises:

and the second comparison submodule is used for reading the request data from the message queue one by one and comparing the request ID carried by the read request data with the request ID carried by the exposure data so as to determine whether the request data same as the request ID carried by the exposure data exists or not.

Optionally, as an embodiment, the apparatus further includes:

a characteristic data acquisition module for acquiring characteristic data associated with the request data;

the association module is used for associating the request data with the characteristic data to obtain associated data;

the cache module comprises:

and the cache submodule is used for caching the associated data corresponding to the acquired request data into the message queue according to the acquired sequence.

Optionally, as an embodiment, the apparatus further includes:

the behavior data acquisition module is used for acquiring user behavior data after acquiring the exposure data, and the user behavior data carries a request ID;

an ID query module, configured to query whether there is user behavior data that is the same as the request ID carried in the exposure data in the user behavior data;

and the behavior data storage module is used for storing the inquired user behavior data to a row of the exposure data in the columnar storage engine under the condition that the inquired user behavior data which is the same as the request ID carried by the exposure data exists.

Optionally, as an embodiment, the apparatus further includes:

the second KV module is configured to store a row Key of the exposure data in the column-type storage engine into the KV storage engine, where the row Key of the column-type storage engine includes a request ID, the KV storage engine uses the request ID carried by the exposure data included in the row Key as a Key, and uses other information included in the row Key as a Value;

a determining module, configured to determine exposure data that is the same as the request ID carried in the user behavior data, including:

the second reading submodule is used for reading a Key in the KV storage engine;

and the second comparison submodule is used for comparing the request ID carried by the user behavior data with the read Key to determine the exposure data which is the same as the request ID carried by the user behavior data.

Optionally, as an embodiment, the apparatus further includes:

and the deleting module is used for deleting the request data which is different from the request ID carried by the exposure data in the cache every preset time period.

It should be noted that, the apparatus embodiment is similar to the method embodiment, so that the description is simple, and reference may be made to the method embodiment for relevant points.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed, the method for generating a training sample according to any of the above embodiments is implemented.

An embodiment of the present invention further provides an electronic device, including: the training sample generating method comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the training sample generating method disclosed by any embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, electronic devices and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The training sample generation method, the training sample generation device, the storage medium and the electronic device provided by the present application are introduced in detail, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understanding the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for generating training samples, the method comprising:

2. The method of claim 1, further comprising:

reading Key from the KV storage engine;

3. The method of claim 1, wherein prior to acquiring exposure data, the method further comprises:

acquiring request data carrying a request ID;

4. The method of claim 3, further comprising:

acquiring characteristic data associated with the request data;

associating the request data with the feature data to obtain associated data;

5. The method of claim 1, wherein after acquiring exposure data, the method further comprises:

6. The method of claim 5, further comprising:

reading Key from the KV storage engine;

7. The method according to any one of claims 1-6, further comprising:

8. A training sample generation apparatus, the apparatus comprising:

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the training sample generation method of any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the training sample generation method of any one of claims 1 to 7 when executing the computer program.