CN109933580B

CN109933580B - Training data generation method and device and server

Info

Publication number: CN109933580B
Application number: CN201910114035.0A
Authority: CN
Inventors: 郑培凝; 孙逸; 金子添; 顾海斌
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-02-14
Filing date: 2019-02-14
Publication date: 2020-12-25
Anticipated expiration: 2039-02-14
Also published as: CN109933580A

Abstract

The embodiment of the invention provides a training data generation method, which can obtain a target characteristic name from initial record data after the initial record data is obtained; obtaining stored historical mapping data; generating current mapping data according to the target characteristic name, the time stamp in the historical mapping data and a preset time window; and replacing the target characteristic name in the cleaned initial record data with the corresponding characteristic identifier according to the mapping relation between the target characteristic name and the characteristic identifier in the current mapping data to generate training data. According to the method provided by the embodiment of the invention, the mapping relation in the time window is fixed, so that only one mapping relation between the feature name and the feature identifier needs to be stored according to the time window. The operation of tracing the feature name corresponding to the feature identifier according to the feature identifier in the training data in the model training process is simplified.

Description

Training data generation method and device and server

Technical Field

The present invention relates to the technical field of training data generation, and in particular, to a training data generation method, apparatus, and server.

Background

The existing prediction models are numerous, and in the process of constructing the prediction models, the models are necessarily trained according to training data to obtain the prediction models. Currently, model training is usually performed by periodic training, such as: the period of model training is one month, and within the month, the model is trained by using training data generated every day.

In the prior art, the method for generating training data every day is as follows: firstly, feature names are obtained from initial record data, and then feature identifiers are distributed to the feature names to generate training data.

The inventor finds that the prior art at least has the following problems in the process of implementing the invention: in the prior art, when training data is generated every day, feature identifiers are redistributed for feature names once, so that the mapping relation between the feature names and the feature identifiers is not fixed, and the generated training data usually only contains the feature identifiers and does not contain the feature names; therefore, in the model training process, the problem of complex operation exists when developers trace the feature names corresponding to the feature identifiers according to the feature identifiers in the training data.

Disclosure of Invention

The embodiment of the invention aims to provide a training data generation method, a training data generation device and a training data generation server, so that the operation of tracing the characteristic corresponding to the characteristic identifier according to the characteristic identifier in the training data in the model training process of a developer is simplified.

The specific technical scheme is as follows:

in order to achieve the above object, in a first aspect, an embodiment of the present invention provides a training data generation method applied to a server, where the method includes:

obtaining initial record data; each record in the initial record data at least comprises: a characteristic name;

obtaining a target characteristic name from the initial record data;

obtaining stored historical mapping data; the history mapping data includes: mapping relations between the feature names and the feature identifications and timestamps corresponding to the mapping relations;

generating current mapping data according to the target feature name, the time stamp in the historical mapping data and a preset time window; the current mapping data includes: determining a mapping relation between the feature name and the feature identifier in the preset time window according to the timestamp; the mapping relation in the current mapping data comprises the mapping relation between the target feature name and the feature identifier; the preset time window is preset according to the model training period and/or the change condition of the statistical characteristic increment;

and replacing the target characteristic name in the initial recording data with the corresponding characteristic identifier according to the mapping relation between the target characteristic name and the characteristic identifier in the current mapping data to generate training data.

Optionally, the step of obtaining the target feature name from the initial record data includes:

carrying out data cleaning on the initial recording data to obtain the cleaned initial recording data and a target characteristic name;

the step of generating training data by replacing the target feature name in the initial recording data with a corresponding feature identifier according to the mapping relationship between the target feature name and the feature identifier in the current mapping data includes:

and replacing the target characteristic name in the cleaned initial record data with the corresponding characteristic identifier according to the mapping relation between the target characteristic name and the characteristic identifier in the current mapping data to generate training data.

Optionally, the initial record data is stored in an HBase database, each record in the HBase database corresponds to one terminal, and each record includes: a terminal identification, a characteristic name and a characteristic value;

the step of obtaining initial recording data includes:

from the HBase database, initial record data saved for each terminal is obtained.

Optionally, the initial record data is stored in the HBase database through the following steps:

reading data obtained from each terminal within a preset unit time to generate a data file; the storage format of the data in the data file is the same as that of the data in the HBase database;

and importing the data file into the HBase database to obtain initial record data.

Optionally, the feature name includes a feature type;

the step of cleaning the initial record data to obtain the cleaned initial record data and the target characteristic name comprises the following steps:

counting the occurrence frequency of each feature name in the initial record data;

deleting the feature names and the feature values of which the occurrence times of the feature names are smaller than the preset frequency threshold corresponding to the features in the initial recording data according to the preset frequency threshold for each feature type; and obtaining the cleaned initial record data and the target characteristic name.

Optionally, the step of generating current mapping data according to the target feature name, the timestamp in the historical mapping data, and a preset time window includes:

distributing a feature identifier for a target feature name which is not contained in historical mapping data, generating a new mapping relation, and adding the newly generated mapping relation and a timestamp of generation time into current mapping data, wherein the generation time is the current time;

modifying the time stamp of the generation time of the mapping relation corresponding to the target characteristic name contained in the historical mapping data into the current time, and adding the current time stamp into the current mapping data;

and adding the mapping relation of which the time stamp falls in the preset time window except the target characteristic name into the current mapping data.

Optionally, after the generating the current mapping file, the method further includes:

and saving the current mapping file as a history mapping file.

Optionally, the step of generating training data by replacing the target feature name in the cleaned initial record data with a corresponding feature identifier according to the mapping relationship between the target feature name and the feature identifier in the current mapping data includes:

replacing the target characteristic name in the cleaned initial record data with a corresponding characteristic identifier according to the mapping relation between the target characteristic name and the characteristic identifier in the current mapping data;

counting the number of the feature identifiers in each record in the cleaned initial record data;

and when the number of the feature identifications contained in the record is smaller than a preset first feature threshold value or larger than a preset second feature threshold value, deleting the record.

In a second aspect, an embodiment of the present invention provides a training data generating apparatus, where the apparatus includes:

the data acquisition module is used for acquiring initial recording data; each record in the initial record data at least comprises: a characteristic name;

the obtaining module is used for obtaining a target characteristic name from the initial record data;

the history obtaining module is used for obtaining the stored history mapping data; the history mapping data includes: mapping relations between the feature names and the feature identifications and timestamps corresponding to the mapping relations;

the mapping module is used for generating current mapping data according to the target characteristic name, the timestamp in the historical mapping data and a preset time window; the current mapping data includes: determining a mapping relation between the feature name and the feature identifier in the preset time window according to the timestamp; the mapping relation in the current mapping data comprises the mapping relation between the target feature name and the feature identifier; the preset time window is preset according to the model training period and/or the change condition of the statistical characteristic increment;

and the generating module is used for replacing the target characteristic name in the initial record data with the corresponding characteristic identifier according to the mapping relation between the target characteristic name and the characteristic identifier in the current mapping data to generate training data.

Optionally, the obtaining module is specifically configured to:

the generating module is specifically configured to replace the target feature name in the cleaned initial record data with a corresponding feature identifier according to a mapping relationship between the target feature name and the feature identifier in the current mapping data, so as to generate training data.

the data obtaining module is specifically configured to:

Optionally, the apparatus further includes:

the file generation module is used for reading data obtained from each terminal within preset unit time and generating a data file; the storage format of the data in the data file is the same as that of the data in the HBase database;

and the import module is used for importing the data file into the HBase database to obtain initial record data.

Optionally, the feature name includes a feature type;

the obtaining module includes:

the statistic submodule is used for counting the occurrence frequency of each feature name in the initial record data;

a deleting submodule, configured to delete, according to a preset frequency threshold for each feature type, a feature name and a feature value, in the initial record data, for which the number of occurrences of the feature name is smaller than the preset frequency threshold corresponding to the feature; and obtaining the cleaned initial record data and the target characteristic name.

Optionally, the mapping module includes:

the distribution submodule is used for distributing a characteristic identifier for a target characteristic name which is not contained in historical mapping data, generating a new mapping relation, and adding the newly generated mapping relation and a timestamp of generation time into current mapping data, wherein the generation time is the current time;

the modification submodule is used for modifying the timestamp of the generation time of the mapping relation corresponding to the target characteristic name in the historical mapping data into the current time and adding the current time to the current mapping data;

and the adding submodule is used for adding the mapping relation of the timestamp falling in the preset time window except the target characteristic name into the current mapping data.

Optionally, the apparatus further includes:

and the storage module is used for storing the current mapping file as a history mapping file after the current mapping file is generated.

Optionally, the generating module includes:

the replacing submodule is used for replacing the target characteristic name in the cleaned initial record data with a corresponding characteristic identifier according to the mapping relation between the target characteristic name and the characteristic identifier in the current mapping data;

the counting submodule is used for counting the number of the characteristic identifications in each record in the cleaned initial record data;

and the generation submodule is used for deleting the record when the number of the feature identifications contained in the record is smaller than a preset first feature threshold or larger than a preset second feature threshold.

In a third aspect, an embodiment of the present invention provides a server, including: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;

a memory for storing a computer program;

the processor is used for realizing the following steps when executing the program stored in the memory:

obtaining a target characteristic name from the initial record data;

The present invention also provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the steps of any one of the above-mentioned training data generation methods are implemented.

Embodiments of the present invention also provide a computer program product containing instructions, which when run on a computer, cause the computer to perform any one of the above-mentioned training data generating methods.

According to the training data generation method, the training data generation device and the training data generation server, the target characteristic name can be obtained from initial recorded data after the initial recorded data are obtained; obtaining stored historical mapping data; generating current mapping data according to the target feature name, the time stamp in the historical mapping data and a preset time window; and replacing the target characteristic name in the cleaned initial record data with the corresponding characteristic identifier according to the mapping relation between the target characteristic name and the characteristic identifier in the current mapping data to generate training data. Therefore, in the training data generation method provided by the embodiment of the invention, the mapping relation is not regenerated every day, but the mapping relation in the time window is kept, and in the tracing process, only one mapping relation between the feature name and the feature identifier needs to be stored according to the time window. The method can simplify the operation that a developer traces the features corresponding to the feature identifications according to the feature identifications in the training data in the model training process. Particularly, when the period set by the time window is the same as the period of model training, the characteristic name can be traced back subsequently only by storing a mapping relation between the characteristic name and the characteristic identifier, and the operation is greatly simplified.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a schematic flow chart of a training data generation method according to an embodiment of the present invention;

fig. 2 is another schematic flow chart of a training data generation method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a training data generating apparatus according to an embodiment of the present invention;

fig. 4 is another schematic structural diagram of a training data generating apparatus according to an embodiment of the present invention;

FIG. 5 is a block diagram of a mapping module according to the embodiment shown in FIG. 4;

FIG. 6 is a schematic structural diagram of a generating module in the embodiment shown in FIG. 4;

fig. 7 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

In order to simplify the operation of a developer tracing the feature corresponding to the feature identifier according to the feature identifier in the training data in the model training process, the embodiment of the invention provides a training data generation method.

Referring to fig. 1, the method includes:

s101: obtaining initial record data; each record in the initial record data at least comprises: a characteristic name;

the initial recorded data may be behavior data of the user obtained from each terminal, and data to be cleaned obtained after preprocessing, such as: and analyzing the behavior data of the user obtained from each terminal to generate characteristics to obtain data to be cleaned. The initial log data may be saved in the HBase database.

S102: obtaining a target characteristic name from the initial record data;

in one embodiment, the step of obtaining the target feature name from the initial record data may be:

and carrying out data cleaning on the initial recorded data to obtain a target characteristic name.

In particular, since in practical applications the frequency of occurrence of each type of feature is unlikely to be identical. Therefore, for better fitting practice, when the features of the initial recorded data are cleaned, different frequency thresholds may be preset for different features, and then, according to the frequency threshold preset for each feature type, the feature names and the feature values of which the number of occurrences of the feature names in the initial recorded data is smaller than the preset frequency threshold corresponding to the feature are deleted, so as to obtain the cleaned initial recorded data and the target feature names.

In other embodiments, the target feature name may also be obtained from the initial record data according to the target feature name determined by the developer in advance.

S103: obtaining stored historical mapping data; the history mapping data includes: mapping relations between the feature names and the feature identifications and timestamps corresponding to the mapping relations;

s104: generating current mapping data according to the target feature name, the time stamp in the historical mapping data and a preset time window; the current mapping data includes: determining a mapping relation between the feature name and the feature identifier in the preset time window according to the timestamp; the mapping relation in the current mapping data comprises the mapping relation between the target feature name and the feature identifier; the preset time window is preset according to the model training period and/or the change condition of the statistical characteristic increment;

specifically, it may be:

In this embodiment, after the current mapping file is generated, the current mapping file may be further saved as a history mapping file for use in next generation of training data.

S105: and replacing the target characteristic name in the initial recording data with the corresponding characteristic identifier according to the mapping relation between the target characteristic name and the characteristic identifier in the current mapping data to generate training data.

In a specific embodiment, the step of generating the training data may be:

and when the number of the feature identifications contained in the record is smaller than a preset first feature threshold value or larger than a preset second feature threshold value, deleting the record and generating training data. Wherein the second feature number threshold is greater than the first feature number threshold, for example: the first characteristic number threshold may be a number of 10 or less, and the second characteristic number threshold may be a number of 3 or more. For example, if the number of the feature identifiers included in a record is 2, the record is deleted because the number of the feature identifiers included in the record is less than the second feature threshold value 3; the number of signatures contained in another record is 12, and since the record contains a number of signatures greater than the first signature threshold 10, the record is deleted.

In short, this step is to delete the special records containing a particularly small number of target features and a particularly large number of target features so as not to affect the special records during the model training process.

As can be seen from the embodiment shown in fig. 1, the training data generation method provided by the embodiment of the present invention can obtain the target feature name from the initial recorded data after obtaining the initial recorded data; obtaining stored historical mapping data; generating current mapping data according to the target feature name, the time stamp in the historical mapping data and a preset time window; and replacing the target characteristic name in the cleaned initial record data with the corresponding characteristic identifier according to the mapping relation between the target characteristic name and the characteristic identifier in the current mapping data to generate training data. Therefore, in the training data generation method provided by the embodiment of the invention, the mapping relation is not regenerated every day, but the mapping relation in the time window is kept, and in the tracing process, only one mapping relation between the feature name and the feature identifier needs to be stored according to the time window. The method can simplify the operation that a developer traces the features corresponding to the feature identifications according to the feature identifications in the training data in the model training process. Particularly, when the period set by the time window is the same as the period of model training, the characteristic name can be traced back subsequently only by storing a mapping relation between the characteristic name and the characteristic identifier, and the operation is greatly simplified.

Alternatively, in the embodiment shown in fig. 1, the initial log data may be stored in the HBase database. In the HBase database, each record corresponds to a terminal, and each record comprises: terminal identification, characteristic name and characteristic value.

Thus, step S101 in fig. 1 may specifically include: from the HBase database, initial record data saved for each terminal is obtained.

In practical applications, the initial record data may be saved for the log data obtained from the computer terminal and for the log data obtained from the mobile terminal, respectively, depending on whether the terminal device is a computer terminal or a mobile terminal.

The initial record data saved for each terminal can be stored in the HBase database in the following two ways:

the first import mode, the batch import mode, includes the following steps:

firstly, reading data obtained from each terminal within a preset unit time to generate a data file; and the storage format of the data in the data file is the same as that of the data in the HBase database. For example: and receiving terminal log data of several days, and writing the terminal log data into a data file with the same storage format as that of the data in the HBase.

And then, importing the data file into the HBase database to obtain initial record data.

In practical application, terminal log data in a preset unit time, for example, the unit time is 3 days, may be read from an HDFS (distributed file system), and an Hfile file is generated, where a storage format of data in the Hfile file is the same as a storage format of data in the HBase; and then, importing the Hfile into an HBase database. Therefore, the data read from the HDFS in the unit time is imported into the HBase database in a bulk loading mode. Wherein the purpose of setting the preset unit time here is to: in order to obtain the log data of the terminal in the unit time, in the method provided in the embodiment of the present invention, the preset window time is set to ensure that the mapping relationship between the feature name and the feature identifier is fixed and unchanged in the preset window time.

The second import mode, a mode of inserting a single record, may specifically have two cases:

in the first case, data obtained from each terminal within a preset unit time is read from the HDFS, and the data is introduced into the HBase database in a single-insertion manner.

In the second case, the data obtained from each terminal in a preset unit time is read from the hive (data warehouse) by using jdbc (Java database connection), and the data is imported into the HBase database in a single-insertion manner.

In addition, in other embodiments, the initial record data in the HBase database may also be updated by using TTL (storage life cycle) function of the HBase database itself. For example: TTL may be set to 45 days. Thus, the initial record data imported into the HBase database can be automatically deleted once the storage time reaches 45 days, and redundant data can be greatly reduced.

The method provided by the embodiment of the present invention will be described in further detail below by referring to a specific embodiment.

Referring to fig. 2, with the method provided by the embodiment of the present invention, the process of generating the training data may include:

s201: acquiring initial record data saved for each terminal from an HBase database; each record in the initial record data at least comprises: a terminal identification, a characteristic name and a characteristic value; the feature name comprises a feature type;

specifically, the format of the Feature name (Feature _ name) may be: category # subcategory # feature _ name, wherein the category represents the category to which the feature belongs, the subcategory represents the subcategory to which the feature belongs in the category, and the feature _ name represents the identification information preset for the feature.

For example: the film watching characteristics comprise album watching time length characteristics, episode watching time length characteristics and the like. The reported data characteristics include terminal equipment type characteristics and the like.

The method specifically comprises the following steps:

the first embodiment is as follows: is characterized in that: the user watched a 1 hour drama with identification information of 100291801album, and this feature belongs to the album viewing duration feature in the viewing feature. The feature name of the feature may be: VV # ALBUM _ ID #100291801ALBUM, where VV indicates that the type to which the feature belongs is the viewing feature type, ALBUM _ ID indicates that the feature belongs to the ALBUM viewing duration feature under the viewing feature, and 100291801ALBUM indicates identification information of the viewed drama; the eigenvalue of the characteristic is 1.

Example two: is characterized in that: if the user watched a episode in a television episode for 45 minutes, the episode identification information was 1000001500, and the feature belongs to the episode watching duration feature in the watching feature. The feature name of the feature may be: VV # epsilon _ ID #1000001500, where VV is a view feature type, the epsilon _ ID indicates that the feature belongs to an EPISODE viewing duration feature under the view feature, and 1000001500 indicates identification information of the watched EPISODE; the characteristic value of this feature was 45 minutes.

Example three: is characterized in that: the os version of the terminal device used by the user is 4.0.4os, and the feature belongs to the terminal device type feature in the reported data feature. The feature name of the feature may be: SDK # SDK _ OS _ VERSION #4.0.4OS, where SDK indicates that the type to which the feature belongs is a reported data feature type, SDK _ OS _ VERSION indicates that the feature belongs to a terminal device type feature under the reported data feature, and 4.0.4OS indicates specific type information of a terminal device used by a user; the characteristic value of this feature is 4.0.4 os.

The feature names adopt the format, and through classification, the conflict generated by the preset identification information for the features can be avoided, namely, the conflict problem that feature _ names are the same is avoided.

S202: counting the occurrence frequency of each feature name in the initial record data;

s203: deleting the feature names and the feature values of which the occurrence times of the feature names are smaller than the preset frequency threshold corresponding to the features in the initial recording data according to the preset frequency threshold for each feature type; obtaining cleaned initial record data and a target characteristic name;

for example: a, B, C, D4 feature names exist in the initial record data, and the statistics on the occurrence times of the feature names A, B, C, D in the initial record data are as follows: 8. 18, 25 and 6 times. According to the frequency thresholds preset for the feature A, B, C, D: 10. 10, 20 and 8 times, deleting the feature name and the feature value of which the number of times of appearance of the feature name is smaller than the preset frequency threshold value corresponding to the feature in the initial record data, namely deleting the feature name A and the feature value corresponding to the feature name A, the feature name D and the feature value corresponding to the feature from the initial record data to obtain cleaned initial record data, wherein the target feature name is the feature name obtained by cleaning the feature in the initial record data, and the feature name obtained by cleaning the feature in the initial record data is B, C, so that the target feature name is: B. and C, performing treatment.

S204: obtaining stored historical mapping data; the history mapping data includes: mapping relations between the feature names and the feature identifications and timestamps corresponding to the mapping relations;

s205: distributing a characteristic identifier for a target characteristic name which is not contained in historical mapping data, generating a new mapping relation, and adding the newly generated mapping relation and a timestamp of generation time to current mapping data; the generation time is the current time;

s206: modifying the time stamp of the generation time of the mapping relation corresponding to the target characteristic name contained in the historical mapping data into the current time, and adding the current time stamp into the current mapping data;

s207: adding a mapping relation of which the timestamp falls in the preset time window except the target characteristic name into current mapping data;

s208: saving the current mapping file as a history mapping file;

in this embodiment, the current mapping file may be stored in the HDFS as a history mapping file, so that the stored history mapping data may be obtained from the HDFS.

The embodiment of the present invention does not limit the timing of saving the current mapping file as the history mapping file, and in other embodiments, the timing of saving the current mapping file as the history mapping file may be any step after step S207.

S209: replacing the target characteristic name in the cleaned initial record data with a corresponding characteristic identifier according to the mapping relation between the target characteristic name and the characteristic identifier in the current mapping data;

in practical application, the timestamps of the corresponding relationship between the target feature name and the feature identifier are the current time, namely the current day;

therefore, the target feature name can be replaced directly according to the mapping relation between the feature name and the feature identifier, wherein the time stamp is the current time.

S210: counting the number of the feature identifiers in each record in the cleaned initial record data;

s211: when the number of the feature identifiers contained in the record is smaller than a preset first feature threshold value or larger than a preset second feature threshold value, deleting the record;

s212: the remaining records are used as the generated training data.

In practical applications, the SIZE of the preset time WINDOW SLIDING _ WINDOW _ SIZE may be set according to a model training period and/or a statistical feature increase variation. For example: according to the statistical data of the features, the increment of the feature number is not large in 25 days, and the period of model training is about 30 days generally, so that the size of a time window can be set to be 20-30 days according to the actual situation, and the mapping relation between the feature names and the feature identifications is fixed in the preset time window in the model training process.

For example, in the present embodiment, the SIZE of the preset time WINDOW SLIDING _ WINDOW _ SIZE may be set to 25 days.

The following describes a process of generating a current mapping file in the method provided by the embodiment of the present invention by taking an actual example:

firstly, the saved historical mapping data can be obtained from the HDFS; in the history mapping data, the storage format of the data may be: $ Feature _ name \ t $ { id } \ t $ { used _ time }, where Feature _ name is a Feature name, id is a Feature identifier corresponding to the Feature name, and used _ time is a timestamp corresponding to the mapping relationship, that is, the history mapping data includes: mapping relations between the feature names and the feature identifications, and timestamps corresponding to each mapping relation.

And then, generating a current mapping file according to the target characteristic name, the time stamp in the historical mapping data and a preset time window.

For example, on the 02 th 08 month in 2018, after the initial record data is cleaned, and the cleaned initial record data and the target feature name are obtained, the history mapping data can be obtained as mapping.20180801 from the HDFS.

And allocating feature identifications to target feature names which are not contained in mapping.20180801, generating a new mapping relation, and adding the newly generated mapping relation and a time stamp of generation time 20180802 to current mapping data mapping.20180802.

And modifying the time stamp of the generation time of the mapping relation corresponding to the target characteristic name contained in the historical mapping data mapping.20180801 into the current time 20180802, and adding the current time 20180802 into the current mapping data mapping.20180802.

And adding the time stamp which falls in the preset time window, namely the preset mapping relation within 25 days except the target characteristic name, into the current mapping data mapping.20180802.

For example: at 2018.08.02, after the initial recorded data was cleaned, the target feature names were obtained as: B. c, D, E, F are provided.

From the HDFS, the data in the history mapping data mapping.20180801 is obtained as shown in table one:

watch 1

Feature_name	id	used_time
			A	00	20180801
B	01	20180801
			C	02	20180724
D	03	20180728
			G	04	20180729
H	05	20180707

The specific method for generating the current mapping data mapping.20180802 comprises the following steps:

assigning a feature identifier to a target feature name E, F which is not included in mapping.20180801, for example, assigning a feature identifier of 06 to the target feature name E, assigning a feature identifier of 07 to the target feature name F, generating a new mapping relationship with a generation time of 20180802, and adding a timestamp of 20180802 and the newly generated mapping relationship to current mapping data mapping.20180802.

And respectively modifying the time stamp of the generation time of the mapping relation corresponding to the target characteristic name B, C, D contained in the historical mapping data mapping.20180801 into the current time 20180802, and adding the current time 20180802 to the current mapping data mapping.20180802.

And (3) the timestamp falls in the preset time window, and the mapping relation except the target characteristic name, namely the mapping relation corresponding to the characteristic name A, G, is added into the current mapping data mapping.20180802.

Since the timestamp corresponding to the mapping relationship between the feature name H and the feature identifier 05 is 20180707 and is not within 25 days of the preset time window, the mapping relationship is not added to the current mapping data mapping.20180802.

Thus, the data in the current mapping data mapping.20180802 is shown in table two:

watch two

Feature_name	id	used_time
			A	00	20180801
B	01	20180802
			C	02	20180802
D	03	20180802
			G	04	20180729
E	06	20180802
			F	07	20180802

The following further illustrates a practical example, and further explains a process of generating training data in the method provided by the embodiment of the present invention:

taking the above embodiment as an example, according to the mapping relationship between the target feature name and the feature identifier in the current mapping data mapping.20180802, that is, the mapping relationship with the timestamp 20180802 in the mapping.20180802, that is, the mapping relationship corresponding to the feature name B, C, D, E, F, is obtained. In the initial recorded data after washing, the target feature name is replaced with the corresponding feature identifier, that is, the feature name B, C, D, E, F is replaced with 01, 02, 03, 06 and 07, respectively.

Reading each record in the cleaned initial record data according to a line, and counting the number of the characteristic marks in the record; and when the number of the feature identifications contained in the record is smaller than a preset first feature threshold value or larger than a preset second feature threshold value, deleting the record and generating training data.

For example, the preset first feature number threshold is set to 3, and the preset second feature number threshold is set to 10. If the number of the characteristic marks contained in the record is less than 3, deleting the record; if the number of the characteristic identifications contained in the record is more than 10, deleting the record; i.e. to ensure that each record in the training data contains a number of features corresponding to 3 and less than 10. This is because when the number of features included in one record is too small, it is not necessary to use the number as training data, and when the number of features included in one record is too large, it is an extreme case, and it is not necessary to use the number as training data.

According to the training data generation method provided by the embodiment of the invention, the mapping relation is not regenerated every day, but the mapping relation in the time window is reserved, and in the tracing process, only one mapping relation between the feature name and the feature identifier needs to be stored according to the time window. The method can simplify the operation that a developer traces the features corresponding to the feature identifications according to the feature identifications in the training data in the model training process. Particularly, when the period set by the time window is the same as the period of model training, the characteristic name can be traced back subsequently only by storing a mapping relation between the characteristic name and the characteristic identifier, and the operation is greatly simplified.

In addition, in the conventional training data generation method, the initial record data obtained from the client is generally stored in the hive. Since the storage form of data in hive is: a record contains a client id, i.e. a device _ id, and a feature name and a feature value corresponding to the feature name, and record data usually obtained from a client contains a plurality of features, such as: the user's viewing characteristics, the film type characteristics in the viewing characteristics, the issuer characteristics, the mobile phone type characteristics, and the like. Therefore, for the record data including a plurality of features obtained from one client, it is necessary to store a plurality of records for the record data including a plurality of features obtained from the client in hive, and thus there is a problem that the operation is complicated when storing the record data.

Meanwhile, since a plurality of records are required to be stored due to record data including a plurality of features obtained from one client, the identifier (device _ id) of the client may be stored many times. For example, if 10 characteristics are included in the record data obtained from a client, 10 records need to be stored in hive for the record data obtained from the client, and the device _ id needs to be stored repeatedly 10 times, so that there is a problem of data redundancy.

The training data generation method provided by the embodiment of the invention can store the initial record data in the HBase database, and the storage form of the data in the HBase database is as follows: and one record comprising one device _ id, and the feature names and corresponding feature values of all the features obtained from the client.

Therefore, for record data obtained from a client and containing a plurality of features, only one record needs to be stored in the HBase database for the record data obtained from the client, namely, the device _ id needs to be stored only once. Therefore, the problem that operation is complex when recording data is stored in the prior art is solved, and the problem of data redundancy caused by the fact that device _ id needs to be stored for multiple times in the prior art is also solved.

Since hive has no function of setting a preservation time threshold, in the prior art, when training data is generated every day, all initial record data stored in hive is generally obtained, and thus, the data volume is large and the processing time is long in the training data generation process. In the prior art, the characteristics of the initial recorded data are cleaned, and the time spent on cleaning the characteristics of the initial recorded data is about 10 hours; generating the current mapping data, wherein the time spent on generating the current mapping data is about 1 hour; and the generated training data is also large, and a large storage space is required to be occupied.

In practical application, the training data generation method provided by the embodiment of the invention can set the storage duration threshold of the HBase database according to practical requirements. For example, the current viewing characteristic VV uses data within 45 days, the storage duration threshold of the HBase database may be set to 45 days. Thus, when training data is generated every day, initial recorded data in the last 45 days is obtained, the data volume is small, and the processing time is short. In the embodiment of the invention, the characteristics of the initial recorded data are cleaned, and the time spent on cleaning the characteristics of the initial recorded data is only 4.5 hours; the time spent generating the current mapping data is only 10 minutes; the time spent in the prior art is far less, the generated training data is smaller, and the storage space is saved.

Corresponding to the embodiment shown in fig. 1, an embodiment of the present invention further provides a training data generating apparatus, and referring to fig. 3, the apparatus may include:

a data obtaining module 301, configured to obtain initial recording data; each record in the initial record data at least comprises: a characteristic name;

an obtaining module 302, configured to obtain a target feature name from the initial record data;

a history obtaining module 303, configured to obtain saved history mapping data; the history mapping data includes: mapping relations between the feature names and the feature identifications and timestamps corresponding to the mapping relations;

the mapping module 304 is configured to generate current mapping data according to the target feature name, a timestamp in the historical mapping data, and a preset time window; the current mapping data includes: determining a mapping relation between the feature name and the feature identifier in the preset time window according to the timestamp; the mapping relation in the current mapping data comprises the mapping relation between the target feature name and the feature identifier; the preset time window is preset according to the model training period and/or the change condition of the statistical characteristic increment;

a generating module 305, configured to replace the target feature name in the initial record data with a corresponding feature identifier according to a mapping relationship between the target feature name and the feature identifier in the current mapping data, so as to generate training data.

Optionally, the obtaining module 302 is specifically configured to:

the generating module 305 is specifically configured to replace the target feature name in the cleaned initial record data with a corresponding feature identifier according to a mapping relationship between the target feature name and the feature identifier in the current mapping data, so as to generate training data.

the data obtaining module 301 may be specifically configured to:

Optionally, referring to fig. 4, the apparatus may further include:

a file generation module 401, configured to read data obtained from each terminal within a preset unit time, and generate a data file; the storage format of the data in the data file is the same as that of the data in the HBase;

an importing module 402, configured to import the data file into the HBase to obtain initial recorded data;

optionally, the feature name includes a feature type;

the obtaining module 302 may include:

a counting submodule (not shown in the figure) for counting the occurrence frequency of each feature name in the initial record data;

a deletion submodule (not shown in the figure) configured to delete, according to a preset frequency threshold for each feature type, the feature name and the feature value of which the number of occurrences of the feature name in the initial record data is smaller than the preset frequency threshold corresponding to the feature; and obtaining the cleaned initial record data and the target characteristic name.

Optionally, referring to fig. 5, the mapping module 304 may include:

the allocating submodule 501 is configured to allocate a feature identifier to a target feature name that is not included in the historical mapping data, generate a new mapping relationship, and add the newly generated mapping relationship and a timestamp of a generation time to the current mapping data, where the generation time is the current time;

a modification submodule 502, configured to modify a timestamp of generation time of a mapping relationship corresponding to a target feature name included in the historical mapping data to be current time, and add the current time to the current mapping data;

the adding submodule 503 is configured to add a mapping relationship, in which a timestamp falls within the preset time window, except for the target feature name, to the current mapping data.

Optionally, referring to fig. 4, the apparatus may further include:

a saving module 403, configured to save the current mapping file as a history mapping file after the current mapping file is generated.

Optionally, referring to fig. 6, the generating module 305 may include:

a replacing submodule 601, configured to replace, according to a mapping relationship between a target feature name and a feature identifier in the current mapping data, the target feature name in the cleaned initial recording data with a corresponding feature identifier;

a statistics submodule 602, configured to count, for each record in the cleaned initial record data, the number of feature identifiers in the record;

the generating sub-module 603 is configured to delete the record when the number of the feature identifiers included in the record is smaller than a preset first feature threshold or larger than a preset second feature threshold.

According to the training data generation device provided by the embodiment of the invention, the mapping relation is not regenerated every day, but the mapping relation in the time window is reserved, and in the tracing process, only one mapping relation between the feature name and the feature identifier needs to be stored according to the time window. The method can simplify the operation that a developer traces the features corresponding to the feature identifications according to the feature identifications in the training data in the model training process. Particularly, when the period set by the time window is the same as the period of model training, the characteristic name can be traced back subsequently only by storing a mapping relation between the characteristic name and the characteristic identifier, and the operation is greatly simplified.

Corresponding to the embodiment shown in fig. 1, the embodiment of the present invention further provides a server, as shown in fig. 7, including a processor 701, a communication interface 702, a memory 703 and a communication bus 704, where the processor 701, the communication interface 702, and the memory 703 complete mutual communication through the communication bus 704,

a memory 703 for storing a computer program;

the processor 701 is configured to implement the following steps when executing the program stored in the memory 703:

obtaining a target characteristic name from the initial record data;

According to the server provided by the embodiment of the invention, because the mapping relation is not regenerated every day, but the mapping relation in the time window is reserved, in the tracing process, only one mapping relation between the feature name and the feature identifier needs to be saved according to the time window. The method can simplify the operation that a developer traces the features corresponding to the feature identifications according to the feature identifications in the training data in the model training process. Particularly, when the period set by the time window is the same as the period of model training, the characteristic name can be traced back subsequently only by storing a mapping relation between the characteristic name and the characteristic identifier, and the operation is greatly simplified.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program realizes the steps of any one of the above training data generation methods when executed by a processor.

In a further embodiment provided by the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the training data generation methods of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A training data generation method applied to a server, the method comprising:

obtaining a target characteristic name from the initial record data;

2. The method of claim 1, wherein the step of obtaining the name of the target feature from the initial recorded data comprises:

3. The method according to claim 1, wherein the initial record data is stored in an HBase database, each record in the HBase database corresponds to a terminal, and each record includes: a terminal identification, a characteristic name and a characteristic value;

the step of obtaining initial recording data includes:

4. The method according to claim 3, wherein the initial log data is stored in the HBase database by:

5. The method of claim 2,

the feature name comprises a feature type;

6. The method of claim 2, wherein the step of generating the current mapping data according to the target feature name, the timestamp in the historical mapping data and a preset time window comprises:

7. The method of claim 2, further comprising, after the generating the current mapping file:

and saving the current mapping file as a history mapping file.

8. The method according to claim 2, wherein the step of generating training data by replacing the target feature name in the cleaned initial record data with a corresponding feature identifier according to the mapping relationship between the target feature name and the feature identifier in the current mapping data comprises:

9. An apparatus for generating training data, the apparatus comprising:

10. The apparatus according to claim 9, wherein the obtaining module is specifically configured to:

11. The apparatus according to claim 9, wherein the initial record data is stored in an HBase database, each record in the HBase database corresponds to a terminal, and each record includes: a terminal identification, a characteristic name and a characteristic value;

the data obtaining module is specifically configured to:

12. The apparatus of claim 9, further comprising:

13. The apparatus of claim 10,

the feature name comprises a feature type;

the obtaining module includes:

14. The apparatus of claim 10, wherein the mapping module comprises:

15. The apparatus of claim 10, further comprising:

16. The apparatus of claim 10, wherein the generating module comprises:

17. A server is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing the communication between the processor and the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 8 when executing a program stored in the memory.