CN117272048A - Data set processing method and device based on machine learning platform - Google Patents

Data set processing method and device based on machine learning platform Download PDF

Info

Publication number
CN117272048A
CN117272048A CN202311277638.5A CN202311277638A CN117272048A CN 117272048 A CN117272048 A CN 117272048A CN 202311277638 A CN202311277638 A CN 202311277638A CN 117272048 A CN117272048 A CN 117272048A
Authority
CN
China
Prior art keywords
data
data set
target
analysis
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311277638.5A
Other languages
Chinese (zh)
Inventor
王清臣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zetyun Tech Co ltd
Original Assignee
Beijing Zetyun Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zetyun Tech Co ltd filed Critical Beijing Zetyun Tech Co ltd
Priority to CN202311277638.5A priority Critical patent/CN117272048A/en
Publication of CN117272048A publication Critical patent/CN117272048A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data set processing method, a device, electronic equipment and a storage medium based on a machine learning platform, and relates to the technical field of data processing, wherein the method comprises the following steps: the method comprises the following steps: acquiring a first data set and first analysis data corresponding to the first data set, wherein the first analysis data comprises at least one of the following: data type, data distribution and statistics; performing first operation processing on the first data set based on the first analysis data to generate a second data set; analyzing the second data set to obtain second analysis data corresponding to the second data set, wherein the second analysis data comprises at least one of the following: data type, data distribution and statistics; storing the first analysis data and the second analysis data into the target file; wherein the first data set and the second data set are used to train an initial model through the machine learning platform.

Description

Data set processing method and device based on machine learning platform
Technical Field
The invention relates to the technical field of artificial intelligence and machine learning, in particular to a data set processing method and device based on a machine learning platform.
Background
Today, in many fields, more and more enterprises have more stringent requirements on data storage and processing, and some enterprises have concentrated data on data warehouses or large data platforms, and the data changes due to inconsistent data processing apertures.
In the traditional machine learning development and training process, the organization form of the data set is generally single, the data set has the capability of storing single version and single fragment, and the metadata generally only has the capability of describing single data set, so that a plurality of data sets for storing different versions and different fragments exist for the same service scene, the information of each data set and the relation between the information are manually recorded by dependent personnel, and the difficulty of data positioning and data exploration is increased if the model development or training is required to be carried out by depending on a plurality of versions or a plurality of fragments in the use process.
Therefore, the prior art has the problem of poor training effect on the initial model.
Disclosure of Invention
The embodiment of the invention provides a data set processing method and device based on a machine learning platform, electronic equipment and a storage medium, which are used for solving the problem of poor training effect on a model in the prior art.
In order to solve the above problems, the present invention is achieved as follows:
in a first aspect, an embodiment of the present invention provides a data set processing method based on a machine learning platform, where the method includes:
acquiring a first data set and first analysis data corresponding to the first data set, wherein the first analysis data comprises at least one of the following: data type, data distribution and statistics;
performing first operation processing on the first data set based on the first analysis data to generate a second data set;
analyzing the second data set to obtain second analysis data corresponding to the second data set, wherein the second analysis data comprises at least one of the following: data type, data distribution and statistics;
storing the first analysis data and the second analysis data into a target file;
wherein the first data set and the second data set are used to train an initial model through the machine learning platform.
In a second aspect, an embodiment of the present invention provides a model training method, including:
determining a target data set corresponding to a current scene, and generating target grouping data according to the target data set;
Training the initial model based on the target grouping data to obtain a target model.
Optionally, the training the initial model based on the target packet data includes:
acquiring demand information;
screening target fragment data from the target packet data according to the demand information, and carrying out data processing on the target fragment data; and training the initial model based on the sliced data processed by the data processing to obtain a target model.
In a third aspect, an embodiment of the present invention provides a data set processing apparatus based on a machine learning platform, including:
the acquisition module is used for acquiring a first data set and first analysis data corresponding to the first data set, wherein the first analysis data comprises at least one of the following: data type, data distribution and statistics;
the first processing module is used for performing first operation processing on the first data set based on the first analysis data to generate a second data set;
the second processing module is used for analyzing and processing the second data set to obtain second analysis data corresponding to the second data set, and the second analysis data comprises at least one of the following: data type, data distribution and statistics;
The storage module is used for storing the first analysis data and the second analysis data into a target file;
wherein the first data set and the second data set are used to train an initial model through the machine learning platform.
In a fourth aspect, an embodiment of the present invention provides a model training apparatus, including:
the processing module is used for determining a target data set corresponding to the current scene and generating target grouping data according to the target data set;
and the training module is used for training the initial model based on the target grouping data to obtain a target model.
In a fifth aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory, and a program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the machine learning platform based dataset processing method as described in the first aspect, and the steps of the model training method as described in the second aspect.
In a sixth aspect, an embodiment of the present invention provides a computer readable storage medium, on which a computer program is stored, the computer program implementing the steps of the machine learning platform based data set processing method according to the first aspect and the steps of the model training method according to the second aspect when executed by a processor.
In the embodiment of the invention, a first data set and first analysis data corresponding to the first data set are firstly obtained, then the first operation processing is carried out on the first data set according to the first analysis data to obtain a second data set, then the analysis processing is carried out on the second data set to obtain second analysis data corresponding to the second data set, and finally the first analysis data and the second analysis data are stored in a target file. For training of machine learning, the embodiment of the invention enriches the organization form of the data set, generates a plurality of versions of data set, reduces the difficulty of data positioning and data reconnaissance when the initial model is trained by relying on target data, and improves the training effect of the model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a data set processing method based on a machine learning platform according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data set according to an embodiment of the present invention;
FIG. 3 is a flowchart of another data set processing method according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of a model training method according to an embodiment of the present invention;
FIG. 5 is a flow chart of another model training method according to an embodiment of the present invention;
FIG. 6 is a block diagram of a data set processing apparatus according to an embodiment of the present invention;
FIG. 7 is a block diagram of a model training apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms "first," "second," and the like in embodiments of the present invention are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Referring to fig. 1, fig. 1 is a flow chart of a data set processing method based on a machine learning platform according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step 101, acquiring a first data set and first analysis data corresponding to the first data set, wherein the first analysis data comprises at least one of the following: data type, data distribution and statistics;
102, performing first operation processing on the first data set based on the first analysis data to generate a second data set;
step 103, analyzing the second data set to obtain second analysis data corresponding to the second data set, where the second analysis data includes at least one of the following: data type, data distribution and statistics;
step 104, storing the first analysis data and the second analysis data into a target file;
wherein the first data set and the second data set are used to train an initial model through the machine learning platform.
The steps 101, 102, 103, and 104 included in the data set processing method may be performed by an electronic device, for example, a computer, which is not limited to the embodiment of the present invention.
In step 101, the first data set and the corresponding first analysis data may be obtained by slicing an initial data set, for example: firstly, an initial data set and data slicing information corresponding to the initial data set are acquired, wherein the data slicing information can be stored in a target file, and then slicing processing is carried out on the initial data set based on the data slicing information, so that the first data set is obtained.
It should be noted that, the data slicing information may indicate a slicing rule preset by a user, that is, how to perform slicing processing on data according to the data slicing information, for example: the slicing rule may be that the user previously takes the date, the region, etc. as the slicing, and then the slicing process may be performed on the initial data set according to the date and the region.
In addition, the embodiment of the invention is applicable to most scenes, such as: the user product recommendation scenario may include user information and product information, and the initial data set may include data of at least one database matched in the current scenario because the user information and the product information may be stored in different places.
Specifically, the first data set may include at least one piece of data, and for the number of pieces of data in the first data set, the number of pieces of data may be determined according to the data piece information, for example: the user sets and pulls the initial data set into three pieces of data, and the number of the pieces of data can be set and updated in advance by the user according to actual requirements and data conditions, so that the embodiment of the invention is not limited.
In addition, the first data set may be understood as an initial version of data or a data set, and further data processing may be performed on the first data set according to the service requirement to obtain an updated version of data or a data set (for example, the second data set in the step 102), where the updated version of data or the data set may also be input into the initial model as training text.
In step 102, a first operation is performed on the first data set based on the first analysis data, so as to generate the second data set, and the user may perform the first operation on the data according to the first analysis data (e.g., the relevant statistical index and distribution of the first data set) and according to the actual situation or service requirement of the data.
The determining of the first analysis data may be, for example: judging whether the first analysis data meets the preset conditions of a user, and performing first operation processing on the first data set to obtain the second data set under the condition that the first analysis data meets the preset conditions.
It should be noted that, the first analysis data may be an algorithm in the related art, for example: data analysis is performed by some automated analysis work on the data, and the analysis results may also include, for example: effective value, maximum value, minimum value, median, mode, composite number, variance, etc.; length, width, class, etc.
In addition, the above-described preset conditions may be preset by the user, for example: when the value of one or more of the analysis data reaches a preset threshold, it may be understood that the analysis data satisfies the preset condition, and the first operation processing is continuously performed on the first data set, so as to obtain the second data set. The preset threshold may be set according to the service requirement and the actual situation of the data, which is not limited in the embodiment of the present invention.
In step 103 and step 104, the second analysis data may be obtained in a manner of generating the first analysis data, and then the first analysis data and the second analysis data are stored in the target file, where the data set defined in the embodiment of the present invention is provided with a metadata (metadata) file to record data source information associated with the data set, where the data source information may be understood as data of the data set, and the metadata file may be understood as the target file.
In this embodiment, a first data set and first analysis data corresponding to the first data set are obtained first, then a first operation process is performed on the first data set according to the first analysis data to obtain a second data set, then an analysis process is performed on the second data set to obtain second analysis data corresponding to the second data set, and finally the first analysis data and the second analysis data are stored in a target file. For training of machine learning, the embodiment of the invention enriches the organization form of the data set, generates a plurality of versions of data set, reduces the difficulty of data positioning and data reconnaissance when the initial model is trained by relying on target data, and improves the training effect of the model.
As an alternative implementation manner, referring to fig. 2, fig. 2 is a schematic structural diagram of a data set provided by an embodiment of the present invention, where the embodiment of the present invention may define a data set, and a metadata (metadata) file is provided to record data source information associated with the data set, where the data source information may be understood as data of the data set.
Continuing with the above embodiment as an example, the initial data set may be data in the data set, because different databases or different versions correspond to different drives, the system may automatically match corresponding drive information according to the data source information, then store the data fragment information into the metadata file, so as to finish pulling the initial data set to obtain a first data set (initial version V1 of the data set), and finally, the data set may analyze the pulled first packet data and record the data type, data distribution or statistics of the first packet data into the metadata file.
After the initialization of the version V1 of the data set is completed, the executing device may obtain relevant statistical indexes and distributions of the data according to metadata information recorded in the metadata file, and may further process according to service requirements, for example: the operations of null value processing, generating a tag column or data encoding, etc., it should be understood that after further processing, the data itself and the corresponding metadata will change, so as to form an updated version V2 of the data set, and meanwhile, the updated version V2 of the data set may also be analyzed to obtain the corresponding data type, data distribution or statistical information, and stored in the metadata file.
It should be understood that, according to the actual service situation and the data change situation, the updated version V3 or V4 may be generated according to the updated version V2, and the relevant user may also select the actual data in the data version corresponding fragment required for model training according to the metadata information recorded in the metadata file, so as to meet the service requirement. The method can record the original appearance and the changing process of the data and the support of multi-version data in the business changing process, and meets the data classification requirement of the user under the same version.
As another alternative embodiment, the first data set may be obtained by: firstly, obtaining the preset number of fragments of a user, determining a field to be processed in an initial data set according to data fragment information, and finally pulling the initial data set according to the field to be processed to obtain the first data set, wherein the first data set comprises fragment data corresponding to the number of fragments, and the method can generate fragment data meeting service requirements so as to complete training of an initial model.
In addition, the initial data set is pulled according to the number of fragments and key attributes, and the initial data set can be realized by the following codes:
optionally, the performing a first operation on the first data set based on the first analysis data to generate a second data set includes:
determining the to-be-processed fragment data in the first data set based on the current scene and the first analysis data;
and acquiring the to-be-processed fragment data as a to-be-processed data set, and performing first operation processing on the to-be-processed data set to generate a second data set.
In this embodiment, the to-be-processed piece-data in the first data set is determined firstly based on the current scene and the first analysis data, where the to-be-processed piece-data may be data preset by a user or data required by an initial model, then the to-be-processed piece-data is obtained as a to-be-processed data set, and the to-be-processed data set is subjected to a first operation process to generate a second data set.
Optionally, the performing a first operation on the to-be-processed data set to generate a second data set includes:
determining a field to be processed of the data in the data set to be processed based on the current scene;
and carrying out data slicing processing on the data set to be processed based on the field to be processed, and generating the second data set based on the data subjected to the data slicing processing, wherein the second data set comprises a plurality of sliced data.
Under the condition that the current scene is a product recommendation scene, firstly, basic information of a customer and deposit information or loan information of a corresponding customer need to be acquired, then the customer information, the deposit information and the loan information can be used as data sets to be processed, then a field to be processed in a first data set can be set as the deposit information, and finally, data slicing processing is performed on the data sets to be processed based on the field to be processed, so as to obtain a second data set, for example: the second data set includes four pieces of data (piece 1: deposit balance less than 1 ten thousand yuan, piece 2: deposit balance between 1 and 5 ten thousand yuan, piece 3: deposit balance between 5 and 10 ten thousand yuan, and piece 4: deposit balance over 10 ten thousand yuan).
Optionally, the generating the second data set based on the data after the data slicing process includes:
and executing at least one of the following data operations on the data after the data slicing processing: null value processing, data encoding and generating a mark column;
and generating the second data set based on the data after the data operation.
Optionally, after the storing the first analysis data and the second analysis data in the target file, the method further includes:
updating the first data set based on the second data set to obtain a third data set, wherein the third data set is used for training an initial model through the machine learning platform;
and storing the analysis data matched with the third data set into the target file.
In this embodiment, the first data set may be updated based on the second data set to obtain a new data set, i.e. the third data set, for example: selecting the target slice data in the second data set according to the metadata information corresponding to the second data set stored in the target file, and adding the target slice data in the first data set, wherein the target slice data may be any number of slice data in the second data set, such as partial or full slice data, or may be combining the target slice data with partial slice data in the first data set.
Optionally, the target data set comprises the first data set or the second data set;
after said storing said first and second analysis data into said target file, said method further comprises:
inputting the target data set into a target model to obtain a first output result of the target model;
correcting the first output result to obtain a fourth data set, wherein the fourth data set is used for training an initial model through the machine learning platform;
and storing the analysis data matched with the fourth data set into the target file.
Optionally, after performing correction processing on the first output result to obtain a fourth data set, the method further includes:
combining the fourth data set with the target data set to obtain a fifth data set, wherein the fifth data set is used for training an initial model through the machine learning platform;
and storing the analysis data matched with the fifth data set into the target file.
For a better understanding of the foregoing embodiments, please refer to the example shown below, please refer to fig. 3, fig. 3 is a schematic flow chart of another data set processing method provided by the embodiment of the present invention, in the process of model development and training, the original appearance of the data can be quickly obtained through metadata information in the data set, so as to start the work of data exploration, data processing or feature processing (step 1), after the data is explored or processed, the information can be reselected and stored back into the data set to obtain an updated version (step 2), and of course, after the data is explored or processed, the partial data can be pushed to the initial model to complete model training (step 3), and in the subsequent model iterative training, the designated version or the data under the designated slice can be directly selected from the data set, so as to directly perform model training (step 4).
And (3) storing the output result of the model training in a data set through correction processing (a fourth data set), wherein in the process of step 4, the fourth data set can be directly used as a training text to train the model, or the fourth data set can be combined with the first data set or the second data set, so as to obtain the fifth data set, and finally, the fifth data set is used as a training text to train the model.
Referring to fig. 4, fig. 4 is a flow chart of a model training method provided in an embodiment of the invention, referring to fig. 4, as shown in fig. 4, including the following steps:
step 401, determining a target data set corresponding to a current scene, and generating target grouping data according to the target data set;
and step 402, training the initial model based on the target grouping data to obtain a target model.
In step 401, the target data set may be selected according to user requirement information or model training requirement, and the target packet data may be generated according to the target data set, for example: and screening partial data in the target data set according to the requirement.
It should be noted that, the user may select all the sliced data or part of the sliced data in the target packet data as the training text, and the rule and the number of the sliced data may be preset by the user, which is not limited to the embodiment of the present invention.
In step 402, the initial model is trained using the target packet data as an input text, thereby obtaining a trained target model.
It should be noted that, an initial model may be determined based on the target packet data, that is, the initial model is determined to be a model associated with the current scene according to the target packet data, then a training text is generated according to the target packet data, and the training text is input into the initial model for training, so as to obtain a target model, where the target model may be used for generating prediction information such as product recommendation.
It should be noted that, in the process of determining that the initial model is a model associated with the current scene according to the first packet data, generation of the initial model may be completed according to the first packet data, or, of course, an initial model set in the related art may be selected, and an initial model associated with the current scene may be selected from a plurality of initial models according to the first packet data.
For training of machine learning, the embodiment of the invention enriches the organization form of the data set, has the data capacity under a plurality of fragments, reduces the difficulty of data positioning and data reconnaissance when the model training is carried out by depending on the plurality of fragments, and improves the training effect on the model.
Optionally, the training the initial model based on the target packet data includes:
acquiring demand information;
screening target fragment data from the target packet data according to the demand information, and carrying out data processing on the target fragment data; and training the initial model based on the sliced data processed by the data processing to obtain a target model.
In this embodiment, the requirement information is first acquired, and the requirement information is used to obtain data meeting the service requirement, where the data meeting the service requirement may be specific to a certain piece of data, that is, the target piece of data is determined according to the requirement information, and then the target piece of data may be directly extracted as training text. Due to different services or properties, the data can be classified according to time dimension or other dimensions, and by the method, the data classification requirements of the user under the same version can be met, and the data management and access policy of multiple fragments can be realized.
It should be noted that, in the case where there are a plurality of group data in the data set, the above-mentioned demand information may also be used to screen a part of the data in the plurality of group data, and the product recommendation scenario is taken as the target scenario, first, in the initial stage of the data set creation, the customer base information accessed from the customer relationship management (Customer Relationship Management, CRM) is acquired, the deposit information of the customer is accessed from the transaction system, the loan information of the customer is accessed from the credit system, and finally, the broad table is formed by taking the customer information as the main data, as the V1 version of the data set, i.e., the first group data, wherein these data are stored in one piece of the first group data.
Then, data processing is performed on the version V1, and processing logic is used for calculating the customer balance and performing slicing according to the customer deposit balance to form 4 slices (slice 1: deposit balance less than 1 ten thousand yuan, slice 2: deposit balance between 1 and 5 ten thousand yuan, slice 3: deposit balance between 5 and 10 ten thousand yuan, and slice 4: deposit balance above 10 ten thousand yuan), so as to form a version V2 of the data set, namely second packet data, wherein four different slices exist under the version.
And then obtaining the requirement information, wherein the requirement information is used for screening the first packet data or the second packet data, and in the embodiment, the requirement information is used for screening the second slice data and the third slice data in the second packet data, namely, the second slice data and the third slice data are used as training texts to train the initial model, after the training is completed, the model result can be evaluated, and after the index standard of the service requirement is met, the trained initial model is put into use.
It should be understood that the above process is a process of model training in a product recommendation scene, and in a subsequent iterative training process of a model, raw data in a V1 version can be directly obtained, and data processing and model training are performed to complete iterative training of the model; of course, the processed data can be obtained from the V2 version, the model can be trained, and the iterative training of the model can be completed.
As an optional implementation manner, selecting target fragment data from the target packet data according to the requirement information, and performing data processing on the target fragment data; training the initial model based on the sliced data after the data processing treatment, and obtaining a target model, wherein the method further comprises the following steps:
acquiring data to be predicted under the current scene;
inputting the data to be predicted into the target model to obtain a predicted result data set of the associated target service, wherein the target service comprises one of the following steps: batch running service and online service;
and updating parameters of the target model based on the prediction result data set.
In this embodiment, for the target service of the target model, the data after reasoning is completed may be stored for iterative training of the subsequent model, specifically, the data to be predicted is input into the target model to obtain an initial data set associated with the target service, and the initial data set may be used as a new training text to train the target model.
Referring to fig. 5, fig. 5 is a flow chart of another model training method provided by the embodiment of the invention, in fig. 5, the target service of the target model may be a model running batch service and a model online service, for the model running batch service, when batch prediction data is ready, the batch prediction data may be sent to the model running batch service (step 1), after the batch prediction data is finished, the data with labels is stored in the data set, and the landing of the data is completed (step 2), wherein the data after the landing may be used for iterative training of the model after finishing correction of the data labels.
Taking product recommendation as an example, for model batch service, a batch of customer information data can be obtained from a service system, the customer information data can be processed and arranged into a data format which can be identified by a model, the batch of data is transmitted to the model, and each data in the batch is labeled after the model is inferred, so that whether the customer wants to purchase the product is described. The batch of data with the labels can be dropped into a data set, can be a new version in a training data set, and can be subjected to model iterative training again after label correction is completed subsequently; or can be stored in a new data set for later use after archiving.
For the model online service, stream data is continuously sent to the model online service along with the occurrence of the service (step 1), tagged data can be stored in a data set after prediction is completed (step 2), and the data after the data is dropped can be used for model iterative training after the correction of the data tag is completed.
Also taking product recommendations as an example below, for a model online service, the model first needs to be packaged as an online Rest application program interface (Application Programming Interface, API) and exposed to the upstream business system. After the upstream service system integrates the API, the user can transmit the client information to the Rest API service of the model in real time after handling the operations of depositing and withdrawing, transferring accounts and the like, and the model can recommend products to the client according to the information so as to complete online real-time recommendation. The model Rest API service, while completing the recommendation, will sort the recommended data into an existing or new dataset for subsequent iterative training or archiving.
Referring to fig. 6, fig. 6 is a block diagram of a data set processing apparatus according to an embodiment of the present invention, and as shown in fig. 6, a data set processing apparatus 600 includes:
The acquiring module 601 is configured to acquire a first data set and first analysis data corresponding to the first data set, where the first analysis data includes at least one of the following: data type, data distribution and statistics;
a first processing module 602, configured to perform a first operation on the first data set based on the first analysis data, and generate a second data set;
the second processing module 603 is configured to perform analysis processing on the second data set to obtain second analysis data corresponding to the second data set, where the second analysis data includes at least one of the following: data type, data distribution and statistics;
a storage module 604, configured to store the first analysis data and the second analysis data into a target file;
wherein the first data set and the second data set are used to train an initial model through the machine learning platform.
Optionally, the first processing module 602 includes:
a first determining unit configured to determine, based on a current scene and the first analysis data, fragment data to be processed in the first data set;
the first processing unit is used for acquiring the to-be-processed fragment data into a to-be-processed data set, and performing first operation processing on the to-be-processed data set to generate a second data set.
Optionally, the first processing unit includes:
determining a field to be processed of the data in the data set to be processed based on the current scene;
and carrying out data slicing processing on the data set to be processed based on the field to be processed, and generating the second data set based on the data subjected to the data slicing processing, wherein the second data set comprises a plurality of sliced data.
Optionally, the generating the second data set based on the data after the data slicing process includes:
and executing at least one of the following data operations on the data after the data slicing processing: null value processing, data encoding and generating a mark column;
and generating the second data set based on the data after the data operation.
Optionally, the apparatus 600 further comprises:
the updating module is used for updating the first data set based on the second data set to obtain a third data set, and the third data set is used for training an initial model through the machine learning platform;
and storing the analysis data matched with the third data set into the target file.
Optionally, the target data set comprises the first data set or the second data set;
the apparatus 600 further comprises:
The third processing module is used for inputting the target data set into a target model to obtain a first output result of the target model;
the correction module is used for correcting the first output result to obtain a fourth data set, and the fourth data set is used for training an initial model through the machine learning platform;
and storing the analysis data matched with the fourth data set into the target file.
Optionally, the apparatus 600 further comprises:
the merging module is used for merging the fourth data set and the target data set to obtain a fifth data set, and the fifth data set is used for training an initial model through the machine learning platform;
and storing the analysis data matched with the fifth data set into the target file.
The model training device provided in this embodiment of the present application is capable of implementing each process of each embodiment of the model training method shown in fig. 4, where technical features of the model training device and the model training device are in one-to-one correspondence, and can achieve the same technical effect, so that repetition is avoided, and no detailed description is given here.
Referring to fig. 7, fig. 7 is a block diagram of a model training apparatus according to an embodiment of the present invention, and as shown in fig. 7, a model training apparatus 700 includes:
A generating module 701, configured to determine a target data set corresponding to a current scene, and generate target packet data according to the target data set;
and the training module 702 is configured to train the initial model based on the target packet data to obtain a target model.
Optionally, training module 702 includes:
the acquisition unit is used for acquiring the demand information;
the training unit is used for screening target fragment data from the target packet data according to the demand information and carrying out data processing on the target fragment data; and training the initial model based on the sliced data processed by the data processing to obtain a target model.
The data set processing device provided in this embodiment of the present application is capable of implementing each process of each embodiment of the data set processing method based on the machine learning platform shown in fig. 1, and technical features of the two correspond to each other one by one, and can achieve the same technical effect, so that repetition is avoided, and no redundant description is provided herein.
It should be noted that, the model training device in the embodiment of the present invention may be a device, or may be a component, an integrated circuit, or a chip in an electronic device.
Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device includes a memory 801, a processor 802, and a program or an instruction stored to run on the memory 801, and when the program or the instruction is executed by the processor 802, any steps in the method embodiments corresponding to fig. 1 and fig. 4 may be implemented and the same beneficial effects are achieved, which will not be described herein.
The processor 802 may be, among other things, a central processing unit (Central Processing Unit, CPU), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA), or a graphics processor (Graphics Processing Unit, GPU).
Those of ordinary skill in the art will appreciate that all or a portion of the steps of implementing the methods of the embodiments described above may be implemented by hardware associated with program instructions, where the program may be stored on a readable medium.
The embodiment of the present invention further provides a readable storage medium, where a computer program is stored, where the computer program when executed by a processor may implement any step in the embodiment of the data set processing method corresponding to fig. 1, or may implement any step in the embodiment of the model training method corresponding to fig. 4, and the same technical effect may be achieved, so that repetition is avoided, and no further description is provided herein. Such as Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disk, etc.
The terms "first," "second," and the like in embodiments of the present invention are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Furthermore, the use of "and/or" in this application means at least one of the connected objects, such as a and/or B and/or C, is meant to encompass the 7 cases of a alone, B alone, C alone, and both a and B, both B and C, both a and C, and both A, B and C.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described example method may be implemented by means of software plus a necessary general hardware platform, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a second terminal device, etc.) to perform the method of the embodiments of the present application.
The embodiments of the present application have been described in connection with the accompanying drawings, but the present application is not limited to the above-described embodiments, which are intended to be illustrative only and not limiting, and many forms can be made by one of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims (11)

1. A data set processing method based on a machine learning platform, comprising:
acquiring a first data set and first analysis data corresponding to the first data set, wherein the first analysis data comprises at least one of the following: data type, data distribution and statistics;
performing first operation processing on the first data set based on the first analysis data to generate a second data set;
analyzing the second data set to obtain second analysis data corresponding to the second data set, wherein the second analysis data comprises at least one of the following: data type, data distribution and statistics;
storing the first analysis data and the second analysis data into a target file;
Wherein the first data set and the second data set are used to train an initial model through the machine learning platform.
2. The data set processing method according to claim 1, wherein the performing a first operation process on the first data set based on the first analysis data to generate a second data set includes:
determining the to-be-processed fragment data in the first data set based on the current scene and the first analysis data;
and acquiring the to-be-processed fragment data as a to-be-processed data set, and performing first operation processing on the to-be-processed data set to generate a second data set.
3. The data set processing method according to claim 2, wherein the performing a first operation process on the data set to be processed to generate a second data set includes:
determining a field to be processed of the data in the data set to be processed based on the current scene;
and carrying out data slicing processing on the data set to be processed based on the field to be processed, and generating the second data set based on the data subjected to the data slicing processing, wherein the second data set comprises a plurality of sliced data.
4. A data set processing method according to claim 3, wherein said generating the second data set based on the data after the data slicing process comprises:
And executing at least one of the following data operations on the data after the data slicing processing: null value processing, data encoding and generating a mark column;
and generating the second data set based on the data after the data operation.
5. The data set processing method according to claim 2, wherein after said storing the first analysis data and the second analysis data in the target file, the method further comprises:
updating the first data set based on the second data set to obtain a third data set, wherein the third data set is used for training an initial model through the machine learning platform;
and storing the analysis data matched with the third data set into the target file.
6. The data set processing method according to claim 2, wherein a target data set includes the first data set or the second data set;
after said storing said first and second analysis data into said target file, said method further comprises:
inputting the target data set into a target model to obtain a first output result of the target model;
correcting the first output result to obtain a fourth data set, wherein the fourth data set is used for training an initial model through the machine learning platform;
And storing the analysis data matched with the fourth data set into the target file.
7. The data set processing method according to claim 4, wherein after performing correction processing on the first output result to obtain a fourth data set, the method further comprises:
combining the fourth data set with the target data set to obtain a fifth data set, wherein the fifth data set is used for training an initial model through the machine learning platform;
and storing the analysis data matched with the fifth data set into the target file.
8. A method of model training, comprising:
determining a target data set corresponding to a current scene, and generating target grouping data according to the target data set;
training the initial model based on the target grouping data to obtain a target model.
9. The model training method of claim 8, wherein the training the initial model based on the target packet data comprises:
acquiring demand information;
screening target fragment data from the target packet data according to the demand information, and carrying out data processing on the target fragment data; and training the initial model based on the sliced data processed by the data processing to obtain a target model.
10. A machine learning platform based data set processing apparatus, comprising:
the acquisition module is used for acquiring a first data set and first analysis data corresponding to the first data set, wherein the first analysis data comprises at least one of the following: data type, data distribution and statistics;
the first processing module is used for performing first operation processing on the first data set based on the first analysis data to generate a second data set;
the second processing module is used for analyzing and processing the second data set to obtain second analysis data corresponding to the second data set, and the second analysis data comprises at least one of the following: data type, data distribution and statistics;
the storage module is used for storing the first analysis data and the second analysis data into a target file;
wherein the first data set and the second data set are used to train an initial model through the machine learning platform.
11. An electronic device, comprising: a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the data set processing method of any one of claims 1 to 7 and the steps of the model training method of any one of claims 8 to 9.
CN202311277638.5A 2023-09-28 2023-09-28 Data set processing method and device based on machine learning platform Pending CN117272048A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311277638.5A CN117272048A (en) 2023-09-28 2023-09-28 Data set processing method and device based on machine learning platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311277638.5A CN117272048A (en) 2023-09-28 2023-09-28 Data set processing method and device based on machine learning platform

Publications (1)

Publication Number Publication Date
CN117272048A true CN117272048A (en) 2023-12-22

Family

ID=89211984

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311277638.5A Pending CN117272048A (en) 2023-09-28 2023-09-28 Data set processing method and device based on machine learning platform

Country Status (1)

Country Link
CN (1) CN117272048A (en)

Similar Documents

Publication Publication Date Title
CN110909182B (en) Multimedia resource searching method, device, computer equipment and storage medium
US20160203156A1 (en) Method, apparatus and system for data analysis
US20140074831A1 (en) Determination of category information using multiple stages
CN111444304B (en) Search ordering method and device
CN111352962A (en) Client portrait construction method and device
US12008047B2 (en) Providing an object-based response to a natural language query
CN112364204A (en) Video searching method and device, computer equipment and storage medium
CN111225009A (en) Method and apparatus for generating information
CN111104590A (en) Information recommendation method, device, medium and electronic equipment
WO2020065611A1 (en) Recommendation method and system and method and system for improving a machine learning system
CN111582932A (en) Inter-scene information pushing method and device, computer equipment and storage medium
CN114756570A (en) Vertical search method, device and system for purchase scene
CN114186024A (en) Recommendation method and device
CN111930944B (en) File label classification method and device
US20210165835A1 (en) Computer driven question identification and understanding within a commerical tender document for automated bid processing for rapid bid submission and win rate enhancement
CN108875014B (en) Precise project recommendation method based on big data and artificial intelligence and robot system
CN110062112A (en) Data processing method, device, equipment and computer readable storage medium
CN116166858A (en) Information recommendation method, device, equipment and storage medium based on artificial intelligence
CN110766488A (en) Method and device for automatically determining theme scene
CN117272048A (en) Data set processing method and device based on machine learning platform
CN115114073A (en) Alarm information processing method and device, storage medium and electronic equipment
CN111400413B (en) Method and system for determining category of knowledge points in knowledge base
CN115080732A (en) Complaint work order processing method and device, electronic equipment and storage medium
JP6927862B2 (en) Market comment generation support device and market comment generation support method
CN111931065A (en) Business opportunity recommendation method, system, electronic device and medium based on LSTM model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination