CN114626619A - Hive-based data prediction method and device, computer equipment and storage medium - Google Patents

Hive-based data prediction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114626619A
CN114626619A CN202210284028.7A CN202210284028A CN114626619A CN 114626619 A CN114626619 A CN 114626619A CN 202210284028 A CN202210284028 A CN 202210284028A CN 114626619 A CN114626619 A CN 114626619A
Authority
CN
China
Prior art keywords
data
calling
prediction
target
hive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210284028.7A
Other languages
Chinese (zh)
Inventor
董萍
周靖植
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202210284028.7A priority Critical patent/CN114626619A/en
Publication of CN114626619A publication Critical patent/CN114626619A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Artificial Intelligence (AREA)
  • Development Economics (AREA)
  • Databases & Information Systems (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the field of artificial intelligence, and discloses a hive-based data prediction method, a hive-based data prediction device, computer equipment and a storage medium, wherein the method comprises the following steps: storing the trained prediction model as a target format file, and loading the target format file into a data warehouse tool hive to obtain calling information of the prediction model; generating a target calling function of the prediction model according to the calling information and the data calling field; when a model calling instruction is received, the following steps are realized by calling a target calling function: extracting target data in hive according to the data calling field, calling a prediction model according to calling information, inputting the target data into the prediction model, and predicting to obtain a prediction result; according to the invention, the target data and the prediction model are pre-stored in the hive so as to be directly called, the prediction result is obtained through the hive prediction, the data acquisition and the prediction are completed in the hive, the data interaction process is reduced, the upstream and the downstream can be perfectly compatible, the compatibility is strong, and the high fault tolerance of the hive is provided.

Description

Hive-based data prediction method and device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of model prediction, in particular to a hive-based data prediction method and device, computer equipment and a storage medium.
Background
With the development of the internet and information technology, the importance of data is increasingly prominent, and the method plays an important role in promoting the social productivity and promoting the innovation and development. For example, the utilization of data can help predict development trends (e.g., personal behaviors, needs) of transactions, and can provide an effective data base for planning related works.
In a traditional prediction method, a large amount of user data is generally acquired from a user system, model training is performed on the basis of the user data to obtain a prediction model, and then the user data in a certain and some near future are predicted through the prediction model on the basis of prediction requirements to obtain a prediction result, so that relevant planning is performed according to the prediction result. However, when the prediction model is used for prediction, it is necessary to ensure that the language and environment of input data are consistent with the prediction model, but different system data environments and data forms are different, so that a large amount of data interaction processes are involved in prediction, and the compatibility and fault tolerance of the conventional prediction mode are not high.
Disclosure of Invention
The invention provides a hive-based data prediction method and device, computer equipment and storage medium, and aims to solve the technical problems of low compatibility and low fault tolerance of the traditional prediction method.
Provided is a hive-based data prediction method, which comprises the following steps:
storing the prediction model obtained by training as a target format file, and loading the target format file into a data warehouse tool hive to obtain calling information of the prediction model;
generating a target calling function of the prediction model according to the pre-acquired data calling field and calling information;
when a model calling instruction is received, the following steps are realized by calling a target calling function:
extracting target data from hive according to the data calling field, and calling a prediction model according to calling information;
and inputting the target data into the prediction model for prediction to obtain a prediction result.
Further, generating a target calling function of the prediction model according to the pre-acquired data calling field and the calling information, including:
determining target data input into the prediction model according to a user instruction;
determining a storage field of the target data in the hive to obtain a data calling field;
determining a prediction type of a prediction model;
and generating the target calling function according to the calling information, the data calling field and the prediction type as parameters of the target calling function.
Further, extracting target data in hive according to the data calling field, calling a prediction model according to calling information, inputting the target data into the prediction model for prediction to obtain a prediction result, and the method comprises the following steps:
determining calling information, a prediction type and a plurality of data calling fields in a target calling function;
extracting data corresponding to the data calling field in hive as target data;
and calling the prediction model according to the calling information, and inputting the target data into the prediction model for prediction so that the prediction model outputs a prediction result corresponding to the prediction type.
Further, loading the target format file into the hive, and obtaining calling information of the prediction model, wherein the calling information comprises:
loading the target format file into a jar data package to obtain a Java jar file package;
loading the Java jar file package into the hive, wherein the Java jar file package exists in a specified position in the hive;
and taking the specified position in the hive or the file name of the Java jar file package as the calling information of the prediction model.
Further, the target format is a pmml file format, and the trained prediction model is stored as a target format file, including:
acquiring a preset model training data set;
model training is carried out on the model training data set through sklern to obtain a prediction model, and the prediction model is stored in a pmml file format to obtain a target format file.
Further, obtaining a preset model training data set includes:
selecting a preset amount of user data from hives, and determining data corresponding to a data calling field in the user data as sample data;
the sample data is compiled into a model training data set.
Further, model training is performed on the model training data set through sklern to obtain a prediction model, and the prediction model is stored in a pmml file format to obtain a target format file, which includes:
calling a sklern feature processing module to perform feature processing on data in the model training data to obtain a plurality of feature variables;
calling a sklern model training module to train the initial model based on a plurality of characteristic variables to obtain a prediction model;
the prediction model is converted and saved to pmml file format by a sklern 2pmml library function.
Provided is a hive-based data prediction apparatus including:
the loading module is used for storing the prediction model obtained through training as a target format file, and loading the target format file into a data warehouse tool hive to obtain calling information of the prediction model;
the generation module is used for generating a target calling function of the prediction model according to the data calling field and the calling information;
the calling module is used for calling a target calling function to realize the following steps when receiving the model calling instruction:
and extracting target data in hive according to the data calling field, calling a prediction model according to calling information, and inputting the target data into the prediction model for prediction to obtain a prediction result.
There is provided a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the hive-based data prediction method when executing the computer program.
There is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above hive based data prediction method.
In one solution provided by the hive-based data prediction method, the device, the computer equipment, and the storage medium, the trained prediction model is saved as a target format file, the target format file is loaded into the data warehouse tool hive to obtain the calling information of the prediction model, then a target calling function of the prediction model is generated according to the calling information and the data calling field, and finally, when a model calling instruction is received, the following steps are implemented by calling the target calling function: extracting target data in hive according to the data calling field, calling a prediction model according to calling information, inputting the target data into the prediction model, and predicting to obtain a prediction result; according to the invention, the target data to be called of the prediction model and the prediction model are pre-stored in the hive, so that the target data and the prediction model can be directly called in the hive by directly calling the target calling function of the prediction model in the subsequent process, the prediction result is obtained by predicting in the hive, the data acquisition and prediction are completed in the hive, the data interaction process is reduced, the upstream and the downstream of the prediction process can be perfectly compatible, the compatibility is strong, and the high fault tolerance of the hive is achieved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic diagram of an application environment of a hive-based data prediction method according to an embodiment of the invention;
FIG. 2 is a flow chart of the hive-based data prediction method according to an embodiment of the invention;
FIG. 3 is a flowchart illustrating an implementation of step S10 in FIG. 2;
FIG. 4 is a flowchart illustrating an implementation of step S11 in FIG. 3;
FIG. 5 is a schematic diagram of an implementation of step S12 in FIG. 3;
FIG. 6 is a schematic flow chart of another implementation of step S10 in FIG. 2;
FIG. 7 is a schematic diagram of another implementation of step S20 in FIG. 2;
FIG. 8 is a schematic flow chart of another implementation of step S30 in FIG. 2;
FIG. 9 is a schematic diagram of an embodiment of the hive-based data prediction apparatus;
FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The hive-based data prediction method provided by the embodiment of the invention can be applied to an application environment shown in fig. 1, wherein the terminal equipment is communicated with a server through a network.
A user sends a relevant instruction to a server through terminal equipment, the server stores a prediction model obtained by training as a target format file after receiving the relevant instruction, the target format file is loaded into a data warehouse tool hive to obtain calling information of the prediction model, then a target calling function of the prediction model is generated according to the calling information and a data calling field, and finally when the model calling instruction is received, the following steps are realized by calling the target calling function: extracting target data in hive according to the data calling field, calling a prediction model according to calling information, inputting the target data into the prediction model, and predicting to obtain a prediction result; the method comprises the steps that a prediction model obtained through training is saved as a target format file, the target format file is loaded into a hive to obtain calling information of the prediction model, then a target calling function of the prediction model is generated according to the calling information and a data calling field, a user sends a calling instruction to a server through terminal equipment, the server directly calls the target calling function after receiving the calling instruction, target data are extracted from the hive according to the data calling field of the target calling function, the prediction model is called according to the calling information of the target calling function, and therefore prediction results are obtained by predicting the target data.
It should be understood that hive is a data warehouse tool based on Hadoop distributed system (Hadoop), which is a mechanism that can store, query and analyze large-scale data stored in Hadoop, and can be used to perform data extraction, transformation, and loading. The hive can map the structured data file into a database table and provide an SQL query function, and general big data processing can have high fault tolerance through hive and hive.
In the embodiment, the target data to be called by the prediction model and the prediction model are stored in the hive in advance, so that the target data can be directly called in the hive by directly calling the target calling function of the prediction model in the following, and the prediction result is obtained by predicting in the hive, the data acquisition and prediction are completed in the hive, the data interaction process is reduced, the upstream and the downstream of the prediction process can be perfectly compatible, the compatibility is strong, and the high fault tolerance of the hive is achieved. Meanwhile, data acquisition and prediction are finished in hive, so that the data interaction process is reduced, manual intervention is greatly reduced, and the artificial intelligence of the prediction process is improved.
In this embodiment, the prediction model, the target data to be called by the prediction model, the target calling function, and other data are all stored in the database of the server, so that when the prediction task is subsequently executed, the target calling function is directly acquired to call the prediction model for prediction, so as to reduce data interaction, thereby predicting efficiency.
The database in this embodiment is stored in a block chain network, and is used for data used and generated in the hive-based data prediction method, such as data of a measurement model, target data required to be called by a prediction model, a target calling function, and the like, and related data of a prediction result. The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like. The database is deployed in the blockchain, so that the safety of data storage can be improved.
The terminal device may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.
In an embodiment, as shown in fig. 2, a hive-based data prediction method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
s10: and storing the prediction model obtained by training as a target format file, and loading the target format file into the hive to obtain the calling information of the prediction model.
In this embodiment, after obtaining the prediction model through training, the server needs to store the prediction model obtained through training as a target format file, and loads the target format file into the hive, so that the prediction model can be directly called in the hive in the following process, and after loading the target format file into the hive, the calling information of the prediction model can be generated according to the storage path of the target format file.
The storage path of the target format file may be used as the calling information of the prediction model, or the file name of the target format file may be used as the calling information of the prediction model after the file name of the target format file is associated with the storage path of the target format file.
S20: and generating a target calling function of the prediction model according to the calling information and the data calling field.
After the target format file of the prediction model is loaded into hive and the calling information of the prediction model is obtained, the server generates a target calling function of the prediction model according to the calling information and the data calling field. The data calling field is a storage field of target data needing to be called in hive when a subsequent prediction model carries out prediction. In this embodiment, considering that the existing data processing generally uses hives, that is, a large amount of user data is stored in advance in hives, different types of data in the user data have different storage fields in hives, and after the target format file of the prediction model is loaded to hives, the storage fields of the target data in hives may be directly determined, so as to generate the target call function. If the hive does not store the target data, before and after the target format file of the prediction model is loaded to the hive, the target data which needs to be called when the prediction model is used for prediction is stored in the hive in advance, and therefore the target data can be called directly when the prediction model is used for prediction in the following process.
S30: when a model calling instruction is received, the following steps are realized by calling a target calling function: and extracting target data in hive according to the data calling field, calling a prediction model according to calling information, and inputting the target data into the prediction model for prediction to obtain a prediction result.
After determining a target calling function of a prediction model, relevant personnel determine whether the prediction model needs to be called for prediction according to business requirements, if data (such as user behaviors, requirements, working time and the like) of a certain user needs to be predicted, the relevant personnel send a calling instruction of the prediction model to a server through terminal equipment so as to predict the data of the certain user through the prediction model, and the server directly calls the target calling function of the prediction model after receiving the calling instruction, so that the following steps are realized by calling the target calling function: and extracting relevant data of the user in the hive as target data according to a data calling field in the target calling function, calling a prediction model from a storage path of the prediction model according to calling information of the target calling function, inputting the target data into the prediction model so that the prediction model predicts the target data, thereby obtaining a prediction result of the user, and storing the prediction result in the hive.
The target data used by the prediction model is the wide-table data stored in the hive, so that the target data can be directly called for prediction when the prediction model predicts, extraction and processing of user data are not needed, the work of inputting the target data into the prediction model is obtained, the data interaction process is reduced, the prediction result is directly kept in the hive after the prediction is finished, other systems are not needed, the data unloading operation is further reduced, the upstream and the downstream can be perfectly compatible, and the compatibility is strong. Meanwhile, due to the high fault tolerance of the hive and the rigor of the environment (java environment) in the hive, for data which individually contains abnormal values in the target data, namely data which do not accord with the definition of the data in the java, the prediction result is empty when the prediction is carried out, the prediction result is not influenced, and therefore the prediction in the hive also has high fault tolerance.
In this embodiment, a prediction model obtained through training is saved as a target format file, the target format file is loaded into a hive to obtain calling information of the prediction model, then a target calling function of the prediction model is generated according to the calling information and a data calling field, the data calling field is a storage field of target data to be called in the hive when the prediction model carries out prediction, and finally when a model calling instruction is received, the following steps are realized by calling the target calling function: extracting target data in hive according to the data calling field, calling a prediction model according to calling information, inputting the target data into the prediction model, and predicting to obtain a prediction result; target data to be called by the prediction model and the prediction model are stored in the hive in advance, so that the target data and the prediction model can be directly called in the hive by directly calling a target calling function of the prediction model in the following, prediction is carried out in the hive to obtain a prediction result, data acquisition and prediction are completed in the hive, the data interaction process is reduced, the upstream and the downstream of the prediction process can be perfectly compatible, the compatibility is strong, and the high fault tolerance of the hive is achieved.
In an embodiment, the target format file is a pmml file, as shown in fig. 3, in step S10, the method stores the trained prediction model as the target format file, and specifically includes the following steps:
s11: and acquiring a preset model training data set.
Before training the prediction model, a preset model training data set needs to be acquired. The method comprises the steps of acquiring user data of different users, cleaning, screening and classifying and labeling each user data according to prediction requirements to generate sample data corresponding to the user, and summarizing the sample data corresponding to each user into a model training data set. The sample data in the model training data set corresponds to a plurality of standard labels, and the standard labels can be determined according to prediction requirements.
For example, taking the prediction requirement as an example of predicting whether a user is a target user, user data of a large number of users needs to be acquired, and after the user data is cleaned, screened, classified and labeled, sample data is acquired, where the standard data includes data types such as a scholarly calendar, a work property, a marital status, an industry, an age, a historical week work duration, and income of the user, that is, standard tags of the sample data include tags such as a scholarly calendar, a work property, a marital status, an industry, an age, a historical week work duration, and income.
S12: and performing model training on the model training data set through sklern to obtain a prediction model, and storing the prediction model into a pmml file format to obtain a target format file.
After the model training data set is obtained, the model training data set is imported into the sklern, then data of the model training data set are processed through the sklern, model training is further conducted on the processed model training data set to obtain a prediction model, and finally the prediction model obtained through training is stored in a pmml file format, so that a target format file can be obtained.
It should be understood that sklern (i.e., Scikit-lern) is a Python language-based machine learning tool, and is a common third-party module in machine learning. skleran is established on NumPy, SciPy, Pandas and Matplotlib, common machine learning methods are packaged, the common machine learning methods comprise Regression (Regression), Dimensionality Reduction (dimensional Reduction), classification (Classfication), Clustering (Cluster) and other methods, the common machine learning methods also comprise a plurality of logistic Regression models, random forests, Lightgbm and other models, and multi-class model training is supported. Model training is carried out on the model training data set by using sklern, and operations such as data processing, model training, model evaluation and the like can be carried out on data quickly and conveniently, so that an accurate prediction model is obtained, and the method is simpler and quicker than a traditional model training method.
In other embodiments, the model training data set may be further subjected to model training by xgboost, lightgbm, and the like to obtain a prediction model, and the prediction model is saved as a pmml file to obtain a target format file, which is not described herein again. In this embodiment, the prediction model of the pmml file can support model expansion, and the current common models such as decision tree, Logistic Regression (LR), Lightgbm, and the like can all support the model expansion, so that the model expansion is strong.
In the embodiment, a preset model training data set is obtained, model training is performed on the model training data set through sklern to obtain a prediction model, the prediction model is stored in a pmml file format to obtain a target format file, the training process of the prediction model is refined, the prediction model obtained through training is stored in the target format file, the model can be quickly trained through sklern to obtain the prediction model, data preprocessing and model training are also quickly performed, the data processing calculation performance and the model training speed are far superior to those of a traditional model training mode, and the prediction model in the pmml file format has high expansibility.
In an embodiment, as shown in fig. 4, in step S11, acquiring a preset model training data set specifically includes the following steps:
s111: and selecting a preset amount of user data from the hive, and determining data corresponding to the data calling field in the user data as sample data.
User data of different users are led into hive intermediate standardization processing in advance, in the hive, the user data storage formats of the users are the same, and the same type of data has the same storage fields. When a model training data set needs to be acquired, target data used for prediction and storage fields of the target data in hive are determined according to prediction requirements, then user data of a preset number are directly acquired and selected from the hive, and data corresponding to data calling fields (namely the storage fields of the target data in hive) are determined in the user data and serve as sample data corresponding to the users. The data calling field is directly used as a standard label of various data in the user data, so that the sample data of the user is obtained.
S112: the sample data is compiled into a model training data set.
After the sample data is obtained, the sample data corresponding to each user is collected into a model training data set.
In the embodiment, the user data with the preset quantity is obtained and selected in the hive, the data corresponding to the data calling field is determined in the user data and is used as the sample data, the sample data is collected into the model training data set, and the data is uniformly processed when being imported into the hive, so that the sample data is extracted only by extracting the data of the corresponding field without data quantity work, the sample data can be quickly obtained from the hive to form the model training sample data set, and the method is simple and quick.
In an embodiment, as shown in fig. 5, in step S12, performing model training on the model training data set through sklern to obtain a prediction model, and storing the prediction model in a pmml file format, the method specifically includes the following steps:
s121: and calling a sklern feature processing module to perform feature processing on data in the model training data to obtain a plurality of feature variables.
In the process of performing model training by using the sklern model training data set to obtain the prediction model, a feature processing module of sklern needs to be called first to perform feature processing on data in the model training data, so as to obtain a plurality of feature variables.
The characteristic processing mode is determined according to the data property of the sample data. Taking sample data as user data including a scholarly calendar, working properties, marital status, industry, age, weekly working duration, income and the like as an example, the plurality of characteristic variables comprise classification type variables and numerical type variables, the scholarly calendar, the working properties, the marital status, the industry and the like can be classified into the classification type variables, and the age, the weekly working duration, the income and the like can be classified into the numerical type variables.
For example, academic history, work property, marital status and industry in sample data are classified by means of label binarization (LabelBinarizer), and age, week duration and income in the sample data are numerically varied by means of regularization, missing value filling and the like. Taking the academic calendar as an example, if the academic calendar includes five categories of the following high school, the university, the researcher, the doctor and the above, the academic calendar of the sample data can be converted into a vector, the vector is 5 numerical values (corresponding to the five categories of the academic calendars), which category of the academic calendars belongs to, the corresponding numerical values are 1, and the others are 0: if the calendar in the sample data is the subject calendar, the calendar variable converted from the calendar of the sample data is (0,0,1,0, 0); if the academic calendar in the sample data is the high school academic calendar, the academic calendar converted from the sample data is (0,1,0, 0).
S122: and calling a sklern model training module to train the initial model based on a plurality of characteristic variables to obtain a prediction model.
After the sklern feature processing module is used for performing feature processing on data in the model training data to obtain a plurality of feature variables, the sklern model training module is called to train the initial model based on the feature variables to obtain a prediction model.
S123: the prediction model is converted and saved to pmml file format by a sklern 2pmml library function.
After a sklern model training module (python module) is called to train an initial model based on a plurality of characteristic variables to obtain a prediction model, the prediction model is converted and stored into a pmml file format through a sklern 2pmml library function, and therefore a target format file of the prediction model is obtained.
An example of a process for calling a sklern 2pmml function through python, for feature processing and model training, and outputting a pmml file, for example, is as follows:
Figure BDA0003559306200000091
Figure BDA0003559306200000101
in the embodiment, sample data is subjected to type variable processing to obtain a marking point type variable and a marking numerical value type variable, then an LGBMClassifier function is called, then the model is trained in a PMMLPipeline mode, and finally a sklern 2pmml function is called to store a model training result into a pmml file format for subsequent hive calling.
In the embodiment, the method includes the steps of calling a sklern feature processing module to perform feature processing on data in model training data to obtain a plurality of feature variables, calling the sklern model training module to train an initial model based on the feature variables to obtain a prediction model, then converting and storing the prediction model into a pmml file format through a sklern 2pmml library function, determining the prediction model obtained by performing model training on a model training data set through sklern, storing the prediction model into the pmml file format to obtain a specific process of a target format file, directly training the model through calling the pmml function and storing the trained prediction model into the pmml file, wherein the calculation performance is far superior to that of a python single-machine version. Through actual tests, the sample data with the quantity of 2kw + is processed by 180+ characteristic variables, and the time from loading the data children to outputting the pmml file of the prediction model is less than 10 min.
In an embodiment, as shown in fig. 6, in step S10, loading the target format file into the hive to obtain the calling information of the prediction model, the method specifically includes the following steps:
s13: and loading the target format file into the jar data packet to obtain a Java jar file packet.
S14: the Java jar file package is loaded into hive and exists in a specified position in hive.
S15: and taking the specified position in the hive or the file name of the Java jar file package as the calling information of the prediction model.
After the target format file is obtained, namely the prediction model is saved as a pmml file (namely the file of the prediction model), the target format file is loaded in a jar data package to obtain a Java jar file package. The Java jar data package can be opened, and then the target format file is loaded into the jar data package for loading or updating according to the service requirement, so that the Java jar file package is generated.
After the Java jar file package is obtained, the Java jar file package is loaded into the hive and is stored in a specified position (storage path) in the hive, and the specified position in the hive or the file name of the Java jar file package is used as calling information of the prediction model. And the specified position is an absolute path for storing the Java jar file package, so that a prediction model in the Java jar file package can be called according to the storage path in the following process. When the filename of the Java jar package is used as the calling information of the prediction model, the filename of the Java jar package needs to be associated with the specified location (the storage path of the Java jar package), so that the prediction model in the Java jar package can be called subsequently according to the storage path associated with the filename.
In this embodiment, due to the high fault tolerance of hive and the rigidness of Java, the target format file is loaded in the jar data packet to obtain a Java jar file packet, and then the Java jar file packet is loaded in hive, so that the high fault tolerance of hive and the rigidness of Java can be combined, the high fault tolerance of the pmml prediction model is further improved, and even if individual target data containing abnormal values exist, namely the target data does not accord with the definition of data in Java, the prediction result of the pmml prediction model is output as null, and the prediction result is not affected.
In the embodiment, the target format file is loaded in the jar data packet to obtain the Java jar file packet, then the Java jar file packet is loaded into the hive and exists at the specified position in the hive, the specified position in the hive or the file name of the Java jar file packet is used as the calling information of the prediction model, the specific process of loading the target format file into the hive to obtain the calling information of the prediction model is defined, the high fault tolerance of the hive is combined with the rigor of the Java, the high fault tolerance of the pmml prediction model is further improved, and a basis is provided for generating a target calling function according to the calling information of the prediction model subsequently.
In an embodiment, as shown in fig. 7, in step S20, that is, generating a target calling function of a prediction model according to the calling information and the data calling field, the method specifically includes the following steps:
s21: a plurality of target data input to the predictive model is determined according to the user instruction.
After loading the predictive model into the hive, a plurality of target data that need to be input into the predictive model for prediction is determined according to user instructions.
For example, taking the business demand as an example for predicting whether the user is a target user, the data that needs to be input by the prediction model includes data such as the academic calendar, the work property, the marital status, the industry, the age, the historical work duration, the income and the like of the user, and the target data that includes the indication in the user instruction includes a plurality of data such as the academic calendar, the work property, the marital status, the industry, the age, the historical work duration, the income and the like of the user.
S22: and determining a storage field of the target data in the hive to obtain a data call field.
After target data input into the prediction model are determined according to business requirements, a storage field of the target data in the hive is determined, the storage field of the target data in the hive is used as a data calling field to obtain the data calling field, and a basis is provided for subsequently calling the target data according to the data calling field.
For example, the target data includes a plurality of data such as the user's academic record, work property, marital status, industry, age, historical week work duration, income, and in hive, the storage fields of the data such as the user's academic record, work property, marital status, industry, age, historical week work duration, income, and the like are respectively: the data call field comprises the academic calendar, the work property, the marital status, the industry, the age, the sex, the historical weekly work duration and the income.
S23: a prediction type of the prediction model is determined.
Meanwhile, the prediction type of the prediction model is determined according to the prediction requirement. Taking the prediction model as a binary model as an example, the prediction types of the prediction model include a certain event and not a certain event.
For example, the prediction requirement is to predict whether a certain user is a target user, and the prediction type includes two types, that is, the user is the target user and the user is not the target user, where 1 may indicate that the user is the target user, 0 indicates that the user is not the target user, after the prediction type of the prediction model is determined in the target calling function, the prediction result output by the subsequent prediction model is the probability of the prediction type, the prediction result output by the binary prediction model may be the probability that the user is the target user, and the prediction type may also be the probability that the user is not the target user.
S24: and generating the target calling function according to the calling information, the data calling field and the prediction type as parameters of the target calling function.
After determining the calling information, the data calling field and the prediction type of the target format file, taking the calling information, the data calling field and the prediction type as parameters of a target calling function to generate the target calling function, namely the target calling function comprises three parameter information of the calling information, the data calling field and the prediction type for subsequent reading. The target calling function needs to correspond to the file format of the target format file.
For example, if the target format is pmml file format, the target call function is pmml _ udf function, and the generation process of the target call function is as follows:
create table tmp_dp_pmml_puke_test2as select
applicant,cellphone,is_buy,
pmml_udf("/LightGBMAudit.pmml",named_struct("Age",Age,"Employment",Employment,"Education",Education,"Marital",Marital,"Occupation",Occupation,"Income",Income,"Gender",Gender,"Deductions",Deductions,"Hours",Hours),"probability(1)")
from tmp_dp_pmml_qudit_test1;
wherein the pmml _ udf function includes three parameters: call information, data call fields, and prediction type. The calling information in this embodiment is a file name of a target format file (prediction model), that is, the lightgbmaudit.pmml field described above; the data call field in this embodiment is a name _ struct part, and includes fields such as academic calendar (duration), working property (employee), Marital status (Marital), industry (occupancy), Age (Age), Gender (Gender), historical weekly working duration (Hours), Income (Income), deduction amount (reduction), and the like; the prediction type in this embodiment is represented by probability (1), that is, a certain event, and the output result of the prediction model is the probability of the certain event.
In the embodiment, a plurality of target data input into a prediction model are determined according to business requirements, then storage fields of the target data in hive are determined to obtain a plurality of data call fields, the prediction type of the prediction model is determined according to the prediction requirements, the prediction type is the probability of a preset value obtained by prediction of the prediction model, finally the call information, the data call fields and the prediction type are used as parameters of a target call function to generate the target call function, the specific steps of generating the target call function of the prediction model according to the call information and the data call fields are determined, the prediction type of the prediction model is determined according to the prediction requirements, the data call fields of the target data are determined in hive, the call information, the data call fields and the prediction type are used as parameters to generate the target call function, a basis is provided for subsequent prediction, and the prediction model and input data thereof can be called only through the target call function, the prediction type of the target calling function can be determined according to the prediction requirement so that the prediction model can output corresponding probability in the following process, and the method is convenient and simple and does not need complex data operation.
In an embodiment, as shown in fig. 8, in step S30, the method includes, by calling a target call function to extract target data in hive according to a data call field, and calling a prediction model according to call information to predict the target data to obtain a prediction result, specifically including the following steps:
s31: and determining calling information, a prediction type and a data calling field in the target calling function.
S32: and extracting data corresponding to the data call field in hive as target data.
S33: and calling the prediction model according to the calling information, and inputting the target data into the prediction model for prediction so that the prediction model outputs a prediction result corresponding to the prediction type.
When a prediction model in the hive needs to be called for prediction, a target calling function of the prediction model is obtained, calling information, a prediction type and data calling fields (which may comprise a plurality of data calling fields) in the target calling function are determined, target data corresponding to each data calling field are extracted from the hive to obtain a plurality of target data, the prediction model is called in the hive according to the calling information, the target data are input into the prediction model for prediction, so that the prediction model outputs a prediction result corresponding to the prediction type, the prediction result at the moment is a probability value corresponding to the prediction type, the prediction result is a probability value of a certain event or a probability value which is not the certain event by taking the prediction model as a binary model as an example.
In the embodiment, the calling information, the prediction type and the data calling field in the target calling function are determined, then the target data corresponding to the data calling field is extracted from the hive, finally the prediction model is called according to the calling information, the target data is input into the prediction model for prediction, so that the prediction model outputs the prediction result corresponding to the prediction type, the target calling function is definitely called, the target data is extracted from the hive according to the data calling field, the prediction model is called according to the calling information to predict the target data to obtain the specific process of the prediction result, the operation can be completed by directly calling the target calling function without other operations, the full automation of prediction is realized, the data interaction process is reduced, and the upstream and the downstream of the prediction process can be perfectly compatible. In addition, the prediction type of the target calling function can be determined according to the prediction requirement, so that the prediction model outputs the probability of demand prediction according to the target calling function.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, a hive-based data prediction apparatus is provided, and the hive-based data prediction apparatus corresponds to the hive-based data prediction method in the above embodiments one to one. As shown in fig. 9, the hive-based data prediction apparatus includes a loading module 901, a generating module 902, and a calling module 903. The functional modules are explained in detail as follows:
the loading module 901 is configured to store the trained prediction model as a target format file, and load the target format file into the data warehouse tool hive to obtain calling information of the prediction model;
the generating module 902 is configured to generate a target call function of the prediction model according to the call information and the data call field, where the data call field is a storage field of target data to be called in the hive when the prediction model performs prediction;
the calling module 903 is configured to, when receiving the model call instruction, call a target call function to implement the following steps: and extracting target data in hive according to the data calling field, calling a prediction model according to calling information, and inputting the target data into the prediction model for prediction to obtain a prediction result.
Further, the generating module 902 is specifically configured to:
determining a plurality of target data input into the prediction model according to a user instruction;
determining storage fields of the target data in the hive to obtain data calling fields;
determining a prediction type of a prediction model;
and taking the calling information, the data calling field and the prediction type as parameters of the target calling function to generate the target calling function.
Further, the calling module 903 is specifically configured to:
determining calling information, a prediction type and a data calling field in a target calling function;
extracting data corresponding to the data calling field in hive as target data;
and calling the prediction model according to the calling information, and inputting the target data into the prediction model for prediction so that the prediction model outputs a prediction result corresponding to the prediction type.
Further, the loading module 901 is specifically configured to:
loading the target format file into a jar data package to obtain a Java jar file package;
loading the Java jar file package into the hive, wherein the Java jar file package exists in a specified position in the hive;
and taking the specified position in the hive or the file name of the Java jar file package as the calling information of the prediction model.
Further, the target format is a pmml file format, and the loading module 901 is further specifically configured to:
obtaining a model training data set of a prediction model;
model training is carried out on the model training data set through sklern to obtain a prediction model, and the prediction model is stored in a pmml file format to obtain a target format file.
Further, the loading module 901 is specifically further configured to include:
acquiring a preset amount of user data in hive, and determining data corresponding to a data calling field in the user data as sample data;
the sample data is compiled into a model training data set.
Further, the loading module 901 is specifically further configured to:
calling a sklern feature processing module to perform feature processing on data in the model training data to obtain a plurality of feature variables;
calling a sklern model training module to train the initial model based on a plurality of characteristic variables to obtain a prediction model;
the prediction model is converted and saved to pmml file format by a sklern 2pmml library function.
For specific limitations of the hive-based data prediction apparatus, reference may be made to the above limitations of the hive-based data prediction method, which is not described herein again. The various modules in the hive-based data prediction apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a storage medium and an internal memory. The storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and computer programs in the storage medium to run. The database of the computer equipment is used for storing the prediction model and target data, target calling functions and other data required by the prediction model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a hive-based data prediction method.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
storing the prediction model obtained by training as a target format file, and loading the target format file into a data warehouse tool hive to obtain calling information of the prediction model;
generating a target calling function of the prediction model according to the calling information and the data calling field;
when a model calling instruction is received, the following steps are realized by calling a target calling function:
and extracting target data in hive according to the data calling field, calling a prediction model according to calling information, and inputting the target data into the prediction model for prediction to obtain a prediction result.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
storing the trained prediction model as a target format file, and loading the target format file into a data warehouse tool hive to obtain calling information of the prediction model;
generating a target calling function of the prediction model according to the calling information and the data calling field;
when a model calling instruction is received, the following steps are realized by calling a target calling function:
and extracting target data in hive according to the data calling field, calling a prediction model according to calling information, and inputting the target data into the prediction model to predict to obtain a prediction result.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A hive-based data prediction method is characterized by comprising the following steps:
storing the prediction model obtained by training as a target format file, and loading the target format file into a data warehouse tool hive to obtain calling information of the prediction model;
generating a target calling function of the prediction model according to the data calling field and the calling information;
when a model calling instruction is received, the following steps are realized by calling the target calling function:
extracting target data in the hive according to the data calling field, calling the prediction model according to the calling information, and inputting the target data into the prediction model for prediction to obtain a prediction result.
2. The hive-based data prediction method of claim 1, wherein the generating a target call function of the prediction model from a data call field and the call information comprises:
determining the target data input into the prediction model according to a user instruction;
determining a storage field of the target data in the hive to obtain a data calling field;
determining a prediction type of the prediction model;
and generating the target calling function by taking the calling information, the data calling field and the prediction type as parameters of the target calling function.
3. The hive-based data prediction method of claim 1, wherein the extracting target data in the hive according to the data call field, calling the prediction model according to the call information, inputting the target data into the prediction model for prediction to obtain a prediction result, comprises:
reading the calling information, the prediction type and the data calling field in the target calling function;
extracting data corresponding to the data calling field in the hive to serve as the target data;
and calling the prediction model according to the calling information, and inputting the target data into the prediction model for prediction so that the prediction model outputs a prediction result corresponding to the prediction type.
4. The hive-based data prediction method of claim 1, wherein the loading the object format file into the hive to obtain the calling information of the prediction model comprises:
loading the target format file into a jar data package to obtain a Java jar file package;
loading the Java jar file package into the hive, wherein the Java jar file package exists in a specified position in the hive;
and taking the specified position in the hive or the file name of the Java jar file package as the calling information of the prediction model.
5. The hive-based data prediction method of any one of claims 1 to 4, wherein the target format is a pmml file format, and the saving the trained prediction model as a target format file comprises:
acquiring a preset model training data set;
and performing model training on the model training data set through sklern to obtain the prediction model, and storing the prediction model in a pmml file format to obtain the target format file.
6. The hive based data prediction method of claim 5, wherein the obtaining the preset model training data set comprises:
selecting a preset amount of user data from the hive, and determining data corresponding to the data calling field in the user data as sample data;
and collecting the sample data into the model training data set.
7. The hive-based data prediction method of claim 5, wherein model training the model training dataset with skleran to obtain the prediction model and saving the prediction model in pmml file format comprises:
calling a feature processing module of the sklern to perform feature processing on data in the model training data to obtain a plurality of feature variables;
calling a model training module of the sklern to train an initial model based on the plurality of characteristic variables to obtain the prediction model;
the prediction model is transformed and saved to the pmml file format by a sklern 2pmml library function.
8. A hive-based data prediction apparatus, comprising:
the loading module is used for storing the prediction model obtained through training as a target format file, and loading the target format file into a data warehouse tool hive to obtain calling information of the prediction model;
the generating module is used for generating a target calling function of the prediction model according to the data calling field and the calling information;
the calling module is used for calling the target calling function to realize the following steps when a model calling instruction is received:
extracting the target data in the hive according to the data calling field, and calling the prediction model according to the calling information;
and inputting the target data into the prediction model for prediction to obtain a prediction result.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor when executing the computer program implements the steps of the hive based data prediction method as defined in any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the hive-based data prediction method according to any one of claims 1 to 7.
CN202210284028.7A 2022-03-22 2022-03-22 Hive-based data prediction method and device, computer equipment and storage medium Pending CN114626619A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210284028.7A CN114626619A (en) 2022-03-22 2022-03-22 Hive-based data prediction method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210284028.7A CN114626619A (en) 2022-03-22 2022-03-22 Hive-based data prediction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114626619A true CN114626619A (en) 2022-06-14

Family

ID=81903621

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210284028.7A Pending CN114626619A (en) 2022-03-22 2022-03-22 Hive-based data prediction method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114626619A (en)

Similar Documents

Publication Publication Date Title
Weyuker et al. Comparing the effectiveness of several modeling methods for fault prediction
US11662719B2 (en) Classification modeling for monitoring, diagnostics optimization and control
CN109711802A (en) Item information processing method, device, computer equipment and storage medium
CN110929036A (en) Electric power marketing inspection management method and device, computer equipment and storage medium
CN104423968B (en) It designs the method for service logic, execute its server and storage medium
US20210304073A1 (en) Method and system for developing a machine learning model
CN113626241B (en) Abnormality processing method, device, equipment and storage medium for application program
CN114943383A (en) Prediction method and device based on time series, computer equipment and storage medium
US10984283B2 (en) Recognition of biases in data and models
CN114139490B (en) Method, device and equipment for automatic data preprocessing
CN110134589B (en) Interface test case generation method and device, computer equipment and storage medium
Shamim et al. Cloud Computing and AI in Analysis of Worksite
CN110532773B (en) Malicious access behavior identification method, data processing method, device and equipment
CA3143808A1 (en) Event promoting method, device, computer apparatus, and storage medium
CN113190426B (en) Stability monitoring method for big data scoring system
CN113947076A (en) Policy data detection method and device, computer equipment and storage medium
CN114626619A (en) Hive-based data prediction method and device, computer equipment and storage medium
CN111680645A (en) Garbage classification processing method and device
CN115936895A (en) Risk assessment method, device and equipment based on artificial intelligence and storage medium
CN114722025A (en) Data prediction method, device and equipment based on prediction model and storage medium
WO2022188994A1 (en) Computer-implemented methods referring to an industrial process for manufacturing a product and system for performing said methods
CN112528662A (en) Entity category identification method, device, equipment and storage medium based on meta-learning
CN111784069A (en) User preference prediction method, device, equipment and storage medium
CN111191692B (en) Data calculation method and device based on decision tree and computer equipment
Santos et al. Technological Coefficient to Improve Research Development and Innovation Factors in the World

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination