CN116861254A - Cold and hot data identification method, device, equipment and storage medium - Google Patents

Cold and hot data identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN116861254A
CN116861254A CN202310954946.0A CN202310954946A CN116861254A CN 116861254 A CN116861254 A CN 116861254A CN 202310954946 A CN202310954946 A CN 202310954946A CN 116861254 A CN116861254 A CN 116861254A
Authority
CN
China
Prior art keywords
data
cold
hot
user
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310954946.0A
Other languages
Chinese (zh)
Inventor
周阳晶
庄校侨
洪日
伍世海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Merchants Bank Co Ltd
Original Assignee
China Merchants Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Merchants Bank Co Ltd filed Critical China Merchants Bank Co Ltd
Priority to CN202310954946.0A priority Critical patent/CN116861254A/en
Publication of CN116861254A publication Critical patent/CN116861254A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of data analysis, and discloses a cold and hot data identification method, a device, equipment and a storage medium. The method comprises the following steps: based on user data, establishing a cold and hot data identification model, and updating the cold and hot data identification model according to a daily updating period; determining cold data and hot data based on the cold-hot data identification model; storing hot data in a preset memory database, and storing cold data in a preset degradation database; when a user service request is received, determining the data cooling and heating type of target data corresponding to the user service request; determining a target database according to the cold and hot data types; and inquiring target data in a target database, and responding to the user service request according to the target data. By the mode, the cold and hot data identification model is established to identify the cold and hot data, the training data is updated daily, the data is kept fresh continuously, the model is trained automatically, and the model keeps high hit rate for identifying the cold and hot data.

Description

Cold and hot data identification method, device, equipment and storage medium
Technical Field
The present invention relates to the field of data analysis technologies, and in particular, to a method, an apparatus, a device, and a storage medium for identifying cold and hot data.
Background
With the continuous development of online services, mobile phone APP is taken as a main online channel facing customers of enterprises, provides various services for customers, and requires the support of computing engines to bear data required by the computing engines, such as product content, service information and the like, based on response efficiency, NOSQL memory databases, such as Redis, memcached and the like, are often used, and are stored and expanded along with the development of the services, and the high-performance data storage databases are often expensive. How to improve the access efficiency and response time of the mobile phone APP and reduce the storage cost and the application management complexity at the same time becomes a problem to be solved urgently by enterprises. The cold and hot data separation scheme becomes an effective method, the data are divided into a refrigeration house and a heat storage, the refrigeration house stores data which are not used frequently, and the heat storage stores data which are changed and used frequently.
In the conventional cold and hot data separation scheme, cold and hot data are mainly identified by adopting a traditional statistical method, abnormal values or hot spot data in the data in a statistical period are identified as hot data by adopting statistical analysis modes (statistical indexes such as average values, median values and variances) according to the characteristics and rules of the data, or the data are inferred and judged by expert rules by utilizing the knowledge and experience of an expert, so that the hot data are specified. However, the customer population is huge, the change of customers is frequent, and if the hot data is identified by the traditional method, the problems of longer data update period, low hot data hit rate and the like can occur.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The invention mainly aims to provide a cold and hot data identification method, a device, equipment and a storage medium, and aims to solve the technical problems that a traditional hot data identification method in the prior art is long in data update period and low in hot data hit rate.
In order to achieve the above object, the present invention provides a cold and hot data identification method, the method comprising the steps of:
based on user data, establishing a cold and hot data identification model, wherein the cold and hot data identification model is updated according to a daily update period;
determining cold data and hot data based on the cold and hot data identification model;
storing the hot data in a preset memory database, and storing the cold data in a preset degradation database;
when a user service request is received, determining a data cooling and heating type of target data corresponding to the user service request;
determining a target database according to the data cold and hot type;
and inquiring the target data in the target database, and responding to the user service request according to the target data.
Optionally, the user data includes user basic information, user behavior data and user exposure data, and the building of the cold and hot data identification model based on the user data includes:
acquiring initial user basic information and initial user behavior data, and preprocessing the initial user basic information and the initial user behavior data to obtain the user basic information and the user behavior data, wherein the preprocessing at least comprises noise reduction processing, null filling and type conversion;
determining sample data according to the user basic information, the user behavior data and the user exposure data;
feature screening is carried out on the user basic information and the user behavior data, and sample features are determined;
training a preset two-classification model based on the sample data and the sample characteristics to obtain the cold and hot data identification model.
Optionally, the determining sample data according to the user basic information, the user behavior data and the user exposure data includes:
determining a user label according to the user data, wherein the user label comprises a positive label and a negative label;
according to the positive label extraction proportion, carrying out hierarchical sampling on the user data with the positive label to obtain positive label sample data;
According to the negative label extraction proportion, carrying out hierarchical sampling on the user data with the negative label to obtain negative label sample data;
and respectively carrying out sample extraction on the positive label sample data and the negative label sample data according to random seeds to obtain the sample data.
Optionally, the feature screening of the user basic information and the user behavior data, determining a sample feature, includes:
determining initial sample characteristics in the user basic information and the user behavior data according to preset requirements;
determining the positive label data proportion and the negative label data proportion corresponding to the initial sample characteristics according to the user labels;
according to the positive label data proportion and the negative label data proportion, calculating evidence weight corresponding to the initial sample characteristics;
calculating the information value corresponding to the initial sample feature according to the evidence weight and the corresponding relation between the evidence weight and the information value;
and screening the initial sample characteristics according to an information value threshold value, and determining the sample characteristics, wherein the information value of the sample characteristics is larger than the information value threshold value.
Optionally, training a preset two-classification model based on the sample data and the sample features to obtain the cold and hot data identification model, including:
Dividing the sample data into training sample data and verification sample data according to a preset dividing proportion;
inputting the training sample data and the sample characteristics into the preset classification model for training to obtain an initial cold and hot data identification model;
inputting the verification sample data into the initial cold and hot data identification model to obtain prediction output;
according to the prediction output, the initial cold and hot data identification model is evaluated, and an evaluation result is obtained;
and when the evaluation result meets a preset condition, determining the initial cold and hot data identification model as the cold and hot data identification model.
Optionally, the cold and hot data identification method includes:
acquiring updated user data according to the daily update period;
training the cold and hot client identification model based on the updated user data and sample characteristics so that the cold and hot client identification model is updated;
and returning to execute the step of determining cold data and hot data based on the cold and hot data identification model according to the updated cold and hot client identification model.
Optionally, the determining the target database according to the data cold and hot type includes:
When the data cold and hot type of the target data is the hot data type, determining the target database as the preset memory database;
and when the data cold and hot type of the target data is the cold data type, determining the target database as the preset degradation database.
In addition, in order to achieve the above object, the present invention also provides a cold and hot data identification device, including:
the model building module is used for building a cold and hot data identification model based on user data, wherein the cold and hot data identification model is updated according to a daily update period;
the data identification module is further used for determining cold data and hot data based on the cold and hot data identification model;
the data separation module is used for storing the hot data in a preset memory database and storing the cold data in a preset degradation database;
the service response module is used for determining the data cooling and heating type of the target data corresponding to the user service request when the user service request is received;
the service response module is also used for determining a target database according to the data cold and hot type;
the service response module is further used for inquiring the target data in the target database and responding to the user service request according to the target data.
In addition, to achieve the above object, the present invention also proposes a cold and hot data identification apparatus including: the system comprises a memory, a processor and a cold and hot data identification program stored on the memory and capable of running on the processor, wherein the cold and hot data identification program is configured to realize the steps of the cold and hot data identification method.
In addition, in order to achieve the above object, the present invention also proposes a storage medium having stored thereon a cold and hot data identification program which, when executed by a processor, implements the steps of the cold and hot data identification method as described above.
In the invention, a cold and hot data identification model is established based on user data, the cold and hot data identification model is updated according to a daily update period, cold data and hot data are determined based on the cold and hot data identification model, hot data are stored in a preset memory database and a preset degradation database, when a user service request is received, the data cold and hot type of target data corresponding to the user service request is determined, a target database is determined according to the data cold and hot type, target data are inquired in the target database, and the user service request is responded according to the target data. Compared with the traditional hot data identification method, the hot data identification method has the advantages that the data updating period is longer, the hot data hit rate is low, the cold data identification model is built for identifying the hot data, the training data is updated daily, the data updating period is shortened, the data is kept fresh continuously, the model is trained automatically, the model keeps high hit rate for identifying the hot data, in addition, the hot data is stored by using a memory database, the request response is fast, the cold data is stored by using a distributed database, the resource consumption of the memory database can be reduced to the greatest extent, and the cost is reduced while the usability is guaranteed.
Drawings
FIG. 1 is a schematic diagram of a hardware operating environment cold and hot data identification device according to an embodiment of the present invention;
FIG. 2 is a flow chart of a first embodiment of a cold and hot data identification method according to the present invention;
FIG. 3 is a schematic overall flow chart of an embodiment of a cold and hot data identification method according to the present invention;
FIG. 4 is a flow chart of a second embodiment of the cold and hot data identification method of the present invention;
FIG. 5 is a schematic diagram of a model creation process according to an embodiment of the cold and hot data identification method of the present invention;
fig. 6 is a block diagram showing a first embodiment of the cold and hot data identification apparatus according to the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a cold and hot data identification device in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the cold and hot data identification apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a client interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The client interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional client interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the structure shown in fig. 1 is not limiting of the cold and hot data identification device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a client interface module, and a cold and hot data identification program may be included in the memory 1005 as one type of storage medium.
In the cold and hot data identification apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the client interface 1003 is mainly used for data interaction with clients; the processor 1001 and the memory 1005 in the cold and hot data identification apparatus of the present invention may be provided in the cold and hot data identification apparatus, which calls the cold and hot data identification program stored in the memory 1005 through the processor 1001 and performs the cold and hot data identification method provided by the embodiment of the present invention.
An embodiment of the invention provides a cold and hot data identification method, referring to fig. 2, fig. 2 is a flow chart of a first embodiment of the cold and hot data identification method of the invention.
In this embodiment, the cold and hot data identification method includes the following steps:
Step S10: based on the user data, a cold and hot data identification model is established, and the cold and hot data identification model is updated according to a daily update period.
It should be noted that, the execution body of the embodiment is an intelligent terminal, for example: and the computer is provided with a cold and hot data identification program in the intelligent terminal, and cold and hot data are identified through the cold and hot data identification program.
It is understood that the user data refers to data of related users that can be acquired, including user basic information, user behavior data, and user exposure data. The user basic information refers to the basic condition of the user, and comprises basic information of the user and basic information of the APP, wherein the basic information of the user comprises age information, sex information and the like, and the basic information of the APP comprises equipment information and service information, namely, related information of used terminal equipment and service types provided by the APP. User behavior data is the relevant behavior of the user. The user exposure data, i.e. the use/operation condition of the user in the APP, such as click record, view record, is obtained according to the actual application scenario, which is not limited in this embodiment.
It should be understood that the cold and hot data identification model is a model for identifying cold and hot data established in this embodiment, and the cold and hot data identification model is updated once every interval of daily update period, that is, the update period of the model, and the daily update period in this embodiment is 1 day, that is, the cold and hot data identification model is updated daily.
In a specific implementation, the cold and hot data identification model is established by using the user basic information, the user behavior data and the user exposure data and used for the identification of the subsequent cold and hot data, and the cold and hot data identification model is updated every day, so that the freshness of the model is maintained.
Step S20: and determining cold data and hot data based on the cold and hot data identification model.
The cold data is data that is not used frequently, and the hot data is data that is changed and used frequently. The data type corresponding to the cold data is a cold data type, and the data type corresponding to the hot data is a hot data type.
In a specific implementation, a cold and hot data identification model is used to identify data to be classified (which can be determined according to an actual application scene of the model), wherein the data type is cold data, the data of which the data type is cold data, and the data of which the data type is hot data.
Step S30: and storing the hot data in a preset memory database, and storing the cold data in a preset degradation database.
It will be appreciated that the predetermined memory database is a predetermined database for storing thermal data, and the type of the memory database is, for example: redis, etc., can be flexibly selected according to actual requirements, and the embodiment is not limited to this. The preset degradation database is a preset database for storing cold data, and the type of the preset degradation database is a distributed database, for example: mongoDB, hbase, etc., can be flexibly selected according to actual requirements, and the embodiment does not limit the above, and generally has lower cost of the distributed database, is used as a degradation database, and has slightly poorer performance than an in-memory database.
It should be appreciated that the hot data is stored in the in-memory database, which can ensure the service response performance of high concurrent requests, the cold data is stored in the degradation database, and the performance of the degradation database is slightly inferior to that of the hot data database, but the overall service degradation rate is not affected due to the small amount of cold data requests.
In a specific implementation, the hot data and the cold data are stored in an in-memory database and a degradation database, respectively.
Step S40: and when receiving a user service request, determining the data cooling and heating type of target data corresponding to the user service request.
It should be noted that, the user service request, that is, the request of the user for accessing the service, is divided into two types, namely, a cold data type and a hot data type, and the target data is the data corresponding to the request, that is, the data required/requested by the user, which may be the cold data type or the hot data type, and needs to be determined according to the actual situation. The user requests access to the service through the channel, which may be a mobile phone APP, a tablet computer APP, etc., which is not limited in this embodiment.
In particular implementations, it is desirable to determine whether the data currently requested by the user is a hot data type or a cold data type for subsequent service responses.
Step S50: and determining a target database according to the data cold and hot type.
Further, the step S50 includes: when the data cold and hot type of the target data is the hot data type, determining the target database as the preset memory database; and when the data cold and hot type of the target data is the cold data type, determining the target database as the preset degradation database.
It can be understood that the target database refers to a database for querying target data, the data of the hot data type is stored in a database corresponding to the hot data, that is, the target data of the hot data type needs to be queried in a preset memory database, and the data of the cold data type is stored in a database corresponding to the cold data, that is, the target data of the cold data type needs to be queried in a preset degradation database.
In a specific implementation, for the target data of the hot data type, the target database is a preset memory database for storing hot data, and for the target data of the cold data type, the target database is a preset degradation database for storing cold data.
Step S60: and inquiring the target data in the target database, and responding to the user service request according to the target data.
It should be understood that if the data requested by the user is a hot data type, service response is performed from the in-memory database query related data corresponding to the hot data, and if the data requested by the user is a cold data type, service response is performed from the degradation database query related data corresponding to the cold data.
As shown in the overall flow chart of fig. 3, a user accesses a mobile phone service through a channel, a service request passes through a gateway and reaches an application service unit, the application service unit judges the cold and hot types of data, when the request data is identified as the hot data type, related data is queried from a memory database storing the hot data, and when the request data is identified as the cold data type, related data is queried from a database storing the cold data.
In the case where the data amount of the cold data is higher than 50% and the flow rate of the requested cold data is lower than 10%, the service degradation rate of the present embodiment is not improved.
In this embodiment, a cold and hot data identification model is established based on user data, the cold and hot data identification model is updated according to a daily update period, cold data and hot data are determined based on the cold and hot data identification model, hot data are stored in a preset memory database and a preset degradation database, when a user service request is received, a data cold and hot type of target data corresponding to the user service request is determined, a target database is determined according to the data cold and hot type, target data are queried in the target database, and response is made to the user service request according to the target data. Compared with the traditional hot data identification method, the hot data identification method has the advantages that the data updating period is longer, the hot data hit rate is low, the cold data identification model is built for cold data identification, the training data is updated daily, the data updating period is shortened, the data is kept fresh continuously, the model is trained automatically, the model keeps high hit rate for cold data identification, in addition, the hot data is stored by using a memory database, the request response is fast, the cold data is stored by using a distributed database, the resource consumption of the memory database can be reduced to the greatest extent, and the cost is reduced while the usability is guaranteed.
Referring to fig. 4, fig. 4 is a flow chart of a second embodiment of a cold and hot data identification method according to the present invention.
Based on the above embodiment, the step S10 includes:
step S101: initial user basic information and initial user behavior data are obtained, and the initial user basic information and the initial user behavior data are preprocessed to obtain the user basic information and the user behavior data.
It should be noted that, the initial user basic information and the initial user behavior data refer to the user basic information and the user behavior data that are directly obtained, and may have problems of more noise, more data null, and the like, so that the initial user basic information and the initial user behavior data need to be preprocessed and generally used later. The preprocessing in this embodiment at least includes noise reduction, null filling and type conversion, and other preprocessing methods are also possible, and this embodiment does not limit.
In specific implementation, data preprocessing is carried out on initial user basic information and initial user behavior data, noise reduction processing is carried out on data with more noise, null filling is carried out on data with more null values, and type conversion is carried out on some character types, so that user basic information and user behavior data are obtained.
Step S102: and determining sample data according to the user basic information, the user behavior data and the user exposure data.
Further, the step S102 includes: determining a user label according to the user data, wherein the user label comprises a positive label and a negative label; according to the positive label extraction proportion, carrying out hierarchical sampling on the user data with the positive label to obtain positive label sample data; according to the negative label extraction proportion, carrying out hierarchical sampling on the user data with the negative label to obtain negative label sample data; and respectively carrying out sample extraction on the positive label sample data and the negative label sample data according to random seeds to obtain the sample data.
It can be understood that the user data is used for labeling the user labels of the data, the user labels are divided into two types, namely 1 and 0, the user corresponding to the data with the label of "1" is the user with actual exposure, the user corresponding to the data with the label of "0" is the user without exposure, the label of "1" is the positive label, and the label of "0" is the negative label.
It should be understood that statistics may be performed on the data at different levels of the data, and that a distribution of the data may be obtained, for example: for the age data, the age data can be divided into four layers of 0-18, 18-35, 35-50 and more than 50, and the data distribution conditions of the four layers can be respectively obtained. The positive label extraction proportion and the negative label extraction proportion are extraction proportions set in the embodiment, and correspond to the positive label data and the negative label data respectively, the specific numerical value of the extraction proportion can be determined according to actual requirements, the embodiment does not limit the specific numerical value, the extraction mode adopts layered extraction, and each level can extract corresponding quantity of data. The positive label sample data and the negative label sample data are extracted data with different levels of positive labels and data with different levels of negative labels. The random seed is set as 42 in this embodiment, which can be flexibly adjusted according to actual requirements, and this embodiment is not limited thereto. The sample data is the relevant sample for model establishment.
In a specific implementation, user data with positive labels are subjected to layered sampling according to a positive label extraction proportion, user data with negative labels are subjected to layered sampling according to a negative label extraction proportion, and then random seeds are used for extracting the layered sampling result to obtain sample data.
The data is sampled according to the positive and negative label proportion, the positive and negative proportion of the data can be adjusted, and the extracted data is guaranteed to have comprehensiveness by using a layering number sampling mode.
Step S103: and carrying out feature screening on the user basic information and the user behavior data to determine sample features.
Further, the step S103 includes: determining initial sample characteristics in the user basic information and the user behavior data according to preset requirements; determining the positive label data proportion and the negative label data proportion corresponding to the initial sample characteristics according to the user labels; according to the positive label data proportion and the negative label data proportion, calculating evidence weight corresponding to the initial sample characteristics; calculating the information value corresponding to the initial sample feature according to the evidence weight and the corresponding relation between the evidence weight and the information value; and screening the initial sample characteristics according to an information value threshold value, and determining the sample characteristics, wherein the information value of the sample characteristics is larger than the information value threshold value.
It will be appreciated that the preset requirement refers to a requirement of selecting features, where the features are generally selected to be related to the scene to be finally identified, and the initial sample features are features that are primarily screened out, and further analysis is required, where each initial sample feature has a corresponding data distribution condition, for example: the age characteristics have data distribution at four levels of 0 to 18, 18 to 35, 35 to 50, and more than 50. The positive label data ratio and the negative label data ratio refer to the duty ratio of the positive label data and the duty ratio of the negative label data, respectively.
It should be understood that the correspondence relationship (evidence weight calculation expression) between the positive label data ratio, the negative label data ratio, and the evidence weight is as follows:
wherein i represents the hierarchy of data and Py i Positive tag data proportion, pn, representing data of the ith hierarchy i Negative tag data proportion, WOE, representing ith level data i Evidence weights representing data at the ith level. Substituting the related data into an evidence weight calculation expression to obtain the evidence weight of each level corresponding to each initial sample feature. The correspondence between the evidence weight and the information value, namely the calculation relation of the information value, is as follows:
IV i =(Py i -Pn i )*WOE i
Wherein i represents the hierarchy of data and Py i Positive tag data proportion, pn, representing data of the ith hierarchy i Negative tag data proportion, WOE, representing ith level data i Evidence weights representing the i-th level data, information values representing the i-th level data. Substituting the related data into the information value calculation expression to obtain the information value of each level corresponding to each initial sample feature, thereby obtaining the information value corresponding to the initial sample feature.
The information value threshold is a set information value threshold, the initial sample features are screened according to the information value threshold, and features with information value larger than the information value threshold in the initial sample features are screened out, namely sample features of a subsequent input model.
In a specific implementation, information values corresponding to the initial sample features are calculated, feature screening is conducted on the initial sample features according to the information values, and most effective features are selected from the initial sample features so as to reduce the dimension of the data set.
Step S104: training a preset two-classification model based on the sample data and the sample characteristics to obtain the cold and hot data identification model.
Further, the step S104 includes: dividing the sample data into training sample data and verification sample data according to a preset dividing proportion; inputting the training sample data and the sample characteristics into the preset classification model for training to obtain an initial cold and hot data identification model; inputting the verification sample data into the initial cold and hot data identification model to obtain prediction output; according to the prediction output, the initial cold and hot data identification model is evaluated, and an evaluation result is obtained; and when the evaluation result meets a preset condition, determining the initial cold and hot data identification model as the cold and hot data identification model.
It may be understood that the preset division ratio refers to a preset ratio of dividing sample data, and according to the preset division ratio, the sample data may be divided into a training set and a verification set, that is, training sample data and verification sample data, for example: the preset division ratio is 0.7, at this time, 70% of the sample data is used as the training set, and the remaining 30% of the sample data is used as the verification set, and other division ratios may be set, which is not limited in this embodiment.
It should be understood that the preset two classification models are XGBoost models, training is performed on the XGBoost models by using training sample data, and an initial cold and hot data identification model, namely an initial cold and hot data identification model, is obtained after training, and further evaluation is required. The prediction output is the result of the model for identifying the verification sample data, the prediction output is used for evaluating the obtained initial cold and hot data identification model, whether the initial cold and hot data identification model can be applied or not is evaluated, and the final evaluation result is the evaluation result. The preset conditions may be set according to actual requirements, for example: the accuracy may be set to reach the accuracy threshold value, which is not limited by the present embodiment.
In a specific implementation, sample data are divided into training sample data and verification sample data, the training sample data are used for training a model, the verification sample data are used for evaluating the model obtained through training, and if the evaluation meets preset conditions, the model can be considered to be used as a cold and hot data recognition model for subsequent data recognition, so that recognition accuracy is guaranteed.
Further, the updating mode of the cold and hot data identification model comprises the following steps: acquiring updated user data according to the daily update period; training the cold and hot client identification model based on the updated user data and sample characteristics so that the cold and hot client identification model is updated; and returning to execute the step of determining cold data and hot data based on the cold and hot data identification model according to the updated cold and hot client identification model.
It should be noted that, the updated user data is new user data acquired, and there is generally an update compared to the user data acquired last time. In addition, updating user data also requires preprocessing.
It can be understood that the embodiment obtains the latest user data every day, trains the existing cold and hot customer identification model by using the latest user data and the sample characteristics obtained before, realizes automatic update and continuously keeps the model fresh.
As shown in fig. 5, a flow chart of model building is shown, firstly, data processing is performed, then data is sampled to obtain samples, then features are screened, the samples are segmented, features are generated, then a cold and hot data identification model is trained and verified, new data is processed, and the cold and hot data identification model is updated by the features.
In this embodiment, initial user basic information and initial user behavior data are obtained, the initial user basic information and the initial user behavior data are preprocessed to obtain user basic information and user behavior data, sample data are determined according to the user basic information, the user behavior data and user exposure data, feature screening is performed on the user basic information and the user behavior data, sample features are determined, and training is performed on a preset two-class model based on the sample data and the sample features to obtain a cold and hot data identification model. The embodiment designs a cold and hot data identification model based on an XGBoost two-classification model, replaces the traditional analysis method by using a training algorithm model, and simultaneously automatically updates the model daily by using user behavior data and user exposure data, so that the model is kept fresh continuously, the hit rate of the outputted hot data is kept high, and the problem of inaccurate judgment of manual expert experience is solved.
In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium is stored with a cold and hot data identification program, and the cold and hot data identification program realizes the steps of the cold and hot data identification method when being executed by a processor.
Referring to fig. 6, fig. 6 is a block diagram illustrating a first embodiment of a cold and hot data identification apparatus according to the present invention.
As shown in fig. 6, the cold and hot data identification device according to the embodiment of the present invention includes:
the model building module 10 is configured to build a cold and hot data identification model based on user data, where the cold and hot data identification model is updated according to a daily update period.
The data identification module 20 is further configured to determine cold data and hot data based on the cold-hot data identification model.
The data separation module 30 is configured to store the hot data in a preset memory database and store the cold data in a preset degradation database.
The service response module 40 is configured to determine, when receiving a user service request, a data cooling and heating type of target data corresponding to the user service request.
The service response module 40 is further configured to determine a target database according to the data cold and hot type.
The service response module 40 is further configured to query the target database for the target data, and respond to the user service request according to the target data.
In this embodiment, a cold and hot data identification model is established based on user data, the cold and hot data identification model is updated according to a daily update period, cold data and hot data are determined based on the cold and hot data identification model, hot data are stored in a preset memory database and a preset degradation database, when a user service request is received, a data cold and hot type of target data corresponding to the user service request is determined, a target database is determined according to the data cold and hot type, target data are queried in the target database, and response is made to the user service request according to the target data. Compared with the traditional hot data identification method, the hot data identification method has the advantages that the data updating period is longer, the hot data hit rate is low, the cold data identification model is built for cold data identification, the training data is updated daily, the data updating period is shortened, the data is kept fresh continuously, the model is trained automatically, the model keeps high hit rate for cold data identification, in addition, the hot data is stored by using a memory database, the request response is fast, the cold data is stored by using a distributed database, the resource consumption of the memory database can be reduced to the greatest extent, and the cost is reduced while the usability is guaranteed.
In an embodiment, the user data includes user basic information, user behavior data and user exposure data, and the model building module 10 is further configured to obtain initial user basic information and initial user behavior data, and perform preprocessing on the initial user basic information and the initial user behavior data to obtain the user basic information and the user behavior data, where the preprocessing at least includes noise reduction processing, null filling and type conversion;
determining sample data according to the user basic information, the user behavior data and the user exposure data;
feature screening is carried out on the user basic information and the user behavior data, and sample features are determined;
training a preset two-classification model based on the sample data and the sample characteristics to obtain the cold and hot data identification model.
In an embodiment, the model building module 10 is further configured to determine a user tag according to the user data, where the user tag includes a positive tag and a negative tag;
according to the positive label extraction proportion, carrying out hierarchical sampling on the user data with the positive label to obtain positive label sample data;
According to the negative label extraction proportion, carrying out hierarchical sampling on the user data with the negative label to obtain negative label sample data;
and respectively carrying out sample extraction on the positive label sample data and the negative label sample data according to random seeds to obtain the sample data.
In an embodiment, the model building module 10 is further configured to determine initial sample characteristics in the user basic information and the user behavior data according to a preset requirement;
determining the positive label data proportion and the negative label data proportion corresponding to the initial sample characteristics according to the user labels;
according to the positive label data proportion and the negative label data proportion, calculating evidence weight corresponding to the initial sample characteristics;
calculating the information value corresponding to the initial sample feature according to the evidence weight and the corresponding relation between the evidence weight and the information value;
and screening the initial sample characteristics according to an information value threshold value, and determining the sample characteristics, wherein the information value of the sample characteristics is larger than the information value threshold value.
In an embodiment, the model building module 10 is further configured to divide the sample data into training sample data and verification sample data according to a preset division ratio;
Inputting the training sample data and the sample characteristics into the preset classification model for training to obtain an initial cold and hot data identification model;
inputting the verification sample data into the initial cold and hot data identification model to obtain prediction output;
according to the prediction output, the initial cold and hot data identification model is evaluated, and an evaluation result is obtained;
and when the evaluation result meets a preset condition, determining the initial cold and hot data identification model as the cold and hot data identification model.
In an embodiment, the model building module 10 is further configured to obtain updated user data according to the daily update period;
training the cold and hot client identification model based on the updated user data and sample characteristics so that the cold and hot client identification model is updated;
and returning to execute the step of determining cold data and hot data based on the cold and hot data identification model according to the updated cold and hot client identification model.
In an embodiment, the service response module 40 is further configured to determine that the target database is the preset memory database when the data cold and hot type of the target data is a hot data type;
And when the data cold and hot type of the target data is the cold data type, determining the target database as the preset degradation database.
It should be understood that the foregoing is illustrative only and is not limiting, and that in specific applications, those skilled in the art may set the invention as desired, and the invention is not limited thereto.
It should be noted that the above-described working procedure is merely illustrative, and does not limit the scope of the present invention, and in practical application, a person skilled in the art may select part or all of them according to actual needs to achieve the purpose of the embodiment, which is not limited herein.
In addition, technical details not described in detail in the present embodiment may refer to the cold and hot data identification method provided in any embodiment of the present invention, which is not described herein.
Furthermore, it should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. Read Only Memory)/RAM, magnetic disk, optical disk) and including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. A cold and hot data identification method, characterized in that the cold and hot data identification method comprises:
based on user data, establishing a cold and hot data identification model, wherein the cold and hot data identification model is updated according to a daily update period;
determining cold data and hot data based on the cold and hot data identification model;
storing the hot data in a preset memory database, and storing the cold data in a preset degradation database;
when a user service request is received, determining a data cooling and heating type of target data corresponding to the user service request;
determining a target database according to the data cold and hot type;
and inquiring the target data in the target database, and responding to the user service request according to the target data.
2. The method of claim 1, wherein the user data includes user base information, user behavior data, and user exposure data, and wherein the building the cold data identification model based on the user data comprises:
acquiring initial user basic information and initial user behavior data, and preprocessing the initial user basic information and the initial user behavior data to obtain the user basic information and the user behavior data, wherein the preprocessing at least comprises noise reduction processing, null filling and type conversion;
Determining sample data according to the user basic information, the user behavior data and the user exposure data;
feature screening is carried out on the user basic information and the user behavior data, and sample features are determined;
training a preset two-classification model based on the sample data and the sample characteristics to obtain the cold and hot data identification model.
3. The method of claim 2, wherein the determining sample data from the user base information, the user behavior data, and the user exposure data comprises:
determining a user label according to the user data, wherein the user label comprises a positive label and a negative label;
according to the positive label extraction proportion, carrying out hierarchical sampling on the user data with the positive label to obtain positive label sample data;
according to the negative label extraction proportion, carrying out hierarchical sampling on the user data with the negative label to obtain negative label sample data;
and respectively carrying out sample extraction on the positive label sample data and the negative label sample data according to random seeds to obtain the sample data.
4. The method of claim 3, wherein the feature screening the user base information and the user behavior data to determine sample features comprises:
Determining initial sample characteristics in the user basic information and the user behavior data according to preset requirements;
determining the positive label data proportion and the negative label data proportion corresponding to the initial sample characteristics according to the user labels;
according to the positive label data proportion and the negative label data proportion, calculating evidence weight corresponding to the initial sample characteristics;
calculating the information value corresponding to the initial sample feature according to the evidence weight and the corresponding relation between the evidence weight and the information value;
and screening the initial sample characteristics according to an information value threshold value, and determining the sample characteristics, wherein the information value of the sample characteristics is larger than the information value threshold value.
5. The method of claim 2, wherein training a preset two-classification model based on the sample data and the sample features to obtain the cold-hot data identification model comprises:
dividing the sample data into training sample data and verification sample data according to a preset dividing proportion;
inputting the training sample data and the sample characteristics into the preset classification model for training to obtain an initial cold and hot data identification model;
Inputting the verification sample data into the initial cold and hot data identification model to obtain prediction output;
according to the prediction output, the initial cold and hot data identification model is evaluated, and an evaluation result is obtained;
and when the evaluation result meets a preset condition, determining the initial cold and hot data identification model as the cold and hot data identification model.
6. The method of claim 2, wherein the cold-hot data identification method comprises:
acquiring updated user data according to the daily update period;
training the cold and hot client identification model based on the updated user data and sample characteristics so that the cold and hot client identification model is updated;
and returning to execute the step of determining cold data and hot data based on the cold and hot data identification model according to the updated cold and hot client identification model.
7. The method of any one of claims 1 to 6, wherein said determining a target database based on said data cold and hot type comprises:
when the data cold and hot type of the target data is the hot data type, determining the target database as the preset memory database;
and when the data cold and hot type of the target data is the cold data type, determining the target database as the preset degradation database.
8. A cold and hot data identification device, characterized in that the cold and hot data identification device comprises:
the model building module is used for building a cold and hot data identification model based on user data, wherein the cold and hot data identification model is updated according to a daily update period;
the data identification module is further used for determining cold data and hot data based on the cold and hot data identification model;
the data separation module is used for storing the hot data in a preset memory database and storing the cold data in a preset degradation database;
the service response module is used for determining the data cooling and heating type of the target data corresponding to the user service request when the user service request is received;
the service response module is also used for determining a target database according to the data cold and hot type;
the service response module is further used for inquiring the target data in the target database and responding to the user service request according to the target data.
9. A cold and hot data identification apparatus, the apparatus comprising: a memory, a processor, and a cold and hot data identification program stored on the memory and executable on the processor, the cold and hot data identification program configured to implement the steps of the cold and hot data identification method of any one of claims 1 to 7.
10. A storage medium having stored thereon a cold and hot data identification program which, when executed by a processor, implements the steps of the cold and hot data identification method according to any one of claims 1 to 7.
CN202310954946.0A 2023-07-31 2023-07-31 Cold and hot data identification method, device, equipment and storage medium Pending CN116861254A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310954946.0A CN116861254A (en) 2023-07-31 2023-07-31 Cold and hot data identification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310954946.0A CN116861254A (en) 2023-07-31 2023-07-31 Cold and hot data identification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116861254A true CN116861254A (en) 2023-10-10

Family

ID=88232270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310954946.0A Pending CN116861254A (en) 2023-07-31 2023-07-31 Cold and hot data identification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116861254A (en)

Similar Documents

Publication Publication Date Title
CN110363387B (en) Portrait analysis method and device based on big data, computer equipment and storage medium
CN108446210B (en) System performance measurement method, storage medium and server
WO2020135535A1 (en) Recommendation model training method and related apparatus
CN110012060B (en) Information pushing method and device of mobile terminal, storage medium and server
CN111614690B (en) Abnormal behavior detection method and device
US8452733B2 (en) Data decay management
US8504558B2 (en) Framework to evaluate content display policies
US9578135B2 (en) Method of identifying remote users of websites
CN112862593B (en) Credit scoring card model training method, device and system and computer storage medium
CN113971527A (en) Data risk assessment method and device based on machine learning
CN116450982A (en) Big data analysis method and system based on cloud service push
CN117291428B (en) Enterprise management APP-based data background management system
CN116886619A (en) Load balancing method and device based on linear regression algorithm
CN109711656B (en) Multisystem association early warning method, device, equipment and computer readable storage medium
CN113434746B (en) User tag-based data processing method, terminal equipment and storage medium
CN109547931B (en) Server for determining location of mobile terminal
CN113938430B (en) Flow control method, device, equipment and storage medium
CN116992294B (en) Satellite measurement and control training evaluation method, device, equipment and storage medium
CN107943678B (en) Method for evaluating application access process and evaluation server
CN115982646B (en) Management method and system for multisource test data based on cloud platform
CN111382345B (en) Topic screening and publishing method, device and server
CN116861254A (en) Cold and hot data identification method, device, equipment and storage medium
CN111506813A (en) Remote sensing information accurate recommendation method based on user portrait
CN115168509A (en) Processing method and device of wind control data, storage medium and computer equipment
CN116595428B (en) User classification method and system based on CNN (CNN) log spectrum analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination