CN116821759A - Identification prediction method and device for category labels, processor and electronic equipment - Google Patents

Identification prediction method and device for category labels, processor and electronic equipment Download PDF

Info

Publication number
CN116821759A
CN116821759A CN202310771876.5A CN202310771876A CN116821759A CN 116821759 A CN116821759 A CN 116821759A CN 202310771876 A CN202310771876 A CN 202310771876A CN 116821759 A CN116821759 A CN 116821759A
Authority
CN
China
Prior art keywords
behavior data
object behavior
target
model
subset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310771876.5A
Other languages
Chinese (zh)
Inventor
邓彬月
赵清华
邱惠婷
黄云泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202310771876.5A priority Critical patent/CN116821759A/en
Publication of CN116821759A publication Critical patent/CN116821759A/en
Pending legal-status Critical Current

Links

Abstract

The application discloses a method and a device for identifying and predicting class labels, a processor and electronic equipment. The method relates to the technical field of big data processing, and comprises the following steps: acquiring an object behavior data set to be identified; under the condition that a first object behavior data subset with characteristic words in the object behavior data set is searched, determining category labels of all object behavior data in the first object behavior data subset based on the characteristic words and a target dictionary; predicting a second object behavior data subset without feature words in the object behavior data set by using a target model to obtain class labels of all object behavior data in the second object behavior data subset, wherein the target model is obtained based on all object behavior data in the first object behavior data subset and class labels of all objects in the first object behavior data subset. The application solves the problem of lower accuracy of acquiring the category labels in the related technology.

Description

Identification prediction method and device for category labels, processor and electronic equipment
Technical Field
The application relates to the technical field of big data processing, in particular to a method and a device for identifying and predicting class labels, a processor and electronic equipment.
Background
At present, the economy of China enters a new stage of high-quality development, the fields and regions of various customer economic activities are widened continuously, the financial transaction relations among different customers are diversified, and the fund circulation characteristics are complicated.
In the prior art, classification rules are often established according to customer transaction behavior data to identify target data or target objects, and class labels are marked for the target data or target objects to carry out subsequent services, however, the method has larger limitation under the condition that the data key feature information has missing values, and the acquisition of corresponding class labels cannot be accurately completed, so that the acquisition accuracy of the class labels is lower.
Aiming at the problem of low accuracy of acquiring category labels in the related technology, no effective solution is proposed at present.
Disclosure of Invention
The application mainly aims to provide a method, a device, a processor and electronic equipment for identifying and predicting category labels, so as to solve the problem of low accuracy in acquiring the category labels in the related technology.
In order to achieve the above object, according to an aspect of the present application, there is provided an identification prediction method of a class tag. The method comprises the following steps: acquiring an object behavior data set to be identified, wherein the object behavior data set comprises a plurality of object behavior data used for indicating historical behaviors of an object; under the condition that a first object behavior data subset with feature words in an object behavior data set is found, determining category labels of all object behavior data in the first object behavior data subset based on the feature words and a target dictionary, wherein the target dictionary is used for indicating mapping relations between pairs of feature words and the category labels, and the category labels are used for indicating behavior intention information of objects; predicting a second object behavior data subset without feature words in the object behavior data set by using a target model to obtain class labels of all object behavior data in the second object behavior data subset, wherein the target model is obtained based on all object behavior data in the first object behavior data subset and class labels of all objects in the first object behavior data subset.
As an alternative, the method further comprises: and under the condition that the target class label associated with the target object behavior data is acquired, pushing target information corresponding to the target class label to the target object, wherein the target information comprises business marketing information matched with behavior intention information of the target object.
As an optional solution, before predicting, by using the target model, the second subset of object behavior data without feature words in the set of object behavior data, the method further includes: obtaining a preprocessed first training sample set based on the identified class labels and the model entering feature labels associated with the first object behavior data subset, wherein the model entering feature labels are used for indicating static features of objects associated with each object behavior data in the first object behavior data subset; according to a preset sampling mode, the number of training samples of each class of labels in the first training sample set is adjusted to obtain a second training sample set after sampling; training the initialization model based on the second training sample set, and determining the initialization model after training as a target model.
As an optional solution, the adjusting the number of training samples of each class of labels in the first training sample set according to the preset sampling manner to obtain a second sampled training sample set includes: obtaining weight parameters corresponding to each class label in a first training sample set, wherein the first weight parameters corresponding to the first class labels are the number of samples in the first training sample set divided by ten times the number of samples in a first training sample subset, and the first training sample subset is a set of training samples belonging to the first class labels in the first training sample set, and each class label comprises the first class label; obtaining the number of new training samples corresponding to each category label based on the weight parameters corresponding to each category label, wherein the number of new training samples corresponding to the first category label is the number of samples of the first training sample subset multiplied by a preset multiple, and the preset multiple is the first weight parameter plus one; and adjusting the first training sample set based on the number of the new training samples adjusted by the category labels to obtain a second training sample set.
As an optional solution, training the initialization model based on the second training sample set, and determining the initialization model after training as the target model includes: dividing the second training sample set into a training set and a testing set according to a preset proportion; training the plurality of initialization models by using a training set to obtain trained models corresponding to the plurality of initialization models, wherein the plurality of initialization models comprise: a random forest model, a lightweight gradient lifting tree model, a decision tree model and a limit gradient lifting tree model; and scoring the plurality of trained models by using the test set, and determining the trained model with the highest score as the target model.
As an optional solution, the acquiring the object behavior data set to be identified includes: acquiring all object behavior data of the same object in a target time period; and screening out first object behavior data for indicating that the same object periodically executes the target event in the target time period from all the object behavior data, wherein the object behavior data set comprises the first object behavior data.
As an alternative, before determining the category label of each object behavior data in the first subset of object behavior data based on the feature word and the target dictionary, the method further includes: searching first type information of each object behavior data in the object behavior data set, and performing word segmentation processing on the first type information to obtain feature words, wherein the first type information is information input by the object when the object executes a target event; under the condition that the characteristic words cannot be obtained based on the first type information, searching second type information of each object behavior data in the object behavior data set, and performing word segmentation processing on the second type information to obtain the characteristic words, wherein the second type information is information automatically generated by the object after executing the target event; in the case that the feature words of the second object behavior data cannot be obtained based on the second type information of the second object behavior data, it is determined that the second object behavior data does not have the feature words, and it is determined that the second object behavior data belongs to the second object behavior data subset.
In order to achieve the above object, according to another aspect of the present application, there is provided an identification prediction apparatus of a class tag. The device comprises: an acquisition unit configured to acquire an object behavior data set to be identified, wherein the object behavior data set includes a plurality of object behavior data for indicating a history behavior of an object; the first recognition unit is used for determining category labels of all object behavior data in the first object behavior data subset based on the feature words and a target dictionary under the condition that a first object behavior data subset with the feature words in the object behavior data set is found, wherein the target dictionary is used for indicating the mapping relation between a plurality of pairs of feature words and the category labels, and the category labels are used for indicating behavior intention information of the objects; the second recognition unit is used for predicting a second object behavior data subset without feature words in the object behavior data set by utilizing the target model to obtain class labels of all object behavior data in the second object behavior data subset, wherein the target model is obtained based on all object behavior data in the first object behavior data subset and class labels training of all objects in the first object behavior data subset.
As an alternative, the apparatus further includes: and the pushing module is used for pushing the target information corresponding to the target category label to the target object under the condition that the target category label associated with the target object behavior data is acquired, wherein the target information comprises business marketing information matched with the behavior intention information of the target object.
As an alternative, the apparatus further includes: the determining module is used for obtaining a first preprocessed training sample set based on the identified class labels and the model entering feature labels associated with the first object behavior data subset before predicting the second object behavior data subset without the feature words in the object behavior data subset by utilizing the target model, wherein the model entering feature labels are used for indicating static features of objects associated with each object behavior data in the first object behavior data subset; the adjusting module is used for adjusting the number of training samples of each class label in the first training sample set according to a preset sampling mode before predicting a second object behavior data subset without feature words in the object behavior data set by utilizing the target model to obtain a sampled second training sample set; the training module is used for training the initialization model based on the second training sample set before predicting the second object behavior data subset without the feature words in the object behavior data set by using the target model, and determining the initialization model after training as the target model.
As an alternative, the adjusting module includes: the acquiring sub-module is used for acquiring weight parameters corresponding to each class label in the first training sample set, wherein the first weight parameters corresponding to the first class labels are the number of samples in the first training sample set divided by ten times the number of samples in the first training sample sub-set, the first training sample sub-set is a set of training samples belonging to the first class labels in the first training sample set, and each class label comprises the first class labels; the first determining submodule is used for obtaining the number of new training samples corresponding to each category label based on the weight parameter corresponding to each category label, wherein the number of the new training samples corresponding to the first category label is the number of samples of the first training sample subset multiplied by a preset multiple, and the preset multiple is the first weight parameter added by one; and the adjustment sub-module is used for adjusting the first training sample set based on the number of the new training samples adjusted by the category labels to obtain a second training sample set.
As an alternative, the training module includes: the molecule cutting module is used for cutting the second training sample set into a training set and a testing set according to a preset proportion; the training sub-module is used for training the plurality of initialization models respectively by using the training set to obtain a trained model corresponding to each of the plurality of initialization models, wherein the plurality of initialization models comprise: a random forest model, a lightweight gradient lifting tree model, a decision tree model and a limit gradient lifting tree model; and the second determining submodule is used for scoring the plurality of trained models by using the test set and determining the trained model with the highest score as the target model.
As an alternative, the acquiring unit includes: the acquisition module is used for acquiring all object behavior data of the same object in a target time period; and the screening module is used for screening first object behavior data for indicating the same object to periodically execute the target event in the target time period from all the object behavior data, wherein the object behavior data set comprises the first object behavior data.
As an alternative, the apparatus further includes: the first searching module is used for searching first type information of each object behavior data in the object behavior data set before determining class labels of each object behavior data in the first object behavior data subset based on the feature words and the target dictionary, and performing word segmentation processing on the first type information to obtain the feature words, wherein the first type information is information input by the object when executing the target event; the second searching module is used for searching second type information of each object behavior data in the object behavior data set under the condition that the characteristic word cannot be obtained based on the first type information before determining the class label of each object behavior data in the first object behavior data subset based on the characteristic word and the target dictionary, and performing word segmentation processing on the second type information to obtain the characteristic word, wherein the second type information is information automatically generated after the object executes the target event; and a third determining module, configured to determine, before determining class labels of respective object behavior data in the first subset of object behavior data based on the feature words and the target dictionary, that the second object behavior data does not have the feature words if the feature words of the second object behavior data cannot be obtained based on the second type information of the second object behavior data, and determine that the second object behavior data belongs to the second subset of object behavior data.
According to the application, the following steps are adopted: acquiring an object behavior data set to be identified, wherein the object behavior data set comprises a plurality of object behavior data used for indicating historical behaviors of an object; under the condition that a first object behavior data subset with feature words in an object behavior data set is found, determining category labels of all object behavior data in the first object behavior data subset based on the feature words and a target dictionary, wherein the target dictionary is used for indicating mapping relations between pairs of feature words and the category labels, and the category labels are used for indicating behavior intention information of objects; predicting a second object behavior data subset without feature words in the object behavior data set by using a target model to obtain class labels of all object behavior data in the second object behavior data subset, wherein the target model is obtained based on all object behavior data in the first object behavior data subset and class labels of all objects in the first object behavior data subset. The method comprises the steps of jointly completing identification labeling of category labels of an object behavior data set by using a target dictionary and a target model, wherein the target dictionary is used for labeling category labels of a first object behavior data subset with characteristic words based on preset rules (mapping relations), and taking labeling results as sample data to train the target model, so that the trained target model is used for further carrying out prediction labeling on a second object behavior data subset which cannot be processed by the target dictionary and does not have the characteristic words, the problem of limitation of identification labeling caused by the lack of key characteristic information of data is avoided, the identification labeling range of the category labels is enlarged, the technical effect of effectively improving the acquisition accuracy of the category labels is achieved, and the technical problem of lower acquisition accuracy of the category labels in related technologies is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:
FIG. 1 is a flow chart of a method for identifying and predicting category labels provided according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a method for predicting identification of category labels according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a method for predicting identification of category labels according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a device for identifying and predicting category labels according to an embodiment of the present application;
fig. 5 is a schematic diagram of an electronic device for identifying and predicting category labels according to an embodiment of the present application.
Detailed Description
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the application herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that, related information (including, but not limited to, user equipment information, user personal information, etc.) and data (including, but not limited to, data for presentation, analyzed data, etc.) related to the present disclosure are information and data authorized by a user or sufficiently authorized by each party. For example, an interface is provided between the system and the relevant user or institution, before acquiring the relevant information, the system needs to send an acquisition request to the user or institution through the interface, and acquire the relevant information after receiving the consent information fed back by the user or institution.
The present application will be described with reference to preferred implementation steps, and fig. 1 is a flowchart of a method for identifying and predicting category labels according to an embodiment of the present application, as shown in fig. 1, where the method includes the following steps:
step S101, acquiring an object behavior data set to be identified, wherein the object behavior data set comprises a plurality of object behavior data used for indicating historical behaviors of an object;
step S102, under the condition that a first object behavior data subset with feature words in an object behavior data set is searched, determining category labels of all object behavior data in the first object behavior data subset based on the feature words and a target dictionary, wherein the target dictionary is used for indicating mapping relations between a plurality of pairs of feature words and the category labels, and the category labels are used for indicating behavior intention information of objects;
step S103, predicting a second object behavior data subset without feature words in the object behavior data set by using a target model to obtain class labels of all object behavior data in the second object behavior data subset, wherein the target model is obtained based on all object behavior data in the first object behavior data subset and class labels training of all objects in the first object behavior data subset.
Optionally, in this embodiment, the method for identifying and predicting the category label may be, but not limited to, an intention identifying and predicting scenario based on funds outflow regular transactions applied in the financial field, and may be, but not limited to, a business content related to bank customer transaction behavior identification, funds flow application scenario, key customer group accurate marketing service, subset accepting work, and the like. Optionally, in the scene and the business content thereof, a classification rule can be established according to the customer transaction behavior data, but not limited to, so as to identify the target data or the target object, and label the class label for the target data or the target object for subsequent service, however, the method has larger limitation under the condition that the data key feature information has a missing value, and cannot accurately complete the acquisition of the corresponding class label, thereby resulting in lower accuracy of acquiring the class label.
Aiming at the problems, the identification prediction method of the class labels is utilized, a class label identification labeling of the object behavior data set is completed by using the target dictionary and the target model, wherein the target dictionary is used for labeling class labels on a first object behavior data subset with characteristic words based on a preset rule (mapping relation), and the labeling result is used as sample data to train the target model, so that the trained target model is utilized to further carry out the prediction labeling on a second object behavior data subset which cannot be processed by the target dictionary and does not have the characteristic words, the problem of limitation of identification labeling caused by the lack of data key characteristic information is avoided, the identification labeling range of the class labels is enlarged, the technical effect of effectively improving the acquisition accuracy of the class labels is achieved, and the problem of lower acquisition accuracy of the class labels is solved.
Alternatively, in the present embodiment, the object behavior data set to be identified may be, but is not limited to, object historical behavior data including a plurality of objects, wherein the object historical behavior data of the same object may be, but is not limited to, historical behavior information for indicating that the object periodically performs a target event within a target period of time.
Further, taking the scenario of intent recognition and prediction based on regular transactions of funds outflow applied in the financial field as an example, the object behavior data set to be recognized may be, but not limited to, transaction detail data with an amount of money greater than ten thousand yuan transferred from the same customer in a certain bank to the same account in another bank in three months, for describing the transaction behavior characteristics of the customer.
Optionally, in this embodiment, in the case of acquiring the object behavior data to be identified, whether the object behavior data has the associated feature word may be further searched, or may be, but not limited to, acquiring first type information of the object behavior data, where the first type information may be, but not limited to, information input by the object when executing the target event, and, taking, as an example, the epipolar text information input by the user corresponding to the object behavior data when executing the related event, and in the case of searching the epipolar text information and successfully segmenting the word to obtain the feature word (i.e., the keyword), searching, by the target dictionary, a category tag matching the feature word.
Optionally, in this embodiment, in the case that the first type of information is null or lacks a feature word, the second type of information may be further searched for, where the second type of information may be, but is not limited to, information that is automatically generated by the object after the object is executed, and taking the object event as a fund transaction as an example, and may be, but is not limited to, performing word segmentation on information such as a service identifier of the fund transaction, an internet service code, and the like to search for the feature word, and searching, in the case that the feature word is obtained, for a category tag that matches the feature word direction through the object dictionary.
Further illustratively, taking the intention recognition and prediction scenario based on regular transactions of funds outflow applied in the financial field as an example, the category label may be but not limited to use information for indicating the subject transaction data, for example, the category label may be but not limited to include "car loan", "house loan", "consume loan", "loan unknown", "wage", "financing", "consumer consumption", "financing lease", "credit card", "rent" and the like.
It should be noted that, the object behavior data set includes object behavior data with feature words and object behavior data without feature words, which may include, but is not limited to, attributing the object behavior data with feature words to the first object behavior data subset, and performing identification prediction labeling of category labels on the object behavior data in the first object behavior data subset through the feature words and the target dictionary; and attributing the object behavior data without feature words to a second object behavior data subset, predicting the second object behavior data subset through a target model obtained through training, obtaining class labels of the object behavior data, and marking.
It should be noted that, the target model may be, but is not limited to, a target model obtained by training based on a class label recognition labeling result of the object behavior data in the first object behavior data subset, where the labeling result is used as sample data to train the initialization model.
According to the embodiment of the application, the object behavior data set to be identified is obtained, wherein the object behavior data set comprises a plurality of object behavior data used for indicating the historical behaviors of the object; under the condition that a first object behavior data subset with feature words in an object behavior data set is found, determining category labels of all object behavior data in the first object behavior data subset based on the feature words and a target dictionary, wherein the target dictionary is used for indicating mapping relations between pairs of feature words and the category labels, and the category labels are used for indicating behavior intention information of objects; predicting a second object behavior data subset without feature words in the object behavior data set by using a target model to obtain class labels of all object behavior data in the second object behavior data subset, wherein the target model is obtained based on all object behavior data in the first object behavior data subset and class labels of all objects in the first object behavior data subset. The method comprises the steps of completing identification labeling of category labels of an object behavior data set by using a target dictionary and a target model, wherein the target dictionary is used for labeling category labels of a first object behavior data subset with characteristic words based on a preset rule (mapping relation), and training the target model by taking labeling results as sample data, so that prediction labeling is further carried out on a second object behavior data subset which cannot be processed by the target dictionary and does not have the characteristic words by using the trained target model, the problem of limitation of identification labeling caused by the lack of key characteristic information of data is avoided, the identification labeling range of the category labels is enlarged, and the technical effect of effectively improving the acquisition accuracy of the category labels is achieved.
As an alternative, the method further comprises:
s1, under the condition that a target class label associated with target object behavior data is acquired, pushing target information corresponding to the target class label to a target object, wherein the target information comprises business marketing information matched with behavior intention information of the target object.
Alternatively, in the present embodiment, in the case where the target class tag associated with the target object behavior data is acquired, the target information corresponding to the target type tag may be pushed to the target object, for example, the vehicle credit-related business marketing information corresponding to the vehicle credit type tag may be recommended to the vehicle credit object, the credit-card-related business marketing information corresponding to the credit card type tag may be recommended to the credit card object, or the like.
It should be noted that, taking the scenario of intent recognition and prediction based on regular transaction of funds outflow applied in the financial field as an example, by pushing different business marketing information associated with different types of tags for the objects of different types of tags, the intent prediction of transaction can be performed on the client objects of different funds, and personalized marketing activities can be further developed in a targeted manner, so as to improve the accuracy and efficiency of business expansion.
It should be noted that, taking the scenario of intent recognition and prediction based on regular transaction of funds outflow applied in the financial field as an example, the invention takes regular transaction of funds outflow of individual clients as a breakthrough point, and proposes a method for recognizing and predicting transaction intent, based on transaction behavior characteristics of users in a period of time (for example, transaction behavior of regular funds outflow from the same other account outside the business every month), initial intent labels (i.e. category labels) are marked by fusing in-business labels and expert recognition rules through a target dictionary; and on the basis of the identified intention labels, performing secondary prediction on the unidentified category labels, constructing a result with highest accuracy for multidimensional comparison evaluation and selection of a machine learning model, and predicting the actual intention of the user transaction. Thousands of personalized marketing activities can be expanded according to the transaction intention recognition and prediction results, users with different fund purposes can be reached in a targeted manner, target users with marketing values are screened, and the rate of the in-line funds Jin Liucun is increased; in addition, the embodiment can also perfect the target dictionary and model iteration by means of transaction intention recognition and prediction result feedback, thereby providing powerful help for subsequent marketing interaction.
As an alternative, before predicting, using the target model, the second subset of object behavior data having no feature words in the set of object behavior data, the method further comprises:
s1, obtaining a preprocessed first training sample set based on an identified class label and a model entering feature label associated with a first object behavior data subset, wherein the model entering feature label is used for indicating static features of objects associated with each object behavior data in the first object behavior data subset;
s2, adjusting the number of training samples of each class of labels in the first training sample set according to a preset sampling mode to obtain a second training sample set after sampling;
and S3, training the initialization model based on the second training sample set, and determining the initialization model after training as a target model.
Optionally, in this embodiment, the identified category label may be, but is not limited to, a category label obtained by matching and identifying through the target dictionary, for example, a transaction behavior feature, including: transaction annotation information, service identification code information, service major class information, clearing path information, channel type information, transaction amount information, account transfer homonym identification and service channel type information; the in-mold feature tag may include, but is not limited to, a static feature for indicating an object, such as a basic information feature of the object, including: sex information, education level information, income information, marital status information, client level information, age information, etc.
Optionally, in this embodiment, the preset sampling manner may be, but is not limited to, used for adjusting the number of samples of the class label, so as to avoid the problem of overfitting in the subsequent training model process, so that the distribution of the training samples is balanced.
After the adjusted second training sample set is obtained, training the initialization model by using the second training sample set, and determining the initialization model after training as the target model, where the initialization module may be, but is not limited to, a plurality of models, may be, but is not limited to, determining a model with an optimal training result as the target model, may be, but is not limited to, an integrated model of a plurality of models, and may be, but is not limited to, training to adjust integrated parameters of a plurality of models to obtain the target model.
According to the embodiment provided by the application, a preprocessed first training sample set is obtained based on the identified class labels and the model entering feature labels associated with the first object behavior data subset, wherein the model entering feature labels are used for indicating static features of objects associated with each object behavior data in the first object behavior data subset; according to a preset sampling mode, the number of training samples of each class of labels in the first training sample set is adjusted to obtain a second training sample set after sampling; training the initialization model based on the second training sample set, and determining the initialization model after training as a target model. Based on the identified class labels, the class labels are preprocessed and adjusted to serve as training samples of the target model for training to obtain the target model, so that the recognition prediction labeling of the class labels is carried out on unidentified object behavior data, the problem of recognition labeling limitation caused by the fact that data key feature information is missing is avoided, the recognition labeling range of the class labels is enlarged, and the accuracy of obtaining the class labels is improved.
As an alternative, according to a preset sampling manner, adjusting the number of training samples of each class of labels in the first training sample set to obtain a second sampled training sample set, including:
s1, acquiring weight parameters corresponding to each class label in a first training sample set, wherein the first weight parameters corresponding to the first class labels are the number of samples in the first training sample set divided by ten times the number of samples in a first training sample subset, and the first training sample subset is a set of training samples belonging to the first class labels in the first training sample set, and each class label comprises the first class label;
s2, obtaining the number of new training samples corresponding to each category label based on the weight parameters corresponding to each category label, wherein the number of new training samples corresponding to the first category label is the number of samples of the first training sample subset multiplied by a preset multiple, and the preset multiple is the first weight parameter plus one;
and S3, adjusting the first training sample set based on the number of the new training samples adjusted by the category labels to obtain a second training sample set.
Alternatively, in the present embodiment, the weight parameter corresponding to the category label may be, but not limited to, obtained by dividing the number of samples of all category labels by (10 times the number of samples of the category label).
Optionally, in this embodiment, the number of samples of each category label is adjusted based on the weight parameter corresponding to each category label, where the number of new training samples after adjustment of each category label is the number of training samples before adjustment multiplied by (weight parameter+1).
It should be noted that, based on the preset sampling mode, the class labels with fewer samples can obtain higher weight, and the class labels with more samples can obtain limited weight, so that the whole number of training samples is more balanced, and the problem of over fitting in the subsequent model training process is avoided.
According to the embodiment of the application, the weight parameters corresponding to each class label in the first training sample set are obtained, wherein the first weight parameters corresponding to the first class labels are the number of samples in the first training sample set divided by ten times the number of samples in the first training sample subset, and the first training sample subset is a set of training samples belonging to the first class labels in the first training sample set, and each class label comprises the first class label; obtaining the number of new training samples corresponding to each category label based on the weight parameters corresponding to each category label, wherein the number of new training samples corresponding to the first category label is the number of samples of the first training sample subset multiplied by a preset multiple, and the preset multiple is the first weight parameter plus one; and adjusting the first training sample set based on the number of the new training samples adjusted by the category labels to obtain a second training sample set. Based on the training samples after weight parameter adjustment, the quantity of the training samples is balanced, the possible overfitting problem in the subsequent target model training process is avoided, and the training accuracy of the target model is further improved, so that the accuracy of the target model for subsequently outputting the class label is improved, and the technical effect of improving the acquisition accuracy of the class label is realized.
As an alternative, training the initialization model based on the second training sample set, and determining the initialization model after training as the target model includes:
s1, dividing a second training sample set into a training set and a testing set according to a preset proportion;
s2, training the plurality of initialization models by using a training set to obtain trained models corresponding to the plurality of initialization models, wherein the plurality of initialization models comprise: a random forest model, a lightweight gradient lifting tree model, a decision tree model and a limit gradient lifting tree model;
and S3, scoring the plurality of trained models by using the test set, and determining the trained model with the highest score as a target model.
Optionally, in this embodiment, after determining the adjusted second training sample for training to obtain the target model, the second sample set is segmented into the training set and the test set according to a preset proportion, for example, the second sample set is randomly segmented according to 7:3 to obtain the training set and the test set.
Optionally, in this embodiment, the training set may be, but not limited to, used to train the initialization model, the test set may be, but not limited to, verifying a trained model obtained by training using the training set, may be, but not limited to, inputting object behavior data in the test set into the trained model, and comparing a class label result output by the trained model with a class label result in the test set, so as to verify an identification prediction effect of a class label of the trained model.
Alternatively, in this embodiment, the class label result output by identifying the object behavior data of the test set based on the trained model may be compared with the identified class label of the test set, and the score of each trained model may be obtained through a plurality of scoring standard parameters, and the trained model with the highest score may be determined as the target model.
Alternatively, in the present embodiment, the above-mentioned plurality of scoring standard parameters may include, but are not limited to: AUC parameters, accuracy parameters, recall parameters, F1 parameters.
Optionally, in this embodiment, the training may be performed by using an ensemble learning model, which uses a "voting" integration manner to combine predictions of multiple machine learning models to generate a result, where each model prediction is considered as a "vote" and a prediction that obtains a plurality of votes is selected as a final prediction, for example, but not limited to, integrating three models including a LightGBM, a decision tree, and XGBoost, combining the prediction probabilities of each class of each model during the integration process, and selecting a class with the highest probability as the final recognition and prediction.
Alternatively, in the present embodiment, after the initialized ensemble learning model is acquired, the initialized ensemble learning model may be trained using a training set, and the trained ensemble learning model may be verified, adjusted and optimized using a test set, for example, the prediction weight parameters of each machine model in the ensemble learning model may be adjusted according to the verification result,
it should be noted that, the integrated learning model is selected, so that generalization of the model can be ensured, different models are gathered together, different models can learn different characteristics of data, and the fused result can be better represented.
As an alternative, the acquiring the object behavior data set to be identified includes:
s1, acquiring all object behavior data of the same object in a target time period;
and S2, screening first object behavior data for indicating the same object to periodically execute the target event in the target time period from all the object behavior data, wherein the object behavior data set comprises the first object behavior data.
Alternatively, in the present embodiment, the object behavior data set to be identified may be, but is not limited to, object historical behavior data including a plurality of objects, wherein the object historical behavior data of the same object may be, but is not limited to, historical behavior information for indicating that the object periodically performs a target event within a target period of time.
Further, taking the scenario of intent recognition and prediction based on regular transactions of funds outflow applied in the financial field as an example, the object behavior data set to be recognized may be, but not limited to, transaction detail data with an amount of money greater than ten thousand yuan transferred from the same customer in a certain bank to the same account in another bank in three months, for describing the transaction behavior characteristics of the customer.
According to the embodiment provided by the application, all object behavior data of the same object in a target time period are obtained; and screening out first object behavior data for indicating that the same object periodically executes the target event in the target time period from all the object behavior data, wherein the object behavior data set comprises the first object behavior data. The data information of regular transactions is screened to mine the regularity behind the object behaviors, so that the aim of improving the accuracy of identification labeling of the follow-up category labels is fulfilled.
As an alternative, before determining the category label of each object behavior data in the first subset of object behavior data based on the feature word and the target dictionary, the method further comprises:
s1, searching first type information of each object behavior data in an object behavior data set, and performing word segmentation processing on the first type information to obtain feature words, wherein the first type information is information input by an object when a target event is executed;
S2, under the condition that the characteristic words cannot be obtained based on the first type information, searching second type information of each object behavior data in the object behavior data set, and performing word segmentation processing on the second type information to obtain the characteristic words, wherein the second type information is information automatically generated after the object executes the target event;
s3, in the case that the feature words of the second object behavior data cannot be obtained based on the second type information of the second object behavior data, determining that the second object behavior data does not have the feature words, and determining that the second object behavior data belongs to the second object behavior data subset.
Optionally, in this embodiment, in the case of acquiring the object behavior data to be identified, whether the object behavior data has the associated feature word may be further searched, or may be, but not limited to, acquiring first type information of the object behavior data, where the first type information may be, but not limited to, information input by the object when executing the target event, and, taking, as an example, the epipolar text information input by the user corresponding to the object behavior data when executing the related event, and in the case of searching the epipolar text information and successfully segmenting the word to obtain the feature word (i.e., the keyword), searching, by the target dictionary, a category tag matching the feature word.
Optionally, in this embodiment, in the case that the first type of information is null or lacks a feature word, the second type of information may be further searched for, where the second type of information may be, but is not limited to, information that is automatically generated by the object after the object is executed, and taking the object event as a fund transaction as an example, and may be, but is not limited to, performing word segmentation on information such as a service identifier of the fund transaction, an internet service code, and the like to search for the feature word, and searching, in the case that the feature word is obtained, for a category tag that matches the feature word direction through the object dictionary.
It should be noted that, in a case where the feature word of the second object behavior data cannot be obtained based on the second type information of the second object behavior data, it is determined that the second object behavior data does not have the feature word, and it is determined that the second object behavior data belongs to the second object behavior data subset.
It should be noted that, in the case where the feature word of the second object behavior data is obtained based on the first type information or the second type information of the second object behavior data, it is determined that the second object behavior data has the feature word, and it is determined that the second object behavior data belongs to the first object behavior data subset.
As an alternative scheme, the identification prediction method of the category label is applied to an intention identification and prediction scene based on regular transaction of funds outflow, and in the scene, the embodiment can but is not limited to, starting from the characteristics of transaction behaviors of clients and the importance of funds receiving rules, firstly locating the characteristics of transaction behaviors of the same transaction opponent transferring funds out of each month of the clients, screening target clients from regular transaction data, and laying the feasibility of establishment of subsequent intention identification rules and model construction; secondly, establishing a transaction intention recognition rule by fusing intra-row labels and expert rules, continuously updating the dialect transaction keywords by adopting a word segmentation method, and mining the actual ideas of the self transaction of the clients, so that the recognizable client initial category labels are convenient to perfect; thirdly, distinguishing the traditional SMOTE oversampling mode, applying an SMOTE oversampling innovation strategy to unbalanced rule identification customer label data, solving the problem of unbalanced distribution of modeling samples, and performing supervised machine learning model training by using processed sample data as a training set so as to maximally predict unidentified transaction labels; finally, the accuracy and the stability of the model can be better evaluated by comparing and evaluating the multidimensional accuracy and the feature importance of the prediction model than by using a single index, the prediction model with the highest accuracy is selected, the result of the transaction intention of the customer is obtained, the business department can conveniently conduct thousands of people and thousands of sides accurate marketing to the customer according to scenes of different transaction purposes, and the development of a fund accepting strategy is supported.
Further illustratively, as shown in fig. 2, a class label recognition and prediction system based on the above-mentioned class label recognition and prediction method at least includes a screening target client module 202, a rule module 204 for creating classification and recognition of transaction intention, a module 206 for creating machine learning model prediction algorithm, and a model evaluation module 208.
The following describes the above modules to embody the general concept of this embodiment:
screening target client module 202
The banking individual customer funds transaction detail has huge value, but the funds transaction itself has the characteristics of diversified economic activities, diversified transaction relations, novel funds management method and the like, so that the rule of digging funds circulation tends to be complex. The screening of valuable transaction details from massive complex transaction data becomes a crucial first step, and by mining regularity behind customer transaction behaviors, the regular characteristics of funds leakage can be found, so that an effective funds receiving strategy is formulated.
In the embodiment, the fund networking relation among the clients is deeply analyzed, the regular monthly frequency behavior of the transaction is transferred out of the same opponent account outside the line when certain clients are located, and key clients with potential marketing value are found by taking the fund circulation rule as identification points. Based on the personal client retail business transaction detail data, the out-of-line transfer transaction of the clients is screened by utilizing the intra-line clearing path code, the lending mark and the money transfer code related characteristic data, and further, the targeted clients of the regular transaction of the funds outflow are accurately identified by searching three or more transactions to the same opponent account through client numbers, opponent accounts and transaction time. For example, the data caliber examples of the present invention are: and selecting transaction detail data with the amount of money larger than 1 ten thousand yuan from the same customer in Shenzhen regional row to the same opponent account outside the row in three months, and describing the transaction behavior characteristics of the customer.
(II) establishing the transaction intention Classification recognition rules Module 204
The screened target clients need to establish transaction intention classification and identification rules, integrate intra-row labels and expert rules, and take transaction details, merchant names, service identification codes, service major classes, transaction notes and online service codes as classification basis for comprehensively judging transaction classes. Firstly, the appendices in the transaction details come from the true transaction ideas of the clients, and the most important judgment basis is that the transaction purpose classification is mainly based on the keywords in the field. The method comprises the steps of establishing a dictionary of corresponding category keywords for each transaction category, wherein the dictionary can be used for periodically identifying and adding new keywords along with the change of the following appendix text. Then, the merchant name is used for judging the property of the opponent account, and the transaction type label is presumed according to the account application of the transaction; then, by means of the service identification code, the service class, the transaction annotation and the internet service code are combined as a further inference, and when the transaction statement is null or unknown, the data can be mined from the four features to identify the classification tag. The invention finally classifies the customer business transaction data into the following 11 types of labels from the key points of the characteristics of the customer transaction behaviors and the marketing value: vehicle credits, house credits, consumer credits, loan unknowns, wages, financing leases, consumer goods, credit cards, leases, among others. Specific category identification rules and category labels are shown in table 1 below.
Note that, as shown in table 1, labeling logic of the category identification rule is performed in the order of the rule in the table from top to bottom and then from left to right.
The customer transaction intention is divided into 11 types of labels, and if relevant business scenes are developed, thousands of personalized marketing can be developed according to the replacement cost. For example, a car credit: customer asset level assessment, high replacement cost; house credit: customer asset level evaluation, if the customer is a small enterprise owner, the popularization of the general loan product can be tried, and the replacement cost is high; financial management: the opponent bank account is still other banks, the client is considered to be a low-risk preference client, conversion can be tried, and the replacement cost is high; renting: customer asset level assessment, high replacement cost; and (3) living consumption: the normal consumption requirement of the customer is met, no fixed service scene exists, and the replacement cost is high; credit card: the customer asset level is evaluated, whether the opponent bank has a credit card or not is detected, and the replacement cost is low;
TABLE 1 Classification rules and Classification Label Table
Consumption credit: the customers have loan consumption requirements, can try to popularize the low-interest rate e-borrowed products, and has low replacement cost; the loan is unknown: the client has the loan requirement, and the replacement cost is low; payroll: whether the customer is a small enterprise owner or not, whether the wages are manually issued at present or not is tried to open a public account, and the public account is converted into a substitute wage object, so that the replacement cost is low; financing lease: the clients have financing and renting requirements, and the replacement cost is low. In a comprehensive view, the replacement cost related to the vehicle loan, the house loan, the financing, the renting and the living consumption is high, which is not beneficial to develop business scenes; the relative replacement cost of credit card, consumption loan, unknown loan, wage and financing lease is low, and the business scenario can be considered to be developed, so that the maintenance and marketing of clients are supported.
(III) construction of machine learning model prediction algorithm Module 206
After the rule for identifying the transaction intention is established, the customer category label after observation and labeling can know that most of the transaction intention of the customer cannot be identified because of the missing value of the key characteristic data (the customer transaction which can be labeled under the example data caliber accounts for about 20% of the total data), namely, the way of labeling the rule by a pure expert has certain limitation, and the defect is greatly influenced by subjective factors of whether the customer is willing to fill in the dialect. Therefore, in order to further mine the law of the fund transfer of the clients and reduce the influence of subjective idea factors of the clients, a machine learning model prediction algorithm is introduced, client labels which are identified by classification rules are used as a training set, and after an SMOTE oversampling innovation strategy, supervised machine model training is carried out, so that label prediction is carried out on the clients which cannot be identified by the classification rules. The specific implementation steps of the prediction model are as follows:
(1) Data preprocessing
Before the model is built, the raw data needs to be processed. First, the basic information of clients in a row and transaction behavior data are integrated, see fig. 3, a plurality of dimensions perfect algorithm modeling features, and an original data wide table of algorithm modeling features is constructed. And secondly, model training is carried out by determining the data subjected to equal-frequency equidistant box division, the box division is carried out based on the transaction data subjected to client transaction layer duplication removal, the characteristic discretization has strong robustness on abnormal data, and the subsequent training model is more stable. And thirdly, converting the model-entering feature tag and the identified class tag data into digital codes by using a LabelEncoder classification algorithm, so that the model-entering feature tag and the identified class tag data are subjected to data standardization processing from original Chinese text information, and the subsequent training of a model is facilitated.
(2) SMOTE oversampling innovation strategy
Because the situation that the quantity of ten class labels in the identification rule sample is unbalanced is identified, in order to avoid the problem of fitting exceeding in the subsequent modeling work, an SMOTE innovation strategy is decided to be adopted to increase the quantity of few class samples. In the traditional up-sampling mode, the SMOTE processing is carried out on all data, and the traditional sampling mode can cause over-fitting due to the fact that the number of the identified sample data is too small, so that the distribution between the training set and the test set is extremely unbalanced. Thus, the innovative oversampling strategy is determined according to the respective classification weights, the calculation of the weights is obtained by dividing the overall number of samples by ten times the number of samples of each class, which results in a higher oversampling weight obtained for a smaller number of classes, while the oversampling weights for a sufficient number of classes per se are limited, so that the overall number is more balanced, and the new number of samples per class is obtained by multiplying the calculated weight by the number of samples per class. The unbalance data is corrected by adjusting the oversampling proportion based on the setting of the smote weight, and the excessive difference between the distribution of each category after the smote and the distribution of the original category is not caused. The calculation steps are as follows: let T (total)/N (number of classes) =m according to the total number, resulting in the number of originals per class: class i X i =M;Then, the smoothie coefficient of each class is obtained, so that the smaller number of classes obtain larger smoothie weight, and the number of the obtained smoothies is as follows: class i (X i +1)=smote(classi)。
(2) Construction of machine learning predictive model
And constructing a model through the customer data sample marked by the identification rule, and further carrying out label prediction on customers which fail to be marked by the identification rule so as to realize the prediction of the trading intention of the customers. The model construction firstly selects 14 characteristics which are favorable for category prediction from a structured data broad table of the in-mold characteristics as initial data characteristics, and data are according to 7: and 3, randomly segmenting into a training set and a testing set. Based on model features and objects, the embodiment respectively constructs a supervised machine learning prediction model: random forest, lightweight gradient lifted tree model (LightGBM), decision tree, extreme gradient lifted tree model (XGBoost).
In addition, in order to ensure generalization of the model, an integrated learning model is further selected, and the integrated learning model has the advantages that different models are gathered together, different characteristics of data can be learned by different models, and the fused result can be better represented.
The ensemble learning model adopts a voing integration mode, and combines predictions of a plurality of machine learning models to generate a result, wherein each model prediction is considered as a 'vote', and the prediction for obtaining a plurality of votes is selected as a final prediction. In the model training of the embodiment, three models of the LightGBM, the decision tree and the XGBoost are subjected to voting integration, and in the integration process, the prediction probability of each model in each category is combined, and the category with the highest probability is selected as the final prediction.
(4) Model evaluation module 208
In order to optimize the model prediction effect, the invention compares several supervised machine learning models, multidimensional evaluation compares the recognized label data in the test set with the prediction results of the models, comprehensively selects the optimal prediction model to perform label prediction on the transaction data which cannot be marked. From the viewpoint of the prediction distribution of the model, the prediction distribution of the model is basically consistent with the true distribution of the label, in the existing labeled test set, the prediction distribution of the LightGBM, the decision tree, the XGBoost and the ensemble learning is basically consistent with the true distribution of the label, and the prediction effect of random forests is worst in the category with a small number of samples; meanwhile, among the samples labeled as others, the predictive distributions of the LightGBM and XGBoost models are most similar. From the prediction results of the models, the macro-auc values of the five models reach more than 90%. Even the worst random forest can reach a very high AUC, and under the condition that the number distribution of sample classification is unbalanced, the calculated AUC under the multi-classification scene is probably too optimistic for the model. Under the condition of unbalanced sample distribution, the prediction effect of each model cannot be accurately estimated only according to auc and accuracy, so that three estimation indexes of accuracy, recall and F1 score are adopted to measure the accuracy of model prediction, wherein the F1 score comprehensively considers the performances of the accuracy and the recall, and the higher the F1 score is, the higher the accuracy of model prediction is. In a comprehensive view, compared with the evaluation results of the five models constructed by the method, the XGBoost can better predict few categories compared with other models by combining with the judgment basis of the feature importance.
It should be noted that, in this embodiment, with the regular transaction of the individual customer funds outflow as a breakthrough point, a method for identifying and predicting transaction intention is provided, the transaction behavior characteristics of the same transaction opponent funds transferred from the customer to the outside of each month are located, and the intra-row label and expert identification rule are fused to label the initial intention label; on the basis of the identified transaction, the unidentified transaction is subjected to secondary prediction, a machine learning model multidimensional comparison evaluation is constructed, a result with highest accuracy is selected, and the actual intention of the customer transaction is predicted. The business department can expand thousands of personalized marketing activities through the transaction intention recognition and prediction results, targeted access to clients with different fund purposes is achieved, target clients with marketing value are screened, and the rate of the in-line funds Jin Liucun is improved; in addition, the embodiment can also perfect the customer identification rule and perform model iteration by means of the service department result feedback, thereby providing powerful help for the subsequent marketing interaction.
It should be noted that, in this embodiment, from the massive fund collection transaction, the transaction behavior characteristics of transferring funds of the same transaction opponent out of the way of each month of the customer are located, the target customer with regular transaction is screened according to the behavior characteristics, the feasibility of establishing the subsequent intention recognition rule and constructing the model is laid, and the cause of the fund leakage is effectively explored.
It should be noted that, in the data processing process of this embodiment, the traditional way of upsampling all data is broken, the SMOTE weight is set according to the specific gravity differentiation of the sample category number to adjust the oversampling proportion, and the modeling fitting problem caused by sample imbalance is corrected.
It should be further noted that, in this embodiment, based on the effective screening of the target client, the transaction labeling comprehensively considers the in-line labels, expert recognition rules, and multidimensional machine learning comparison modeling, so that the accuracy of the final result is higher, the limitation of the conventional rule labeling is supplemented, the labeling range is further enlarged, and the applicability of the method is deepened.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
The embodiment of the application also provides a device for identifying and predicting the class labels, and the device for identifying and predicting the class labels can be used for executing the method for identifying and predicting the class labels. The following describes a device for identifying and predicting category labels provided by the embodiment of the application.
Fig. 4 is a schematic diagram of an identification prediction apparatus of category labels according to an embodiment of the present application. As shown in fig. 4, the apparatus includes:
an obtaining unit 401, configured to obtain an object behavior data set to be identified, where the object behavior data set includes a plurality of object behavior data for indicating a historical behavior of an object;
a first recognition unit 402, configured to determine, when a first subset of object behavior data having feature words in the object behavior data set is found, a category label of each object behavior data in the first subset of object behavior data based on the feature words and a target dictionary, where the target dictionary is used to indicate a mapping relationship between pairs of feature words and the category label, and the category label is used to indicate behavior intention information of an object;
the second identifying unit 403 is configured to predict a second subset of object behavior data without feature words in the subset of object behavior data by using a target model, so as to obtain class labels of all object behavior data in the second subset of object behavior data, where the target model is obtained by training based on all object behavior data in the first subset of object behavior data and class labels of all objects in the first subset of object behavior data.
Optionally, in the apparatus for identifying and predicting a category label provided in the embodiment of the present application, the apparatus further includes:
and the pushing module is used for pushing the target information corresponding to the target category label to the target object under the condition that the target category label associated with the target object behavior data is acquired, wherein the target information comprises business marketing information matched with the behavior intention information of the target object.
Optionally, in the apparatus for identifying and predicting a category label provided in the embodiment of the present application, the apparatus further includes:
the determining module is used for obtaining a first preprocessed training sample set based on the identified class labels and the model entering feature labels associated with the first object behavior data subset before predicting the second object behavior data subset without the feature words in the object behavior data subset by utilizing the target model, wherein the model entering feature labels are used for indicating static features of objects associated with each object behavior data in the first object behavior data subset;
the adjusting module is used for adjusting the number of training samples of each class label in the first training sample set according to a preset sampling mode before predicting a second object behavior data subset without feature words in the object behavior data set by utilizing the target model to obtain a sampled second training sample set;
The training module is used for training the initialization model based on the second training sample set before predicting the second object behavior data subset without the feature words in the object behavior data set by using the target model, and determining the initialization model after training as the target model.
Optionally, in the identifying and predicting device for a category label provided by the embodiment of the present application, the adjusting module includes:
the acquiring sub-module is used for acquiring weight parameters corresponding to each class label in the first training sample set, wherein the first weight parameters corresponding to the first class labels are the number of samples in the first training sample set divided by ten times the number of samples in the first training sample sub-set, the first training sample sub-set is a set of training samples belonging to the first class labels in the first training sample set, and each class label comprises the first class labels;
the first determining submodule is used for obtaining the number of new training samples corresponding to each category label based on the weight parameter corresponding to each category label, wherein the number of the new training samples corresponding to the first category label is the number of samples of the first training sample subset multiplied by a preset multiple, and the preset multiple is the first weight parameter added by one;
And the adjustment sub-module is used for adjusting the first training sample set based on the number of the new training samples adjusted by the category labels to obtain a second training sample set.
Optionally, in the identifying and predicting device for a category label provided by the embodiment of the present application, the training module includes:
the molecule cutting module is used for cutting the second training sample set into a training set and a testing set according to a preset proportion;
the training sub-module is used for training the plurality of initialization models respectively by using the training set to obtain a trained model corresponding to each of the plurality of initialization models, wherein the plurality of initialization models comprise: a random forest model, a lightweight gradient lifting tree model, a decision tree model and a limit gradient lifting tree model;
and the second determining submodule is used for scoring the plurality of trained models by using the test set and determining the trained model with the highest score as the target model.
Optionally, in the identifying and predicting device for a category label provided in the embodiment of the present application, the obtaining unit 401 includes:
the acquisition module is used for acquiring all object behavior data of the same object in a target time period;
and the screening module is used for screening first object behavior data for indicating the same object to periodically execute the target event in the target time period from all the object behavior data, wherein the object behavior data set comprises the first object behavior data.
Optionally, in the apparatus for identifying and predicting a category label provided in the embodiment of the present application, the apparatus further includes:
the first searching module is used for searching first type information of each object behavior data in the object behavior data set before determining class labels of each object behavior data in the first object behavior data subset based on the feature words and the target dictionary, and performing word segmentation processing on the first type information to obtain the feature words, wherein the first type information is information input by the object when executing the target event;
the second searching module is used for searching second type information of each object behavior data in the object behavior data set under the condition that the characteristic word cannot be obtained based on the first type information before determining the class label of each object behavior data in the first object behavior data subset based on the characteristic word and the target dictionary, and performing word segmentation processing on the second type information to obtain the characteristic word, wherein the second type information is information automatically generated after the object executes the target event;
and a third determining module, configured to determine, before determining class labels of respective object behavior data in the first subset of object behavior data based on the feature words and the target dictionary, that the second object behavior data does not have the feature words if the feature words of the second object behavior data cannot be obtained based on the second type information of the second object behavior data, and determine that the second object behavior data belongs to the second subset of object behavior data.
The recognition prediction device for the class labels provided by the embodiment of the application uses the target dictionary and the target model to jointly finish the recognition and labeling of the class labels of the object behavior data set, wherein the target dictionary is used for labeling the class labels of the first object behavior data subset with the characteristic words based on the preset rule (mapping relation), and the labeling result is used as sample data to train the target model, so that the trained target model is used for further carrying out the prediction and labeling of the second object behavior data subset which cannot be processed by the target dictionary and does not have the characteristic words, the problem of limitation of recognition and labeling caused by the lack of data key characteristic information is avoided, the recognition and prediction labeling range of the class labels is enlarged, the technical effect of effectively improving the recognition and prediction accuracy of the class labels is realized, and the technical problem of lower recognition and prediction accuracy of the class labels in the related technology is solved.
The identification and prediction device for the class labels comprises a processor and a memory, wherein the acquisition unit, the extraction unit, the recursion unit, the determination unit and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can be provided with one or more than one, and the identification prediction labeling range of the class label is enlarged by adjusting kernel parameters, so that the accuracy of the class label is improved.
The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.
An embodiment of the present invention provides a computer-readable storage medium having stored thereon a program that, when executed by a processor, implements the above-described category label identification prediction method.
The embodiment of the invention provides a processor for running a program, wherein the program runs to execute the identification and prediction method of the category labels.
As shown in fig. 5, an embodiment of the present invention provides an electronic device, where the device includes a processor, a memory, and a program stored in the memory and executable on the processor, and when the processor executes the program, the following steps are implemented:
acquiring an object behavior data set to be identified, wherein the object behavior data set comprises a plurality of object behavior data used for indicating historical behaviors of an object;
Under the condition that a first object behavior data subset with feature words in an object behavior data set is found, determining category labels of all object behavior data in the first object behavior data subset based on the feature words and a target dictionary, wherein the target dictionary is used for indicating mapping relations between pairs of feature words and the category labels, and the category labels are used for indicating behavior intention information of objects;
predicting a second object behavior data subset without feature words in the object behavior data set by using a target model to obtain class labels of all object behavior data in the second object behavior data subset, wherein the target model is obtained based on all object behavior data in the first object behavior data subset and class labels of all objects in the first object behavior data subset.
The device herein may be a server, PC, PAD, cell phone, etc.
The application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of:
acquiring an object behavior data set to be identified, wherein the object behavior data set comprises a plurality of object behavior data used for indicating historical behaviors of an object;
Under the condition that a first object behavior data subset with feature words in an object behavior data set is found, determining category labels of all object behavior data in the first object behavior data subset based on the feature words and a target dictionary, wherein the target dictionary is used for indicating mapping relations between pairs of feature words and the category labels, and the category labels are used for indicating behavior intention information of objects;
predicting a second object behavior data subset without feature words in the object behavior data set by using a target model to obtain class labels of all object behavior data in the second object behavior data subset, wherein the target model is obtained based on all object behavior data in the first object behavior data subset and class labels of all objects in the first object behavior data subset.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims (10)

1. The identification and prediction method of the category labels is characterized by comprising the following steps:
acquiring an object behavior data set to be identified, wherein the object behavior data set comprises a plurality of object behavior data used for indicating historical behaviors of an object;
under the condition that a first object behavior data subset with feature words in the object behavior data set is searched, determining category labels of all object behavior data in the first object behavior data subset based on the feature words and a target dictionary, wherein the target dictionary is used for indicating mapping relations between a plurality of pairs of feature words and the category labels, and the category labels are used for indicating behavior intention information of the objects;
predicting a second object behavior data subset without the feature words in the object behavior data set by using a target model to obtain the category labels of all the object behavior data in the second object behavior data subset, wherein the target model is obtained by training based on all the object behavior data in the first object behavior data subset and the category labels of all the objects in the first object behavior data subset.
2. The method according to claim 1, wherein the method further comprises:
and pushing target information corresponding to the target class label to the target object under the condition that the target class label associated with the target object behavior data is acquired, wherein the target information comprises business marketing information matched with behavior intention information of the target object.
3. The method of claim 1, wherein prior to predicting a second subset of object behavior data in the set of object behavior data that does not have the feature word using the target model, the method further comprises:
obtaining a preprocessed first training sample set based on the identified class labels and the model entering feature labels associated with the first subset of object behavior data, wherein the model entering feature labels are used for indicating static features of objects associated with each object behavior data in the first subset of object behavior data;
according to a preset sampling mode, the number of training samples of each class of labels in the first training sample set is adjusted to obtain a second training sample set after sampling;
training the initialization model based on the second training sample set, and determining the initialization model after training as the target model.
4. The method of claim 3, wherein the adjusting the number of training samples of each class label in the first training sample set according to the preset sampling manner to obtain the sampled second training sample set includes:
obtaining weight parameters corresponding to each class label in the first training sample set, wherein the first weight parameters corresponding to a first class label are the number of samples in the first training sample set divided by ten times the number of samples in a first training sample subset, and the first training sample subset is a set of training samples belonging to the first class label in the first training sample set, and each class label comprises the first class label;
obtaining the number of new training samples corresponding to each category label based on the weight parameters corresponding to each category label, wherein the number of new training samples corresponding to the first category label is the number of samples of the first training sample subset multiplied by a preset multiple, and the preset multiple is the first weight parameter added by one;
and adjusting the first training sample set based on the number of the new training samples adjusted by each category label to obtain the second training sample set.
5. The method of claim 3, wherein training the initialization model based on the second set of training samples and determining the trained initialization model as the target model comprises:
dividing the second training sample set into a training set and a testing set according to a preset proportion;
training a plurality of initialization models by using the training set to obtain trained models corresponding to the initialization models, wherein the initialization models comprise: a random forest model, a lightweight gradient lifting tree model, a decision tree model and a limit gradient lifting tree model;
and scoring a plurality of trained models by using the test set, and determining the trained model with the highest score as the target model.
6. The method of claim 1, wherein the obtaining the set of object behavior data to be identified comprises:
acquiring all object behavior data of the same object in a target time period;
and screening first object behavior data for indicating the same object to periodically execute a target event in the target time period from all the object behavior data, wherein the object behavior data set comprises the first object behavior data.
7. The method of claim 6, wherein prior to the determining the category labels for each object behavior data in the first subset of object behavior data based on the feature words and target dictionary, the method further comprises:
searching first type information of each object behavior data in the object behavior data set, and performing word segmentation processing on the first type information to obtain the feature words, wherein the first type information is information input by the object when the object executes the target event;
searching second type information of each object behavior data in the object behavior data set under the condition that the feature word cannot be obtained based on the first type information, and performing word segmentation processing on the second type information to obtain the feature word, wherein the second type information is information automatically generated by the object after executing the target event;
in case the second type information based on second object behavior data fails to obtain the feature word of the second object behavior data, determining that the second object behavior data does not have the feature word, and determining that the second object behavior data belongs to the second subset of object behavior data.
8. A class label identification prediction apparatus comprising:
an acquisition unit configured to acquire an object behavior data set to be identified, where the object behavior data set includes a plurality of object behavior data for indicating a history behavior of an object;
a first recognition unit, configured to determine, when a first subset of object behavior data having feature words in the object behavior data set is found, a category label of each object behavior data in the first subset of object behavior data based on the feature words and a target dictionary, where the target dictionary is used to indicate a mapping relationship between a plurality of pairs of the feature words and the category label, and the category label is used to indicate behavior intention information of the object;
the second recognition unit is used for predicting a second object behavior data subset without the feature words in the object behavior data set by utilizing a target model to obtain the class labels of all the object behavior data in the second object behavior data subset, wherein the target model is obtained by training based on all the object behavior data in the first object behavior data subset and the class labels of all the objects in the first object behavior data subset.
9. A processor for running a program, wherein the program when run performs the method of 7 of any one of claims 1 to 7.
10. An electronic device comprising one or more processors and memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.
CN202310771876.5A 2023-06-27 2023-06-27 Identification prediction method and device for category labels, processor and electronic equipment Pending CN116821759A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310771876.5A CN116821759A (en) 2023-06-27 2023-06-27 Identification prediction method and device for category labels, processor and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310771876.5A CN116821759A (en) 2023-06-27 2023-06-27 Identification prediction method and device for category labels, processor and electronic equipment

Publications (1)

Publication Number Publication Date
CN116821759A true CN116821759A (en) 2023-09-29

Family

ID=88114067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310771876.5A Pending CN116821759A (en) 2023-06-27 2023-06-27 Identification prediction method and device for category labels, processor and electronic equipment

Country Status (1)

Country Link
CN (1) CN116821759A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117092525A (en) * 2023-10-20 2023-11-21 广东采日能源科技有限公司 Training method and device for battery thermal runaway early warning model and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117092525A (en) * 2023-10-20 2023-11-21 广东采日能源科技有限公司 Training method and device for battery thermal runaway early warning model and electronic equipment
CN117092525B (en) * 2023-10-20 2024-01-09 广东采日能源科技有限公司 Training method and device for battery thermal runaway early warning model and electronic equipment

Similar Documents

Publication Publication Date Title
CN111291816A (en) Method and device for carrying out feature processing aiming at user classification model
Voican Credit Card Fraud Detection using Deep Learning Techniques.
CN115545886A (en) Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium
CN116821759A (en) Identification prediction method and device for category labels, processor and electronic equipment
CN112990989B (en) Value prediction model input data generation method, device, equipment and medium
El Qadi et al. Feature contribution alignment with expert knowledge for artificial intelligence credit scoring
CN112508684B (en) Collecting-accelerating risk rating method and system based on joint convolutional neural network
CN112487284A (en) Bank customer portrait generation method, equipment, storage medium and device
CN112926989B (en) Bank loan risk assessment method and equipment based on multi-view integrated learning
CN114612239A (en) Stock public opinion monitoring and wind control system based on algorithm, big data and artificial intelligence
CN113052512A (en) Risk prediction method and device and electronic equipment
Lee et al. Application of machine learning in credit risk scorecard
CN112115258A (en) User credit evaluation method, device, server and storage medium
CN113298448B (en) Lease index analysis method and system based on Internet and cloud platform
Preetham et al. A Stacked Model for Approving Bank Loans
KR102409019B1 (en) System and method for risk assessment of financial transactions and computer program for the same
CN108520042B (en) System and method for realizing suspect case-involved role calibration and role evaluation in detection work
Amara Agglomeration and firm performance in times of economic turmoil: Evidence from Tunisian firm‐level data
CN117688181A (en) Text data clustering method and device, storage medium and electronic equipment
CN116485446A (en) Service data processing method and device, processor and electronic equipment
Das et al. Maximizing Customer Base By Forecasting The Most Profitable Customers Using Logistic Regression
Chen et al. Construction of Bank Credit White List Access System Based on Grey Clustering Algorithm
CN117764692A (en) Method for predicting credit risk default probability
Wei et al. A new dynamic credit scoring model based on the objective cluster analysis
CN112668796A (en) Money return prediction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination