CN117556256A

CN117556256A - Private domain service label screening system and method based on big data

Info

Publication number: CN117556256A
Application number: CN202311528120.4A
Authority: CN
Inventors: 张东晴; 王鹏飞
Original assignee: Nanjing Small Fission Network Technology Co ltd
Current assignee: Nanjing Small Fission Network Technology Co ltd
Priority date: 2023-11-16
Filing date: 2023-11-16
Publication date: 2024-02-13

Abstract

The invention relates to the technical field of big data, in particular to a private domain business label screening system and a private domain business label screening method based on big data, wherein the private domain business label screening system comprises the following steps: a data collection unit: for collecting data of users from different channels and platforms; a data integration unit: the system is used for integrating and uniformly formatting the collected data; a user portrait unit: the method is used for integrating a plurality of dimension data of the user through data mining and analysis based on a big data technology to form a complete user portrait; label management unit: for updating and managing different tags in real time; tag screening unit: the method is used for screening out the labels meeting the conditions according to the user portrait and the label definition. The invention standardizes different user data based on standard scores, ensures the uniformity of the data, provides an accurate data base for generating user portraits, and simultaneously, forms complete data portraits based on big data technology, thereby improving the quality of an algorithm model and the accuracy of labels.

Description

Private domain service label screening system and method based on big data

Technical Field

The invention relates to the technical field of big data, in particular to a private domain business label screening system and a private domain business label screening method based on big data.

Background

With the advent of the digitization age, businesses are faced with vast amounts of user data, including user identity information, behavioral information, transaction data, and so forth. How to effectively manage and utilize such data becomes a key to enterprise operation private domain traffic. The tag screening system may help businesses integrate, analyze, and utilize such data to better understand user needs and preferences. In the prior art, as user data come from different channels and platforms, the data format and the standard may have differences, so that the difficulty of data integration is high, and the accuracy of the label is low due to errors of data quality and algorithm. In view of the above problems, the present invention proposes a private domain service tag screening system and method based on big data to solve the above problems.

Disclosure of Invention

The invention aims to solve the defects in the background technology by providing a private domain service label screening system and a private domain service label screening method based on big data.

The technical scheme adopted by the invention is as follows:

the utility model provides a private domain business label screening system based on big data, includes:

a data collection unit: for collecting data of users from different channels and platforms;

a data integration unit: the system is used for integrating and uniformly formatting the collected data;

a user portrait unit: the method is used for integrating a plurality of dimension data of the user through data mining and analysis based on a big data technology to form a complete user portrait;

label management unit: the method is used for updating and managing different labels in real time according to service requirements and user characteristics;

tag screening unit: the method is used for screening out the labels meeting the conditions according to the updating management of the user portrait and the labels.

As a preferred technical scheme of the invention: the data integration unit includes:

and a data cleaning module: the data cleaning module is used for removing repeated, invalid or erroneous data;

and a data merging module: the data merging module is used for integrating a plurality of data sources together to form a complete data set;

a data formatting module: the data formatting module is used for carrying out standardized processing on the integrated data and ensuring the uniformity of the data.

As a preferred technical scheme of the invention: the data formatting module is based on a standard score Z-score, and the formula is as follows:

where x is the value of the user data sample, μ is the average value of the user data sample values, σ is the standard deviation of the user data sample values.

As a preferred technical scheme of the invention: the user portrait unit includes:

the data extraction module is used for carrying out statistical analysis on the data and extracting useful information features;

the labeling module is used for classifying and marking the users according to the statistical analysis result;

the portrait assessment module is used for assessing and optimizing the built-in portrait and ensuring the quality and usability of the portrait of the user;

the feature extraction module is based on a random forest feature selection algorithm, a judgment method in the random forest feature selection algorithm is a coefficient of the foundation, and the formula of the coefficient of the foundation is as follows:

wherein A is a feature, and the data set D is divided into D ₁ And D ₂ ，k is the kind of feature and,a probability of being the kth class.

As a preferred technical scheme of the invention: the classification labels in the labeling module are based on pearson correlation coefficients, and the pearson correlation coefficients are expressed as follows:

wherein cov (x, y) is the covariance of variables x and y, σx and σy are the standard deviations of variables x and y, respectively, and E (i) represents the expected value of i;

the portrait assessment module is based on F-score indexes, and the formula of the F-score indexes is as follows:

wherein alpha is a weight number, P is an accuracy rate, R is a recall rate, and the formulas of P and R are as follows

Where TP is the number of samples that were originally correct and are also divided into correct samples, FP is the number of samples that were originally incorrect and are divided into correct samples, and FN is the number of samples that were originally incorrect and are correctly divided into incorrect samples.

As a preferred technical scheme of the invention: the tag management unit comprises real-time updating and maintenance management of the tag, application management of the tag, security management of the tag and usage analysis of the tag.

As a preferred technical scheme of the invention: the label screening unit executes the following steps:

selecting a corresponding label screener according to the label type to be screened;

setting corresponding screening conditions according to the content to be screened;

applying the screening conditions to the tag data to obtain tag data meeting the conditions;

and applying the filtered label data to the corresponding service scene.

The method for screening the private domain business labels based on the big data comprises the following steps:

s1: collecting data of users from different channels and platforms;

s2: integrating and uniformly formatting the collected user data;

s3: integrating a plurality of dimension data of the user to form a complete user portrait;

s4: updating and managing different labels according to service requirements and user characteristics;

s5: and screening out labels meeting the conditions according to the updating management of the user portrait and the labels.

Compared with the prior art, the private domain service label screening system and method based on big data provided by the invention have the beneficial effects that:

the invention standardizes different user data based on standard scores, ensures the uniformity of the data, provides an accurate data basis for generating user portraits, forms complete data portraits based on big data technology, relates to the special detection extraction of random forests, integrates pearson correlation coefficients and F-score indexes into user multidimensional data, improves the quality of an algorithm model and the accuracy of labels, and improves the later marketing effect and customer satisfaction.

Drawings

FIG. 1 is a block diagram of the overall system of the present invention;

FIG. 2 is a system block diagram of a data integration unit of the present invention;

FIG. 3 is a system block diagram of a user portrait unit of the present invention;

fig. 4 is a flow chart of the method of the present invention.

The meaning of each label in the figure is:

1. a data collection unit;

2. a data integration unit;

21. a data cleaning module; 22. a data merging module; 23. a data formatting module;

3. a user portrait unit;

31. a feature extraction module; 32. a labeling module; 33. an image evaluation module;

4. a tag management unit;

5. and a tag screening unit.

Detailed Description

It should be noted that, under the condition of no conflict, the embodiments of the present embodiments and features in the embodiments may be combined with each other, and the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and obviously, the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-3, the present invention provides a private domain service tag screening system based on big data, comprising:

data collection unit 1: for collecting data of users from different channels and platforms;

the data integration unit 2: the system is used for integrating and uniformly formatting the collected data;

user portrayal unit 3: the method is used for integrating a plurality of dimension data of the user through data mining and analysis based on a big data technology to form a complete user portrait;

label management unit 4: the method is used for updating and managing different labels in real time according to service requirements and user characteristics;

tag screening unit 5: the method is used for screening out the labels meeting the conditions according to the updating management of the user portrait and the labels.

The data integration unit 2 includes:

data cleaning module 21: the data cleansing module 21 is configured to remove duplicate, invalid or erroneous data;

data merge module 22: the data merging module 22 is configured to integrate a plurality of data sources together to form a complete data set;

data formatting module 23: the data formatting module 23 is configured to perform normalization processing on the integrated data, so as to ensure uniformity of the data.

The data formatting module 23 is based on a standard score Z-score, as follows:

The user portrayal unit 3 comprises:

a feature extraction module 31, wherein the data extraction module 31 is used for performing statistical analysis on data and extracting useful information features;

a labeling module 32, wherein the labeling module 32 is used for classifying and labeling users according to the result of statistical analysis;

and the portrait assessment module 33 is used for assessing and optimizing the built-in portrait and ensuring the quality and usability of the portrait of the user.

The feature extraction module 31 is based on a random forest feature selection algorithm, wherein a judgment method in the random forest feature selection algorithm is a coefficient of kunity, and the formula of the coefficient of kunity is as follows:

The classification labels in the labeling module 32 are based on pearson correlation coefficients, which are formulated as follows:

wherein cov (x, y) is the covariance of variables x and y, σx and σy are the standard deviations of variables x and y, respectively; e (i) represents the expected value of i.

The image evaluation module 33 is based on an F-score index, and the formula of the F-score index is as follows:

The tag management unit 4 includes real-time update and maintenance management of tags, application management of tags, security management of tags, and usage analysis of tags.

The label screening unit 5 performs the following procedure:

and applying the filtered label data to the corresponding service scene.

Referring to fig. 4, a private domain service tag screening method based on big data is provided, which includes the following steps:

s1: collecting data of users from different channels and platforms;

s2: integrating and uniformly formatting the collected user data;

s4: according to the service requirements and the user characteristics, different labels are updated and managed in real time;

In this embodiment, the private domain service is a domain which is established and operated by an enterprise, belongs to a field which can be fully controlled by the enterprise, and can be repeatedly and low-cost or even freely reach the user. In the field, the enterprise can acquire clients in various modes and deposit, convert and repurchase the clients, so that the enterprise can better know and manage the user data, and the marketing effect and the client satisfaction are improved. To complete private domain service tag screening, first, the data collection unit 1 collects data of users from different channels and platforms, including basic information, behavior information, transaction data, etc. of the users. Such data may be obtained through different channels and platforms, such as official websites, social media, CRM systems, and the like. The data integration unit 2 integrates and uniformly formats the collected user data; the method comprises the following specific steps: the data cleaning module 21 cleans the collected data to remove repeated, invalid or wrong data, so as to ensure the quality and accuracy of the data; the data merge module 22 will integrate multiple data sources together to form a complete data set; the data formatting module 23 performs normalization processing on the integrated data, and ensures the uniformity of the data. The method for realizing data standardization is a common data processing method based on standard Score Z-Score, can convert data of different orders into Z-Score scores with uniform measurement for comparison, improves data comparability, reduces influence of human factors on the data by weakening the interpretation of the data, and enables the data to be more objective, wherein the formula is as follows:

where x is the value of the user data sample, μ is the average value of the user data sample values, σ is the standard deviation of the user data sample values. The Z-score standardization is simple and easy to use, is convenient to calculate, can be applied to numerical data, and is not influenced by the data magnitude.

Furthermore, the user portrait unit 3 integrates a plurality of dimension data of the user to form a complete user portrait; the method comprises the following specific steps: the feature extraction module 31 performs statistical analysis on the normalized data to extract useful information features, and the feature extraction is based on a random forest feature selection algorithm based on the module 31, wherein the random forest is an integrated classifier which is a strong classifier algorithm formed by training a plurality of basic classifiers and combining the basic classifiers according to a voting system. The standard for screening non-leaf nodes in the random forest is the relevance and importance of characteristic variables, and indexes for measuring the relevance of the variables are information gain and a coefficient value of a kunity. The calculation process when the non-leaf nodes of the decision tree are screened in the random forest can be applied to feature selection, a common judgment method in a random forest feature selection algorithm is a coefficient of base, and the main function of the coefficient of base is to calculate the uncertainty of each feature variable, so that the feature with the minimum uncertainty is screened out. When the coefficient of the kunit value is minimum, it means that all samples are of one type. In other words, the smaller the value of the coefficient of the kunit, the smaller its uncertainty, the better the effect of this feature, the formula of the coefficient of the kunit is as follows:

wherein A is a feature, and the data set D is divided into D ₁ And D ₂ ，k is the kind of feature, < >>A probability of being the kth class. When the random forest is used as a feature selection algorithm, the method is suitable for a high-dimensional feature set and a large-scale data scene, and has the advantage of insensitivity of a default value.

The labeling module 32 then classifies and labels the user based on the results of the statistical analysis, wherein the classification labels are based on pearson correlation coefficients, which are formulated as follows:

wherein cov (x, y) is the covariance of variables x and y, σx and σy are the standard deviations of variables x and y, respectively; e (i) represents the expected value of i. The interval of the pearson correlation coefficient result is [ -1,1], and if the coefficient is equal to 0, it is indicated that the two variables have no correlation, if the coefficient is closer to 1, the correlation is higher, otherwise, the two variables are in negative correlation, in short, the pearson correlation coefficient value is larger, and the correlation between the two variables is higher, otherwise, the correlation is lower. The two user data variables with higher correlation are classified into the same class.

Finally, the portrait assessment module 33 will evaluate and optimize the user portrait to ensure the quality and usability of the user portrait. The image evaluation module 33 is based on the F-score index, which has the following formula:

The recall rate and the accuracy rate are evaluation indexes, the recall rate is used for examining how many positive examples in the original sample are recalled, the coverage is evaluated, the recall rate is needed to be analyzed according to specific conditions, and the higher the recall rate is, the better the recall rate is, and the lower the recall rate is, the better the recall rate is; the accuracy refers to the ratio of the number of correctly classified samples to the total number of samples, and is a relatively visual index. In general, the higher the accuracy, the better the classification effect. However, the classifier is also determined according to the situation, for example, under the condition that the proportion of the number of positive and negative samples is extremely large, almost every sample set is a negative sample set, the classifier has high accuracy during training, and a positive verification set can be misjudged later if the classifier encounters, so that the accuracy is also determined according to the situation, and the recall rate and the accuracy cannot comprehensively evaluate the quality of a model because the classifier can only reflect the unilateral performance of the model. The F-score index is a more comprehensive index relative to the recall rate and the accuracy, and the F-score value is adopted to comprehensively consider the quality of the algorithm, so that the algorithm can be intuitively evaluated to a certain extent.

When the user portrait is generated, the label management unit 4 updates and manages different labels in real time according to the service requirements and the user characteristics; the specific management content comprises application management of the tag, and the tag is applied to an actual service scene; the real-time updating and maintaining of the label is to update and maintain the label in time according to business requirements and data changes, so that the accuracy and the integrity of the label are ensured; the security management of the tag is to protect the security and privacy of the tag data and ensure that the tag data is not revealed or abused; and (3) carrying out use analysis on the labels, carrying out statistical analysis on the use condition of the labels, knowing the effect and the value of the labels, and providing reference for optimizing label management.

Furthermore, the tag filtering unit 5 filters out the tags that match the condition based on the user profile and the updated management of the tags. The system can screen according to different labels, help enterprises to find target user groups quickly, and specifically execute the following steps: selecting a corresponding label screener according to the label type to be screened; setting corresponding screening conditions according to the content to be screened; applying the screening conditions to the tag data to obtain tag data meeting the conditions; and applying the filtered label data to the corresponding service scene. When the label screening is carried out, adjustment and optimization are needed according to actual conditions.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. A private domain business label screening system based on big data is characterized in that: comprising the following steps:

data collection unit (1): for collecting data of users from different channels and platforms;

data integration unit (2): the system is used for integrating and uniformly formatting the collected data;

user portrait unit (3): the method is used for integrating a plurality of dimension data of the user through data mining and analysis based on a big data technology to form a complete user portrait;

label management unit (4): the method is used for updating and managing different labels in real time according to service requirements and user characteristics;

label screening unit (5): the method is used for screening out the labels meeting the conditions according to the updating management of the user portrait and the labels.

2. The big data based private domain business label screening system of claim 1, wherein: the data integration unit (2) includes:

data cleaning module (21): the data cleaning module (21) is used for removing repeated, invalid or wrong data;

data merge module (22): the data merging module (22) is used for integrating a plurality of data sources together to form a complete data set;

a data formatting module (23): the data formatting module (23) is used for carrying out standardized processing on the integrated data so as to ensure the uniformity of the data.

3. The big data based private domain business label screening system of claim 2, wherein: the data formatting module (23) is based on a standard score Z-score, the formula is as follows:

4. The big data based private domain business label screening system of claim 1, wherein: the user portrayal unit (3) comprises:

a feature extraction module (31), wherein the data extraction module (31) is used for carrying out statistical analysis on data and extracting useful information features;

a tagging module (32), wherein the tagging module (32) is used for classifying and marking users according to the result of statistical analysis;

the portrait assessment module (33), the said portrait assessment module (33) is used for evaluating and optimizing the build-in portrait, ensure the quality and usability of the user portrait;

the classification labels in the labeling module (32) are based on pearson correlation coefficients, which are formulated as follows:

the image evaluation module (33) is based on an F-score index, and the formula of the F-score index is as follows:

Wherein TP is the number of samples that were originally correct and are also divided into correct samples, FP is the number of samples that were originally incorrect and are divided into correct samples, and FN is the number of samples that were originally incorrect and are correctly divided into incorrect samples;

the feature extraction module (31) is based on a random forest feature selection algorithm, wherein a judgment method in the random forest feature selection algorithm is a coefficient of the basis, and the coefficient of the basis is expressed as follows:

wherein A is a feature, and the data set D is divided into D ₁ And D ₂ ；k is the kind of feature, < >>A probability of being the kth class.

5. The big data based private domain business label screening system of claim 1, wherein: the tag management unit (4) comprises real-time updating and maintenance management of the tag, application management of the tag, security management of the tag and usage analysis of the tag.

6. The big data based private domain business label screening system of claim 1, wherein: the label screening unit (5) executes the following steps:

and applying the filtered label data to the corresponding service scene.

7. The private domain service label screening method based on big data is based on the private domain service label screening system based on big data as set forth in any one of claims 1 to 6, and is characterized in that: the method comprises the following steps:

s1: collecting data of users from different channels and platforms;

s2: integrating and uniformly formatting the collected user data;