CN117556256A - Private domain service label screening system and method based on big data - Google Patents
Private domain service label screening system and method based on big data Download PDFInfo
- Publication number
- CN117556256A CN117556256A CN202311528120.4A CN202311528120A CN117556256A CN 117556256 A CN117556256 A CN 117556256A CN 202311528120 A CN202311528120 A CN 202311528120A CN 117556256 A CN117556256 A CN 117556256A
- Authority
- CN
- China
- Prior art keywords
- data
- user
- module
- label
- private domain
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012216 screening Methods 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 15
- 230000010354 integration Effects 0.000 claims abstract description 11
- 238000005516 engineering process Methods 0.000 claims abstract description 7
- 238000013480 data collection Methods 0.000 claims abstract description 6
- 238000007405 data analysis Methods 0.000 claims abstract description 4
- 238000007418 data mining Methods 0.000 claims abstract description 4
- 238000007637 random forest analysis Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 9
- 238000002372 labelling Methods 0.000 claims description 9
- 238000007619 statistical method Methods 0.000 claims description 9
- 238000004140 cleaning Methods 0.000 claims description 7
- 238000011156 evaluation Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 238000013075 data extraction Methods 0.000 claims description 3
- 238000012423 maintenance Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 238000010606 normalization Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000003313 weakening effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention relates to the technical field of big data, in particular to a private domain business label screening system and a private domain business label screening method based on big data, wherein the private domain business label screening system comprises the following steps: a data collection unit: for collecting data of users from different channels and platforms; a data integration unit: the system is used for integrating and uniformly formatting the collected data; a user portrait unit: the method is used for integrating a plurality of dimension data of the user through data mining and analysis based on a big data technology to form a complete user portrait; label management unit: for updating and managing different tags in real time; tag screening unit: the method is used for screening out the labels meeting the conditions according to the user portrait and the label definition. The invention standardizes different user data based on standard scores, ensures the uniformity of the data, provides an accurate data base for generating user portraits, and simultaneously, forms complete data portraits based on big data technology, thereby improving the quality of an algorithm model and the accuracy of labels.
Description
Technical Field
The invention relates to the technical field of big data, in particular to a private domain business label screening system and a private domain business label screening method based on big data.
Background
With the advent of the digitization age, businesses are faced with vast amounts of user data, including user identity information, behavioral information, transaction data, and so forth. How to effectively manage and utilize such data becomes a key to enterprise operation private domain traffic. The tag screening system may help businesses integrate, analyze, and utilize such data to better understand user needs and preferences. In the prior art, as user data come from different channels and platforms, the data format and the standard may have differences, so that the difficulty of data integration is high, and the accuracy of the label is low due to errors of data quality and algorithm. In view of the above problems, the present invention proposes a private domain service tag screening system and method based on big data to solve the above problems.
Disclosure of Invention
The invention aims to solve the defects in the background technology by providing a private domain service label screening system and a private domain service label screening method based on big data.
The technical scheme adopted by the invention is as follows:
the utility model provides a private domain business label screening system based on big data, includes:
a data collection unit: for collecting data of users from different channels and platforms;
a data integration unit: the system is used for integrating and uniformly formatting the collected data;
a user portrait unit: the method is used for integrating a plurality of dimension data of the user through data mining and analysis based on a big data technology to form a complete user portrait;
label management unit: the method is used for updating and managing different labels in real time according to service requirements and user characteristics;
tag screening unit: the method is used for screening out the labels meeting the conditions according to the updating management of the user portrait and the labels.
As a preferred technical scheme of the invention: the data integration unit includes:
and a data cleaning module: the data cleaning module is used for removing repeated, invalid or erroneous data;
and a data merging module: the data merging module is used for integrating a plurality of data sources together to form a complete data set;
a data formatting module: the data formatting module is used for carrying out standardized processing on the integrated data and ensuring the uniformity of the data.
As a preferred technical scheme of the invention: the data formatting module is based on a standard score Z-score, and the formula is as follows:
where x is the value of the user data sample, μ is the average value of the user data sample values, σ is the standard deviation of the user data sample values.
As a preferred technical scheme of the invention: the user portrait unit includes:
the data extraction module is used for carrying out statistical analysis on the data and extracting useful information features;
the labeling module is used for classifying and marking the users according to the statistical analysis result;
the portrait assessment module is used for assessing and optimizing the built-in portrait and ensuring the quality and usability of the portrait of the user;
the feature extraction module is based on a random forest feature selection algorithm, a judgment method in the random forest feature selection algorithm is a coefficient of the foundation, and the formula of the coefficient of the foundation is as follows:
wherein A is a feature, and the data set D is divided into D 1 And D 2 ,k is the kind of feature and,a probability of being the kth class.
As a preferred technical scheme of the invention: the classification labels in the labeling module are based on pearson correlation coefficients, and the pearson correlation coefficients are expressed as follows:
wherein cov (x, y) is the covariance of variables x and y, σx and σy are the standard deviations of variables x and y, respectively, and E (i) represents the expected value of i;
the portrait assessment module is based on F-score indexes, and the formula of the F-score indexes is as follows:
wherein alpha is a weight number, P is an accuracy rate, R is a recall rate, and the formulas of P and R are as follows
Where TP is the number of samples that were originally correct and are also divided into correct samples, FP is the number of samples that were originally incorrect and are divided into correct samples, and FN is the number of samples that were originally incorrect and are correctly divided into incorrect samples.
As a preferred technical scheme of the invention: the tag management unit comprises real-time updating and maintenance management of the tag, application management of the tag, security management of the tag and usage analysis of the tag.
As a preferred technical scheme of the invention: the label screening unit executes the following steps:
selecting a corresponding label screener according to the label type to be screened;
setting corresponding screening conditions according to the content to be screened;
applying the screening conditions to the tag data to obtain tag data meeting the conditions;
and applying the filtered label data to the corresponding service scene.
The method for screening the private domain business labels based on the big data comprises the following steps:
s1: collecting data of users from different channels and platforms;
s2: integrating and uniformly formatting the collected user data;
s3: integrating a plurality of dimension data of the user to form a complete user portrait;
s4: updating and managing different labels according to service requirements and user characteristics;
s5: and screening out labels meeting the conditions according to the updating management of the user portrait and the labels.
Compared with the prior art, the private domain service label screening system and method based on big data provided by the invention have the beneficial effects that:
the invention standardizes different user data based on standard scores, ensures the uniformity of the data, provides an accurate data basis for generating user portraits, forms complete data portraits based on big data technology, relates to the special detection extraction of random forests, integrates pearson correlation coefficients and F-score indexes into user multidimensional data, improves the quality of an algorithm model and the accuracy of labels, and improves the later marketing effect and customer satisfaction.
Drawings
FIG. 1 is a block diagram of the overall system of the present invention;
FIG. 2 is a system block diagram of a data integration unit of the present invention;
FIG. 3 is a system block diagram of a user portrait unit of the present invention;
fig. 4 is a flow chart of the method of the present invention.
The meaning of each label in the figure is:
1. a data collection unit;
2. a data integration unit;
21. a data cleaning module; 22. a data merging module; 23. a data formatting module;
3. a user portrait unit;
31. a feature extraction module; 32. a labeling module; 33. an image evaluation module;
4. a tag management unit;
5. and a tag screening unit.
Detailed Description
It should be noted that, under the condition of no conflict, the embodiments of the present embodiments and features in the embodiments may be combined with each other, and the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and obviously, the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-3, the present invention provides a private domain service tag screening system based on big data, comprising:
data collection unit 1: for collecting data of users from different channels and platforms;
the data integration unit 2: the system is used for integrating and uniformly formatting the collected data;
user portrayal unit 3: the method is used for integrating a plurality of dimension data of the user through data mining and analysis based on a big data technology to form a complete user portrait;
label management unit 4: the method is used for updating and managing different labels in real time according to service requirements and user characteristics;
tag screening unit 5: the method is used for screening out the labels meeting the conditions according to the updating management of the user portrait and the labels.
The data integration unit 2 includes:
data cleaning module 21: the data cleansing module 21 is configured to remove duplicate, invalid or erroneous data;
data merge module 22: the data merging module 22 is configured to integrate a plurality of data sources together to form a complete data set;
data formatting module 23: the data formatting module 23 is configured to perform normalization processing on the integrated data, so as to ensure uniformity of the data.
The data formatting module 23 is based on a standard score Z-score, as follows:
where x is the value of the user data sample, μ is the average value of the user data sample values, σ is the standard deviation of the user data sample values.
The user portrayal unit 3 comprises:
a feature extraction module 31, wherein the data extraction module 31 is used for performing statistical analysis on data and extracting useful information features;
a labeling module 32, wherein the labeling module 32 is used for classifying and labeling users according to the result of statistical analysis;
and the portrait assessment module 33 is used for assessing and optimizing the built-in portrait and ensuring the quality and usability of the portrait of the user.
The feature extraction module 31 is based on a random forest feature selection algorithm, wherein a judgment method in the random forest feature selection algorithm is a coefficient of kunity, and the formula of the coefficient of kunity is as follows:
wherein A is a feature, and the data set D is divided into D 1 And D 2 ,k is the kind of feature and,a probability of being the kth class.
The classification labels in the labeling module 32 are based on pearson correlation coefficients, which are formulated as follows:
wherein cov (x, y) is the covariance of variables x and y, σx and σy are the standard deviations of variables x and y, respectively; e (i) represents the expected value of i.
The image evaluation module 33 is based on an F-score index, and the formula of the F-score index is as follows:
wherein alpha is a weight number, P is an accuracy rate, R is a recall rate, and the formulas of P and R are as follows
Where TP is the number of samples that were originally correct and are also divided into correct samples, FP is the number of samples that were originally incorrect and are divided into correct samples, and FN is the number of samples that were originally incorrect and are correctly divided into incorrect samples.
The tag management unit 4 includes real-time update and maintenance management of tags, application management of tags, security management of tags, and usage analysis of tags.
The label screening unit 5 performs the following procedure:
selecting a corresponding label screener according to the label type to be screened;
setting corresponding screening conditions according to the content to be screened;
applying the screening conditions to the tag data to obtain tag data meeting the conditions;
and applying the filtered label data to the corresponding service scene.
Referring to fig. 4, a private domain service tag screening method based on big data is provided, which includes the following steps:
s1: collecting data of users from different channels and platforms;
s2: integrating and uniformly formatting the collected user data;
s3: integrating a plurality of dimension data of the user to form a complete user portrait;
s4: according to the service requirements and the user characteristics, different labels are updated and managed in real time;
s5: and screening out labels meeting the conditions according to the updating management of the user portrait and the labels.
In this embodiment, the private domain service is a domain which is established and operated by an enterprise, belongs to a field which can be fully controlled by the enterprise, and can be repeatedly and low-cost or even freely reach the user. In the field, the enterprise can acquire clients in various modes and deposit, convert and repurchase the clients, so that the enterprise can better know and manage the user data, and the marketing effect and the client satisfaction are improved. To complete private domain service tag screening, first, the data collection unit 1 collects data of users from different channels and platforms, including basic information, behavior information, transaction data, etc. of the users. Such data may be obtained through different channels and platforms, such as official websites, social media, CRM systems, and the like. The data integration unit 2 integrates and uniformly formats the collected user data; the method comprises the following specific steps: the data cleaning module 21 cleans the collected data to remove repeated, invalid or wrong data, so as to ensure the quality and accuracy of the data; the data merge module 22 will integrate multiple data sources together to form a complete data set; the data formatting module 23 performs normalization processing on the integrated data, and ensures the uniformity of the data. The method for realizing data standardization is a common data processing method based on standard Score Z-Score, can convert data of different orders into Z-Score scores with uniform measurement for comparison, improves data comparability, reduces influence of human factors on the data by weakening the interpretation of the data, and enables the data to be more objective, wherein the formula is as follows:
where x is the value of the user data sample, μ is the average value of the user data sample values, σ is the standard deviation of the user data sample values. The Z-score standardization is simple and easy to use, is convenient to calculate, can be applied to numerical data, and is not influenced by the data magnitude.
Furthermore, the user portrait unit 3 integrates a plurality of dimension data of the user to form a complete user portrait; the method comprises the following specific steps: the feature extraction module 31 performs statistical analysis on the normalized data to extract useful information features, and the feature extraction is based on a random forest feature selection algorithm based on the module 31, wherein the random forest is an integrated classifier which is a strong classifier algorithm formed by training a plurality of basic classifiers and combining the basic classifiers according to a voting system. The standard for screening non-leaf nodes in the random forest is the relevance and importance of characteristic variables, and indexes for measuring the relevance of the variables are information gain and a coefficient value of a kunity. The calculation process when the non-leaf nodes of the decision tree are screened in the random forest can be applied to feature selection, a common judgment method in a random forest feature selection algorithm is a coefficient of base, and the main function of the coefficient of base is to calculate the uncertainty of each feature variable, so that the feature with the minimum uncertainty is screened out. When the coefficient of the kunit value is minimum, it means that all samples are of one type. In other words, the smaller the value of the coefficient of the kunit, the smaller its uncertainty, the better the effect of this feature, the formula of the coefficient of the kunit is as follows:
wherein A is a feature, and the data set D is divided into D 1 And D 2 ,k is the kind of feature, < >>A probability of being the kth class. When the random forest is used as a feature selection algorithm, the method is suitable for a high-dimensional feature set and a large-scale data scene, and has the advantage of insensitivity of a default value.
The labeling module 32 then classifies and labels the user based on the results of the statistical analysis, wherein the classification labels are based on pearson correlation coefficients, which are formulated as follows:
wherein cov (x, y) is the covariance of variables x and y, σx and σy are the standard deviations of variables x and y, respectively; e (i) represents the expected value of i. The interval of the pearson correlation coefficient result is [ -1,1], and if the coefficient is equal to 0, it is indicated that the two variables have no correlation, if the coefficient is closer to 1, the correlation is higher, otherwise, the two variables are in negative correlation, in short, the pearson correlation coefficient value is larger, and the correlation between the two variables is higher, otherwise, the correlation is lower. The two user data variables with higher correlation are classified into the same class.
Finally, the portrait assessment module 33 will evaluate and optimize the user portrait to ensure the quality and usability of the user portrait. The image evaluation module 33 is based on the F-score index, which has the following formula:
wherein alpha is a weight number, P is an accuracy rate, R is a recall rate, and the formulas of P and R are as follows
Where TP is the number of samples that were originally correct and are also divided into correct samples, FP is the number of samples that were originally incorrect and are divided into correct samples, and FN is the number of samples that were originally incorrect and are correctly divided into incorrect samples.
The recall rate and the accuracy rate are evaluation indexes, the recall rate is used for examining how many positive examples in the original sample are recalled, the coverage is evaluated, the recall rate is needed to be analyzed according to specific conditions, and the higher the recall rate is, the better the recall rate is, and the lower the recall rate is, the better the recall rate is; the accuracy refers to the ratio of the number of correctly classified samples to the total number of samples, and is a relatively visual index. In general, the higher the accuracy, the better the classification effect. However, the classifier is also determined according to the situation, for example, under the condition that the proportion of the number of positive and negative samples is extremely large, almost every sample set is a negative sample set, the classifier has high accuracy during training, and a positive verification set can be misjudged later if the classifier encounters, so that the accuracy is also determined according to the situation, and the recall rate and the accuracy cannot comprehensively evaluate the quality of a model because the classifier can only reflect the unilateral performance of the model. The F-score index is a more comprehensive index relative to the recall rate and the accuracy, and the F-score value is adopted to comprehensively consider the quality of the algorithm, so that the algorithm can be intuitively evaluated to a certain extent.
When the user portrait is generated, the label management unit 4 updates and manages different labels in real time according to the service requirements and the user characteristics; the specific management content comprises application management of the tag, and the tag is applied to an actual service scene; the real-time updating and maintaining of the label is to update and maintain the label in time according to business requirements and data changes, so that the accuracy and the integrity of the label are ensured; the security management of the tag is to protect the security and privacy of the tag data and ensure that the tag data is not revealed or abused; and (3) carrying out use analysis on the labels, carrying out statistical analysis on the use condition of the labels, knowing the effect and the value of the labels, and providing reference for optimizing label management.
Furthermore, the tag filtering unit 5 filters out the tags that match the condition based on the user profile and the updated management of the tags. The system can screen according to different labels, help enterprises to find target user groups quickly, and specifically execute the following steps: selecting a corresponding label screener according to the label type to be screened; setting corresponding screening conditions according to the content to be screened; applying the screening conditions to the tag data to obtain tag data meeting the conditions; and applying the filtered label data to the corresponding service scene. When the label screening is carried out, adjustment and optimization are needed according to actual conditions.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.
Claims (7)
1. A private domain business label screening system based on big data is characterized in that: comprising the following steps:
data collection unit (1): for collecting data of users from different channels and platforms;
data integration unit (2): the system is used for integrating and uniformly formatting the collected data;
user portrait unit (3): the method is used for integrating a plurality of dimension data of the user through data mining and analysis based on a big data technology to form a complete user portrait;
label management unit (4): the method is used for updating and managing different labels in real time according to service requirements and user characteristics;
label screening unit (5): the method is used for screening out the labels meeting the conditions according to the updating management of the user portrait and the labels.
2. The big data based private domain business label screening system of claim 1, wherein: the data integration unit (2) includes:
data cleaning module (21): the data cleaning module (21) is used for removing repeated, invalid or wrong data;
data merge module (22): the data merging module (22) is used for integrating a plurality of data sources together to form a complete data set;
a data formatting module (23): the data formatting module (23) is used for carrying out standardized processing on the integrated data so as to ensure the uniformity of the data.
3. The big data based private domain business label screening system of claim 2, wherein: the data formatting module (23) is based on a standard score Z-score, the formula is as follows:
where x is the value of the user data sample, μ is the average value of the user data sample values, σ is the standard deviation of the user data sample values.
4. The big data based private domain business label screening system of claim 1, wherein: the user portrayal unit (3) comprises:
a feature extraction module (31), wherein the data extraction module (31) is used for carrying out statistical analysis on data and extracting useful information features;
a tagging module (32), wherein the tagging module (32) is used for classifying and marking users according to the result of statistical analysis;
the portrait assessment module (33), the said portrait assessment module (33) is used for evaluating and optimizing the build-in portrait, ensure the quality and usability of the user portrait;
the classification labels in the labeling module (32) are based on pearson correlation coefficients, which are formulated as follows:
wherein cov (x, y) is the covariance of variables x and y, σx and σy are the standard deviations of variables x and y, respectively, and E (i) represents the expected value of i;
the image evaluation module (33) is based on an F-score index, and the formula of the F-score index is as follows:
wherein alpha is a weight number, P is an accuracy rate, R is a recall rate, and the formulas of P and R are as follows
Wherein TP is the number of samples that were originally correct and are also divided into correct samples, FP is the number of samples that were originally incorrect and are divided into correct samples, and FN is the number of samples that were originally incorrect and are correctly divided into incorrect samples;
the feature extraction module (31) is based on a random forest feature selection algorithm, wherein a judgment method in the random forest feature selection algorithm is a coefficient of the basis, and the coefficient of the basis is expressed as follows:
wherein A is a feature, and the data set D is divided into D 1 And D 2 ;k is the kind of feature, < >>A probability of being the kth class.
5. The big data based private domain business label screening system of claim 1, wherein: the tag management unit (4) comprises real-time updating and maintenance management of the tag, application management of the tag, security management of the tag and usage analysis of the tag.
6. The big data based private domain business label screening system of claim 1, wherein: the label screening unit (5) executes the following steps:
selecting a corresponding label screener according to the label type to be screened;
setting corresponding screening conditions according to the content to be screened;
applying the screening conditions to the tag data to obtain tag data meeting the conditions;
and applying the filtered label data to the corresponding service scene.
7. The private domain service label screening method based on big data is based on the private domain service label screening system based on big data as set forth in any one of claims 1 to 6, and is characterized in that: the method comprises the following steps:
s1: collecting data of users from different channels and platforms;
s2: integrating and uniformly formatting the collected user data;
s3: integrating a plurality of dimension data of the user to form a complete user portrait;
s4: according to the service requirements and the user characteristics, different labels are updated and managed in real time;
s5: and screening out labels meeting the conditions according to the updating management of the user portrait and the labels.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311528120.4A CN117556256A (en) | 2023-11-16 | 2023-11-16 | Private domain service label screening system and method based on big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311528120.4A CN117556256A (en) | 2023-11-16 | 2023-11-16 | Private domain service label screening system and method based on big data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117556256A true CN117556256A (en) | 2024-02-13 |
Family
ID=89818058
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311528120.4A Pending CN117556256A (en) | 2023-11-16 | 2023-11-16 | Private domain service label screening system and method based on big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117556256A (en) |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107391603A (en) * | 2017-06-30 | 2017-11-24 | 北京奇虎科技有限公司 | User's portrait method for building up and device for mobile terminal |
CN108764663A (en) * | 2018-05-15 | 2018-11-06 | 广东电网有限责任公司信息中心 | A kind of power customer portrait generates the method and system of management |
CN111191122A (en) * | 2019-12-20 | 2020-05-22 | 重庆邮电大学 | Learning resource recommendation system based on user portrait |
CN112347372A (en) * | 2020-10-30 | 2021-02-09 | 银盛支付服务股份有限公司 | Method for service promotion of financial enterprise based on user portrait scheme |
CN113312531A (en) * | 2021-04-22 | 2021-08-27 | 广州丰石科技有限公司 | User portrait identification method based on DPI analysis and decision tree model |
CN113822390A (en) * | 2021-11-24 | 2021-12-21 | 杭州贝嘟科技有限公司 | User portrait construction method and device, electronic equipment and storage medium |
CN113988221A (en) * | 2021-11-26 | 2022-01-28 | 泰康保险集团股份有限公司 | Insurance user classification model establishing method, classification method, device and equipment |
CN114004584A (en) * | 2021-10-22 | 2022-02-01 | 国网重庆市电力公司电力科学研究院 | Power information management method for building user portrait based on data middleboxes |
CN114547128A (en) * | 2021-12-14 | 2022-05-27 | 浙江吉利控股集团有限公司 | False order identification method, false order identification system, computer equipment and storage medium |
CN114626940A (en) * | 2022-03-31 | 2022-06-14 | 中国工商银行股份有限公司 | Data analysis method and device and electronic equipment |
CN115098599A (en) * | 2022-06-20 | 2022-09-23 | 启明信息技术股份有限公司 | Sketch analysis method and system based on multi-dimensional user preference label |
CN116401564A (en) * | 2023-03-24 | 2023-07-07 | 上海电力大学 | PCA-based redundant variable screening improvement method and device |
CN116595418A (en) * | 2023-05-26 | 2023-08-15 | 国网上海市电力公司 | Multi-dimensional image construction method for scientific and technological achievements |
CN116703129A (en) * | 2023-08-07 | 2023-09-05 | 匠达(苏州)科技有限公司 | Intelligent task matching scheduling method and system based on personnel data image |
-
2023
- 2023-11-16 CN CN202311528120.4A patent/CN117556256A/en active Pending
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107391603A (en) * | 2017-06-30 | 2017-11-24 | 北京奇虎科技有限公司 | User's portrait method for building up and device for mobile terminal |
CN108764663A (en) * | 2018-05-15 | 2018-11-06 | 广东电网有限责任公司信息中心 | A kind of power customer portrait generates the method and system of management |
CN111191122A (en) * | 2019-12-20 | 2020-05-22 | 重庆邮电大学 | Learning resource recommendation system based on user portrait |
CN112347372A (en) * | 2020-10-30 | 2021-02-09 | 银盛支付服务股份有限公司 | Method for service promotion of financial enterprise based on user portrait scheme |
CN113312531A (en) * | 2021-04-22 | 2021-08-27 | 广州丰石科技有限公司 | User portrait identification method based on DPI analysis and decision tree model |
CN114004584A (en) * | 2021-10-22 | 2022-02-01 | 国网重庆市电力公司电力科学研究院 | Power information management method for building user portrait based on data middleboxes |
CN113822390A (en) * | 2021-11-24 | 2021-12-21 | 杭州贝嘟科技有限公司 | User portrait construction method and device, electronic equipment and storage medium |
CN113988221A (en) * | 2021-11-26 | 2022-01-28 | 泰康保险集团股份有限公司 | Insurance user classification model establishing method, classification method, device and equipment |
CN114547128A (en) * | 2021-12-14 | 2022-05-27 | 浙江吉利控股集团有限公司 | False order identification method, false order identification system, computer equipment and storage medium |
CN114626940A (en) * | 2022-03-31 | 2022-06-14 | 中国工商银行股份有限公司 | Data analysis method and device and electronic equipment |
CN115098599A (en) * | 2022-06-20 | 2022-09-23 | 启明信息技术股份有限公司 | Sketch analysis method and system based on multi-dimensional user preference label |
CN116401564A (en) * | 2023-03-24 | 2023-07-07 | 上海电力大学 | PCA-based redundant variable screening improvement method and device |
CN116595418A (en) * | 2023-05-26 | 2023-08-15 | 国网上海市电力公司 | Multi-dimensional image construction method for scientific and technological achievements |
CN116703129A (en) * | 2023-08-07 | 2023-09-05 | 匠达(苏州)科技有限公司 | Intelligent task matching scheduling method and system based on personnel data image |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021184630A1 (en) | Method for locating pollutant discharge object on basis of knowledge graph, and related device | |
CN107528832B (en) | Baseline construction and unknown abnormal behavior detection method for system logs | |
Nguyen et al. | Automatic image filtering on social networks using deep learning and perceptual hashing during crises | |
US10031829B2 (en) | Method and system for it resources performance analysis | |
CN111914468A (en) | Intelligent monitoring and analyzing method and device for air pollution | |
CN108038627B (en) | Object evaluation method and device | |
Al-Janabi | A proposed framework for analyzing crime data set using decision tree and simple k-means mining algorithms | |
CN104809108A (en) | Information monitoring and analyzing system | |
CN115794803B (en) | Engineering audit problem monitoring method and system based on big data AI technology | |
CN113360566A (en) | Information content monitoring method and system | |
CN113904872A (en) | Feature extraction method and system for anonymous service website fingerprint attack | |
CN111506635A (en) | System and method for analyzing residential electricity consumption behavior based on autoregressive naive Bayes algorithm | |
CN112926045A (en) | Group control equipment identification method based on logistic regression model | |
CN116384736A (en) | Smart city risk perception method and system | |
Borg et al. | Clustering residential burglaries using modus operandi and spatiotemporal information | |
CN116452212B (en) | Intelligent customer service commodity knowledge base information management method and system | |
Memon et al. | Harvesting covert networks: a case study of the iMiner database | |
CN115062725B (en) | Hotel income anomaly analysis method and system | |
CN114625901B (en) | Multi-algorithm integration method and device | |
CN117556256A (en) | Private domain service label screening system and method based on big data | |
CN112506930B (en) | Data insight system based on machine learning technology | |
CN104809253A (en) | Internet data analysis system | |
CN114826632A (en) | Network attack classification method based on network security data cleaning fusion | |
CN117633675B (en) | Network pollution website discovery method and system based on model cascading | |
CN111526053B (en) | Network security alarm processing method based on confidence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |