CN117556256A - Private domain service label screening system and method based on big data - Google Patents

Private domain service label screening system and method based on big data Download PDF

Info

Publication number
CN117556256A
CN117556256A CN202311528120.4A CN202311528120A CN117556256A CN 117556256 A CN117556256 A CN 117556256A CN 202311528120 A CN202311528120 A CN 202311528120A CN 117556256 A CN117556256 A CN 117556256A
Authority
CN
China
Prior art keywords
data
user
module
label
private domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311528120.4A
Other languages
Chinese (zh)
Inventor
张东晴
王鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Small Fission Network Technology Co ltd
Original Assignee
Nanjing Small Fission Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Small Fission Network Technology Co ltd filed Critical Nanjing Small Fission Network Technology Co ltd
Priority to CN202311528120.4A priority Critical patent/CN117556256A/en
Publication of CN117556256A publication Critical patent/CN117556256A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of big data, in particular to a private domain business label screening system and a private domain business label screening method based on big data, wherein the private domain business label screening system comprises the following steps: a data collection unit: for collecting data of users from different channels and platforms; a data integration unit: the system is used for integrating and uniformly formatting the collected data; a user portrait unit: the method is used for integrating a plurality of dimension data of the user through data mining and analysis based on a big data technology to form a complete user portrait; label management unit: for updating and managing different tags in real time; tag screening unit: the method is used for screening out the labels meeting the conditions according to the user portrait and the label definition. The invention standardizes different user data based on standard scores, ensures the uniformity of the data, provides an accurate data base for generating user portraits, and simultaneously, forms complete data portraits based on big data technology, thereby improving the quality of an algorithm model and the accuracy of labels.

Description

Private domain service label screening system and method based on big data
Technical Field
The invention relates to the technical field of big data, in particular to a private domain business label screening system and a private domain business label screening method based on big data.
Background
With the advent of the digitization age, businesses are faced with vast amounts of user data, including user identity information, behavioral information, transaction data, and so forth. How to effectively manage and utilize such data becomes a key to enterprise operation private domain traffic. The tag screening system may help businesses integrate, analyze, and utilize such data to better understand user needs and preferences. In the prior art, as user data come from different channels and platforms, the data format and the standard may have differences, so that the difficulty of data integration is high, and the accuracy of the label is low due to errors of data quality and algorithm. In view of the above problems, the present invention proposes a private domain service tag screening system and method based on big data to solve the above problems.
Disclosure of Invention
The invention aims to solve the defects in the background technology by providing a private domain service label screening system and a private domain service label screening method based on big data.
The technical scheme adopted by the invention is as follows:
the utility model provides a private domain business label screening system based on big data, includes:
a data collection unit: for collecting data of users from different channels and platforms;
a data integration unit: the system is used for integrating and uniformly formatting the collected data;
a user portrait unit: the method is used for integrating a plurality of dimension data of the user through data mining and analysis based on a big data technology to form a complete user portrait;
label management unit: the method is used for updating and managing different labels in real time according to service requirements and user characteristics;
tag screening unit: the method is used for screening out the labels meeting the conditions according to the updating management of the user portrait and the labels.
As a preferred technical scheme of the invention: the data integration unit includes:
and a data cleaning module: the data cleaning module is used for removing repeated, invalid or erroneous data;
and a data merging module: the data merging module is used for integrating a plurality of data sources together to form a complete data set;
a data formatting module: the data formatting module is used for carrying out standardized processing on the integrated data and ensuring the uniformity of the data.
As a preferred technical scheme of the invention: the data formatting module is based on a standard score Z-score, and the formula is as follows:
where x is the value of the user data sample, μ is the average value of the user data sample values, σ is the standard deviation of the user data sample values.
As a preferred technical scheme of the invention: the user portrait unit includes:
the data extraction module is used for carrying out statistical analysis on the data and extracting useful information features;
the labeling module is used for classifying and marking the users according to the statistical analysis result;
the portrait assessment module is used for assessing and optimizing the built-in portrait and ensuring the quality and usability of the portrait of the user;
the feature extraction module is based on a random forest feature selection algorithm, a judgment method in the random forest feature selection algorithm is a coefficient of the foundation, and the formula of the coefficient of the foundation is as follows:
wherein A is a feature, and the data set D is divided into D 1 And D 2k is the kind of feature and,a probability of being the kth class.
As a preferred technical scheme of the invention: the classification labels in the labeling module are based on pearson correlation coefficients, and the pearson correlation coefficients are expressed as follows:
wherein cov (x, y) is the covariance of variables x and y, σx and σy are the standard deviations of variables x and y, respectively, and E (i) represents the expected value of i;
the portrait assessment module is based on F-score indexes, and the formula of the F-score indexes is as follows:
wherein alpha is a weight number, P is an accuracy rate, R is a recall rate, and the formulas of P and R are as follows
Where TP is the number of samples that were originally correct and are also divided into correct samples, FP is the number of samples that were originally incorrect and are divided into correct samples, and FN is the number of samples that were originally incorrect and are correctly divided into incorrect samples.
As a preferred technical scheme of the invention: the tag management unit comprises real-time updating and maintenance management of the tag, application management of the tag, security management of the tag and usage analysis of the tag.
As a preferred technical scheme of the invention: the label screening unit executes the following steps:
selecting a corresponding label screener according to the label type to be screened;
setting corresponding screening conditions according to the content to be screened;
applying the screening conditions to the tag data to obtain tag data meeting the conditions;
and applying the filtered label data to the corresponding service scene.
The method for screening the private domain business labels based on the big data comprises the following steps:
s1: collecting data of users from different channels and platforms;
s2: integrating and uniformly formatting the collected user data;
s3: integrating a plurality of dimension data of the user to form a complete user portrait;
s4: updating and managing different labels according to service requirements and user characteristics;
s5: and screening out labels meeting the conditions according to the updating management of the user portrait and the labels.
Compared with the prior art, the private domain service label screening system and method based on big data provided by the invention have the beneficial effects that:
the invention standardizes different user data based on standard scores, ensures the uniformity of the data, provides an accurate data basis for generating user portraits, forms complete data portraits based on big data technology, relates to the special detection extraction of random forests, integrates pearson correlation coefficients and F-score indexes into user multidimensional data, improves the quality of an algorithm model and the accuracy of labels, and improves the later marketing effect and customer satisfaction.
Drawings
FIG. 1 is a block diagram of the overall system of the present invention;
FIG. 2 is a system block diagram of a data integration unit of the present invention;
FIG. 3 is a system block diagram of a user portrait unit of the present invention;
fig. 4 is a flow chart of the method of the present invention.
The meaning of each label in the figure is:
1. a data collection unit;
2. a data integration unit;
21. a data cleaning module; 22. a data merging module; 23. a data formatting module;
3. a user portrait unit;
31. a feature extraction module; 32. a labeling module; 33. an image evaluation module;
4. a tag management unit;
5. and a tag screening unit.
Detailed Description
It should be noted that, under the condition of no conflict, the embodiments of the present embodiments and features in the embodiments may be combined with each other, and the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and obviously, the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-3, the present invention provides a private domain service tag screening system based on big data, comprising:
data collection unit 1: for collecting data of users from different channels and platforms;
the data integration unit 2: the system is used for integrating and uniformly formatting the collected data;
user portrayal unit 3: the method is used for integrating a plurality of dimension data of the user through data mining and analysis based on a big data technology to form a complete user portrait;
label management unit 4: the method is used for updating and managing different labels in real time according to service requirements and user characteristics;
tag screening unit 5: the method is used for screening out the labels meeting the conditions according to the updating management of the user portrait and the labels.
The data integration unit 2 includes:
data cleaning module 21: the data cleansing module 21 is configured to remove duplicate, invalid or erroneous data;
data merge module 22: the data merging module 22 is configured to integrate a plurality of data sources together to form a complete data set;
data formatting module 23: the data formatting module 23 is configured to perform normalization processing on the integrated data, so as to ensure uniformity of the data.
The data formatting module 23 is based on a standard score Z-score, as follows:
where x is the value of the user data sample, μ is the average value of the user data sample values, σ is the standard deviation of the user data sample values.
The user portrayal unit 3 comprises:
a feature extraction module 31, wherein the data extraction module 31 is used for performing statistical analysis on data and extracting useful information features;
a labeling module 32, wherein the labeling module 32 is used for classifying and labeling users according to the result of statistical analysis;
and the portrait assessment module 33 is used for assessing and optimizing the built-in portrait and ensuring the quality and usability of the portrait of the user.
The feature extraction module 31 is based on a random forest feature selection algorithm, wherein a judgment method in the random forest feature selection algorithm is a coefficient of kunity, and the formula of the coefficient of kunity is as follows:
wherein A is a feature, and the data set D is divided into D 1 And D 2k is the kind of feature and,a probability of being the kth class.
The classification labels in the labeling module 32 are based on pearson correlation coefficients, which are formulated as follows:
wherein cov (x, y) is the covariance of variables x and y, σx and σy are the standard deviations of variables x and y, respectively; e (i) represents the expected value of i.
The image evaluation module 33 is based on an F-score index, and the formula of the F-score index is as follows:
wherein alpha is a weight number, P is an accuracy rate, R is a recall rate, and the formulas of P and R are as follows
Where TP is the number of samples that were originally correct and are also divided into correct samples, FP is the number of samples that were originally incorrect and are divided into correct samples, and FN is the number of samples that were originally incorrect and are correctly divided into incorrect samples.
The tag management unit 4 includes real-time update and maintenance management of tags, application management of tags, security management of tags, and usage analysis of tags.
The label screening unit 5 performs the following procedure:
selecting a corresponding label screener according to the label type to be screened;
setting corresponding screening conditions according to the content to be screened;
applying the screening conditions to the tag data to obtain tag data meeting the conditions;
and applying the filtered label data to the corresponding service scene.
Referring to fig. 4, a private domain service tag screening method based on big data is provided, which includes the following steps:
s1: collecting data of users from different channels and platforms;
s2: integrating and uniformly formatting the collected user data;
s3: integrating a plurality of dimension data of the user to form a complete user portrait;
s4: according to the service requirements and the user characteristics, different labels are updated and managed in real time;
s5: and screening out labels meeting the conditions according to the updating management of the user portrait and the labels.
In this embodiment, the private domain service is a domain which is established and operated by an enterprise, belongs to a field which can be fully controlled by the enterprise, and can be repeatedly and low-cost or even freely reach the user. In the field, the enterprise can acquire clients in various modes and deposit, convert and repurchase the clients, so that the enterprise can better know and manage the user data, and the marketing effect and the client satisfaction are improved. To complete private domain service tag screening, first, the data collection unit 1 collects data of users from different channels and platforms, including basic information, behavior information, transaction data, etc. of the users. Such data may be obtained through different channels and platforms, such as official websites, social media, CRM systems, and the like. The data integration unit 2 integrates and uniformly formats the collected user data; the method comprises the following specific steps: the data cleaning module 21 cleans the collected data to remove repeated, invalid or wrong data, so as to ensure the quality and accuracy of the data; the data merge module 22 will integrate multiple data sources together to form a complete data set; the data formatting module 23 performs normalization processing on the integrated data, and ensures the uniformity of the data. The method for realizing data standardization is a common data processing method based on standard Score Z-Score, can convert data of different orders into Z-Score scores with uniform measurement for comparison, improves data comparability, reduces influence of human factors on the data by weakening the interpretation of the data, and enables the data to be more objective, wherein the formula is as follows:
where x is the value of the user data sample, μ is the average value of the user data sample values, σ is the standard deviation of the user data sample values. The Z-score standardization is simple and easy to use, is convenient to calculate, can be applied to numerical data, and is not influenced by the data magnitude.
Furthermore, the user portrait unit 3 integrates a plurality of dimension data of the user to form a complete user portrait; the method comprises the following specific steps: the feature extraction module 31 performs statistical analysis on the normalized data to extract useful information features, and the feature extraction is based on a random forest feature selection algorithm based on the module 31, wherein the random forest is an integrated classifier which is a strong classifier algorithm formed by training a plurality of basic classifiers and combining the basic classifiers according to a voting system. The standard for screening non-leaf nodes in the random forest is the relevance and importance of characteristic variables, and indexes for measuring the relevance of the variables are information gain and a coefficient value of a kunity. The calculation process when the non-leaf nodes of the decision tree are screened in the random forest can be applied to feature selection, a common judgment method in a random forest feature selection algorithm is a coefficient of base, and the main function of the coefficient of base is to calculate the uncertainty of each feature variable, so that the feature with the minimum uncertainty is screened out. When the coefficient of the kunit value is minimum, it means that all samples are of one type. In other words, the smaller the value of the coefficient of the kunit, the smaller its uncertainty, the better the effect of this feature, the formula of the coefficient of the kunit is as follows:
wherein A is a feature, and the data set D is divided into D 1 And D 2k is the kind of feature, < >>A probability of being the kth class. When the random forest is used as a feature selection algorithm, the method is suitable for a high-dimensional feature set and a large-scale data scene, and has the advantage of insensitivity of a default value.
The labeling module 32 then classifies and labels the user based on the results of the statistical analysis, wherein the classification labels are based on pearson correlation coefficients, which are formulated as follows:
wherein cov (x, y) is the covariance of variables x and y, σx and σy are the standard deviations of variables x and y, respectively; e (i) represents the expected value of i. The interval of the pearson correlation coefficient result is [ -1,1], and if the coefficient is equal to 0, it is indicated that the two variables have no correlation, if the coefficient is closer to 1, the correlation is higher, otherwise, the two variables are in negative correlation, in short, the pearson correlation coefficient value is larger, and the correlation between the two variables is higher, otherwise, the correlation is lower. The two user data variables with higher correlation are classified into the same class.
Finally, the portrait assessment module 33 will evaluate and optimize the user portrait to ensure the quality and usability of the user portrait. The image evaluation module 33 is based on the F-score index, which has the following formula:
wherein alpha is a weight number, P is an accuracy rate, R is a recall rate, and the formulas of P and R are as follows
Where TP is the number of samples that were originally correct and are also divided into correct samples, FP is the number of samples that were originally incorrect and are divided into correct samples, and FN is the number of samples that were originally incorrect and are correctly divided into incorrect samples.
The recall rate and the accuracy rate are evaluation indexes, the recall rate is used for examining how many positive examples in the original sample are recalled, the coverage is evaluated, the recall rate is needed to be analyzed according to specific conditions, and the higher the recall rate is, the better the recall rate is, and the lower the recall rate is, the better the recall rate is; the accuracy refers to the ratio of the number of correctly classified samples to the total number of samples, and is a relatively visual index. In general, the higher the accuracy, the better the classification effect. However, the classifier is also determined according to the situation, for example, under the condition that the proportion of the number of positive and negative samples is extremely large, almost every sample set is a negative sample set, the classifier has high accuracy during training, and a positive verification set can be misjudged later if the classifier encounters, so that the accuracy is also determined according to the situation, and the recall rate and the accuracy cannot comprehensively evaluate the quality of a model because the classifier can only reflect the unilateral performance of the model. The F-score index is a more comprehensive index relative to the recall rate and the accuracy, and the F-score value is adopted to comprehensively consider the quality of the algorithm, so that the algorithm can be intuitively evaluated to a certain extent.
When the user portrait is generated, the label management unit 4 updates and manages different labels in real time according to the service requirements and the user characteristics; the specific management content comprises application management of the tag, and the tag is applied to an actual service scene; the real-time updating and maintaining of the label is to update and maintain the label in time according to business requirements and data changes, so that the accuracy and the integrity of the label are ensured; the security management of the tag is to protect the security and privacy of the tag data and ensure that the tag data is not revealed or abused; and (3) carrying out use analysis on the labels, carrying out statistical analysis on the use condition of the labels, knowing the effect and the value of the labels, and providing reference for optimizing label management.
Furthermore, the tag filtering unit 5 filters out the tags that match the condition based on the user profile and the updated management of the tags. The system can screen according to different labels, help enterprises to find target user groups quickly, and specifically execute the following steps: selecting a corresponding label screener according to the label type to be screened; setting corresponding screening conditions according to the content to be screened; applying the screening conditions to the tag data to obtain tag data meeting the conditions; and applying the filtered label data to the corresponding service scene. When the label screening is carried out, adjustment and optimization are needed according to actual conditions.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims (7)

1. A private domain business label screening system based on big data is characterized in that: comprising the following steps:
data collection unit (1): for collecting data of users from different channels and platforms;
data integration unit (2): the system is used for integrating and uniformly formatting the collected data;
user portrait unit (3): the method is used for integrating a plurality of dimension data of the user through data mining and analysis based on a big data technology to form a complete user portrait;
label management unit (4): the method is used for updating and managing different labels in real time according to service requirements and user characteristics;
label screening unit (5): the method is used for screening out the labels meeting the conditions according to the updating management of the user portrait and the labels.
2. The big data based private domain business label screening system of claim 1, wherein: the data integration unit (2) includes:
data cleaning module (21): the data cleaning module (21) is used for removing repeated, invalid or wrong data;
data merge module (22): the data merging module (22) is used for integrating a plurality of data sources together to form a complete data set;
a data formatting module (23): the data formatting module (23) is used for carrying out standardized processing on the integrated data so as to ensure the uniformity of the data.
3. The big data based private domain business label screening system of claim 2, wherein: the data formatting module (23) is based on a standard score Z-score, the formula is as follows:
where x is the value of the user data sample, μ is the average value of the user data sample values, σ is the standard deviation of the user data sample values.
4. The big data based private domain business label screening system of claim 1, wherein: the user portrayal unit (3) comprises:
a feature extraction module (31), wherein the data extraction module (31) is used for carrying out statistical analysis on data and extracting useful information features;
a tagging module (32), wherein the tagging module (32) is used for classifying and marking users according to the result of statistical analysis;
the portrait assessment module (33), the said portrait assessment module (33) is used for evaluating and optimizing the build-in portrait, ensure the quality and usability of the user portrait;
the classification labels in the labeling module (32) are based on pearson correlation coefficients, which are formulated as follows:
wherein cov (x, y) is the covariance of variables x and y, σx and σy are the standard deviations of variables x and y, respectively, and E (i) represents the expected value of i;
the image evaluation module (33) is based on an F-score index, and the formula of the F-score index is as follows:
wherein alpha is a weight number, P is an accuracy rate, R is a recall rate, and the formulas of P and R are as follows
Wherein TP is the number of samples that were originally correct and are also divided into correct samples, FP is the number of samples that were originally incorrect and are divided into correct samples, and FN is the number of samples that were originally incorrect and are correctly divided into incorrect samples;
the feature extraction module (31) is based on a random forest feature selection algorithm, wherein a judgment method in the random forest feature selection algorithm is a coefficient of the basis, and the coefficient of the basis is expressed as follows:
wherein A is a feature, and the data set D is divided into D 1 And D 2k is the kind of feature, < >>A probability of being the kth class.
5. The big data based private domain business label screening system of claim 1, wherein: the tag management unit (4) comprises real-time updating and maintenance management of the tag, application management of the tag, security management of the tag and usage analysis of the tag.
6. The big data based private domain business label screening system of claim 1, wherein: the label screening unit (5) executes the following steps:
selecting a corresponding label screener according to the label type to be screened;
setting corresponding screening conditions according to the content to be screened;
applying the screening conditions to the tag data to obtain tag data meeting the conditions;
and applying the filtered label data to the corresponding service scene.
7. The private domain service label screening method based on big data is based on the private domain service label screening system based on big data as set forth in any one of claims 1 to 6, and is characterized in that: the method comprises the following steps:
s1: collecting data of users from different channels and platforms;
s2: integrating and uniformly formatting the collected user data;
s3: integrating a plurality of dimension data of the user to form a complete user portrait;
s4: according to the service requirements and the user characteristics, different labels are updated and managed in real time;
s5: and screening out labels meeting the conditions according to the updating management of the user portrait and the labels.
CN202311528120.4A 2023-11-16 2023-11-16 Private domain service label screening system and method based on big data Pending CN117556256A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311528120.4A CN117556256A (en) 2023-11-16 2023-11-16 Private domain service label screening system and method based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311528120.4A CN117556256A (en) 2023-11-16 2023-11-16 Private domain service label screening system and method based on big data

Publications (1)

Publication Number Publication Date
CN117556256A true CN117556256A (en) 2024-02-13

Family

ID=89818058

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311528120.4A Pending CN117556256A (en) 2023-11-16 2023-11-16 Private domain service label screening system and method based on big data

Country Status (1)

Country Link
CN (1) CN117556256A (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391603A (en) * 2017-06-30 2017-11-24 北京奇虎科技有限公司 User's portrait method for building up and device for mobile terminal
CN108764663A (en) * 2018-05-15 2018-11-06 广东电网有限责任公司信息中心 A kind of power customer portrait generates the method and system of management
CN111191122A (en) * 2019-12-20 2020-05-22 重庆邮电大学 Learning resource recommendation system based on user portrait
CN112347372A (en) * 2020-10-30 2021-02-09 银盛支付服务股份有限公司 Method for service promotion of financial enterprise based on user portrait scheme
CN113312531A (en) * 2021-04-22 2021-08-27 广州丰石科技有限公司 User portrait identification method based on DPI analysis and decision tree model
CN113822390A (en) * 2021-11-24 2021-12-21 杭州贝嘟科技有限公司 User portrait construction method and device, electronic equipment and storage medium
CN113988221A (en) * 2021-11-26 2022-01-28 泰康保险集团股份有限公司 Insurance user classification model establishing method, classification method, device and equipment
CN114004584A (en) * 2021-10-22 2022-02-01 国网重庆市电力公司电力科学研究院 Power information management method for building user portrait based on data middleboxes
CN114547128A (en) * 2021-12-14 2022-05-27 浙江吉利控股集团有限公司 False order identification method, false order identification system, computer equipment and storage medium
CN114626940A (en) * 2022-03-31 2022-06-14 中国工商银行股份有限公司 Data analysis method and device and electronic equipment
CN115098599A (en) * 2022-06-20 2022-09-23 启明信息技术股份有限公司 Sketch analysis method and system based on multi-dimensional user preference label
CN116401564A (en) * 2023-03-24 2023-07-07 上海电力大学 PCA-based redundant variable screening improvement method and device
CN116595418A (en) * 2023-05-26 2023-08-15 国网上海市电力公司 Multi-dimensional image construction method for scientific and technological achievements
CN116703129A (en) * 2023-08-07 2023-09-05 匠达(苏州)科技有限公司 Intelligent task matching scheduling method and system based on personnel data image

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391603A (en) * 2017-06-30 2017-11-24 北京奇虎科技有限公司 User's portrait method for building up and device for mobile terminal
CN108764663A (en) * 2018-05-15 2018-11-06 广东电网有限责任公司信息中心 A kind of power customer portrait generates the method and system of management
CN111191122A (en) * 2019-12-20 2020-05-22 重庆邮电大学 Learning resource recommendation system based on user portrait
CN112347372A (en) * 2020-10-30 2021-02-09 银盛支付服务股份有限公司 Method for service promotion of financial enterprise based on user portrait scheme
CN113312531A (en) * 2021-04-22 2021-08-27 广州丰石科技有限公司 User portrait identification method based on DPI analysis and decision tree model
CN114004584A (en) * 2021-10-22 2022-02-01 国网重庆市电力公司电力科学研究院 Power information management method for building user portrait based on data middleboxes
CN113822390A (en) * 2021-11-24 2021-12-21 杭州贝嘟科技有限公司 User portrait construction method and device, electronic equipment and storage medium
CN113988221A (en) * 2021-11-26 2022-01-28 泰康保险集团股份有限公司 Insurance user classification model establishing method, classification method, device and equipment
CN114547128A (en) * 2021-12-14 2022-05-27 浙江吉利控股集团有限公司 False order identification method, false order identification system, computer equipment and storage medium
CN114626940A (en) * 2022-03-31 2022-06-14 中国工商银行股份有限公司 Data analysis method and device and electronic equipment
CN115098599A (en) * 2022-06-20 2022-09-23 启明信息技术股份有限公司 Sketch analysis method and system based on multi-dimensional user preference label
CN116401564A (en) * 2023-03-24 2023-07-07 上海电力大学 PCA-based redundant variable screening improvement method and device
CN116595418A (en) * 2023-05-26 2023-08-15 国网上海市电力公司 Multi-dimensional image construction method for scientific and technological achievements
CN116703129A (en) * 2023-08-07 2023-09-05 匠达(苏州)科技有限公司 Intelligent task matching scheduling method and system based on personnel data image

Similar Documents

Publication Publication Date Title
WO2021184630A1 (en) Method for locating pollutant discharge object on basis of knowledge graph, and related device
CN107528832B (en) Baseline construction and unknown abnormal behavior detection method for system logs
Nguyen et al. Automatic image filtering on social networks using deep learning and perceptual hashing during crises
US10031829B2 (en) Method and system for it resources performance analysis
CN111914468A (en) Intelligent monitoring and analyzing method and device for air pollution
CN108038627B (en) Object evaluation method and device
Al-Janabi A proposed framework for analyzing crime data set using decision tree and simple k-means mining algorithms
CN104809108A (en) Information monitoring and analyzing system
CN115794803B (en) Engineering audit problem monitoring method and system based on big data AI technology
CN113360566A (en) Information content monitoring method and system
CN113904872A (en) Feature extraction method and system for anonymous service website fingerprint attack
CN111506635A (en) System and method for analyzing residential electricity consumption behavior based on autoregressive naive Bayes algorithm
CN112926045A (en) Group control equipment identification method based on logistic regression model
CN116384736A (en) Smart city risk perception method and system
Borg et al. Clustering residential burglaries using modus operandi and spatiotemporal information
CN116452212B (en) Intelligent customer service commodity knowledge base information management method and system
Memon et al. Harvesting covert networks: a case study of the iMiner database
CN115062725B (en) Hotel income anomaly analysis method and system
CN114625901B (en) Multi-algorithm integration method and device
CN117556256A (en) Private domain service label screening system and method based on big data
CN112506930B (en) Data insight system based on machine learning technology
CN104809253A (en) Internet data analysis system
CN114826632A (en) Network attack classification method based on network security data cleaning fusion
CN117633675B (en) Network pollution website discovery method and system based on model cascading
CN111526053B (en) Network security alarm processing method based on confidence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination