WO2020248843A1 - 基于大数据的画像分析方法、装置、计算机设备及存储介质 - Google Patents

基于大数据的画像分析方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020248843A1
WO2020248843A1 PCT/CN2020/093359 CN2020093359W WO2020248843A1 WO 2020248843 A1 WO2020248843 A1 WO 2020248843A1 CN 2020093359 W CN2020093359 W CN 2020093359W WO 2020248843 A1 WO2020248843 A1 WO 2020248843A1
Authority
WO
WIPO (PCT)
Prior art keywords
analyzed
factor
profile
factors
portrait
Prior art date
Application number
PCT/CN2020/093359
Other languages
English (en)
French (fr)
Inventor
郑立颖
徐亮
金戈
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020248843A1 publication Critical patent/WO2020248843A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of big data processing, and in particular to a big data-based portrait analysis method, device, computer equipment and storage medium.
  • clustering methods In order to better arrange the work of enterprise employees, companies generally use clustering methods to perform cluster analysis on user portrait data of enterprise employees to determine the group attributes of enterprise employees in order to better arrange work. Or, in order to better expand the enterprise business, the current company generally uses the clustering method to perform cluster analysis on the user profile data of the enterprise customer to determine the group attribute of the enterprise customer in order to better expand the enterprise business.
  • the corresponding user portrait data is clustered, there are not only the disadvantages of large amount of calculation and long time, but also the clustering effect is not ideal.
  • the embodiments of the present application provide a big data-based portrait analysis method, device, computer equipment, and storage medium to solve the problems of large amount of calculation, long time and unsatisfactory clustering effect when analyzing user portrait data.
  • a portrait analysis method based on big data including:
  • the to-be-analyzed portrait data includes the to-be-analyzed portrait factors and the corresponding to each of the to-be-analyzed portrait factors The value of the factor to be analyzed;
  • the target user database is queried according to the user group attribute corresponding to each cluster, and the target object corresponding to the user group attribute is obtained.
  • a portrait analysis device based on big data including:
  • the to-be-analyzed portrait data screening module is used to obtain a portrait analysis request, and based on the portrait analysis request, filter the to-be-analyzed portrait data that meets the target screening conditions from the user portrait database.
  • the value of the factor to be analyzed corresponding to each of the profile factors to be analyzed;
  • a standardized factor value acquisition module configured to standardize the to-be-analyzed factor value corresponding to the to-be-analyzed profile factor, and obtain the standardized factor value corresponding to the to-be-analyzed profile factor;
  • the weight value acquisition module is configured to use the CRITIC method to perform weight analysis on the profile factors to be analyzed and the corresponding standardized factor values, and to acquire the weight values corresponding to each profile factor to be analyzed;
  • the to-be-selected portrait factor determination module is used to screen the to-be-analyzed portrait factors according to the weight value corresponding to each of the to-be-analyzed portrait factors to determine the to-be-selected portrait factors;
  • the target portrait factor determination module is configured to reduce the dimensions of the to-be-selected portrait factors by using the PCA method, and determine the first M to-be-selected portrait factors after the dimensionality reduction as target portrait factors;
  • the user group attribute determination module is used to cluster the target profile factor and the corresponding normalization factor value by using the Kmeans clustering algorithm to obtain K clusters, and according to the normalization factor corresponding to each of the clusters The value determines the corresponding user group attribute;
  • the target object obtaining module is used to query the target user database according to the user group attribute corresponding to each cluster cluster, and obtain the target object corresponding to the user group attribute.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the to-be-analyzed portrait data includes the to-be-analyzed portrait factors and the corresponding to each of the to-be-analyzed portrait factors The value of the factor to be analyzed;
  • the target user database is queried according to the user group attribute corresponding to each cluster, and the target object corresponding to the user group attribute is obtained.
  • One or more readable storage media storing computer readable instructions
  • the computer readable storage medium storing computer readable instructions
  • the one Or multiple processors perform the following steps:
  • the to-be-analyzed portrait data includes the to-be-analyzed portrait factors and the corresponding to each of the to-be-analyzed portrait factors The value of the factor to be analyzed;
  • the target user database is queried according to the user group attribute corresponding to each cluster, and the target object corresponding to the user group attribute is obtained.
  • the image data to be analyzed that meets the target screening conditions are selected from the user image database, and the value of the analysis factor corresponding to the image factor to be analyzed is standardized to obtain
  • the standardized factor value corresponding to the profile factor to be analyzed makes each standardized factor at the same level to ensure the accuracy of the subsequent processing data;
  • the CRITIC method is used to perform weight analysis on the profile factor to be analyzed and the corresponding standardized factor value to obtain each
  • the weight value corresponding to the profile factor to be analyzed ensures that the weight value of the profile factor to be analyzed is objective and improves the accuracy of subsequent calculation results;
  • the profile factor to be analyzed is screened according to the weight value corresponding to each profile factor to be analyzed, and the selection is determined Image factor, to remove unimportant image factors to be analyzed and reduce the complexity of subsequent operations.
  • the PCA method is used to reduce the dimensionality of the selected portrait factors, and the first M to be selected portrait factors after the dimensionality reduction are determined as the target portrait factors to simplify the subsequent calculations and reduce the computational overhead;
  • the traditional Kmeans clustering algorithm is very sensitive to interference data, and a small amount Interfering data can have a great impact on the clustering effect, making the clustering effect unsatisfactory.
  • the CRITIC method and PCA method are used to reduce the dimensionality of the data, remove the interfering data, and reduce the data dimensionality, and then use the Kmeans clustering algorithm to determine the target profile factor Perform clustering with the corresponding standardized factor values, obtain K clusters, determine the corresponding user group attributes according to the standardized factor values corresponding to each cluster, and query according to the user group attributes corresponding to each cluster User portrait database to accurately obtain the target objects corresponding to the attributes of the user group, so as to screen out the target objects that meet the target screening conditions.
  • FIG. 1 is a schematic diagram of an application environment of an image analysis method based on big data in an embodiment of the present application
  • FIG. 2 is a flowchart of a portrait analysis method based on big data in an embodiment of the present application
  • FIG. 3 is another flowchart of a portrait analysis method based on big data in an embodiment of the present application
  • FIG. 4 is another flowchart of a portrait analysis method based on big data in an embodiment of the present application.
  • FIG. 5 is another flowchart of a portrait analysis method based on big data in an embodiment of the present application.
  • Fig. 6 is another flowchart of a portrait analysis method based on big data in an embodiment of the present application.
  • FIG. 7 is another flowchart of a portrait analysis method based on big data in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an image analysis device based on big data in an embodiment of the present application.
  • Fig. 9 is a schematic diagram of a computer device in an embodiment of the present application.
  • the big data-based portrait analysis method provided by the embodiment of the present application can be applied to the application environment as shown in FIG. 1.
  • the big data-based portrait analysis method is applied to a portrait analysis system.
  • the portrait analysis system includes a client and a server as shown in FIG. 1.
  • the client and the server communicate through the network to realize the user portrait data Dimensionality reduction is performed on the portrait factors in, and the dimensionality reduction data is clustered to improve clustering efficiency.
  • the client is also called the client, which refers to the program that corresponds to the server and provides local services to the client.
  • the client can be installed on, but not limited to, various personal computers, laptops, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented as an independent server or a server cluster composed of multiple servers.
  • a portrait analysis method based on big data is provided. Taking the method applied to the server in FIG. 1 as an example, the method includes the following steps:
  • the to-be-analyzed portrait data includes the to-be-analyzed portrait factors and the to-be-analyzed factors corresponding to each to-be-analyzed portrait factor value.
  • the portrait analysis request refers to a request for analyzing user portrait data.
  • User portrait database refers to a database that stores original portrait data.
  • the original portrait data refers to the user portrait data corresponding to each user stored in the user portrait database.
  • the original portrait data is user portrait data obtained based on big data methods. For example, if the user corresponding to the original portrait data is an enterprise employee, the corresponding original portrait data includes but not limited to the basic personal information of each user (such as the year of birth) Month, birthplace), business behavior information (such as frequent places of entry and exit, working hours, working address, occupation) and dimensional customer information (such as number of customers, customer types, etc.).
  • Target screening conditions refer to the conditions used to filter the original portrait data during this portrait analysis to filter out the user portrait data corresponding to the user to be analyzed.
  • the image data to be analyzed refers to the image data that meets the target screening conditions selected from the original image data, so that the subsequent analysis of the image data to be analyzed can be performed.
  • the image factor to be analyzed refers to a specific image factor in the image data to be analyzed, which can be understood as dimensions. For example, birth time, hometown, and occupation respectively represent 3 image factors to be analyzed.
  • the value of the factor to be analyzed refers to the value corresponding to the image factor to be analyzed.
  • the image factor to be analyzed and the value of the factor to be analyzed form a set of key-value pairs, for example, birth date-January 1990, hometown-Shenzhen, Guangdong and occupation- Users etc.
  • the user portrait database pre-stores the original portrait data corresponding to multiple users, and the user portrait database is queried according to the target filtering conditions, and the user portrait data that meets the target selection conditions are selected from the original portrait data in the user portrait database as the analysis to be analyzed Portrait data.
  • the target filtering condition can be set as the performance standard, and the original portrait data corresponding to the corporate employees with the performance standard can be selected from the original portrait data and determined as the portrait data to be analyzed .
  • the profile data to be analyzed includes profile factors to be analyzed and corresponding factor values to be analyzed.
  • S202 Perform standardization processing on the to-be-analyzed factor value corresponding to the to-be-analyzed profile factor, and obtain the standardized factor value corresponding to the to-be-analyzed profile factor.
  • standardization processing refers to the process of processing the value of the factor to be analyzed so that the value of the factor to be analyzed is in the same order of magnitude.
  • the standardized factor value refers to the corresponding value of the image factor to be analyzed after standardized processing.
  • the standardized factor values are all in the same order of magnitude, so that subsequent analysis of the standardized factor value can avoid errors in the data analysis results due to data diversity.
  • the native place may be Shenzhen, Guangdong, Guangzhou, Guangdong, Dongguan, Guangdong, etc. To facilitate subsequent analysis, it can be converted to a specific value, such as 0001 for Shenzhen, Guangdong, 0002 for Guangdong advertising, and 0003 for Dongguan, Guangdong.
  • the factor to be analyzed in this embodiment The value is standardized, and the value of the factor to be analyzed is converted into the value of a dimensionless index, that is, the value of the factor to be analyzed is converted into a dimensionless standardized factor value, so that each standardized factor is at the same level, ensuring that the subsequent processing data Accuracy.
  • S203 Use the CRITIC method to perform weight analysis on the profile factors to be analyzed and the corresponding standardized factor values, and obtain a weight value corresponding to each profile factor to be analyzed.
  • the CRITIC method (Criteria Importance Through Intercrieria Correlation) is the weight determination method, and the CRITIC method is an objective weighting method proposed by Diakoulaki.
  • the CRITIC method is used to determine the objective weight of the image factor to be analyzed, and the objective weight of the image factor to be analyzed is determined based on the two basic concepts of contrast intensity and conflict between indicators.
  • the contrast intensity is used to indicate the size of the difference in the value of the evaluation schemes of the same indicator, expressed in the form of standard deviation, that is, the size of the standard deviation indicates the size of the difference in the value of each scheme within the same indicator, the larger the standard deviation The greater the difference in the value of each program.
  • the conflict between the indicators is based on the correlation between the indicators, that is, the conflict between the indicators is used to indicate the conflict between the image factors to be analyzed, if there is a strong relationship between the two image factors to be analyzed.
  • the positive correlation indicates that the conflict between the two indicators is low.
  • the weight value refers to the value used to determine the importance of the profile factor to be analyzed after performing weight analysis on the profile factor to be analyzed and the corresponding standardized factor value.
  • the CRITIC method is used to perform weight analysis on the profile factors to be analyzed and the corresponding standardized factor values, and then the standardized factor value is multiplied by the weight of each profile factor to be analyzed to obtain the weight value of each profile factor to be analyzed.
  • the CRITIC method is used to determine the weight values of the image factors to be analyzed, to ensure that the weight values of the image factors to be analyzed are objective, and to improve the accuracy of subsequent calculation results.
  • S204 Screen the image factors to be analyzed according to the weight value corresponding to each image factor to be analyzed, and determine the image factors to be selected.
  • the image factor to be selected refers to a factor with a higher weight value obtained after weight analysis of the image factor to be analyzed.
  • the CRITIC method is used to determine the weight of each image factor to be analyzed, and the image factors to be analyzed whose weight value is greater than the preset weight threshold are screened out, and these image factors to be analyzed are determined as the image factors to be selected to exclude
  • the image factor to be analyzed corresponding to the low weight value is to filter the unimportant image factor to be analyzed, thereby reducing the number of calculations and improving the analysis efficiency.
  • the preset weight threshold refers to a preset value used to filter out the value of the image factor to be analyzed.
  • the profile factor to be analyzed is changed Determined as the image factor to be selected. For example, in this image analysis, if the weight value of the single type of the image factor to be analyzed is greater than the preset weight threshold, the image factor to be analyzed is screened out and determined as the image factor to be selected. When the weight value corresponding to the image factor to be analyzed is less than the preset weight threshold, it means that the image factor to be analyzed is not critical to the overall analysis.
  • the image factor to be analyzed is the birth date
  • the weight value is less than the preset weight
  • the image factor to be analyzed as the date of birth is not important to this image analysis. Therefore, the image factor to be analyzed needs to be deleted.
  • the image factors to be analyzed are screened according to the weight value corresponding to each image factor to be analyzed, so as to remove unimportant image factors to be analyzed, reduce the complexity of subsequent operations, and improve analysis efficiency.
  • S205 Use the PCA method to reduce the dimensions of the image factors to be selected, and determine the first M image factors to be selected after the dimensionality reduction as the target image factors.
  • the PCA method is the principal component analysis method, also known as the principal component analysis, which aims to use the idea of dimensionality reduction to convert multiple indicators into a few comprehensive indicators (ie principal components), where each principal component Both can reflect most of the information of the original variables, and the information contained is not duplicated.
  • the PCA method introduces multiple variables and at the same time summarizes the complex factors into several principal components, simplifies the problem, and obtains more scientific and effective data information at the same time.
  • the CRITIC method is used to perform weight analysis on the profile factors to be analyzed and the corresponding standardized factor values, only the weight value corresponding to each profile factor to be analyzed is obtained.
  • the PCA method is also required to select profile factors. Dimensionality reduction is performed to obtain the data characteristics of the profile factors to be selected, which further realizes data dimensionality reduction and reduces the complexity of clustering operations.
  • S206 Use the Kmeans clustering algorithm to cluster the target profile factors and the corresponding standardized factor values, obtain K clusters, and determine the corresponding user group attributes according to the standardized factor values corresponding to each cluster.
  • the Kmeans clustering algorithm refers to an algorithm that uses K points in the space as the initial clustering center to classify the points closest to the initial clustering center. That is, the standardized factor values corresponding to the target profile factors are divided into different initial clusters. The normalization factor value for the center of the class.
  • the user group attribute is a common attribute used to represent the users corresponding to each cluster. It can be understood that user group attributes are different according to the analysis purpose. For example, if the purpose of analysis is to analyze the job type of business personnel, the user group attribute can be the job type, that is, group portraits are divided into types suitable for handling complaints, types suitable for product promotion, and types suitable for handling after-sales services.
  • the target profile factor obtained after processing by the CRITIC method and the PCA method is a number of key factors that affect whether the performance meets the target (such as the four target profiles of A, B, C and D) Factor), because in different image data to be analyzed, each target image factor corresponds to a standardized factor value (for example, the target image factor of A can correspond to any value of A1, A2...An, such as the standardization corresponding to user 1.
  • the factor values can be A1, B2, C3, and D1, and the standardized factor values corresponding to user 2 can be A2, B2, C1, and D4).
  • the determined K clusters After clustering the standardized factors corresponding to these target profile factors, the determined K clusters, the corresponding user group attributes are determined according to the standardized factor value corresponding to each cluster. Determine the corresponding user group attributes according to the standardized factor value corresponding to each cluster. Specifically, it refers to the inductive analysis of the standardized factor value corresponding to each target profile factor in each cluster to extract the common attributes. process.
  • using the Kmeans clustering algorithm to cluster the target profile factors and the corresponding standardized factor values includes: (1) Select the standardized factor values corresponding to k target profile factors from the data as the initial clustering center; (2) ) Calculate the distance of each cluster object (standardized factor value corresponding to the target profile factor) to the cluster center, and assign the cluster object to the nearest initial cluster center according to the principle of minimum distance; (3) According to the clustering result, again Calculate the centers of k clusters and use them as the new cluster centers; (4) Calculate the standard measurement function (usually the mean square error is used as the standard measurement function), and repeat the calculation to obtain new cluster centers until the standard measurement function Start to converge, that is, until the maximum number of iterations is reached, stop, otherwise, continue to operate to obtain K clusters.
  • the standard measurement function usually the mean square error is used as the standard measurement function
  • the factor data table is searched to determine the corresponding user group attributes.
  • the Kmeans clustering algorithm is used to cluster the data processed by the CRITIC method and the PCA method to improve the clustering efficiency to obtain accurate user group attributes.
  • S207 Query the target user database according to the user group attribute corresponding to each cluster, and obtain the target object corresponding to the user group attribute.
  • the target user database refers to a database storing user data
  • the target object refers to users who meet the attributes of the user group.
  • each user profile database since each user profile database stores all the data of each user, after calculating each cluster cluster, the user profile database is queried according to the user group attributes corresponding to each cluster cluster. Obtain the target object corresponding to the attributes of the user group, and provide accurate data for subsequent analysis.
  • the image data to be analyzed that meet the target screening conditions are selected from the user image database, and the value of the analysis factor corresponding to the image factor to be analyzed is standardized to obtain the image to be analyzed.
  • the normalized factor value corresponding to the factor makes each normalized factor at the same level to ensure the accuracy of the subsequent processing data; the CRITIC method is used to perform weight analysis on the analysis image factor and the corresponding standardized factor value to obtain each image to be analyzed The weight value corresponding to the factor ensures that the weight value of the image factor to be analyzed is objective and improves the accuracy of the subsequent calculation results; the image factor to be analyzed is screened according to the weight value corresponding to each image factor to be analyzed, and the image factor to be selected is determined. In order to remove the unimportant image factors to be analyzed, the subsequent calculation complexity is reduced.
  • the PCA method is used to reduce the dimensionality of the selected portrait factors, and the first M to be selected portrait factors after the dimensionality reduction are determined as the target portrait factors to simplify the subsequent calculations and reduce the computational overhead;
  • the traditional Kmeans clustering algorithm is very sensitive to interference data, and a small amount Interference data can have a great impact on the clustering effect, making the clustering effect unsatisfactory.
  • the CRITIC method and PCA method are used to reduce the dimensionality of the data, remove the interference data, and reduce the data dimensionality, and then use the Kmeans clustering algorithm to determine the target profile factor Perform clustering with the corresponding standardized factor values, obtain K clusters, determine the corresponding user group attributes according to the standardized factor values corresponding to each cluster, and query according to the user group attributes corresponding to each cluster User portrait database to accurately obtain the target objects corresponding to the attributes of the user group, so as to screen out the target objects that meet the target screening conditions.
  • the target screening conditions include the dimensions to be filtered and the dimensional threshold corresponding to the dimensions to be filtered.
  • Step S201 is to screen out the user portrait database based on the portrait analysis request that meets the target screening conditions.
  • the portrait data to be analyzed including:
  • S301 Query the user portrait database based on the portrait analysis request, and determine the original dimension value corresponding to the dimension to be filtered in each original portrait data.
  • the dimensions to be selected refer to the criteria for screening the original image factors to select the image factors that meet the purpose of the image analysis. For example, if the image analysis is to analyze the work performance of the salesperson, the dimensions to be selected include the salesperson Job performance, working age, client type, and client’s work area.
  • the dimension threshold refers to the value corresponding to the dimension to be filtered.
  • the dimension threshold is artificially set. For example, if the dimension to be filtered is the business performance of a salesperson, in order to analyze the work performance of a salesperson with better performance, set the dimension threshold 70% for subsequent analysis of the performance of salespersons with better performance.
  • the original dimension value is the value of the same dimension of the user obtained through the user’s original profile data. For example, the business performance dimension of the salesperson in the original profile data is obtained, and the average business performance of the salesperson is counted as the original dimension value and recorded The original portrait data table.
  • the portrait of the same user can be collected in the original portrait data table and stored in the user portrait database.
  • the original portrait data table includes the original portrait data of each user, and then the server compares the original portrait data table with the dimensions to be filtered. Corresponding dimensions are judged to quickly filter out the dimensions that meet the dimensions to be filtered, speeding up the analysis progress.
  • the original portrait data table refers to a table used to store portrait data of the same user, and different users correspond to different original portrait data tables.
  • the query condition command can be used to query the data in the portrait data table, and the original portrait data can be quickly filtered from the original portrait data according to the dimensional threshold.
  • the matched original image data is determined as the image data to be analyzed, so as to remove the image data that does not need to be analyzed, and reduce the subsequent calculation complexity, and then analyze the image data to be analyzed.
  • the original image data matching the original dimensionality value and the dimensional threshold value is determined as the image data to be analyzed, so as to remove the image data that does not need to be analyzed and reduce subsequent calculations. Complexity, the subsequent analysis of the image data to be analyzed.
  • step S202 that is, performing standardization processing on the to-be-analyzed factor value corresponding to the to-be-analyzed profile factor to obtain the normalized factor value corresponding to the to-be-analyzed profile factor includes:
  • the value conversion rule refers to the rule that converts the value of the factor to be analyzed into data of the same magnitude. For example, for gender, it is converted to 0/1, and the native place is converted to the corresponding code to ensure that the data is comparable.
  • the standardized conversion formula refers to a formula that converts the value of the factor to be analyzed into data of the same magnitude. It can be understood that both the numerical conversion rule and the standardized conversion formula are used to convert the value of the factor to be analyzed into the standardized factor value of the same magnitude, so as to ensure the accuracy of subsequent data processing and make the data analysis result more reliable.
  • categorical data refers to the value of the factor to be analyzed to represent the value of a specific category, rather than continuous data.
  • categorical data can refer to gender, hometown, or type of business.
  • the value of the factor to be analyzed is categorical data
  • the value of the factor to be analyzed is converted into the corresponding Arabic numerals by the numerical conversion rule to obtain the standardized factor value corresponding to the profile factor to be analyzed, for example, when the gender is male or female, male conversion If it is 0, the female turns into 1.
  • Continuous data refers to data whose factor values to be analyzed are continuous intervals.
  • Continuous data includes but is not limited to continuous values such as working hours, number of customers, and customer purchase limits. Specifically, when the value of the factor to be analyzed is continuous data and the larger the data, the better, such as the number of customers or the customer’s purchase amount, it is required that the lth profile factor to be analyzed is as large as possible.
  • the standardized conversion formula is N is used to define the numerical range of the normalization factor value.
  • the value of the factor to be analyzed is continuous data and the smaller the data is, the better, for example, the customer complaint rate or customer misunderstanding rate, etc., that is, the larger the factor of the image to be analyzed, the better, the standardized conversion formula is N is used to define the numerical range of the normalization factor value.
  • a numerical conversion rule or a standardized conversion formula corresponding to the image factor to be analyzed is obtained, so that the categorical data is converted into a standardized factor value according to the numerical conversion rule, and the standardized conversion
  • the formula converts continuous data into standardized factor values, and converts the values corresponding to the profile factors to be analyzed into standardized factor values of the same magnitude to make the factor values comparable, ensure the accuracy of subsequent data processing, and make data analysis results more reliable.
  • step S203 that is, using the CRITIC method to perform a weight analysis on the image factor to be analyzed and the corresponding standardized factor value, to obtain the weight value corresponding to each image factor to be analyzed, including:
  • S501 Perform correlation calculation based on the standardized factor values corresponding to any two profile factors to be analyzed, and obtain correlation coefficients corresponding to any two profile factors to be analyzed.
  • the correlation coefficient is a statistical indicator used to reflect the close degree of correlation between variables.
  • the correlation coefficient is calculated according to the product difference method. It is also based on the deviation of the two variables from their respective averages. The multiplication of the two deviations reflects the degree of correlation between the two variables to ensure the reliability of the obtained correlation coefficient.
  • the formula for calculating the correlation coefficient is r i,j refer to correlation coefficients, i and j are used to represent the normalized factor values corresponding to any two profile factors to be analyzed.
  • the value of the correlation coefficient is between -1 and 1, and its properties are as follows: 1) When r>0, it means that the two standardized factor values are positively correlated, when r ⁇ 0, it means that the two variables are negatively correlated; 2) When
  • the quantitative index is an index used to measure the conflict between each profile factor to be analyzed and other profile factors to be analyzed.
  • the quantitative index of each profile factor to be analyzed can be passed It is calculated, where r i,j is the correlation coefficient between the i-th image factor to be analyzed and the j-th image factor to be analyzed. Understandably, the stronger the correlation between the two profile factors to be analyzed, the smaller the quantitative index.
  • S503 Calculate the amount of information corresponding to each profile factor to be analyzed by using the quantitative index corresponding to each profile factor to be analyzed.
  • the amount of information refers to the value used to judge the importance of the profile factors to be analyzed. Specifically, using Calculate the information amount of each image factor to be analyzed, where C j is the information amount included in the j-th image factor to be analyzed, and b is the b-th image factor to be analyzed in this embodiment. Generally speaking, the larger C j is, the greater the amount of information contained in the j-th image factor to be analyzed is, and the relative importance of the image factor to be analyzed is also greater, and ⁇ j refers to the standard deviation. Determine the amount of information corresponding to each profile factor to be analyzed according to the quantitative index, and determine the importance of each profile factor to be analyzed relative to all profile factors to be analyzed.
  • S504 Determine a weight value corresponding to each profile factor to be analyzed according to the amount of information corresponding to each profile factor to be analyzed.
  • the calculation formula is based on the weight of each profile factor to be analyzed. Calculate the weight proportion of each profile factor to be analyzed, and determine the weight value corresponding to each profile factor to be analyzed according to the normalized factor value corresponding to each profile factor to be analyzed multiplied by the weight proportion value of the profile factor to be analyzed, Ensure that the weight value corresponding to each profile factor to be analyzed is reliable, where W j is the weight value corresponding to the profile factor to be analyzed, m is the number of profile factors to be analyzed, and C j is the jth profile factor to be analyzed. The amount of information included.
  • correlation is calculated based on the standardized factor values corresponding to any two image factors to be analyzed to ensure that the obtained correlation coefficient is reliable; according to any two image factors to be analyzed Corresponding correlation coefficient, calculate the quantitative index corresponding to each image factor to be analyzed; use the quantitative index corresponding to each image factor to be analyzed, calculate the amount of information corresponding to each image factor to be analyzed, to determine the relative The importance of all the image factors to be analyzed; according to the amount of information corresponding to each image factor to be analyzed, the weight value corresponding to each image factor to be analyzed is determined to ensure the objectivity of the weight value corresponding to the obtained image factor to be analyzed.
  • step S204 screening the image factors to be analyzed according to the weight value corresponding to each image factor to be analyzed, and determining the image factors to be selected includes:
  • the weight value ranking result refers to the result of ranking according to the weight value of each profile factor to be analyzed. Specifically, it can be displayed on the display device in a positive order (that is, the order of weight values from high to low), or it can be displayed on the display device in a reverse order (that is, the weight values are in order from low to high), and the weights are displayed intuitively.
  • Value sort result the display device refers to a device used for storage, display, and operation, and may be a computer or the like.
  • the proportion of the total weight can be understood as the proportion of the sum of the weight values corresponding to some of the profile factors to be analyzed to the sum of the weight values.
  • the sum of the weights corresponding to the first X (X ⁇ 1) profile factors to be analyzed may be divided by the sum of the weight values corresponding to all profile factors to be analyzed for calculation, so as to quickly obtain the proportion of the total weight.
  • the preset proportion threshold refers to a preset threshold, which is used to judge whether the sum of the weight values corresponding to the first X image factors to be analyzed meets the standard. Specifically, when the total weight ratio is greater than the preset ratio threshold, the first X to-be-analyzed profile factors in the weight value sorting result are determined as the to-be-selected profile factors, thereby removing interference factors, reducing operation dimensions, and improving clustering accuracy .
  • the weight values corresponding to all the image factors to be analyzed are sorted to obtain the weight value ranking results; in the calculation weight value ranking results, the first X image factors to be analyzed correspond to The total weight ratio of the sum of weight values relative to the sum of weight values corresponding to all the profile factors to be analyzed; when the total weight percentage is greater than the preset percentage threshold, the top X profile factors to be analyzed in the weight value ranking result Determine as the image factor to be selected, thereby removing the interference factor, reducing the computational dimension, and improving the accuracy of clustering.
  • step S206 that is, determining the corresponding user group attribute according to the normalization factor value corresponding to each cluster cluster, includes:
  • S701 Obtain a target profile factor corresponding to each cluster cluster, classify the to-be-analyzed factor value corresponding to the target profile factor according to a preset classification rule, and acquire at least two classification attributes.
  • the classification rules refer to the pre-set rules used to classify the standardized factor values.
  • the classification rules can be set to 0-2, 2-4, 4-6, 4-8 « and other working hours, that is, divided according to a classification attribute for 2 years, and at least two classification attributes are obtained to determine the corresponding quantity of each classification attribute.
  • S702 Count the number of categories of the target profile factors corresponding to each category attribute, sort in descending order according to the number of categories, and obtain a descending sort result.
  • the number of categories refers to the number of all values in the target profile factor that meet the same category attribute.
  • the descending sort result is the result of displaying the number of each category attribute in the same target portrait factor from more to less.
  • the descending sort result includes the number of categories and the corresponding category attributes, which can be visually displayed on the display device for easy viewing. For example, when the target portrait factor is working hours, suppose the number of categories corresponding to the classification attribute 0-2 is 100, the number of categories corresponding to the classification attribute 2-4 is 300, and the classification attribute 4-6 corresponds to The number of categories is 250, the number of categories 6-8 corresponds to 200, and the number of categories 8-10 corresponds to 150. When sorting in descending order according to the number of categories, you can get the number of categories and working hours of 300-2-4, 250-4-6, 200-6-8, 150-8-10 and 100-0-2 The descending sort result corresponding to the segment.
  • S703 Calculate the target ratio value corresponding to the sum of the numbers of the first S categories and the sum of the numbers of all categories in the descending sorting result.
  • the target ratio value refers to the value of the proportion of the number of partial categories to the total number of categories, specifically through the calculation formula of the target ratio value
  • the target ratio value is obtained by convenient calculation, where P is the target ratio value, Q i is the number of categories corresponding to each i-th classification attribute, M is the number of classification attributes, and S is the number of the S-th classification attribute in the descending sorted result position.
  • the preset ratio threshold refers to a preset value used to determine whether the target ratio value meets the standard.
  • the preset ratio threshold can be set according to actual conditions to limit the range of group attributes in the target portrait factor.
  • the union of the classification attributes corresponding to the number of the first X categories is determined as the factor group attribute corresponding to the target profile factor, which can exclude the influence of discrete values on the cluster analysis result interference.
  • the preset ratio threshold is set to 90%, and according to the descending order, the union of the first 4 classification attributes is used to determine the factor group attributes, namely 2-4, 4-6
  • the union of, 6-8 and 8-10 determines the attributes of the factor population.
  • S705 Based on the factor group attributes corresponding to the target profile factors, determine the user group attributes corresponding to the cluster clusters.
  • the set of factor group attributes corresponding to all target profile factors is determined as the user group attribute corresponding to the cluster cluster, and the user group attribute is the common attribute corresponding to the users meeting the target screening conditions, so as to be subsequently based on
  • This user group attribute can be used for business expansion, such as personnel recruitment, customer assignment and other scenarios.
  • the target image factor corresponding to each cluster cluster is classified according to the classification rule to be analyzed to determine the number corresponding to each classification attribute, and Sort in descending order according to the number of categories, and visually display the results of the descending order on the display device; calculate the target ratio value corresponding to the sum of the number of the first S categories and the sum of the numbers of all categories in the descending order result; when the target ratio is greater than
  • the ratio threshold is preset, the union of the classification attributes corresponding to the number of the first S categories is determined as the factor group attribute corresponding to the target profile factor, and the factor group attribute corresponding to the target profile factor is determined based on the factor group attribute corresponding to the target profile factor.
  • User group attributes so that subsequent business expansion based on the user group attributes, such as personnel recruitment, customer assignment, etc., are applicable.
  • an image analysis device based on big data is provided, and the image analysis device based on big data corresponds to the image analysis method based on big data in the foregoing embodiment in a one-to-one correspondence.
  • the big data-based portrait analysis device includes a portrait data screening module 801 to be analyzed, a standardized factor value acquisition module 802, a weight value acquisition module 803, a portrait factor determination module to be selected 804, and a target portrait factor determination module 805 , User group attribute determination module 806 and target object acquisition module 807.
  • each functional module is as follows:
  • the to-be-analyzed profile data screening module 801 is used to obtain the profile analysis request, and filter out the profile-to-be-analyzed data that meets the target filtering conditions from the user profile database based on the profile analysis request. Analyze the value of the factor to be analyzed corresponding to the profile factor.
  • the standardized factor value obtaining module 802 is used to perform standardization processing on the to-be-analyzed factor value corresponding to the to-be-analyzed profile factor, and obtain the standardized factor value corresponding to the to-be-analyzed profile factor.
  • the weight value acquisition module 803 is configured to use the CRITIC method to perform weight analysis on the profile factors to be analyzed and the corresponding standardized factor values, and to acquire the weight values corresponding to each profile factor to be analyzed.
  • the to-be-selected profile factor determination module 804 is configured to screen the to-be-analyzed profile factors according to the weight value corresponding to each to-be-analyzed profile factor, and determine the profile to be selected.
  • the target profile factor determination module 805 is configured to reduce the dimensions of the profile factors to be selected by using the PCA method, and determine the first M profile factors to be selected after dimensionality reduction as target profile factors.
  • the user group attribute determination module 806 is used to cluster the target profile factor and the corresponding standardized factor value by using the Kmeans clustering algorithm, obtain K clusters, and determine the corresponding according to the standardized factor value corresponding to each cluster User group attributes.
  • the target object obtaining module 807 is configured to query the target user database according to the user group attribute corresponding to each cluster, and obtain the target object corresponding to the user group attribute.
  • the target screening condition includes the dimension to be filtered and the dimension threshold corresponding to the dimension to be filtered;
  • the image data filtering module 801 to be analyzed includes: an original dimension value determining unit and a first judgment unit.
  • the original dimension value determining unit is used to query the user portrait database based on the portrait analysis request, and determine the original dimension value corresponding to the dimension to be filtered in each original portrait data.
  • the first judgment unit is configured to determine the original portrait data as the to-be-analyzed portrait data that meets the target screening condition if the original dimension value matches the dimension threshold value.
  • the standardized factor value acquisition module 802 includes: a factor conversion unit, a categorical data conversion unit, and a continuous data conversion unit.
  • the factor conversion unit is used to obtain a numerical conversion rule or a standardized conversion formula corresponding to the image factor to be analyzed.
  • the categorical data conversion unit is configured to, if the value of the factor to be analyzed is categorical data, use the numerical conversion rule to perform the numerical conversion of the value of the factor to be analyzed to obtain the standardized factor value corresponding to the profile factor to be analyzed.
  • the continuous data conversion unit is configured to, if the value of the factor to be analyzed is continuous data, use a standardized conversion formula to standardize the value of the factor to be analyzed, and obtain the standardized factor value corresponding to the profile factor to be analyzed.
  • the weight value acquisition module 803 includes: a correlation coefficient acquisition unit, a quantization index calculation unit, an information amount calculation unit, and a weight value determination unit.
  • the correlation coefficient acquisition unit is configured to perform correlation calculation based on the standardized factor values corresponding to any two profile factors to be analyzed, and obtain correlation coefficients corresponding to any two profile factors to be analyzed.
  • the quantitative index calculation unit is used to calculate the quantitative index corresponding to each image factor to be analyzed according to the correlation coefficients corresponding to any two image factors to be analyzed.
  • the information amount calculation unit is used to calculate the information amount corresponding to each image factor to be analyzed by using the quantitative index corresponding to each image factor to be analyzed.
  • the weight value determining unit is used to determine the weight value corresponding to each portrait factor to be analyzed according to the amount of information corresponding to each portrait factor to be analyzed.
  • the to-be-selected portrait factor determination module 804 includes: a weight value ranking result obtaining unit, a total weight ratio calculation unit, and a second judgment unit.
  • the weight value sorting result obtaining unit is used to sort the weight values corresponding to all the profile factors to be analyzed, and obtain the weight value sorting results.
  • the total weight percentage calculation unit is used to calculate the weight value ranking result, the total weight percentage of the sum of the weight values corresponding to the first X profile factors to be analyzed relative to the sum of the weight values corresponding to all profile factors to be analyzed.
  • the second judging unit is configured to determine the top X to-be-analyzed portrait factors in the weight value sorting result as the to-be-selected portrait factors if the total weight proportion is greater than the preset proportion threshold.
  • the user group attribute determination module 806 includes: a classification attribute acquisition unit, a descending order result acquisition unit, a target ratio value calculation unit, a factor group attribute determination unit, and a user group attribute determination unit.
  • the classification attribute acquiring unit is used to acquire the target profile factor corresponding to each cluster cluster, classify the to-be-analyzed factor value corresponding to the target profile factor according to preset classification rules, and acquire at least two classification attributes.
  • the descending sort result obtaining unit is used to count the number of categories of the target profile factors corresponding to each category attribute, and perform descending sorting according to the number of categories to obtain the descending sort result.
  • the target proportion value calculation unit is used to calculate the target proportion value corresponding to the sum of the number of the first S categories and the sum of the numbers of all the categories in the descending sorting result.
  • the factor group attribute determining unit is configured to determine the union of the classification attributes corresponding to the first S category quantities as the factor group attribute corresponding to the target profile factor if the target ratio value is greater than the preset ratio threshold value.
  • the user group attribute determining unit is used to determine the user group attribute corresponding to the cluster cluster based on the factor group attribute corresponding to the target portrait factor.
  • the various modules in the above-mentioned big data-based portrait analysis device can be implemented in whole or in part by software, hardware, and combinations thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 9.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer device executes the data used or generated in the process of the image analysis method based on big data, such as target image factors.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to realize a big data-based portrait analysis method.
  • a computer device including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor.
  • the processor executes the computer-readable instructions to implement the The portrait analysis method of big data, such as S201-S207 shown in Fig. 2, or shown in Figs. 3 to 7, is not repeated here to avoid repetition.
  • the functions of each module/unit in this embodiment of the image analysis device based on big data are realized, for example, the image data filtering module 801 to be analyzed and the standardized factor value obtaining module shown in FIG. 8 802.
  • the functions of the weight value obtaining module 803, the to-be-selected portrait factor determining module 804, the target portrait factor determining module 805, the user group attribute determining module 806, and the target object obtaining module 807 are not repeated here to avoid repetition.
  • one or more readable storage media storing computer readable instructions are provided.
  • the computer readable storage medium stores computer readable instructions, and the computer readable instructions are executed by one or more processors.
  • the one or more processors are executed to implement the image analysis method based on big data in the foregoing embodiment, such as S201-S207 shown in FIG. 2 or shown in FIG. 3 to FIG. 7, in order to avoid repetition, I won't repeat it here.
  • the processor executes computer-readable instructions
  • the functions of each module/unit in this embodiment of the image analysis device based on big data are realized, for example, the image data filtering module 801 to be analyzed and the standardized factor value obtaining module shown in FIG. 8 802.
  • the functions of the weight value obtaining module 803, the to-be-selected portrait factor determining module 804, the target portrait factor determining module 805, the user group attribute determining module 806, and the target object obtaining module 807 are not repeated here to avoid repetition.
  • the readable storage medium in this embodiment includes a nonvolatile readable storage medium and a volatile readable storage medium.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Finance (AREA)
  • Economics (AREA)
  • Accounting & Taxation (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Theoretical Computer Science (AREA)
  • Game Theory and Decision Science (AREA)
  • Educational Administration (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于大数据的画像分析方法、装置、计算机设备及存储介质,涉及大数据处理领域,该方法包括:获取画像分析请求,基于画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,待分析画像数据包括待分析画像因子和每一待分析画像因子对应的待分析因子值(S201);对待分析因子值进行标准化处理,获取待分析画像因子对应的标准化因子值(S202);采用CRITIC方法对待分析画像因子和对应的标准化因子值进行权重分析,获取每一待分析画像因子对应的权重值(S203);依据每一待分析画像因子对应的权重值对待分析画像因子进行筛选,确定待选择画像因子(S204);采用PCA法对待选择画像因子进行降维,将降维后的前M个待选择画像因子确定为目标画像因子(S205);采用Kmeans聚类算法对目标画像因子和对应的标准化因子值进行聚类获得用户群体属性,查询与用户群体属性对应的目标对象。采用该方法进行画像分析可提高聚类效率。

Description

基于大数据的画像分析方法、装置、计算机设备及存储介质
本申请以2019年6月14日提交的申请号为201910517664.8,名称为“基于大数据的画像分析方法、装置、计算机设备及存储介质”的中国发明申请为基础,并要求其优先权。
技术领域
本申请涉及大数据处理领域,尤其涉及一种基于大数据的画像分析方法、装置、计算机设备及存储介质。
背景技术
当前公司为了更好地安排企业员工的工作,一般通过聚类方法对企业员工的用户画像数据进行聚类分析,以确定企业员工的群体属性,以便更好地安排工作。或者,当前公司为了更好的扩展企业业务,一般通过聚类方法对企业客户的用户画像数据进行聚类分析,以确定企业客户的群体属性,以便更好地扩展企业业务。
发明人意识到当前用户画像数据分析过程中,用户画像数据对应的画像因子的数量巨大,且这些画像因子对应的维度较多或者存在类似的维度,采用经典的聚类方法对数量巨大的画像因子对应的用户画像数据进行聚类时,不但存在运算量大和花费时间长的缺点,而且聚类效果不理想。
发明内容
本申请实施例提供一种基于大数据的画像分析方法、装置、计算机设备及存储介质,以解决用户画像数据分析时存在运算量大、时间长且聚类效果不理想的问题。
一种基于大数据的画像分析方法,包括:
获取画像分析请求,基于所述画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,所述待分析画像数据包括待分析画像因子和每一所述待分析画像因子对应的待分析因子值;
对所述待分析画像因子对应的待分析因子值进行标准化处理,获取所述待分析画像因子对应的标准化因子值;
采用CRITIC方法对所述待分析画像因子和对应的标准化因子值进行权重分析,获取每一所述待分析画像因子对应的权重值;
依据每一所述待分析画像因子对应的权重值对所述待分析画像因子进行筛选,确定待选择画像因子;
采用PCA法对所述待选择画像因子进行降维,将降维后的前M个待选择画像因子确定为目标画像因子;
采用Kmeans聚类算法对所述目标画像因子和对应的标准化因子值进行聚类,获取K个聚类类簇,根据每个所述聚类类簇对应的标准化因子值确定对应的用户群体属性;
根据每一聚类类簇对应的用户群体属性查询目标用户数据库,获取与所述用户群体属性相对应的目标对象。
一种基于大数据的画像分析装置,包括:
待分析画像数据筛选模块,用于获取画像分析请求,基于所述画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,所述待分析画像数据包括待分析画像因子和每一所述待分析画像因子对应的待分析因子值;
标准化因子值获取模块,用于对所述待分析画像因子对应的待分析因子值进行标准化处理,获取所述待分析画像因子对应的标准化因子值;
权重值获取模块,用于采用CRITIC方法对所述待分析画像因子和对应的标准化因子值进行权重分析,获取每一所述待分析画像因子对应的权重值;
待选择画像因子确定模块,用于依据每一所述待分析画像因子对应的权重值对所述待分析画像因子进行筛选,确定待选择画像因子;
目标画像因子确定模块,用于采用PCA法对所述待选择画像因子进行降维,将降维后的前M个待选择画像因子确定为目标画像因子;
用户群体属性确定模块,用于采用Kmeans聚类算法对所述目标画像因子和对应的标准化因子值进行聚类,获取K个聚类类簇,根据每个所述聚类类簇对应的标准化因子值确定对应的用户群体属性;
目标对象获取模块,用于根据每一聚类类簇对应的用户群体属性查询目标用户数据库,获取与所述用户群体属性相对应的目标对象。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取画像分析请求,基于所述画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,所述待分析画像数据包括待分析画像因子和每一所述待分析画像因子对应的待分析因子值;
对所述待分析画像因子对应的待分析因子值进行标准化处理,获取所述待分析画像因子对应的标准化因子值;
采用CRITIC方法对所述待分析画像因子和对应的标准化因子值进行权重分析,获取每一所述待分析画像因子对应的权重值;
依据每一所述待分析画像因子对应的权重值对所述待分析画像因子进行筛选,确定待选择画像因子;
采用PCA法对所述待选择画像因子进行降维,将降维后的前M个待选择画像因子确定为目标画像因子;
采用Kmeans聚类算法对所述目标画像因子和对应的标准化因子值进行聚类,获取K个聚类类簇,根据每个所述聚类类簇对应的标准化因子值确定对应的用户群体属性;
根据每一聚类类簇对应的用户群体属性查询目标用户数据库,获取与所述用户群体属性相对应的目标对象。
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
获取画像分析请求,基于所述画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,所述待分析画像数据包括待分析画像因子和每一所述待分析画像因子对应的待分析因子值;
对所述待分析画像因子对应的待分析因子值进行标准化处理,获取所述待分析画像因子对应的标准化因子值;
采用CRITIC方法对所述待分析画像因子和对应的标准化因子值进行权重分析,获取每一所述待分析画像因子对应的权重值;
依据每一所述待分析画像因子对应的权重值对所述待分析画像因子进行筛选,确定待选择画像因子;
采用PCA法对所述待选择画像因子进行降维,将降维后的前M个待选择画像因子确定为目标画像因子;
采用Kmeans聚类算法对所述目标画像因子和对应的标准化因子值进行聚类,获取K个聚类类簇,根据每个所述聚类类簇对应的标准化因子值确定对应的用户群体属性;
根据每一聚类类簇对应的用户群体属性查询目标用户数据库,获取与所述用户群体属 性相对应的目标对象。
上述基于大数据的画像分析方法、装置、计算机设备及存储介质中,从用户画像数据库中筛选出符合目标筛选条件的待分析画像数据,对待分析画像因子对应的待分析因子值进行标准化处理,获取待分析画像因子对应的标准化因子值,使得各标准化因子都处于同一个级别上,确保对后续处理数据的准确性;采用CRITIC方法对待分析画像因子和对应的标准化因子值进行权重分析,获取每一待分析画像因子对应的权重值,确保待分析画像因子的权重值具有客观性,提高后续运算结果的准确度;依据每一待分析画像因子对应的权重值对待分析画像因子进行筛选,确定待选择画像因子,以除去不重要的待分析画像因子,减小后续运算复杂度。采用PCA法对待选择画像因子进行降维,将降维后的前M个待选择画像因子确定为目标画像因子,以简化后续运算,降低运算开销;传统Kmeans聚类算法对干扰数据非常敏感,少量干扰数据就能对聚类效果产生极大影响,使得聚类效果不理想,采用CRITIC方法和PCA法对数据进行降维,除去干扰数据,降低数据维度,然后采用Kmeans聚类算法对目标画像因子和对应的标准化因子值进行聚类,获取K个聚类类簇,根据每个聚类类簇对应的标准化因子值确定对应的用户群体属性,根据每一聚类类簇对应的用户群体属性查询用户画像数据库,以精确获取与用户群体属性相对应的目标对象,从而筛选出符合目标筛选条件的目标对象。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中基于大数据的画像分析方法的一应用环境示意图;
图2是本申请一实施例中基于大数据的画像分析方法的一流程图;
图3是本申请一实施例中基于大数据的画像分析方法的另一流程图;
图4是本申请一实施例中基于大数据的画像分析方法的另一流程图;
图5是本申请一实施例中基于大数据的画像分析方法的另一流程图;
图6是本申请一实施例中基于大数据的画像分析方法的另一流程图;
图7是本申请一实施例中基于大数据的画像分析方法的另一流程图;
图8是本申请一实施例中基于大数据的画像分析装置的一示意图;
图9是本申请一实施例中计算机设备的一示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的基于大数据的画像分析方法,该基于大数据的画像分析方法可应用如图1所示的应用环境中。具体地,该基于大数据的画像分析方法应用在画像分析系统中,该画像分析系统包括如图1所示的客户端和服务器,客户端与服务器通过网络进行通信,用于实现对用户画像数据中的画像因子进行降维,并对降维后的数据进行聚类,以提高聚类效率。其中,客户端又称为用户端,是指与服务器相对应,为客户提供本地服务的程序。客户端可安装在但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备上。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一实施例中,如图2所示,提供一种基于大数据的画像分析方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:
S201:获取画像分析请求,基于画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,待分析画像数据包括待分析画像因子和每一待分析画像因子对应的待分析因子值。
其中,画像分析请求是指对用户画像数据进行分析的请求。用户画像数据库是指存储原始画像数据的数据库。原始画像数据是指存储在用户画像数据库中的每一用户对应的用户画像数据。该原始画像数据是基于大数据方法获得的用户画像数据,例如,若原始画像数据对应的用户为企业员工,则其对应的原始画像数据包括但不限于每一用户的个人基本信息(如出生年月、籍贯)、展业行为信息(如频繁出入场所、工作时间、工作地址、职业)和维度客户信息(如客户数量、客户类型等)。目标筛选条件是指本次画像分析时,用于对原始画像数据进行筛选的条件,以筛选出要进行分析的用户对应的用户画像数据,一般来说,在客户端触发画像分析请求时,会携带本次画像分析对应的目标筛选条件。待分析画像数据是指从原始画像数据中筛选出满足目标筛选条件的画像数据,以便后续对待分析画像数据进行分析。待分析画像因子是指待分析画像数据中一个具体的画像因子,可以理解为维度例如,出生时间、籍贯和职业分别表示3个待分析画像因子。待分析因子值是指待分析画像因子对应的值,该待分析画像因子和待分析因子值形成一组key-value对,例如,出生年月-1990年1月、籍贯-广东深圳和职业-用户等。
具体地,用户画像数据库中预先存储有多个用户对应的原始画像数据,根据目标筛选条件查询用户画像数据库,从用户画像数据库中的原始画像数据筛选出满足目标筛选条件的用户画像数据作为待分析画像数据。例如,若需要分析业绩达标的企业员工的用户画像数据,则可以将目标筛选条件设定为业绩达标,从原始画像数据中筛选出业绩达标的企业员工对应的原始画像数据确定为待分析画像数据,该待分析画像数据包括待分析画像因子和对应的待分析因子值。
S202:对待分析画像因子对应的待分析因子值进行标准化处理,获取待分析画像因子对应的标准化因子值。
其中,标准化处理是指对待分析因子值进行处理,使得待分析因子值处于同一数量级的过程。标准化因子值是指待分析画像因子经过标准化处理后对应的值,标准化因子值均处于同一数量级,以便后续对标准化因子值进行分析,避免出现由于数据多样性而造成数据分析结果出现错误。例如,籍贯可能为广东深圳、广东广州和广东东莞等,为了便于后续分析,可将其转成为特定的数值,如采用0001代表广东深圳、0002代表广东广告和0003代表广东东莞。
由于待分析画像因子的对应的待分析因子值的取值具有多样性,即每一待分析因子值的取值具有不同的量化单位,不利于进行数据分析处理,因此,本实施例对待分析因子值进行标准化处理,将待分析因子值转换为无量纲化指标的值,即将待分析因子值转换为无量纲化的标准化因子值,使得各标准化因子都处于同一个级别上,确保对后续处理数据的准确性。
S203:采用CRITIC方法对待分析画像因子和对应的标准化因子值进行权重分析,获取每一待分析画像因子对应的权重值。
其中,CRITIC方法(Criteria Importance Though Intercrieria Correlation)即权重确定方法,CRITIC方法是由Diakoulaki提出的一种客观权重赋权法。本实施例中,采用CRITIC方法确定待分析画像因子的客观权重,具体通过对比强度和指标之间的冲突性这两个基本概念作为基础确定待分析画像因子的客观权重。其中,对比强度用于表示同一指标各个评价方案取值差距的大小,以标准差的形式来表现,即标准差的大小表明了在同一指标内各方案的取值差距的大小,标准差越大各方案的取值差距越大。指标之间的冲 突性是以指标之间的相关性为基础,即指标之间的冲突性用于表示待分析画像因子之间的冲突性,若两个待分析画像因子之间具有较强的正相关,说明两个指标冲突性较低。权重值是指对待分析画像因子和对应的标准化因子值进行权重分析后,确定该待分析画像因子的重要程度的值。
具体地,在对用户画像数据进行分析时,由于每一用户对应的待分析画像数据具有非常多个待分析画像因子,若采用传统的聚类对待分析画像因子对应的待分析因子值进行聚类,由于待分析画像因子的数量过多将使得运算困难且聚类结果也不准确。本实施例中,采用CRITIC方法对待分析画像因子和对应的标准化因子值进行权重分析,然后将标准化因子值与每一待分析画像因子的权重占比相乘得到每一待分析画像因子的权重值,以确定待分析画像因子的相对重要性,采用CRITIC方法确定待分析画像因子的权重值,确保待分析画像因子的权重值具有客观性,提高后续运算结果的准确度。
S204:依据每一待分析画像因子对应的权重值对待分析画像因子进行筛选,确定待选择画像因子。
其中,待选择画像因子是指对待分析画像因子经过权重分析后获取的权重值较高的因子。具体是对待分析画像因子采用CRITIC方法确定每一待分析画像因子的权重,筛选出权重值大于预设权重阈值的待分析画像因子,并将这些待分析画像因子确定为待选择画像因子,以排除权重值低对应的待分析画像因子,即过滤不重要的待分析画像因子,从而减低运算数量,提高分析效率。其中,预设权重阈值是指预先设定,用于筛选出待分析画像因子的值。
具体地,为了减小运算复杂度同时确保聚类时可以准确得到待分析画像因子对应的用户群体属性,在待分析画像因子对应的权重值大于或等于预设权重阈值时,将待分析画像因子确定为待选择画像因子。例如,本次画像分析中,待分析画像因子为成单类型的权重值大于预设权重阈值,则将该待分析画像因子筛选出来,确定为待选择画像因子。在待分析画像因子对应的权重值小于预设权重阈值时,则说明该待分析画像因子相对于整体分析并不是关键的,例如,若待分析画像因子为出生年月的权重值小于预设权重阈值时,则说明出生年月这一待分析画像因子相对于本次画像分析并不重要,因此,需删除该待分析画像因子。依据每一待分析画像因子对应的权重值对待分析画像因子进行筛选,以除去不重要的待分析画像因子,减小后续运算复杂度,进而提高分析效率。
S205:采用PCA法对待选择画像因子进行降维,将降维后的前M个待选择画像因子确定为目标画像因子。
其中,PCA法(Principal Component Analysis)即主成分分析法,也称主分量分析,旨在利用降维的思想,把多指标转化为少数几个综合指标(即主成分),其中每个主成分都能够反映原始变量的大部分信息,且所含信息互不重复。PCA法在引进多方面变量的同时将复杂因素归结为几个主成分,使问题简单化,同时得到的结果更加科学有效的数据信息。
具体地,由于采用CRITIC方法对待分析画像因子和对应的标准化因子值进行权重分析只得出每一待分析画像因子对应的权重值,为了更好地实现聚类,还需要采用PCA法对待选择画像因子进行降维,以得到待选择画像因子的数据特征,进一步实现数据降维,降低聚类运算复杂度。
PCA法对待选择画像因子进行降维的具体步骤,包括:首先,将待选择画像因子对应的标准化因子值按行列组成矩阵队列L,将矩阵队列中的每一行(即所有用户的同一属性的待选择画像因子对应的标准化因子值)进行零均值化处理,即减去这一行的均值;然后,求协方差矩阵,求协方差的特征值和特征向量;接着,将特征向量按对应特征值大于从上到下按行排列成矩阵,取前Z(Z为正整数)行组成矩阵P;Y=PL即为降维到后的数据,其中L为降维前的矩阵队列,Y为矩阵P乘以原始的矩阵队列L,就得到了我们需要的降维 后的数据矩阵Y,通过PCA法对待选择画像因子对应的标准化因子值进行降维可保存原始数据的信息且有效减少数据的维度,可有效简化后续聚类运算,降低运算开销,提高聚类效果。
S206:采用Kmeans聚类算法对目标画像因子和对应的标准化因子值进行聚类,获取K个聚类类簇,根据每个聚类类簇对应的标准化因子值确定对应的用户群体属性。
其中,Kmeans聚类算法是指以空间中K个点为初始聚类中心对最靠近初始聚类中心的点进行归类的算法,即将目标画像因子对应的标准化因子值划分为归属于不同初始聚类中心的标准化因子值。用户群体属性是用于表示每个聚类类簇对应的用户的共有属性。可以理解为,用户群体属性根据分析目的有所不同。例如,若分析目的是指分析业务人员的工作类型,则用户群体属性可以是工作类型,即将人群画像分为适合处理投诉型、适合产品推广类型和适合处理售后服务型等。例如,若目标筛选条件是分析业绩达标时,则通过CRITIC方法和PCA法处理后获取的目标画像因子是影响业绩是否达标的若干关键因子(如包括A、B、C和D这四个目标画像因子),由于不同待分析画像数据中,每一目标画像因子均对应一标准化因子值(如A这一目标画像因子可以对应A1、A2……An中的任一值,如用户1对应的标准化因子值可以A1、B2、C3和D1,用户2对应的标准化因子值可以A2、B2、C1和D4……),在对这些目标画像因子对应的标准化因子进行聚类之后,确定的K个聚类类簇后,根据每个聚类类簇对应的标准化因子值确定其对应的用户群体属性。根据每个聚类类簇对应的标准化因子值确定其对应的用户群体属性,具体是指对每个聚类类簇中各个目标画像因子对应的标准化因子值进行归纳分析,以提取其共有属性的过程。
具体地,采用Kmeans聚类算法对目标画像因子和对应的标准化因子值进行聚类的步骤包括:(1)从数据中选择k个目标画像因子对应的标准化因子值作为初始聚类中心;(2)计算每个聚类对象(目标画像因子对应的标准化因子值)到聚类中心的距离,按照最小距离原则将聚类对象分配到最近的初始聚类中心;(3)根据聚类结果,再次计算k个聚类的中心,并作为新的聚类中心;(4)计算标准测度函数(通常采用均方差作为标准测度函数),不断重复计算以得到新的聚类中心的过程直到标准测度函数开始收敛为止,即直到达到最大迭代次数,则停止,否则,继续操作从而获得K个聚类类簇。根据处于每一聚类类簇范围内的标准化因子值查询因子数据表确定对应的用户群体属性。采用Kmeans聚类算法对经过CRITIC方法和PCA法处理后的数据进行聚类,提高聚类效率以获得精确用户群体属性。
S207:根据每一聚类类簇对应的用户群体属性查询目标用户数据库,获取与用户群体属性相对应的目标对象。
其中,目标用户数据库是指存储有用户数据的数据库,目标对象是指符合用户群体属性的用户。在本实施例中,由于每一用户画像数据库中存储有每一用户的所有数据,在计算出每一聚类类簇后,根据每一聚类类簇对应的用户群体属性查询用户画像数据库,获得与用户群体属性相对应的目标对象,为后续分析提供精准数据。
本实施例所提供的基于大数据的画像分析方法中,从用户画像数据库中筛选出符合目标筛选条件的待分析画像数据,对待分析画像因子对应的待分析因子值进行标准化处理,获取待分析画像因子对应的标准化因子值,使得各标准化因子都处于同一个级别上,确保对后续处理数据的准确性;采用CRITIC方法对待分析画像因子和对应的标准化因子值进行权重分析,获取每一待分析画像因子对应的权重值,确保待分析画像因子的权重值具有客观性,提高后续运算结果的准确度;依据每一待分析画像因子对应的权重值对待分析画像因子进行筛选,确定待选择画像因子,以除去不重要的待分析画像因子,减小后续运算复杂度。采用PCA法对待选择画像因子进行降维,将降维后的前M个待选择画像因子确定为目标画像因子,以简化后续运算,降低运算开销;传统Kmeans聚类算法对干扰数据非常敏感,少量干扰数据就能对聚类效果产生极大影响,使得聚类效果不理想,采用CRITIC 方法和PCA法对数据进行降维,除去干扰数据,降低数据维度,然后采用Kmeans聚类算法对目标画像因子和对应的标准化因子值进行聚类,获取K个聚类类簇,根据每个聚类类簇对应的标准化因子值确定对应的用户群体属性,根据每一聚类类簇对应的用户群体属性查询用户画像数据库,以精确获取与用户群体属性相对应的目标对象,从而筛选出符合目标筛选条件的目标对象。
在一实施例中,如图3所示,目标筛选条件包括待筛选维度和与待筛选维度相对应的维度阈值,步骤S201,即基于画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,包括:
S301:基于画像分析请求查询用户画像数据库,确定每一原始画像数据中与待筛选维度相对应的原始维度值。
其中,待筛选维度是指对原始画像因子进行筛选的标准,以选择出符合画像分析目的的画像因子,例如,若本次画像分析是为了分析业务员的工作表现,则待筛选维度包括业务员的工作绩效、工作年龄、客户类型、和客户工作领域等。维度阈值是指待筛选维度对应的值,该维度阈值是人为设定的,例如,若待筛选维度为业务员的业务绩效,为分析业绩较好的业务员的工作表现,则将维度阈值设为70%,以便后续分析业绩较好的业务员的工作表现。原始维度值是通过用户的原始画像数据得出的该用户同一维度的值,例如,获取原始画像数据中业务员的业务绩效维度,统计该业务员的业务绩效平均值作为原始维度值并记录中原始画像数据表中。
具体地,可以将同一用户的画像收集在原始画像数据表并存储在用户画像数据库中,该原始画像数据表包括每一用户的原始画像数据,然后服务器对原始画像数据表中与待筛选维度相对应的维度进行判断,以快速筛选出符合待筛选维度的维度,加快分析进度。其中,原始画像数据表是指用于存储将同一用户的画像数据的表,不同用户对应不同的原始画像数据表。
S302:若原始维度值与维度阈值相匹配,则将原始画像数据确定为符合目标筛选条件的待分析画像数据。
具体地,在服务器获取到与待筛选维度对应的原始维度值后,可以采用查询条件指令查询画像数据表中的数据,依据维度阈值以快速从原始画像数据中筛选出原始维度值与维度阈值相匹配的原始画像数据并确定为待分析画像数据,以除去不需要进行分析的画像数据,减小后续的运算复杂度,便后续对待分析画像数据进行分析。
本实施例所提供的基于大数据的画像分析方法中,将原始维度值与维度阈值相匹配的原始画像数据确定为待分析画像数据,以除去不需要进行分析的画像数据,减小后续的运算复杂度,便后续对待分析画像数据进行分析。
在一实施例中,如图4所示,步骤S202,即对待分析画像因子对应的待分析因子值进行标准化处理,获取待分析画像因子对应的标准化因子值,包括:
S401:获取与待分析画像因子相对应的数值转换规则或者标准化转换公式。
其中,数值转换规则是指将待分析因子值转化为同一量级的数据的规则,例如,对于性别男女就转换成0/1,籍贯转换成相应的编码,确保数据具有可比性。标准化转换公式是指将待分析因子值转化为同一量级的数据的公式。可以理解,数值转换规则和标准化转换公式均用于将待分析因子值转化为同一量级的标准化因子值,以确保后续数据处理的准确性,使数据分析结果更加可靠。
S402:若待分析因子值为类别型数据,则采用数值转换规则对待分析因子值进行数值转换,获取与待分析画像因子相对应的标准化因子值。
其中,类别型数据是指待分析因子值为用于表示特定类别的数值,而不是连续型数据。例如,类别型数据可以是指性别、籍贯或者业务类型等。在待分析因子值为类别型数据时,采用数值转换规则将待分析因子值转换成对应的阿拉伯数字,以获取待分析画像因子对应 的标准化因子值,例如,性别为男或者女时,男性转化为0,女性转化为1。
S403:若待分析因子值为连续型数据,则采用标准化转换公式对待分析因子值进行标准化处理,获取与待分析画像因子相对应的标准化因子值。
连续型数据是指待分析因子值为连续区间的数据,连续型数据包括但不限于工作时间、客户数量和客户购买额度等连续数值。具体地,待分析因子值为连续型数据且数据越大越好时,如客户数量或者客户购买额度时,即要求第l个待分析画像因子越大越好,则其标准化转换公式为
Figure PCTCN2020093359-appb-000001
N用于限定标准化因子值的数值范围。当待分析因子值为连续型数据且数据越小越好时,例如,客户投诉率或者客户误解率等,即要求第l个待分析画像因子越大越好,则其标准化转换公式为
Figure PCTCN2020093359-appb-000002
N用于限定标准化因子值的数值范围。
本实施例所提供的基于大数据的画像分析方法中,获取与待分析画像因子相对应的数值转换规则或者标准化转换公式,以便根据数值转换规则将类别型数据转换为标准化因子值,根据标准化转换公式将连续型数据转换为标准化因子值,将待分析画像因子对应的数值转换为同一量级的标准化因子值使得因子值具有可比性,确保后续数据处理的准确性,使数据分析结果更加可靠。
在一实施例中,如图5所示,步骤S203,即采用CRITIC方法对待分析画像因子和对应的标准化因子值进行权重分析,获取每一待分析画像因子对应的权重值,包括:
S501:基于任意两个待分析画像因子对应的标准化因子值进行相关度计算,获取任意两个待分析画像因子对应的相关系数。
其中,相关系数是用于反映变量之间相关关系密切程度的统计指标。相关系数是按积差方法计算,同样以两变量与各自平均值的离差为基础,通过两个离差相乘来反映两变量之间相关程度,确保获得相关系数具有可靠性。计算相关系数的公式为
Figure PCTCN2020093359-appb-000003
r i,j是指相关系数,i和j用于表示任意两个待分析画像因子对应的标准化因子值。相关系数的值介于-1与1之间,其性质如下:1)若r>0时,表示两标准化因子值正相关,r<0时,表示两变量负相关;2)当|r|=1时,表示两标准化因子值为完全线性相关,即为函数关系;3)当r=0时,表示两标准化因子值间无线性相关关系。当0<|r|<1时,表示两标准化因子值存在一定程度的线性相关,且|r|越接近1,两变量间线性关系越密切;|r|越接近于0,两变量间线性关系越弱。
S502:根据任意两个待分析画像因子对应的相关系数,计算每一待分析画像因子对应的量化指标。
量化指标是用于衡量每一待分析画像因子与其他待分析画像因子的冲突性大小的指标。具体地,每一待分析画像因子的量化指标可以通过
Figure PCTCN2020093359-appb-000004
计算得到,其中,r i,j为 第i个待分析画像因子与第j个待分析画像因子之间的相关系数。可以理解地,若两个待分析画像因子的相关性越强,则量化指标越小。
S503:采用每一待分析画像因子对应的量化指标,计算每一待分析画像因子对应的信息量。
信息量是指用于评判待分析画像因子的重要程度的值。具体地,采用
Figure PCTCN2020093359-appb-000005
计算每一待分析画像因子的信息量,其中,C j为第j个待分析画像因子所包括的信息量,b为本实施例中第b个待分析画像因子。一般来说,C j越大,说明第j个待分析画像因子所包含的信息量越大,该待分析画像因子的相对重要性也大,δ j是指标准差。根据量化指标确定每一待分析画像因子对应的信息量,以确定每一待分析画像因子相对全部待分析画像因子的重要程度。
S504:根据每一待分析画像因子对应的信息量,确定每一待分析画像因子对应的权重值。
具体地,根据每一待分析画像因子的权重占比计算公式
Figure PCTCN2020093359-appb-000006
计算得到每一待分析画像因子的权重占比,根据每一待分析画像因子对应的标准化因子值乘以对应的待分析画像因子的权重占比值,确定每一待分析画像因子对应的权重值,确保每一待分析画像因子对应的权重值具有可靠性,其中,W j为待分析画像因子对应的权重值,m为所有待分析画像因子的数量,C j为第j个待分析画像因子所包括的信息量。
本实施例所提供的基于大数据的画像分析方法中,基于任意两个待分析画像因子对应的标准化因子值进行相关度计算,确保获得的相关系数具有可靠性;根据任意两个待分析画像因子对应的相关系数,计算每一待分析画像因子对应的量化指标;采用每一待分析画像因子对应的量化指标,计算每一待分析画像因子对应的信息量,以确定每一待分析画像因子相对全部待分析画像因子的重要程度;根据每一待分析画像因子对应的信息量,确定每一待分析画像因子对应的权重值,以保证获取的待分析画像因子对应的权重值的客观性。
在一实施例中,如图6所示,步骤S204,依据每一待分析画像因子对应的权重值对待分析画像因子进行筛选,确定待选择画像因子,包括:
S601:对所有待分析画像因子对应的权重值进行排序,获取权重值排序结果。
其中,权重值排序结果是指根据每一待分析画像因子的权重值进行排序的结果。具体地,可以采用正序(即权重值由高到低的顺序)依次显示在显示设备上,也可以采用倒序(即权重值由低到高的顺序)依次显示在显示设备上,直观显示权重值排序结果。其中,显示设备是指用于存储、显示和运算的设备,可以是计算机等。
S602:计算权重值排序结果中,前X个待分析画像因子对应的权重值之和相对于所有待分析画像因子对应的权重值之和的总权重占比。
其中,总权重占比可以理解为部分待分析画像因子对应的权重值之和占中权重值之和的比例。具体地,可以选取前X(X≧1)个待分析画像因子对应的权重之和除以所有待分析画像因子对应的权重值之和进行计算,以快速得到总权重占比。
S603:若总权重占比大于预设占比阈值,则将权重值排序结果中前X个待分析画像因子确定为待选择画像因子。
其中,预设占比阈值是指预先设定的阈值,用于评判前X个待分析画像因子对应的权重值之和是否达到标准。具体地,在总权重占比大于预设占比阈值时,将权重值排序结果中前X个待分析画像因子确定为待选择画像因子,从而除去干扰因子,降低运算维度,提高聚类准确率。
本实施例所提供的基于大数据的画像分析方法中,对所有待分析画像因子对应的权重值进行排序,获取权重值排序结果;计算权重值排序结果中,前X个待分析画像因子对应的权重值之和相对于所有待分析画像因子对应的权重值之和的总权重占比;在总权重占比大于预设占比阈值时,则将权重值排序结果中前X个待分析画像因子确定为待选择画像因子,从而除去干扰因子,降低运算维度,提高聚类准确率。
在一实施例中,如图7所示,步骤S206,即根据每个聚类类簇对应的标准化因子值确定对应的用户群体属性,包括:
S701:获取每个聚类类簇对应的目标画像因子,对目标画像因子对应的待分析因子值按预设的分类规则进行分类,获取至少两个分类属性。
其中,分类规则是指预先设定的用于对标准化因子值进行分类的规则,例如,在目标画像因子为工作时间时,可以将分类规则设为0-2、2-4、4-6、4-8……等工作时间段,即按按2年一个分类属性进行划分,得到至少两个分类属性,以确定每一分类属性对应的数量。
S702:统计每一分类属性对应的目标画像因子的类别数量,依据类别数量进行降序排序,获取降序排序结果。
其中,类别数量是指目标画像因子中符合同一分类属性应的全部数值的数量。降序排序结果是显示相同目标画像因子中每一分类属性中的数量由多到少的结果,该降序排序结果包括类别数量和对应的类别属性,可直观显示于显示设备上,便于查看。例如,在目标画像因子为工作时间时,设0-2这一分类属性对应的类别数量为100个,2-4这一分类属性对应的类别数量为300个,4-6这一分类属性对应的类别数量为250个,6-8这一类别数量对应的200个,8-10这一分类属性对应的类别数量为150个。在依据类别数量进行降序排序,获取降序排序结果时,可获取300-2-4、250-4-6、200-6-8、150-8-10和100-0-2类别数量和工作时间段对应的降序排序结果。
S703:计算降序排序结果中,前S个类别数量的和值与所有类别数量的和值对应的目标比例值。
其中,目标比列值是指部分类别数量占整体类别数量比例的值,具体是通过目标比列值计算公式
Figure PCTCN2020093359-appb-000007
以便捷计算得到目标比例值,其中,P为目标比例值,Q i为每一第i个分类属性对应的类别数量,M为分类属性的数量,S为降序排序结果中第S个分类属性的位置。
S704:若目标比例值大于预设比例阈值,则将前S个类别数量对应的分类属性的并集确定为目标画像因子对应的因子群体属性。
其中,预设比例阈值是指预先设定的用于判断目标比例值是否符合标准的值。该预设比例阈值可以可以根据实际情况设定,以限定目标画像因子中群体属性的范围。
具体地,在目标比例值大于预设比例阈值时,则将前X个类别数量对应的分类属性的并集确定为目标画像因子对应的因子群体属性,可以排除离散型数值对聚类分析结果的干 扰。例如,在目标画像因子为工作时间时,将预设比例阈值设定为90%,根据即降序排列结果则将前4个分类属性的并集确定因子群体属性,即将2-4、4-6、6-8和8-10的并集确定因子群体属性。
S705:基于目标画像因子对应的因子群体属性,确定与聚类类簇相对应的用户群体属性。
具体地,将所有目标画像因子对应的因子群体属性的集合,确定为与聚类类簇相对应的用户群体属性,该用户群体属性是符合目标筛选条件下的用户对应的共性属性,以便后续基于该用户群体属性进行业务扩展,例如人员招聘、客户分配等场景下均可适用。
本实施例所提供的基于大数据的画像分析方法中,对每个聚类类簇对应的目标画像因子的待分析因子值按的分类规则进行分类,以确定每一分类属性对应的数量,并依据类别数量进行降序排序,将降序排序结果直观显示在显示设备上;计算降序排序结果中,前S个类别数量的和值与所有类别数量的和值对应的目标比例值;在目标比例值大于预设比例阈值时,则将前S个类别数量对应的分类属性的并集确定为目标画像因子对应的因子群体属性,基于目标画像因子对应的因子群体属性,确定与聚类类簇相对应的用户群体属性,以便后续基于该用户群体属性进行业务扩展,例如人员招聘、客户分配等场景下均可适用。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种基于大数据的画像分析装置,该基于大数据的画像分析装置与上述实施例中基于大数据的画像分析方法一一对应。如图8所示,该基于大数据的画像分析装置包括待分析画像数据筛选模块801、标准化因子值获取模块802、权重值获取模块803、待选择画像因子确定模块804、目标画像因子确定模块805、用户群体属性确定模块806和目标对象获取模块807。各功能模块详细说明如下:
待分析画像数据筛选模块801,用于获取画像分析请求,基于画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,待分析画像数据包括待分析画像因子和每一待分析画像因子对应的待分析因子值。
标准化因子值获取模块802,用于对待分析画像因子对应的待分析因子值进行标准化处理,获取待分析画像因子对应的标准化因子值。
权重值获取模块803,用于采用CRITIC方法对待分析画像因子和对应的标准化因子值进行权重分析,获取每一待分析画像因子对应的权重值。
待选择画像因子确定模块804,用于依据每一待分析画像因子对应的权重值对待分析画像因子进行筛选,确定待选择画像因子。
目标画像因子确定模块805,用于采用PCA法对待选择画像因子进行降维,将降维后的前M个待选择画像因子确定为目标画像因子。
用户群体属性确定模块806,用于采用Kmeans聚类算法对目标画像因子和对应的标准化因子值进行聚类,获取K个聚类类簇,根据每个聚类类簇对应的标准化因子值确定对应的用户群体属性。
目标对象获取模块807,用于根据每一聚类类簇对应的用户群体属性查询目标用户数据库,获取与用户群体属性相对应的目标对象。
优选地,目标筛选条件包括待筛选维度和与待筛选维度相对应的维度阈值;待分析画像数据筛选模块801,包括:原始维度值确定单元和第一判断单元。
原始维度值确定单元,用于基于画像分析请求查询用户画像数据库,确定每一原始画像数据中与待筛选维度相对应的原始维度值。
第一判断单元,用于若原始维度值与维度阈值相匹配,则将原始画像数据确定为符合目标筛选条件的待分析画像数据。
优选地,标准化因子值获取模块802,包括:因子转换单元、类别型数据转换单元和 连续型数据转换单元。
因子转换单元,用于获取与待分析画像因子相对应的数值转换规则或者标准化转换公式。
类别型数据转换单元,用于若待分析因子值为类别型数据,则采用数值转换规则对待分析因子值进行数值转换,获取与待分析画像因子相对应的标准化因子值。
连续型数据转换单元,用于若待分析因子值为连续型数据,则采用标准化转换公式对待分析因子值进行标准化处理,获取与待分析画像因子相对应的标准化因子值。
优选地,权重值获取模块803,包括:相关系数获取单元、量化指标计算单元、信息量计算单元和权重值确定单元。
相关系数获取单元,用于基于任意两个待分析画像因子对应的标准化因子值进行相关度计算,获取任意两个待分析画像因子对应的相关系数。
量化指标计算单元,用于根据任意两个待分析画像因子对应的相关系数,计算每一待分析画像因子对应的量化指标。
信息量计算单元,用于采用每一待分析画像因子对应的量化指标,计算每一待分析画像因子对应的信息量。
权重值确定单元,用于根据每一待分析画像因子对应的信息量,确定每一待分析画像因子对应的权重值。
优选地,待选择画像因子确定模块804,包括:权重值排序结果获取单元、总权重占比计算单元和第二判断单元。
权重值排序结果获取单元,用于对所有待分析画像因子对应的权重值进行排序,获取权重值排序结果。
总权重占比计算单元,用于计算权重值排序结果中,前X个待分析画像因子对应的权重值之和相对于所有待分析画像因子对应的权重值之和的总权重占比。
第二判断单元,用于若总权重占比大于预设占比阈值,则将权重值排序结果中前X个待分析画像因子确定为待选择画像因子。
优选地,用户群体属性确定模块806,包括:分类属性获取单元、降序排序结果获取单元、目标比例值计算单元、因子群体属性确定单元和用户群体属性确定单元。
分类属性获取单元,用于获取每个聚类类簇对应的目标画像因子,对目标画像因子对应的待分析因子值按预设的分类规则进行分类,获取至少两个分类属性。
降序排序结果获取单元,用于统计每一分类属性对应的目标画像因子的类别数量,依据类别数量进行降序排序,获取降序排序结果。
目标比例值计算单元,用于计算降序排序结果中,前S个类别数量的和值与所有类别数量的和值对应的目标比例值。
因子群体属性确定单元,用于若目标比例值大于预设比例阈值,则将前S个类别数量对应的分类属性的并集确定为目标画像因子对应的因子群体属性。
用户群体属性确定单元,用于基于目标画像因子对应的因子群体属性,确定与聚类类簇相对应的用户群体属性。
关于基于大数据的画像分析装置的具体限定可以参见上文中对于基于大数据的画像分析方法的限定,在此不再赘述。上述基于大数据的画像分析装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图9所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器 包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库执行上述基于大数据的画像分析方法过程中采用或者生成的数据,如目标画像因子。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种基于大数据的画像分析方法。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述实施例中基于大数据的画像分析方法,例如图2所示S201-S207,或者图3至图7中所示,为避免重复,这里不再赘述。或者,处理器执行计算机可读指令时实现基于大数据的画像分析装置这一实施例中的各模块/单元的功能,例如图8所示的待分析画像数据筛选模块801、标准化因子值获取模块802、权重值获取模块803、待选择画像因子确定模块804、目标画像因子确定模块805、用户群体属性确定模块806和目标对象获取模块807的功能,为避免重复,这里不再赘述。
在一实施例中,提供一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现上述实施例中基于大数据的画像分析方法,例如图2所示S201-S207,或者图3至图7中所示,为避免重复,这里不再赘述。或者,处理器执行计算机可读指令时实现基于大数据的画像分析装置这一实施例中的各模块/单元的功能,例如图8所示的待分析画像数据筛选模块801、标准化因子值获取模块802、权重值获取模块803、待选择画像因子确定模块804、目标画像因子确定模块805、用户群体属性确定模块806和目标对象获取模块807的功能,为避免重复,这里不再赘述。本实施例中的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一非易失性可读存储介质也可以存储在易失性可读存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种基于大数据的画像分析方法,其中,包括:
    获取画像分析请求,基于所述画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,所述待分析画像数据包括待分析画像因子和每一所述待分析画像因子对应的待分析因子值;
    对所述待分析画像因子对应的待分析因子值进行标准化处理,获取所述待分析画像因子对应的标准化因子值;
    采用CRITIC方法对所述待分析画像因子和对应的标准化因子值进行权重分析,获取每一所述待分析画像因子对应的权重值;
    依据每一所述待分析画像因子对应的权重值对所述待分析画像因子进行筛选,确定待选择画像因子;
    采用PCA法对所述待选择画像因子进行降维,将降维后的前M个待选择画像因子确定为目标画像因子;
    采用Kmeans聚类算法对所述目标画像因子和对应的标准化因子值进行聚类,获取K个聚类类簇,根据每个所述聚类类簇对应的标准化因子值确定对应的用户群体属性;
    根据每一聚类类簇对应的用户群体属性查询目标用户数据库,获取与所述用户群体属性相对应的目标对象。
  2. 如权利要求1所述的基于大数据的画像分析方法,其中,所述目标筛选条件包括待筛选维度和与所述待筛选维度相对应的维度阈值;
    所述基于所述画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,包括:
    基于所述画像分析请求查询用户画像数据库,确定每一原始画像数据中与所述待筛选维度相对应的原始维度值;
    若所述原始维度值与所述维度阈值相匹配,则将所述原始画像数据确定为符合目标筛选条件的待分析画像数据。
  3. 如权利要求1所述的基于大数据的画像分析方法,其中,所述对所述待分析画像因子对应的待分析因子值进行标准化处理,获取所述待分析画像因子对应的标准化因子值,包括:
    获取与所述待分析画像因子相对应的数值转换规则或者标准化转换公式;
    若所述待分析因子值为类别型数据,则采用所述数值转换规则对所述待分析因子值进行数值转换,获取与所述待分析画像因子相对应的标准化因子值;
    若所述待分析因子值为连续型数据,则采用所述标准化转换公式对所述待分析因子值进行标准化处理,获取与所述待分析画像因子相对应的标准化因子值。
  4. 如权利要求1所述的基于大数据的画像分析方法,其中,所述采用CRITIC方法对所述待分析画像因子和对应的标准化因子值进行权重分析,获取每一所述待分析画像因子对应的权重值,包括:
    基于任意两个所述待分析画像因子对应的标准化因子值进行相关度计算,获取任意两个所述待分析画像因子对应的相关系数;
    根据任意两个所述待分析画像因子对应的相关系数,计算每一所述待分析画像因子对应的量化指标;
    采用每一所述待分析画像因子对应的量化指标,计算每一所述待分析画像因子对应的信息量;
    根据每一所述待分析画像因子对应的信息量,确定每一所述待分析画像因子对应的权重值。
  5. 如权利要求1所述的基于大数据的画像分析方法,其中,所述依据每一所述待分析画像因子对应的权重值对所述待分析画像因子进行筛选,确定待选择画像因子,包括:
    对所有所述待分析画像因子对应的权重值进行排序,获取权重值排序结果;
    计算所述权重值排序结果中,前X个所述待分析画像因子对应的权重值之和相对于所有所述待分析画像因子对应的权重值之和的总权重占比;
    若所述总权重占比大于预设占比阈值,则将所述权重值排序结果中前X个所述待分析画像因子确定为待选择画像因子。
  6. 如权利要求1所述的基于大数据的画像分析方法,其中,所述根据每个所述聚类类簇对应的标准化因子值确定对应的用户群体属性,包括:
    获取每个所述聚类类簇对应的目标画像因子,对所述目标画像因子对应的待分析因子值按预设的分类规则进行分类,获取至少两个分类属性;
    统计每一分类属性对应的目标画像因子的类别数量,依据所述类别数量进行降序排序,获取降序排序结果;
    计算所述降序排序结果中,前S个类别数量的和值与所有类别数量的和值对应的目标比例值;
    若所述目标比例值大于预设比例阈值,则将前S个类别数量对应的分类属性的并集确定为所述目标画像因子对应的因子群体属性;
    基于所述目标画像因子对应的因子群体属性,确定与所述聚类类簇相对应的用户群体属性。
  7. 一种基于大数据的画像分析装置,其中,包括:
    待分析画像数据筛选模块,用于获取画像分析请求,基于所述画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,所述待分析画像数据包括待分析画像因子和每一所述待分析画像因子对应的待分析因子值;
    标准化因子值获取模块,用于对所述待分析画像因子对应的待分析因子值进行标准化处理,获取所述待分析画像因子对应的标准化因子值;
    权重值获取模块,用于采用CRITIC方法对所述待分析画像因子和对应的标准化因子值进行权重分析,获取每一所述待分析画像因子对应的权重值;
    待选择画像因子确定模块,用于依据每一所述待分析画像因子对应的权重值对所述待分析画像因子进行筛选,确定待选择画像因子;
    目标画像因子确定模块,用于采用PCA法对所述待选择画像因子进行降维,将降维后的前M个待选择画像因子确定为目标画像因子;
    用户群体属性确定模块,用于采用Kmeans聚类算法对所述目标画像因子和对应的标准化因子值进行聚类,获取K个聚类类簇,根据每个所述聚类类簇对应的标准化因子值确定对应的用户群体属性;
    目标对象获取模块,用于根据每一聚类类簇对应的用户群体属性查询目标用户数据库,获取与所述用户群体属性相对应的目标对象。
  8. 如权利要求7所述的大规模画像因子聚类基于大数据的画像分析装置,其中,所述目标筛选条件包括待筛选维度和与所述待筛选维度相对应的维度阈值;待分析画像数据筛选模块,包括:
    原始维度值确定单元,用于基于所述画像分析请求查询用户画像数据库,确定每一原始画像数据中与所述待筛选维度相对应的原始维度值;
    第一判断单元,用于若所述原始维度值与所述维度阈值相匹配,则将所述原始画像数据确定为符合目标筛选条件的待分析画像数据。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取画像分析请求,基于所述画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,所述待分析画像数据包括待分析画像因子和每一所述待分析画像因子对应的待分析因子值;
    对所述待分析画像因子对应的待分析因子值进行标准化处理,获取所述待分析画像因子对应的标准化因子值;
    采用CRITIC方法对所述待分析画像因子和对应的标准化因子值进行权重分析,获取每一所述待分析画像因子对应的权重值;
    依据每一所述待分析画像因子对应的权重值对所述待分析画像因子进行筛选,确定待选择画像因子;
    采用PCA法对所述待选择画像因子进行降维,将降维后的前M个待选择画像因子确定为目标画像因子;
    采用Kmeans聚类算法对所述目标画像因子和对应的标准化因子值进行聚类,获取K个聚类类簇,根据每个所述聚类类簇对应的标准化因子值确定对应的用户群体属性;
    根据每一聚类类簇对应的用户群体属性查询目标用户数据库,获取与所述用户群体属性相对应的目标对象。
  10. 如权利要求9所述的计算机设备,其中,所述目标筛选条件包括待筛选维度和与所述待筛选维度相对应的维度阈值;
    所述基于所述画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,包括:
    基于所述画像分析请求查询用户画像数据库,确定每一原始画像数据中与所述待筛选维度相对应的原始维度值;
    若所述原始维度值与所述维度阈值相匹配,则将所述原始画像数据确定为符合目标筛选条件的待分析画像数据。
  11. 如权利要求9所述的计算机设备,其中,所述对所述待分析画像因子对应的待分析因子值进行标准化处理,获取所述待分析画像因子对应的标准化因子值,包括:
    获取与所述待分析画像因子相对应的数值转换规则或者标准化转换公式;
    若所述待分析因子值为类别型数据,则采用所述数值转换规则对所述待分析因子值进行数值转换,获取与所述待分析画像因子相对应的标准化因子值;
    若所述待分析因子值为连续型数据,则采用所述标准化转换公式对所述待分析因子值进行标准化处理,获取与所述待分析画像因子相对应的标准化因子值。
  12. 如权利要求9所述的计算机设备,其中,所述采用CRITIC方法对所述待分析画像因子和对应的标准化因子值进行权重分析,获取每一所述待分析画像因子对应的权重值,包括:
    基于任意两个所述待分析画像因子对应的标准化因子值进行相关度计算,获取任意两个所述待分析画像因子对应的相关系数;
    根据任意两个所述待分析画像因子对应的相关系数,计算每一所述待分析画像因子对应的量化指标;
    采用每一所述待分析画像因子对应的量化指标,计算每一所述待分析画像因子对应的信息量;
    根据每一所述待分析画像因子对应的信息量,确定每一所述待分析画像因子对应的权重值。
  13. 如权利要求9所述的计算机设备,其中,所述依据每一所述待分析画像因子对应的权重值对所述待分析画像因子进行筛选,确定待选择画像因子,包括:
    对所有所述待分析画像因子对应的权重值进行排序,获取权重值排序结果;
    计算所述权重值排序结果中,前X个所述待分析画像因子对应的权重值之和相对于所 有所述待分析画像因子对应的权重值之和的总权重占比;
    若所述总权重占比大于预设占比阈值,则将所述权重值排序结果中前X个所述待分析画像因子确定为待选择画像因子。
  14. 如权利要求9所述的计算机设备,其中,所述根据每个所述聚类类簇对应的标准化因子值确定对应的用户群体属性,包括:
    获取每个所述聚类类簇对应的目标画像因子,对所述目标画像因子对应的待分析因子值按预设的分类规则进行分类,获取至少两个分类属性;
    统计每一分类属性对应的目标画像因子的类别数量,依据所述类别数量进行降序排序,获取降序排序结果;
    计算所述降序排序结果中,前S个类别数量的和值与所有类别数量的和值对应的目标比例值;
    若所述目标比例值大于预设比例阈值,则将前S个类别数量对应的分类属性的并集确定为所述目标画像因子对应的因子群体属性;
    基于所述目标画像因子对应的因子群体属性,确定与所述聚类类簇相对应的用户群体属性。
  15. 一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取画像分析请求,基于所述画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,所述待分析画像数据包括待分析画像因子和每一所述待分析画像因子对应的待分析因子值;
    对所述待分析画像因子对应的待分析因子值进行标准化处理,获取所述待分析画像因子对应的标准化因子值;
    采用CRITIC方法对所述待分析画像因子和对应的标准化因子值进行权重分析,获取每一所述待分析画像因子对应的权重值;
    依据每一所述待分析画像因子对应的权重值对所述待分析画像因子进行筛选,确定待选择画像因子;
    采用PCA法对所述待选择画像因子进行降维,将降维后的前M个待选择画像因子确定为目标画像因子;
    采用Kmeans聚类算法对所述目标画像因子和对应的标准化因子值进行聚类,获取K个聚类类簇,根据每个所述聚类类簇对应的标准化因子值确定对应的用户群体属性;
    根据每一聚类类簇对应的用户群体属性查询目标用户数据库,获取与所述用户群体属性相对应的目标对象。
  16. 如权利要求15所述的可读存储介质,其中,所述目标筛选条件包括待筛选维度和与所述待筛选维度相对应的维度阈值;
    所述基于所述画像分析请求从用户画像数据库中,筛选出符合目标筛选条件的待分析画像数据,包括:
    基于所述画像分析请求查询用户画像数据库,确定每一原始画像数据中与所述待筛选维度相对应的原始维度值;
    若所述原始维度值与所述维度阈值相匹配,则将所述原始画像数据确定为符合目标筛选条件的待分析画像数据。
  17. 如权利要求15所述的可读存储介质,其中,所述对所述待分析画像因子对应的待分析因子值进行标准化处理,获取所述待分析画像因子对应的标准化因子值,包括:
    获取与所述待分析画像因子相对应的数值转换规则或者标准化转换公式;
    若所述待分析因子值为类别型数据,则采用所述数值转换规则对所述待分析因子值进 行数值转换,获取与所述待分析画像因子相对应的标准化因子值;
    若所述待分析因子值为连续型数据,则采用所述标准化转换公式对所述待分析因子值进行标准化处理,获取与所述待分析画像因子相对应的标准化因子值。
  18. 如权利要求15所述的可读存储介质,其中,所述采用CRITIC方法对所述待分析画像因子和对应的标准化因子值进行权重分析,获取每一所述待分析画像因子对应的权重值,包括:
    基于任意两个所述待分析画像因子对应的标准化因子值进行相关度计算,获取任意两个所述待分析画像因子对应的相关系数;
    根据任意两个所述待分析画像因子对应的相关系数,计算每一所述待分析画像因子对应的量化指标;
    采用每一所述待分析画像因子对应的量化指标,计算每一所述待分析画像因子对应的信息量;
    根据每一所述待分析画像因子对应的信息量,确定每一所述待分析画像因子对应的权重值。
  19. 如权利要求15所述的可读存储介质,其中,所述依据每一所述待分析画像因子对应的权重值对所述待分析画像因子进行筛选,确定待选择画像因子,包括:
    对所有所述待分析画像因子对应的权重值进行排序,获取权重值排序结果;
    计算所述权重值排序结果中,前X个所述待分析画像因子对应的权重值之和相对于所有所述待分析画像因子对应的权重值之和的总权重占比;
    若所述总权重占比大于预设占比阈值,则将所述权重值排序结果中前X个所述待分析画像因子确定为待选择画像因子。
  20. 如权利要求15所述的可读存储介质,其中,所述根据每个所述聚类类簇对应的标准化因子值确定对应的用户群体属性,包括:
    获取每个所述聚类类簇对应的目标画像因子,对所述目标画像因子对应的待分析因子值按预设的分类规则进行分类,获取至少两个分类属性;
    统计每一分类属性对应的目标画像因子的类别数量,依据所述类别数量进行降序排序,获取降序排序结果;
    计算所述降序排序结果中,前S个类别数量的和值与所有类别数量的和值对应的目标比例值;
    若所述目标比例值大于预设比例阈值,则将前S个类别数量对应的分类属性的并集确定为所述目标画像因子对应的因子群体属性;
    基于所述目标画像因子对应的因子群体属性,确定与所述聚类类簇相对应的用户群体属性。
PCT/CN2020/093359 2019-06-14 2020-05-29 基于大数据的画像分析方法、装置、计算机设备及存储介质 WO2020248843A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910517664.8A CN110363387B (zh) 2019-06-14 2019-06-14 基于大数据的画像分析方法、装置、计算机设备及存储介质
CN201910517664.8 2019-06-14

Publications (1)

Publication Number Publication Date
WO2020248843A1 true WO2020248843A1 (zh) 2020-12-17

Family

ID=68217302

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093359 WO2020248843A1 (zh) 2019-06-14 2020-05-29 基于大数据的画像分析方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN110363387B (zh)
WO (1) WO2020248843A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111924A (zh) * 2021-03-26 2021-07-13 邦道科技有限公司 电力客户分类方法及装置
CN113420204A (zh) * 2021-05-21 2021-09-21 北京达佳互联信息技术有限公司 目标用户确定方法、装置、电子设备及存储介质
CN113780415A (zh) * 2021-09-10 2021-12-10 平安科技(深圳)有限公司 基于小程序游戏的用户画像生成方法、装置、设备及介质
CN115795342A (zh) * 2022-11-15 2023-03-14 支付宝(杭州)信息技术有限公司 一种业务场景分类的方法、装置、存储介质及电子设备
CN116089401A (zh) * 2023-02-17 2023-05-09 国网浙江省电力有限公司营销服务中心 用户数据管理方法及系统
CN116523546A (zh) * 2023-06-29 2023-08-01 深圳市华图测控系统有限公司 智能读者行为分析预测系统数据采集分析的方法和装置
CN116705337A (zh) * 2023-08-07 2023-09-05 山东第一医科大学第一附属医院(山东省千佛山医院) 一种健康数据采集及智能分析方法
CN116756736A (zh) * 2023-08-24 2023-09-15 深圳红途科技有限公司 用户异常行为分析方法、装置、计算机设备及存储介质
CN116956097A (zh) * 2023-09-18 2023-10-27 湖南华菱电子商务有限公司 基于K-means的专家画像分析方法及系统
CN117876015A (zh) * 2024-03-11 2024-04-12 南京数策信息科技有限公司 一种用户行为数据分析方法、装置及相关设备

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110363387B (zh) * 2019-06-14 2023-09-05 平安科技(深圳)有限公司 基于大数据的画像分析方法、装置、计算机设备及存储介质
CN111159258A (zh) * 2019-12-31 2020-05-15 科技谷(厦门)信息技术有限公司 一种基于聚类分析的客户分群实现方法
CN111210201B (zh) * 2020-01-02 2021-02-26 平安科技(深圳)有限公司 职业标签建立方法、装置、电子设备及存储介质
CN111310052A (zh) * 2020-02-29 2020-06-19 平安国际智慧城市科技股份有限公司 用户画像构建方法、装置及计算机可读存储介质
CN113554041B (zh) * 2020-04-03 2023-09-26 北京京东振世信息技术有限公司 一种对用户标记标签的方法和装置
CN111597348B (zh) * 2020-04-27 2024-02-06 平安科技(深圳)有限公司 用户画像方法、装置、计算机设备和存储介质
CN111753186A (zh) * 2020-05-09 2020-10-09 杭州数跑科技有限公司 群体筛选方法、装置、设备及存储介质
CN111724051A (zh) * 2020-06-11 2020-09-29 苏州汇川技术有限公司 员工画像生成方法、设备及可读存储介质
CN111861697B (zh) * 2020-07-02 2021-05-18 北京睿知图远科技有限公司 一种基于贷款多头数据的用户画像生成方法及系统
CN111932315B (zh) * 2020-09-02 2023-10-24 度小满科技(北京)有限公司 数据展示的方法及装置、电子设备及计算机可读存储介质
CN112085526A (zh) * 2020-09-04 2020-12-15 中国平安财产保险股份有限公司 基于用户群的信息匹配方法、装置、计算机设备及存储介质
CN112633977A (zh) * 2020-12-22 2021-04-09 苏州斐波那契信息技术有限公司 一种基于用户行为的评分方法、装置计算机设备及存储介质
CN113297479A (zh) * 2021-04-29 2021-08-24 上海淇玥信息技术有限公司 一种用户画像生成方法、装置及电子设备
CN113408970B (zh) * 2021-08-20 2021-11-09 北京国电通网络技术有限公司 用户信息生成方法、装置、电子设备和计算机可读介质
CN116307921A (zh) * 2023-03-30 2023-06-23 国网甘肃省电力公司信息通信公司 一种评估人才成长的方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203772A (zh) * 2016-03-16 2017-09-26 阿里巴巴集团控股有限公司 一种用户类型识别方法及装置
US20180307733A1 (en) * 2016-09-22 2018-10-25 Tencent Technology (Shenzhen) Company Limited User characteristic extraction method and apparatus, and storage medium
CN109559245A (zh) * 2017-09-26 2019-04-02 北京国双科技有限公司 一种识别特定用户的方法及装置
CN109615018A (zh) * 2018-12-24 2019-04-12 广东德诚科教有限公司 用户个性化行为评价方法、装置、计算机设备和存储介质
CN110363387A (zh) * 2019-06-14 2019-10-22 平安科技(深圳)有限公司 基于大数据的画像分析方法、装置、计算机设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893407A (zh) * 2015-11-12 2016-08-24 乐视云计算有限公司 个体用户画像方法和系统
CN106803168B (zh) * 2016-12-30 2021-04-16 中国银联股份有限公司 一种异常转账侦测方法和装置
CN109086787B (zh) * 2018-06-06 2023-07-25 平安科技(深圳)有限公司 用户画像获取方法、装置、计算机设备以及存储介质
CN109711459B (zh) * 2018-12-24 2019-11-15 广东德诚科教有限公司 用户个性化行为评测方法、装置、计算机设备和存储介质
CN109711484A (zh) * 2019-01-10 2019-05-03 哈步数据科技(上海)有限公司 一种顾客的分类方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203772A (zh) * 2016-03-16 2017-09-26 阿里巴巴集团控股有限公司 一种用户类型识别方法及装置
US20180307733A1 (en) * 2016-09-22 2018-10-25 Tencent Technology (Shenzhen) Company Limited User characteristic extraction method and apparatus, and storage medium
CN109559245A (zh) * 2017-09-26 2019-04-02 北京国双科技有限公司 一种识别特定用户的方法及装置
CN109615018A (zh) * 2018-12-24 2019-04-12 广东德诚科教有限公司 用户个性化行为评价方法、装置、计算机设备和存储介质
CN110363387A (zh) * 2019-06-14 2019-10-22 平安科技(深圳)有限公司 基于大数据的画像分析方法、装置、计算机设备及存储介质

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111924A (zh) * 2021-03-26 2021-07-13 邦道科技有限公司 电力客户分类方法及装置
CN113420204A (zh) * 2021-05-21 2021-09-21 北京达佳互联信息技术有限公司 目标用户确定方法、装置、电子设备及存储介质
CN113420204B (zh) * 2021-05-21 2023-12-26 北京达佳互联信息技术有限公司 目标用户确定方法、装置、电子设备及存储介质
CN113780415A (zh) * 2021-09-10 2021-12-10 平安科技(深圳)有限公司 基于小程序游戏的用户画像生成方法、装置、设备及介质
CN113780415B (zh) * 2021-09-10 2023-08-15 平安科技(深圳)有限公司 基于小程序游戏的用户画像生成方法、装置、设备及介质
CN115795342A (zh) * 2022-11-15 2023-03-14 支付宝(杭州)信息技术有限公司 一种业务场景分类的方法、装置、存储介质及电子设备
CN115795342B (zh) * 2022-11-15 2024-02-06 支付宝(杭州)信息技术有限公司 一种业务场景分类的方法、装置、存储介质及电子设备
CN116089401A (zh) * 2023-02-17 2023-05-09 国网浙江省电力有限公司营销服务中心 用户数据管理方法及系统
CN116089401B (zh) * 2023-02-17 2023-09-05 国网浙江省电力有限公司营销服务中心 用户数据管理方法及系统
CN116523546B (zh) * 2023-06-29 2023-12-19 深圳市华图测控系统有限公司 智能读者行为分析预测系统数据采集分析的方法和装置
CN116523546A (zh) * 2023-06-29 2023-08-01 深圳市华图测控系统有限公司 智能读者行为分析预测系统数据采集分析的方法和装置
CN116705337A (zh) * 2023-08-07 2023-09-05 山东第一医科大学第一附属医院(山东省千佛山医院) 一种健康数据采集及智能分析方法
CN116705337B (zh) * 2023-08-07 2023-10-27 山东第一医科大学第一附属医院(山东省千佛山医院) 一种健康数据采集及智能分析方法
CN116756736A (zh) * 2023-08-24 2023-09-15 深圳红途科技有限公司 用户异常行为分析方法、装置、计算机设备及存储介质
CN116756736B (zh) * 2023-08-24 2024-03-22 深圳红途科技有限公司 用户异常行为分析方法、装置、计算机设备及存储介质
CN116956097B (zh) * 2023-09-18 2023-12-12 湖南华菱电子商务有限公司 基于K-means的专家画像分析方法及系统
CN116956097A (zh) * 2023-09-18 2023-10-27 湖南华菱电子商务有限公司 基于K-means的专家画像分析方法及系统
CN117876015A (zh) * 2024-03-11 2024-04-12 南京数策信息科技有限公司 一种用户行为数据分析方法、装置及相关设备
CN117876015B (zh) * 2024-03-11 2024-05-07 南京数策信息科技有限公司 一种用户行为数据分析方法、装置及相关设备

Also Published As

Publication number Publication date
CN110363387B (zh) 2023-09-05
CN110363387A (zh) 2019-10-22

Similar Documents

Publication Publication Date Title
WO2020248843A1 (zh) 基于大数据的画像分析方法、装置、计算机设备及存储介质
WO2020062660A1 (zh) 企业信用风险评估方法、装置、设备及存储介质
WO2019218699A1 (zh) 欺诈交易判断方法、装置、计算机设备和存储介质
WO2021003938A1 (zh) 图像分类方法、装置、计算机设备和存储介质
WO2023024670A1 (zh) 设备聚类方法、装置、计算机设备及存储介质
WO2019200742A1 (zh) 短期盈利的预测方法、装置、计算机设备和存储介质
CN112396428B (zh) 一种基于用户画像数据的客群分类管理方法及装置
AU2019101158A4 (en) A method of analyzing customer churn of credit cards by using logistics regression
WO2020143305A1 (zh) 群体信息分类方法、装置、计算机设备和存储介质
CN109190929B (zh) 一种工作量分摊方法、装置、计算机设备及存储介质
Bian SPSS discriminant function analysis
Wu et al. Bootstrap variability studies in ROC analysis on large datasets
EP4227855A1 (en) Graph explainable artificial intelligence correlation
WO2023083051A1 (zh) 生物特征识别方法、装置、设备和存储介质
CN111581197A (zh) 对数据集中的数据表进行抽样和校验的方法及装置
CN116340831A (zh) 一种信息分类方法、装置、电子设备及存储介质
Machado et al. Ranking the scientific output of researchers in fractional calculus
CN113095604B (zh) 产品数据的融合方法、装置、设备及存储介质
CN114861800A (zh) 模型训练方法、概率确定方法、装置、设备、介质及产品
CN114372835A (zh) 综合能源服务潜力客户识别方法、系统及计算机设备
CN113920366A (zh) 一种基于机器学习的综合加权主数据识别方法
CN115146890A (zh) 企业运营风险告警方法、装置、计算机设备和存储介质
CN113538020B (zh) 获取客群特征关联度方法、装置、存储介质和电子装置
CN112529708B (zh) 一种客户识别方法及装置、电子设备
RU2774046C1 (ru) Способ и система определения наличия критических корпоративных данных в тестовой базе данных

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20823414

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20823414

Country of ref document: EP

Kind code of ref document: A1