WO2023071105A1 - 一种特征变量的分析方法、装置、计算机设备及存储介质 - Google Patents

一种特征变量的分析方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2023071105A1
WO2023071105A1 PCT/CN2022/089514 CN2022089514W WO2023071105A1 WO 2023071105 A1 WO2023071105 A1 WO 2023071105A1 CN 2022089514 W CN2022089514 W CN 2022089514W WO 2023071105 A1 WO2023071105 A1 WO 2023071105A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
weighted average
data
linear regression
weighted
Prior art date
Application number
PCT/CN2022/089514
Other languages
English (en)
French (fr)
Inventor
黄晨宇
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023071105A1 publication Critical patent/WO2023071105A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Definitions

  • the present application belongs to the technical field of data analysis and processing, and in particular relates to a characteristic variable analysis method, device, computer equipment and storage medium.
  • the existing federated learning is mainly aimed at the modeling part, but in the actual process, due to the large number of data features and the uneven quality of the features, the model training rate is slow and the effect is not good.
  • the traditional method is to analyze the features before modeling. After selecting effective features, modeling can improve the training speed and improve the accuracy of the model.
  • the inventor realized that when the existing feature variable analysis methods are not on the same client for features or labels, it is usually necessary to transfer the features or labels to the same client, and then Variable analysis can only be performed, but in federated learning, data privacy needs to be protected, and direct transmission of data features may violate the original intention of federated learning privacy protection. Therefore, when the existing variable analysis methods are directly applied to federated learning, there is a risk of data leakage. risk, it is difficult to guarantee data security.
  • the purpose of the embodiments of the present application is to propose a characteristic variable analysis method, device, computer equipment, and storage medium to solve the technical problems of data leakage risks that may exist in existing characteristic variable analysis methods and difficulty in ensuring data security.
  • the embodiment of the present application provides an analysis method of characteristic variables, adopting the following technical solutions:
  • a method for analyzing characteristic variables comprising:
  • a coefficient of determination of the target characteristic variable is calculated based on the mean square error and the total sum of squares, and the target characteristic variable is evaluated based on the coefficient of determination.
  • performing the binning operation on the feature tags in the second participant to obtain the feature tag set includes:
  • the calculating the weighted average of the feature data in the data feature set to obtain the first weighted average includes:
  • the calculating the weighted average of the feature tags in the feature tag set to obtain the second weighted average includes:
  • An average of the second weighted results is calculated to obtain the first weighted average.
  • constructing a linear regression equation based on the first weighted average and the second weighted average includes:
  • the linear regression equation is constructed based on the linear regression parameters and a preset least square method.
  • a 0 and a 1 are linear regression parameters
  • x t is the data feature
  • the specific calculation formulas of a 0 and a 1 are as follows:
  • x ti is the eigenvalue of the i-th data feature
  • y i is the label value of the i-th feature label
  • y' i is the label average of all feature labels y' i , that is, the second weighted average.
  • R 2 is coefficient of determination
  • RMSE mean square error
  • SST is the total sum of squares
  • the evaluation of the target characteristic variable based on the coefficient of determination includes:
  • the target characteristic variable is determined to be a necessary variable
  • the determination coefficient is smaller than a preset threshold, it is determined that the target characteristic variable is an unnecessary variable.
  • the embodiment of the present application also provides an analysis device for characteristic variables, which adopts the following technical solutions:
  • An analysis device for characteristic variables comprising:
  • the first binning module is used to perform a binning operation on the feature data in the first participant to obtain a data feature set
  • the second binning module is used to perform a binning operation on the feature tags in the second participant to obtain a feature tag set
  • the first weighted average module is used to calculate the weighted average of the feature data in the data feature set to obtain the first weighted average
  • the second weighted average module is used to calculate the weighted average of the feature tags in the feature tag set to obtain a second weighted average
  • a linear regression module for constructing a linear regression equation based on the first weighted average and the second weighted average
  • An evaluation parameter calculation module used to calculate the mean square error and the total sum of squares of the target feature variable based on the linear regression equation
  • a variable evaluation module configured to calculate a coefficient of determination of the target characteristic variable based on the mean square error and the total sum of squares, and evaluate the target characteristic variable based on the coefficient of determination.
  • the embodiment of the present application also provides a computer device, which adopts the following technical solution:
  • a computer device comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the steps of the analysis method for the following characteristic variables are realized:
  • a coefficient of determination of the target characteristic variable is calculated based on the mean square error and the total sum of squares, and the target characteristic variable is evaluated based on the coefficient of determination.
  • the embodiment of the present application also provides a computer-readable storage medium, which adopts the following technical solution:
  • a computer-readable storage medium the computer-readable storage medium is stored with computer-readable instructions, and when the computer-readable instructions are executed by a processor, the steps of the analysis method for the following characteristic variables are realized:
  • a coefficient of determination of the target characteristic variable is calculated based on the mean square error and the total sum of squares, and the target characteristic variable is evaluated based on the coefficient of determination.
  • the application discloses an analysis method, device, computer equipment and storage medium of characteristic variables, belonging to the technical field of data analysis and processing.
  • This application takes into account the correlation between the feature data of different clients, so the feature data and feature tags stored in different clients are binned, and the weighted average of the feature data is calculated respectively to obtain the first weighted average, and Calculate the weighted average of the feature labels to obtain the second weighted average, construct a linear regression equation based on the first weighted average and the second weighted average, and calculate the mean square error and total sum of squares of the target feature variable based on the linear regression equation, based on The mean square error and total sum of squares calculate the coefficient of determination of the target characteristic variable, and evaluate the target characteristic variable based on the coefficient of determination.
  • this application In the process of realizing multi-client joint variable analysis, this application only transfers intermediate evaluation factors such as weighted average value, mean square error, and total square sum among multi-clients, without transferring feature data and feature labels. Therefore, multi-client joint variable analysis can be realized while protecting data privacy.
  • FIG. 1 shows an exemplary system architecture diagram to which the present application can be applied
  • Fig. 2 shows the flow chart of an embodiment of the method for analyzing characteristic variables according to the present application
  • Fig. 3 shows a schematic structural view of an embodiment of an analysis device according to the characteristic variable of the present application
  • Fig. 4 shows a schematic structural diagram of an embodiment of a computer device according to the present application.
  • a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
  • the network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 .
  • Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
  • Terminal devices 101 , 102 , 103 Users can use terminal devices 101 , 102 , 103 to interact with server 105 via network 104 to receive or send messages and the like.
  • Various communication client applications can be installed on the terminal devices 101, 102, 103, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
  • Terminal devices 101, 102, 103 can be various electronic devices with display screens and support web browsing, including but not limited to smartphones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compress standard audio layer 4) players, laptops and desktop computers, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, dynamic Video experts compress standard audio layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, moving picture experts compress standard audio layer 4
  • laptops and desktop computers etc.
  • the server 105 can be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101, 102, 103.
  • the server can be an independent server, or it can provide cloud services, cloud databases, cloud computing, Cloud servers for basic cloud computing services such as cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network, CDN), and big data and artificial intelligence platforms.
  • the characteristic variable analysis method provided in the embodiment of the present application is generally executed by a server, and correspondingly, the characteristic variable analysis device is generally set in the server.
  • terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.
  • FIG. 2 it shows a flowchart of an embodiment of a method for analyzing characteristic variables according to the present application.
  • the embodiments of the present application may acquire and process relevant data based on artificial intelligence technology.
  • artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the federated learning modeling process may involve multiple parties, and the data features or feature labels used in it may not be on the client of the same participant, so the existing variable analysis methods usually need to transfer the data features or feature labels
  • the present application provides a feature variable analysis method, device, computer equipment, and storage medium, aiming at realizing multi-client joint variable analysis on the premise of transferring feature data and feature tags to protect data privacy.
  • the analysis method of the characteristic variable comprises the following steps:
  • the server receives the feature variable instruction, and performs binning operations on the feature data in the first participant according to the binning conditions uploaded by the user to obtain a data feature set; then the server obtains the binning information in the first participant, and based on The binning information in the first participant performs a binning operation on the feature labels in the second participant to obtain a set of feature labels.
  • a certain federated learning modeling requires participants A and B to participate in the realization.
  • participant A is an insurance institution
  • participant B is a bank.
  • banks have additional user data compared to insurance institutions, which can be obtained by extracting this part of user data feature label y.
  • the electronic device (such as the server shown in FIG. 1 ) on which the characteristic variable analysis method runs may receive the characteristic variable instruction through a wired connection or a wireless connection.
  • the above wireless connection methods may include but not limited to 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods known or developed in the future .
  • the server first calculates the feature weight of each feature data in the data feature set based on the preset feature weight algorithm, and calculates the feature weight of each feature in the feature label set based on the preset feature weight algorithm.
  • the label weight of the label and then calculate the weighted average of the feature data in the data feature set based on the feature weight to obtain the first weighted average, and calculate the weighted average of the feature labels in the feature label set based on the label weight to obtain the second weighted average value.
  • the specific calculation formula is as follows:
  • ⁇ i is the weight of the i-th data feature in the data feature set
  • x ti is the feature value of the i-th data feature
  • n is the number of data features in the data feature set
  • n is the feature average of all data features x' t,i , that is, the first weighted average
  • w i is the weight of the i-th feature tag in the feature tag set
  • y i is the label value of the i-th feature tag
  • m is the number of feature labels in the feature label set
  • the feature weight algorithm is the Relief algorithm.
  • the Relief algorithm randomly selects a sample R from any feature data combination D, and then finds the nearest neighbor sample H from D, which is called Near Hit. Find the nearest neighbor sample M from other feature data combinations, called NearMiss, and then update the weight of each feature according to the following rules: If the distance between R and Near Hit on a certain feature is smaller than the distance between R and Near Miss, here The distance is the similarity between two feature data, which means that the feature is beneficial to distinguish the nearest neighbors of the same type and different types, and then increase the weight of the feature; conversely, if the distance between R and Near Hit in a feature is greater than The distance between R and Near Miss indicates that this feature has a negative effect on distinguishing the nearest neighbors of the same class from different classes, so the weight of this feature is reduced.
  • the above process is repeated p times, and finally the average weight of each feature is obtained.
  • the greater the weight of the feature the stronger the classification ability of the feature, and vice versa, the weaker the classification ability of the feature.
  • the running time of the Relief algorithm increases linearly with the number of samples p and the number of original features N, so the running efficiency is very high.
  • the server calculates the linear regression parameters of the linear regression equation based on the first weighted average and the second weighted average,
  • the least square method (also known as the least square method) is a mathematical optimization technique. It finds the best function fit to the data by minimizing the sum of squared errors. Unknown data can be easily obtained by using the least square method, and the sum of the squares of the errors between the obtained data and the actual data can be minimized. The least square method can also be used for curve fitting, and some other optimization problems can also be obtained through the minimum The energy or entropy maximization is expressed by the method of least squares. The least square method is a mathematical tool widely used in many disciplines of data processing such as error estimation, uncertainty, system identification, prediction, and forecasting.
  • the server After the server completes the construction of the linear regression equation, it calculates the mean square error and the total sum of squares of the target characteristic variable based on the linear regression equation. Whether the feature variable is a required variable.
  • the specific calculation formula of the mean square error is as follows:
  • the server calculates the coefficient of determination of the target characteristic variable based on the mean square error and the total sum of squares.
  • the coefficient of determination directly reflects the criticality of the characteristic variable to be evaluated.
  • this application considers the correlation between the characteristic data of different clients, so the characteristic data and characteristic tags stored in different clients are binned, and the weighted average value of the characteristic data is calculated respectively to obtain the first A weighted average, and calculate the weighted average of the feature labels to obtain the second weighted average, construct a linear regression equation based on the first weighted average and the second weighted average, and calculate the mean square error of the target feature variable based on the linear regression equation and the total sum of squares, calculate the coefficient of determination of the target characteristic variable based on the mean square error and the total sum of squares, and evaluate the target characteristic variable based on the coefficient of determination.
  • this application In the process of realizing multi-client joint variable analysis, this application only transfers intermediate evaluation factors such as weighted average value, mean square error, and total square sum among multi-clients, without transferring feature data and feature labels. Therefore, multi-client joint variable analysis can be realized while protecting data privacy.
  • performing the binning operation on the feature tags in the second participant to obtain the feature tag set includes:
  • the server obtains the binning information of the feature data, and sends the binning information to the second participant, and performs a binning operation on the feature tags in the second participant based on the binning information to obtain a feature tag set.
  • data binning is performed on the characteristic data in participant A. Assuming that q data bins are divided, after the binning operation of participant A is completed, the feature data in each data bin is traversed. id, get the binning situation of the characteristic data of participant A, and send the binning situation of participant A to B, and participant B performs corresponding binning on the label y corresponding to the id according to the binning situation sent by participant A .
  • the calculating the weighted average of the feature data in the data feature set to obtain the first weighted average includes:
  • the calculating the weighted average of the feature tags in the feature tag set to obtain the second weighted average includes:
  • the server first calculates the feature weight of each feature data in the data feature set and the feature weight of each feature label in the feature label set through a preset feature weight algorithm, wherein the server first calculates the feature weight of each feature tag in the data feature set Classify the characteristic data to obtain multiple characteristic data groups, and then assign an initial weight to each classified characteristic data, for example, the initial weight is "0.5".
  • Calculate the similarity of feature data in feature data groups of the same category based on the feature weight algorithm to obtain the first similarity calculate the similarity of feature data between feature data groups of different categories, and obtain the second similarity, based on the first similarity degree and the second degree of similarity adjust the initial weight of the feature data to obtain the feature weight of each feature data, that is, the first weight. For example, when the second similarity difference of the first similarity is greater than or equal to the preset similarity threshold, the initial weight is lowered.
  • the feature weight of each feature label ie, the second weight, is calculated according to the above calculation process.
  • the server performs weighted summation on the feature data in the data feature set based on the first weight to obtain a first weighted result, and performs weighted summation on the feature tags in the feature label set based on the second weight to obtain a second weighted result. Finally, the server calculates the average of the first weighted results to obtain the first weighted average, and calculates the average of the second weighted results to obtain the first weighted average.
  • the feature data and feature tags stored in different clients are binned, and the weighted average values of the feature data are calculated respectively to obtain the first weighted Average, and calculate the weighted average of the feature labels to obtain the second weighted average, and implement feature association by performing operations such as weighting, weighting, summing, and averaging on the features to ensure that as many features as possible are extracted.
  • constructing a linear regression equation based on the first weighted average and the second weighted average includes:
  • the linear regression equation is constructed based on the linear regression parameters and a preset least square method.
  • a 0 and a 1 are linear regression parameters
  • x t is the data feature
  • the specific calculation formulas of a 0 and a 1 are as follows:
  • x ti is the eigenvalue of the i-th data feature
  • y i is the label value of the i-th feature label
  • y i is the label average of the feature labels in the feature label set, that is, the second weighted average.
  • the intermediate evaluation factors of the characteristic variables to be evaluated when the intermediate evaluation factors of the characteristic variables to be evaluated are transferred between different clients, the intermediate evaluation factors that need to be transferred can be encrypted by a homomorphic encryption algorithm to further ensure data security.
  • the server when the server calculates the linear regression parameters, the participant A first calculates the average difference Then, participant A encrypts the average difference u i through homomorphic encryption to obtain the encrypted average difference [u i ], the server sends [u i ] to participant B, and participant B decrypts [u i ] , get u i , and calculate the standard deviation Then, participant B encrypts the standard deviation v through homomorphic encryption to obtain the encrypted standard deviation [v].
  • the server sends [v] to participant A, and participant A decrypts [v] to obtain v. And calculate the regression parameter a 1 and the true value according to v.
  • participant A encrypts the real value o through homomorphic encryption to obtain the encrypted real value [o], and then sends the real value [o] to participant B, and the participant calculates the regression parameters
  • homomorphic encryption is a cryptographic technology based on computational complexity theory of mathematical problems. Processing the homomorphically encrypted data yields an output that, when decrypted, yields the same output as the original unencrypted data.
  • Homomorphic encryption includes a pair of public and private keys (pk, sk), and [] is used to represent the homomorphic encryption after encryption with pk h .
  • pk, sk public and private keys
  • [] is used to represent the homomorphic encryption after encryption with pk h .
  • m is plaintext
  • the intermediate evaluation factors that need to be transmitted are encrypted by using a homomorphic encryption algorithm, so as to further ensure data security.
  • R 2 is coefficient of determination
  • RMSE mean square error
  • SST is the total sum of squares
  • the evaluation of the target characteristic variable based on the coefficient of determination includes:
  • the target characteristic variable is determined to be a necessary variable
  • the determination coefficient is smaller than a preset threshold, it is determined that the target characteristic variable is an unnecessary variable.
  • the mean square error RMSE, total square sum SST is calculated by participant B, and finally participant B encrypts the mean square error RMSE, total square sum SST through homomorphic encryption to obtain the encrypted mean square error [RMSE], total sum of squares [SST], and send the encrypted mean square error [RMSE], total sum of squares [SST] to the server
  • the server when the server evaluates the target characteristic variable, it compares the determination coefficient with a preset threshold, and when the determination coefficient is greater than or equal to the preset threshold, determines that the target characteristic variable is a necessary variable, When the coefficient of determination is smaller than the preset threshold, it is determined that the target characteristic variable is an unnecessary variable.
  • the above-mentioned feature data and feature tags can also be stored in nodes of a block chain.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with each other using cryptographic methods. Each data block contains a batch of network transaction information, which is used to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the present application provides an embodiment of a device for analyzing characteristic variables.
  • This device embodiment corresponds to the method embodiment shown in FIG. 2.
  • the device Specifically, it can be applied to various electronic devices.
  • the analysis device of the characteristic variable described in the present embodiment comprises:
  • the first binning module 301 is configured to perform a binning operation on the feature data in the first participant to obtain a data feature set;
  • the second binning module 302 is used to perform a binning operation on the feature tags in the second participant to obtain a feature tag set;
  • a first weighted average module 303 configured to calculate a weighted average of the feature data in the data feature set to obtain a first weighted average
  • the second weighted average module 304 is used to calculate the weighted average of the feature tags in the feature tag set to obtain the second weighted average;
  • a linear regression module 305 configured to construct a linear regression equation based on the first weighted average and the second weighted average;
  • the evaluation parameter calculation module 306 is used to calculate the mean square error and the total sum of squares of the target characteristic variable based on the linear regression equation;
  • the variable evaluation module 307 is configured to calculate the determination coefficient of the target characteristic variable based on the mean square error and the total sum of squares, and evaluate the target characteristic variable based on the determination coefficient.
  • the second binning module 302 specifically includes:
  • a binning information acquisition unit configured to acquire binning information of the feature data, and send the binning information to the second participant;
  • the second binning unit is configured to perform a binning operation on the feature tags in the second participant based on the binning information to obtain a feature tag set.
  • the first weighted average module 303 specifically includes:
  • the first weight calculation unit is configured to calculate the feature weight of each feature data in the data feature set based on a preset feature weight algorithm to obtain the first weight;
  • a first weighting unit configured to weight and sum the feature data in the data feature set based on the first weight to obtain a first weighted result
  • the first average calculation unit is configured to calculate the average of the first weighted results to obtain the first weighted average.
  • the second weighted average module 304 specifically includes:
  • the second weight calculation unit is configured to calculate the feature weight of each feature tag in the feature tag set based on a preset feature weight algorithm to obtain a second weight;
  • a second weighting unit configured to weight and sum the feature tags in the feature tag set based on the second weight to obtain a second weighted result
  • the second average value calculation unit is used to calculate the average value of the second weighted result to obtain the first weighted average value
  • linear regression module 305 specifically includes:
  • a regression parameter calculation unit configured to calculate a linear regression parameter of the linear regression equation based on the first weighted average and the second weighted average;
  • a regression equation construction unit configured to construct the linear regression equation based on the linear regression parameters and a preset least square method.
  • a 0 and a 1 are linear regression parameters
  • x t is the data feature
  • the specific calculation formulas of a 0 and a 1 are as follows:
  • x ti is the eigenvalue of the i-th data feature
  • y i is the label value of the i-th feature label
  • y' i is the label average of all feature labels y' i , that is, the second weighted average.
  • variable evaluation module 307 specifically includes:
  • An evaluation and comparison unit configured to compare the coefficient of determination with a preset threshold
  • a first evaluation result unit configured to determine that the target characteristic variable is a necessary variable when the determination coefficient is greater than or equal to a preset threshold
  • the second evaluation result unit is configured to determine that the target feature variable is an unnecessary variable when the determination coefficient is smaller than a preset threshold.
  • FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 4 includes a memory 41 , a processor 42 and a network interface 43 connected to each other through a system bus. It should be noted that only the computer device 4 with components 41-43 is shown in the figure, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to preset or stored instructions, and its hardware includes but is not limited to microprocessors, dedicated Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded devices, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • the computer equipment may be computing equipment such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through keyboard, mouse, remote control, touch pad or voice control device.
  • the memory 41 includes at least one type of readable storage medium, and the readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc.
  • the memory 41 may be an internal storage unit of the computer device 4 , such as a hard disk or memory of the computer device 4 .
  • the memory 41 is generally used to store the operating system and various application software installed in the computer device 4 , such as computer-readable instructions for analysis methods of characteristic variables, and the like.
  • the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chips in some embodiments. This processor 42 is generally used to control the general operation of said computer device 4 . In this embodiment, the processor 42 is configured to execute computer-readable instructions stored in the memory 41 or process data, for example, computer-readable instructions for executing the analysis method of the characteristic variable.
  • CPU Central Processing Unit
  • controller microcontroller
  • microprocessor microprocessor
  • This processor 42 is generally used to control the general operation of said computer device 4 .
  • the processor 42 is configured to execute computer-readable instructions stored in the memory 41 or process data, for example, computer-readable instructions for executing the analysis method of the characteristic variable.
  • the network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
  • the present application also provides another implementation manner, which is to provide a computer-readable storage medium
  • the computer-readable storage medium may be non-volatile or volatile
  • the computer-readable storage medium stores Computer-readable instructions
  • the computer-readable instructions can be executed by at least one processor, so that the at least one processor executes the steps of the method for analyzing characteristic variables as described above.
  • the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, disk, CD) contains several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
  • a terminal device which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Operations Research (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Algebra (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Complex Calculations (AREA)

Abstract

本申请公开了一种特征变量的分析方法、装置、计算机设备及存储介质,属于数据分析处理技术领域中的数据挖掘技术。本申请先对不同客户端中存储的特征数据和特征标签进行分箱,然后计算特征数据的加权平均值,得到第一加权平均值(S203),以及计算特征标签的加权平均值,得到第二加权平均值(S204),基于第一加权平均值和第二加权平均值构建线性回归方程(S205),基于线性回归方程计算目标特征变量的均方误差和总平方和(S206),基于均方误差和总平方和计算目标特征变量的决定系数,并基于决定系数对目标特征变量进行评价(S207)。此外,本申请还涉及区块链技术,特征数据和特征标签可存储于区块链中。本申请可以在保护数据隐私的情况下实现多客户端联合的变量分析。

Description

一种特征变量的分析方法、装置、计算机设备及存储介质
本申请要求于2021年10月27日提交中国专利局、申请号为202111254424.7,发明名称为“一种特征变量的分析方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于数据分析处理技术领域,具体涉及一种特征变量的分析方法、装置、计算机设备及存储介质。
背景技术
现有的联邦学习主要是针对建模的部分,但在实际过程中,由于数据特征量多,特征质量参差不齐导致模型训练速率较慢且效果不好。针对这一情况,传统的方法是在建模前对特征进行分析,在选取了有效的特征后再去进行建模可以拥有提升训练速率,提高模型准确率。
然而,在针对特征变量分析的过程中,发明人意识到现有的特征变量分析方法针对于特征或标签不在同一个客户端上时,通常需要将特征或标签传递到同一个客户端上,然后才能进行变量分析,但在联邦学习中,需要对数据隐私进行保护,直接传输数据特征或会违背联邦学习的隐私保护初衷,因此现有的变量分析方法直接应用于联邦学习时,存在数据泄露的风险,难以保证数据安全。
发明内容
本申请实施例的目的在于提出一种特征变量的分析方法、装置、计算机设备及存储介质,以解决现有的特征变量分析方法可能存在的数据泄露风险,难以保证数据安全的技术问题。
为了解决上述技术问题,本申请实施例提供一种特征变量的分析方法,采用了如下所述的技术方案:
一种特征变量的分析方法,包括:
对第一参与方中的特征数据进行分箱操作,得到数据特征集合;
对第二参与方中的特征标签进行分箱操作,得到特征标签集合;
计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值;
计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值;
基于所述第一加权平均值和所述第二加权平均值构建线性回归方程;
基于所述线性回归方程计算目标特征变量的均方误差和总平方和;
基于所述均方误差和所述总平方和计算所述目标特征变量的决定系数,并基于所述决定系数对所述目标特征变量进行评价。
进一步地,所述对第二参与方中的特征标签进行分箱操作,得到特征标签集合包括:
获取所述特征数据的分箱信息,并将所述分箱信息发送至所述第二参与方中;
基于所述分箱信息对所述对第二参与方中的特征标签进行分箱操作,得到特征标签集合。
进一步地,所述计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值包括:
基于预设的特征权重算法计算所述数据特征集合中每一个特征数据的特征权重,得到第一权重;
基于所述第一权重对所述数据特征集合中特征数据进行加权求和,得到第一加权结果;
计算第一加权结果的平均值,得到所述第一加权平均值。
进一步地,所述计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值包括:
基于预设的特征权重算法计算所述特征标签集合中每一个特征标签的特征权重,得到第二权重;
基于所述第二权重对所述特征标签集合中的特征标签进行加权求和,得到第二加权结果;
计算第二加权结果的平均值,得到所述第一加权平均值。
进一步地,所述基于所述第一加权平均值和所述第二加权平均值构建线性回归方程包括:
基于所述第一加权平均值和所述第二加权平均值计算所述线性回归方程的线性回归参量;
基于所述线性回归参量和预设的最小二乘法构建所述线性回归方程。
进一步地,所述线性回归方程的表达式如下:
f(x t)=a 0+a 1x t
其中,a 0和a 1均为线性回归参量,x t为数据特征,a 0和a 1的具体计算公式如下:
Figure PCTCN2022089514-appb-000001
Figure PCTCN2022089514-appb-000002
其中,x ti是第i个数据特征的特征值,
Figure PCTCN2022089514-appb-000003
为数据特征集合中数据特征的特征平均值, 即第一加权平均值,y i是第i个特征标签的标签值,
Figure PCTCN2022089514-appb-000004
为所有特征标签y' i的标签平均值,即第二加权平均值。
进一步地,所述目标特征变量的决定系数的具体计算公式如下:
Figure PCTCN2022089514-appb-000005
其中,R 2为决定系数,RMSE为均方误差,SST为总平方和,所述基于所述决定系数对所述目标特征变量进行评价的包括:
将所述决定系数与预设阈值进行比对;
当所述决定系数大于或等于预设阈值时,确定所述目标特征变量为必要变量;
当所述决定系数小于预设阈值时,确定所述目标特征变量为非必要变量。
为了解决上述技术问题,本申请实施例还提供一种特征变量的分析装置,采用了如下所述的技术方案:
一种特征变量的分析装置,包括:
第一分箱模块,用于对第一参与方中的特征数据进行分箱操作,得到数据特征集合;
第二分箱模块,用于对第二参与方中的特征标签进行分箱操作,得到特征标签集合;
第一加权平均模块,用于计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值;
第二加权平均模块,用于计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值;
线性回归模块,用于基于所述第一加权平均值和所述第二加权平均值构建线性回归方程;
评价参数计算模块,用于基于所述线性回归方程计算目标特征变量的均方误差和总平方和;
变量评价模块,用于基于所述均方误差和所述总平方和计算所述目标特征变量的决定系数,并基于所述决定系数对所述目标特征变量进行评价。
为了解决上述技术问题,本申请实施例还提供一种计算机设备,采用了如下所述的技术方案:
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下特征变量的分析方法的步骤:
对第一参与方中的特征数据进行分箱操作,得到数据特征集合;
对第二参与方中的特征标签进行分箱操作,得到特征标签集合;
计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值;
计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值;
基于所述第一加权平均值和所述第二加权平均值构建线性回归方程;
基于所述线性回归方程计算目标特征变量的均方误差和总平方和;
基于所述均方误差和所述总平方和计算所述目标特征变量的决定系数,并基于所述决定系数对所述目标特征变量进行评价。
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,采用了如下所述的技术方案:
一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下特征变量的分析方法的步骤:
对第一参与方中的特征数据进行分箱操作,得到数据特征集合;
对第二参与方中的特征标签进行分箱操作,得到特征标签集合;
计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值;
计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值;
基于所述第一加权平均值和所述第二加权平均值构建线性回归方程;
基于所述线性回归方程计算目标特征变量的均方误差和总平方和;
基于所述均方误差和所述总平方和计算所述目标特征变量的决定系数,并基于所述决定系数对所述目标特征变量进行评价。
与现有技术相比,本申请实施例主要有以下有益效果:
本申请公开了一种特征变量的分析方法、装置、计算机设备及存储介质,属于数据分析处理技术领域。本申请考虑到不同客户端的特征数据之间的关联关系,因此在对不同客户端中存储的特征数据和特征标签进行分箱,分别计算特征数据的加权平均值,得到第一加权平均值,以及计算特征标签的加权平均值,得到第二加权平均值,基于第一加权平均值和第二加权平均值构建线性回归方程,基于线性回归方程计算目标特征变量的均方误差和总平方和,基于均方误差和总平方和计算目标特征变量的决定系数,并基于决定系数对目标特征变量进行评价。本申请在实现多客户端联合的变量分析过程中,仅在多客户端之间传递加权平均值、均方误差、总平方和等中间评价因子,而不需要对特征数据和特征标签进行转移,因此可以在保护数据隐私的情况下实现多客户端联合的变量分析。
附图说明
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图 作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了本申请可以应用于其中的示例性系统架构图;
图2示出了根据本申请的特征变量的分析方法的一个实施例的流程图;
图3示出了根据本申请的特征变量的分析装置的一个实施例的结构示意图;
图4示出了根据本申请的计算机设备的一个实施例的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器,服务器可以是独立的服务器,也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。
需要说明的是,本申请实施例所提供的特征变量的分析方法一般由服务器执行,相应地,特征变量的分析装置一般设置于服务器中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
继续参考图2,示出了根据本申请的特征变量的分析的方法的一个实施例的流程图。 本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
在联邦学习建模过程可能包含多个参与方,而其中使用到的数据特征或特征标签可能不在同一个参与方的客户端上,因此现有的变量分析方法通常需要将数据特征或特征标签传递到同一个客户端上,然后才能进行变量分析,但在数据特征或特征标签过程中,很容易发射数据泄露风险,难以保证数据安全。为此,本申请提供一种特征变量的分析方法、装置、计算机设备及存储介质,旨在对特征数据和特征标签进行转移的前提下,实现多客户端联合的变量分析,以保护数据隐私。
所述的特征变量的分析方法,包括以下步骤:
S201,对第一参与方中的特征数据进行分箱操作,得到数据特征集合。
S202,对第二参与方中的特征标签进行分箱操作,得到特征标签集合。
具体的,服务器接收特征变量指令,并按照用户上传的分箱条件对第一参与方中的特征数据进行分箱操作,得到数据特征集合;然后服务器获取第一参与方中分箱信息,并基于第一参与方中分箱信息对第二参与方中的特征标签进行分箱操作,得到特征标签集合。
在本申请具体的实施例中,某次联邦学习建模需要参与方A和参与方B共同参与实现,它们各自拥有一部分联合数据,即参与方A拥有样本数据的部分特征x t,参与方B拥有样本数据的部分特征x t和标签y,例如参与方A为保险机构,参与方B为银行,在实际场景中,银行相比于保险机构存在额外的用户数据,通过该部分用户数据提取获得特征标签y。
在本实施例中,特征变量的分析方法运行于其上的电子设备(例如图1所示的服务器)可以通过有线连接方式或者无线连接方式接收特征变量指令。需要指出的是,上述无线连接方式可以包括但不限于3G/4G连接、WiFi连接、蓝牙连接、WiMAX连接、Zigbee连接、UWB(ultra wideband)连接、以及其他现在已知或将来开发的无线连接方式。
S203,计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值。
S204,计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值。
具体的,服务器在完成分箱操作后,先基于预设的特征权重算法分别计算数据特征集 合中每一个特征数据的特征权重,以及基于预设的特征权重算法分别计算特征标签集合中每一个特征标签的标签权重,然后基于特征权重计算数据特征集合中的特征数据的加权平均值,得到第一加权平均值,以及基于标签权重计算特征标签集合中的特征标签的加权平均值,得到第二加权平均值。具体计算公式如下:
Figure PCTCN2022089514-appb-000006
Figure PCTCN2022089514-appb-000007
Figure PCTCN2022089514-appb-000008
Figure PCTCN2022089514-appb-000009
其中,φ i是数据特征集中第i个数据特征的权重,x ti是第i个数据特征的特征值,
Figure PCTCN2022089514-appb-000010
是数据特征集的特征值总和,n数据特征集中数据特征的数量,
Figure PCTCN2022089514-appb-000011
为所有数据特征x' t,i的特征平均值,即第一加权平均值。w i是特征标签集中第i个特征标签的权重,y i是第i个特征标签的标签值,
Figure PCTCN2022089514-appb-000012
是特征标签集的标签值总和,m为特征标签集中特征标签的数量,
Figure PCTCN2022089514-appb-000013
为所有特征标签y' i的标签平均值,即第二加权平均值。
在本申请一种具体的实施例中,特征权重算法为Relief算法,Relief算法通过从任意一个特征数据组合D中随机选择一个样本R,然后从D中寻找最近邻样本H,称为Near Hit,从其他特征数据组合中寻找最近邻样本M,称为NearMiss,然后根据以下规则更新每个特征的权重:如果R和Near Hit在某个特征上的距离小于R和Near Miss上的距离,这里的距离即两个特征数据之间的相似度,则说明该特征对区分同类和不同类的最近邻是有益的,则增加该特征的权重;反之,如果R和Near Hit在某个特征的距离大于R和Near Miss上的距离,说明该特征对区分同类和不同类的最近邻起负面作用,则降低该特征的权重。以上过程重复p次,最后得到各特征的平均权重,特征的权重越大,表示该特征的分类能力越强,反之,表示该特征分类能力越弱。Relief算法的运行时间随着样本的抽样次数p和原始特征个数N的增加线性增加,因而运行效率非常高。
S205,基于所述第一加权平均值和所述第二加权平均值构建线性回归方程。
具体的,服务器基于第一加权平均值和第二加权平均值计算线性回归方程的线性回归参量,
并基于线性回归参量和预设的最小二乘法构建线性回归方程。
需要说明的是,最小二乘法(又称最小平方法)是一种数学优化技术。它通过最小化误差的平方和寻找数据的最佳函数匹配。利用最小二乘法可以简便地求得未知的数据,并使得这些求得的数据与实际数据之间误差的平方和为最小,最小二乘法还可用于曲线拟合,其他一些优化问题也可通过最小化能量或最大化熵用最小二乘法来表达。最小二乘法是一种在误差估计、不确定度、系统辨识及预测、预报等数据处理诸多学科领域得到广泛应用的数学工具。
S206,基于所述线性回归方程计算目标特征变量的均方误差和总平方和。
具体的,服务器在完成线性回归方程的构建后,基于线性回归方程计算目标特征变量的均方误差和总平方和,均方误差和总平方和为关键程度特征变量的中间评价因子,用于评价特征变量是否为必要变量。均方误差的具体计算公式如下:
Figure PCTCN2022089514-appb-000014
总平方和的具体计算公式如下:
Figure PCTCN2022089514-appb-000015
S207,基于所述均方误差和所述总平方和计算所述目标特征变量的决定系数,并基于所述决定系数对所述目标特征变量进行评价。
具体的,服务器基于均方误差和总平方和计算目标特征变量的决定系数,决定系数直接反应了待评价特征变量的关键程度,通过比对决定系数与预设阈值,当决定系数大于或等于预设阈值时,确定目标特征变量为必要变量,当决定系数小于预设阈值时,确定目标特征变量为非必要变量。
在上述实施例中,本申请考虑到不同客户端的特征数据之间的关联关系,因此在对不同客户端中存储的特征数据和特征标签进行分箱,分别计算特征数据的加权平均值,得到第一加权平均值,以及计算特征标签的加权平均值,得到第二加权平均值,基于第一加权平均值和第二加权平均值构建线性回归方程,基于线性回归方程计算目标特征变量的均方误差和总平方和,基于均方误差和总平方和计算目标特征变量的决定系数,并基于决定系数对目标特征变量进行评价。本申请在实现多客户端联合的变量分析过程中,仅在多客户端之间传递加权平均值、均方误差、总平方和等中间评价因子,而不需要对特征数据和特征标签进行转移,因此可以在保护数据隐私的情况下实现多客户端联合的变量分析。
进一步地,所述对第二参与方中的特征标签进行分箱操作,得到特征标签集合包括:
获取所述特征数据的分箱信息,并将所述分箱信息发送至所述第二参与方中;
基于所述分箱信息对所述对第二参与方中的特征标签进行分箱操作,得到特征标签集合。
具体的,服务器获取特征数据的分箱信息,并将分箱信息发送至第二参与方中,基于分箱信息对对第二参与方中的特征标签进行分箱操作,得到特征标签集合。
在本申请具体的实施例中,对参与方A中的特征数据进行数据分箱,假设分了q个数据箱,在完成参与方A的分箱操作后,遍历每一个数据箱中特征数据的id,得到参与方A特征数据的分箱情况,并将参与方A的分箱情况发送给B,参与方B根据参与方A发送的分箱情况,对对应id的标签y进行对应的分箱。
在上述实施例中,通过获取参与方A的分箱信息,并根据参与方A的分箱信息对参与方B中的特征标签进行分箱,保证参与方A和参与方B的分箱结果的结构一致,以减少分箱差异带来的误差。
进一步地,所述计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值包括:
基于预设的特征权重算法计算所述数据特征集合中每一个特征数据的特征权重,得到第一权重;
基于所述第一权重对所述数据特征集合中特征数据进行加权求和,得到第一加权结果;
计算第一加权结果的平均值,得到所述第一加权平均值。
进一步地,所述计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值包括:
基于预设的特征权重算法计算所述特征标签集合中每一个特征标签的特征权重,得到第二权重;
基于所述第二权重对所述特征标签集合中的特征标签进行加权求和,得到第二加权结果;
计算第二加权结果的平均值,得到所述第一加权平均值;
具体的,服务器先通过预设的特征权重算法计算所述数据特征集合中每一个特征数据的特征权重和所述特征标签集合中每一个特征标签的特征权重,其中,服务器先对数据特征集合中的特征数据进行分类,得到多个特征数据组,然后为分类后的每一个特征数据赋予初始权重,例如初始权重为“0.5”。在基于特征权重算法计算同一类别的特征数据组中特征数据的相似度,得到第一相似度,计算不同类别的特征数据组之间特征数据的相似度,得到第二相似度,基于第一相似度和第二相似度对特征数据的初始权重进行调整,得到每一个特征数据的特征权重,即第一权重。例如,当第一相似度的第二相似度差值大于或等 于预设相似度阈值时,下调初始权重。同理,按照上述计算过程计算得到每一个特征标签的特征权重,即第二权重。
然后服务器基于第一权重对数据特征集合中特征数据进行加权求和,得到第一加权结果,以及基于第二权重对特征标签集合中的特征标签进行加权求和,得到第二加权结果。最后,服务器计算第一加权结果的平均值,得到第一加权平均值,以及计算第二加权结果的平均值,得到第一加权平均值。
在上述实施例中,考虑到不同客户端的特征数据之间的关联关系,因此在对不同客户端中存储的特征数据和特征标签进行分箱,分别计算特征数据的加权平均值,得到第一加权平均值,以及计算特征标签的加权平均值,得到第二加权平均值,通过对特征进行赋权、加权、求和、求平均值等操作实现特征关联,以保证提取到尽可能多的特征。
进一步地,所述基于所述第一加权平均值和所述第二加权平均值构建线性回归方程包括:
基于所述第一加权平均值和所述第二加权平均值计算所述线性回归方程的线性回归参量;
基于所述线性回归参量和预设的最小二乘法构建所述线性回归方程。
进一步地,所述线性回归方程的表达式如下:
f(x t)=a 0+a 1x t
其中,a 0和a 1均为线性回归参量,x t为数据特征,a 0和a 1的具体计算公式如下:
Figure PCTCN2022089514-appb-000016
Figure PCTCN2022089514-appb-000017
其中,x ti是第i个数据特征的特征值,
Figure PCTCN2022089514-appb-000018
为数据特征集合中数据特征的特征平均值,即第一加权平均值,y i是第i个特征标签的标签值,
Figure PCTCN2022089514-appb-000019
为特征标签集合中特征标签的标签平均值,即第二加权平均值。
在本申请具体的实施例中,待评价特征变量的中间评价因子在不同客户端进行传递时,可以通过同态加密算法对需要进行传递的中间评价因子进行加密,以进一步保证数据安全。例如,在上述实施例中,服务器在计算线性回归参量时,先由参与方A计算平均差
Figure PCTCN2022089514-appb-000020
然后在参与方A通过同态加密对平均差u i进行加密,得到加密后的平均差[u i],服务器将[u i]发送给参与方B,参与方B对[u i]进行解密,得到u i,并计算标准差
Figure PCTCN2022089514-appb-000021
然后在参与方B通过同态加密对标准差v进行加密,得到加密后的标准差[v],服务器将[v]发送个参与方A,参与方A对[v]进行解密,得到v,并根据v计算回归参数a 1和真实值
Figure PCTCN2022089514-appb-000022
并在参与方A通过同态加密对真实值o进行加密,得到加密后的真实值[o],然后将真实值[o]发送给参与方B,由参与方计算回归参数
Figure PCTCN2022089514-appb-000023
其中,同态加密是基于数学难题的计算复杂性理论的密码学技术。对经过同态加密的数据进行处理得到一个输出,将这一输出进行解密,其结果与用同一方法处理未加密的原始数据得到的输出结果是一样的。同态加密包含一对公私密钥(pk,sk),用[]表征用pk h进行加密后的同态加密,例如m为明文,[m]=Enc pk(m)则为同态加密后的密文。
在上述实施例中,通过同态加密算法对需要进行传递的中间评价因子进行加密,以进一步保证数据安全。
进一步地,所述目标特征变量的决定系数的具体计算公式如下:
Figure PCTCN2022089514-appb-000024
其中,R 2为决定系数,RMSE为均方误差,SST为总平方和,所述基于所述决定系数对所述目标特征变量进行评价的包括:
将所述决定系数与预设阈值进行比对;
当所述决定系数大于或等于预设阈值时,确定所述目标特征变量为必要变量;
当所述决定系数小于预设阈值时,确定所述目标特征变量为非必要变量。
在上述实施例中,服务器在计算目标特征变量的决定系数之前,先由参与方A计算中间参量w 1i和w 2i,其中,
Figure PCTCN2022089514-appb-000025
w 2i=a 1x ti,然后在参与方A通过同态加密对中间参量w 1i和w 2i进行加密,得到加密后的中间参量
Figure PCTCN2022089514-appb-000026
和[w 2i]=Enc pk(a 1x ti),服务器将[w 1i]和[w 2i]发送个参与方B,参与方B对[w 1i]和[w 2i]进行解密,得到中间参量w 1i和w 2i,由参与方B计算均方误差RMSE、总平方和SST,最后参与方B通过同态加密对均方误差 RMSE、总平方和SST进行加密,得到加密后的均方误差[RMSE]、总平方和[SST],并将加密后的均方误差[RMSE]、总平方和[SST]发送给服务器,由服务器计算决定系数R 2,并基于决定系数对目标特征变量进行评价。其中:
Figure PCTCN2022089514-appb-000027
Figure PCTCN2022089514-appb-000028
Figure PCTCN2022089514-appb-000029
Figure PCTCN2022089514-appb-000030
在本申请一种具体的实施例中,服务器对目标特征变量进行评价时,将决定系数与预设阈值进行比对,当决定系数大于或等于预设阈值时,确定目标特征变量为必要变量,当决定系数小于预设阈值时,确定目标特征变量为非必要变量。
需要强调的是,为进一步保证上述特征数据和特征标签的私密和安全性,上述特征数据和特征标签还可以存储于一区块链的节点中。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
进一步参考图3,作为对上述图2所示方法的实现,本申请提供了一种特征变量的分析装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图3所示,本实施例所述的特征变量的分析装置包括:
第一分箱模块301,用于对第一参与方中的特征数据进行分箱操作,得到数据特征集合;
第二分箱模块302,用于对第二参与方中的特征标签进行分箱操作,得到特征标签集合;
第一加权平均模块303,用于计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值;
第二加权平均模块304,用于计算所述特征标签集合中的特征标签的加权平均值,得 到第二加权平均值;
线性回归模块305,用于基于所述第一加权平均值和所述第二加权平均值构建线性回归方程;
评价参数计算模块306,用于基于所述线性回归方程计算目标特征变量的均方误差和总平方和;
变量评价模块307,用于基于所述均方误差和所述总平方和计算所述目标特征变量的决定系数,并基于所述决定系数对所述目标特征变量进行评价。
进一步地,所述第二分箱模块302具体包括:
分箱信息获取单元,用于获取所述特征数据的分箱信息,并将所述分箱信息发送至所述第二参与方中;
第二分箱单元,用于基于所述分箱信息对所述对第二参与方中的特征标签进行分箱操作,得到特征标签集合。
进一步地,所述第一加权平均模块303具体包括:
第一权重计算单元,用于基于预设的特征权重算法计算所述数据特征集合中每一个特征数据的特征权重,得到第一权重;
第一加权单元,用于基于所述第一权重对所述数据特征集合中特征数据进行加权求和,得到第一加权结果;
第一均值计算单元,用于计算第一加权结果的平均值,得到所述第一加权平均值。
进一步地,所述第二加权平均模块304具体包括:
第二权重计算单元,用于基于预设的特征权重算法计算所述特征标签集合中每一个特征标签的特征权重,得到第二权重;
第二加权单元单元,用于基于所述第二权重对所述特征标签集合中的特征标签进行加权求和,得到第二加权结果;
第二均值计算单元单元,用于计算第二加权结果的平均值,得到所述第一加权平均值;
进一步地,所述线性回归模块305具体包括:
回归参量计算单元,用于基于所述第一加权平均值和所述第二加权平均值计算所述线性回归方程的线性回归参量;
回归方程构建单元,用于基于所述线性回归参量和预设的最小二乘法构建所述线性回归方程。
进一步地,所述线性回归方程的表达式如下:
f(x t)=a 0+a 1x t
其中,a 0和a 1均为线性回归参量,x t为数据特征,a 0和a 1的具体计算公式如下:
Figure PCTCN2022089514-appb-000031
Figure PCTCN2022089514-appb-000032
其中,x ti是第i个数据特征的特征值,
Figure PCTCN2022089514-appb-000033
为数据特征集合中数据特征的特征平均值,即第一加权平均值,y i是第i个特征标签的标签值,
Figure PCTCN2022089514-appb-000034
为所有特征标签y' i的标签平均值,即第二加权平均值。
进一步地,所述目标特征变量的决定系数的具体计算公式如下:
Figure PCTCN2022089514-appb-000035
其中,R 2为决定系数,RMSE为均方误差,SST为总平方和,所述变量评价模块307具体包括:
评价比对单元,用于将所述决定系数与预设阈值进行比对;
第一评价结果单元,用于当所述决定系数大于或等于预设阈值时,确定所述目标特征变量为必要变量;
第二评价结果单元,用于当所述决定系数小于预设阈值时,确定所述目标特征变量为非必要变量。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图4,图4为本实施例计算机设备基本结构框图。
所述计算机设备4包括通过系统总线相互通信连接存储器41、处理器42、网络接口43。需要指出的是,图中仅示出了具有组件41-43的计算机设备4,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机 交互。
所述存储器41至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器41可以是所述计算机设备4的内部存储单元,例如该计算机设备4的硬盘或内存。本实施例中,所述存储器41通常用于存储安装于所述计算机设备4的操作系统和各类应用软件,例如特征变量的分析方法的计算机可读指令等。此外,所述存储器41还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器42在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器42通常用于控制所述计算机设备4的总体操作。本实施例中,所述处理器42用于运行所述存储器41中存储的计算机可读指令或者处理数据,例如运行所述特征变量的分析方法的计算机可读指令。
所述网络接口43可包括无线网络接口或有线网络接口,该网络接口43通常用于在所述计算机设备4与其他电子设备之间建立通信连接。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的特征变量的分析方法的步骤。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。

Claims (20)

  1. 一种特征变量的分析方法,包括:
    对第一参与方中的特征数据进行分箱操作,得到数据特征集合;
    对第二参与方中的特征标签进行分箱操作,得到特征标签集合;
    计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值;
    计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值;
    基于所述第一加权平均值和所述第二加权平均值构建线性回归方程;
    基于所述线性回归方程计算目标特征变量的均方误差和总平方和;
    基于所述均方误差和所述总平方和计算所述目标特征变量的决定系数,并基于所述决定系数对所述目标特征变量进行评价。
  2. 如权利要求1所述的特征变量的分析方法,其中,所述对第二参与方中的特征标签进行分箱操作,得到特征标签集合包括:
    获取所述特征数据的分箱信息,并将所述分箱信息发送至所述第二参与方中;
    基于所述分箱信息对所述对第二参与方中的特征标签进行分箱操作,得到特征标签集合。
  3. 如权利要求1所述的特征变量的分析方法,其中,所述计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值包括:
    基于预设的特征权重算法计算所述数据特征集合中每一个特征数据的特征权重,得到第一权重;
    基于所述第一权重对所述数据特征集合中特征数据进行加权求和,得到第一加权结果;
    计算第一加权结果的平均值,得到所述第一加权平均值。
  4. 如权利要求1所述的特征变量的分析方法,其中,所述计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值包括:
    基于预设的特征权重算法计算所述特征标签集合中每一个特征标签的特征权重,得到第二权重;
    基于所述第二权重对所述特征标签集合中的特征标签进行加权求和,得到第二加权结果;
    计算第二加权结果的平均值,得到所述第一加权平均值。
  5. 如权利要求1至4任意一项所述的特征变量的分析方法,其中,所述基于所述第一加权平均值和所述第二加权平均值构建线性回归方程包括:
    基于所述第一加权平均值和所述第二加权平均值计算所述线性回归方程的线性回归参量;
    基于所述线性回归参量和预设的最小二乘法构建所述线性回归方程。
  6. 如权利要求5所述的特征变量的分析方法,其中,所述线性回归方程的表达式如下:
    f(x t)=a 0+a 1x t
    其中,a 0和a 1均为线性回归参量,x t为数据特征,a 0和a 1的具体计算公式如下:
    Figure PCTCN2022089514-appb-100001
    Figure PCTCN2022089514-appb-100002
    其中,x ti是第i个数据特征的特征值,
    Figure PCTCN2022089514-appb-100003
    为数据特征集合中数据特征的特征平均值,即第一加权平均值,y i是第i个特征标签的标签值,
    Figure PCTCN2022089514-appb-100004
    为所有特征标签y' i的标签平均值,即第二加权平均值。
  7. 如权利要求5所述的特征变量的分析方法,其中,所述目标特征变量的决定系数的具体计算公式如下:
    Figure PCTCN2022089514-appb-100005
    其中,R 2为决定系数,RMSE为均方误差,SST为总平方和,所述基于所述决定系数对所述目标特征变量进行评价的包括:
    将所述决定系数与预设阈值进行比对;
    当所述决定系数大于或等于预设阈值时,确定所述目标特征变量为必要变量;
    当所述决定系数小于预设阈值时,确定所述目标特征变量为非必要变量。
  8. 一种特征变量的分析装置,包括:
    第一分箱模块,用于对第一参与方中的特征数据进行分箱操作,得到数据特征集合;
    第二分箱模块,用于对第二参与方中的特征标签进行分箱操作,得到特征标签集合;
    第一加权平均模块,用于计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值;
    第二加权平均模块,用于计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值;
    线性回归模块,用于基于所述第一加权平均值和所述第二加权平均值构建线性回归方程;
    评价参数计算模块,用于基于所述线性回归方程计算目标特征变量的均方误差和总平方和;
    变量评价模块,用于基于所述均方误差和所述总平方和计算所述目标特征变量的决定系数,并基于所述决定系数对所述目标特征变量进行评价。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下特征变量的分析方法的步骤:
    对第一参与方中的特征数据进行分箱操作,得到数据特征集合;
    对第二参与方中的特征标签进行分箱操作,得到特征标签集合;
    计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值;
    计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值;
    基于所述第一加权平均值和所述第二加权平均值构建线性回归方程;
    基于所述线性回归方程计算目标特征变量的均方误差和总平方和;
    基于所述均方误差和所述总平方和计算所述目标特征变量的决定系数,并基于所述决定系数对所述目标特征变量进行评价。
  10. 如权利要求9所述的计算机设备,其中,所述对第二参与方中的特征标签进行分箱操作,得到特征标签集合包括:
    获取所述特征数据的分箱信息,并将所述分箱信息发送至所述第二参与方中;
    基于所述分箱信息对所述对第二参与方中的特征标签进行分箱操作,得到特征标签集合。
  11. 如权利要求9所述的计算机设备,其中,所述计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值包括:
    基于预设的特征权重算法计算所述数据特征集合中每一个特征数据的特征权重,得到第一权重;
    基于所述第一权重对所述数据特征集合中特征数据进行加权求和,得到第一加权结果;
    计算第一加权结果的平均值,得到所述第一加权平均值。
  12. 如权利要求9所述的计算机设备,其中,所述计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值包括:
    基于预设的特征权重算法计算所述特征标签集合中每一个特征标签的特征权重,得到第二权重;
    基于所述第二权重对所述特征标签集合中的特征标签进行加权求和,得到第二加权结果;
    计算第二加权结果的平均值,得到所述第一加权平均值。
  13. 如权利要求9至12任意一项所述的计算机设备,其中,所述基于所述第一加权平均值和所述第二加权平均值构建线性回归方程包括:
    基于所述第一加权平均值和所述第二加权平均值计算所述线性回归方程的线性回归参量;
    基于所述线性回归参量和预设的最小二乘法构建所述线性回归方程。
  14. 如权利要求13所述的计算机设备,其中,所述线性回归方程的表达式如下:
    f(x t)=a 0+a 1x t
    其中,a 0和a 1均为线性回归参量,x t为数据特征,a 0和a 1的具体计算公式如下:
    Figure PCTCN2022089514-appb-100006
    Figure PCTCN2022089514-appb-100007
    其中,x ti是第i个数据特征的特征值,
    Figure PCTCN2022089514-appb-100008
    为数据特征集合中数据特征的特征平均值,即第一加权平均值,y i是第i个特征标签的标签值,
    Figure PCTCN2022089514-appb-100009
    为所有特征标签y' i的标签平均值,即第二加权平均值。
  15. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下特征变量的分析方法的步骤:
    对第一参与方中的特征数据进行分箱操作,得到数据特征集合;
    对第二参与方中的特征标签进行分箱操作,得到特征标签集合;
    计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值;
    计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值;
    基于所述第一加权平均值和所述第二加权平均值构建线性回归方程;
    基于所述线性回归方程计算目标特征变量的均方误差和总平方和;
    基于所述均方误差和所述总平方和计算所述目标特征变量的决定系数,并基于所述决定系数对所述目标特征变量进行评价。
  16. 如权利要求15所述的计算机可读存储介质,其中,所述对第二参与方中的特征标签进行分箱操作,得到特征标签集合包括:
    获取所述特征数据的分箱信息,并将所述分箱信息发送至所述第二参与方中;
    基于所述分箱信息对所述对第二参与方中的特征标签进行分箱操作,得到特征标签集合。
  17. 如权利要求15所述的计算机可读存储介质,其中,所述计算所述数据特征集合中的特征数据的加权平均值,得到第一加权平均值包括:
    基于预设的特征权重算法计算所述数据特征集合中每一个特征数据的特征权重,得到第一权重;
    基于所述第一权重对所述数据特征集合中特征数据进行加权求和,得到第一加权结果;
    计算第一加权结果的平均值,得到所述第一加权平均值。
  18. 如权利要求15所述的计算机可读存储介质,其中,所述计算所述特征标签集合中的特征标签的加权平均值,得到第二加权平均值包括:
    基于预设的特征权重算法计算所述特征标签集合中每一个特征标签的特征权重,得到第二权重;
    基于所述第二权重对所述特征标签集合中的特征标签进行加权求和,得到第二加权结果;
    计算第二加权结果的平均值,得到所述第一加权平均值。
  19. 如权利要求15至18任意一项所述的计算机可读存储介质,其中,所述基于所述第一加权平均值和所述第二加权平均值构建线性回归方程包括:
    基于所述第一加权平均值和所述第二加权平均值计算所述线性回归方程的线性回归参量;
    基于所述线性回归参量和预设的最小二乘法构建所述线性回归方程。
  20. 如权利要求19所述的计算机可读存储介质,其中,所述线性回归方程的表达式如下:
    f(x t)=a 0+a 1x t
    其中,a 0和a 1均为线性回归参量,x t为数据特征,a 0和a 1的具体计算公式如下:
    Figure PCTCN2022089514-appb-100010
    Figure PCTCN2022089514-appb-100011
    其中,x ti是第i个数据特征的特征值,
    Figure PCTCN2022089514-appb-100012
    为数据特征集合中数据特征的特征平均值,即第一加权平均值,y i是第i个特征标签的标签值,
    Figure PCTCN2022089514-appb-100013
    为所有特征标签y' i的标签平均值,即第二加权平均值。
PCT/CN2022/089514 2021-10-27 2022-04-27 一种特征变量的分析方法、装置、计算机设备及存储介质 WO2023071105A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111254424.7A CN113934983A (zh) 2021-10-27 2021-10-27 一种特征变量的分析方法、装置、计算机设备及存储介质
CN202111254424.7 2021-10-27

Publications (1)

Publication Number Publication Date
WO2023071105A1 true WO2023071105A1 (zh) 2023-05-04

Family

ID=79284685

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089514 WO2023071105A1 (zh) 2021-10-27 2022-04-27 一种特征变量的分析方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN113934983A (zh)
WO (1) WO2023071105A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116244650A (zh) * 2023-05-12 2023-06-09 北京富算科技有限公司 特征分箱方法、装置、电子设备和计算机可读存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113934983A (zh) * 2021-10-27 2022-01-14 平安科技(深圳)有限公司 一种特征变量的分析方法、装置、计算机设备及存储介质
CN115081004B (zh) * 2022-08-22 2022-11-04 北京瑞莱智慧科技有限公司 数据处理方法、相关装置及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140280257A1 (en) * 2013-03-15 2014-09-18 Konstantinos (Constantin) F. Aliferis Data Analysis Computer System and Method For Parallelized and Modularized Analysis of Big Data
CN110851786A (zh) * 2019-11-14 2020-02-28 深圳前海微众银行股份有限公司 纵向联邦学习优化方法、装置、设备及存储介质
WO2021000958A1 (zh) * 2019-07-04 2021-01-07 华为技术有限公司 用于实现模型训练的方法及装置、计算机存储介质
CN112508199A (zh) * 2020-11-30 2021-03-16 同盾控股有限公司 针对跨特征联邦学习的特征选择方法、装置及相关设备
US20210142222A1 (en) * 2019-11-13 2021-05-13 International Business Machines Corporation Automated data and label creation for supervised machine learning regression testing
CN113934983A (zh) * 2021-10-27 2022-01-14 平安科技(深圳)有限公司 一种特征变量的分析方法、装置、计算机设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140280257A1 (en) * 2013-03-15 2014-09-18 Konstantinos (Constantin) F. Aliferis Data Analysis Computer System and Method For Parallelized and Modularized Analysis of Big Data
WO2021000958A1 (zh) * 2019-07-04 2021-01-07 华为技术有限公司 用于实现模型训练的方法及装置、计算机存储介质
US20210142222A1 (en) * 2019-11-13 2021-05-13 International Business Machines Corporation Automated data and label creation for supervised machine learning regression testing
CN110851786A (zh) * 2019-11-14 2020-02-28 深圳前海微众银行股份有限公司 纵向联邦学习优化方法、装置、设备及存储介质
CN112508199A (zh) * 2020-11-30 2021-03-16 同盾控股有限公司 针对跨特征联邦学习的特征选择方法、装置及相关设备
CN113934983A (zh) * 2021-10-27 2022-01-14 平安科技(深圳)有限公司 一种特征变量的分析方法、装置、计算机设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116244650A (zh) * 2023-05-12 2023-06-09 北京富算科技有限公司 特征分箱方法、装置、电子设备和计算机可读存储介质
CN116244650B (zh) * 2023-05-12 2023-10-03 北京富算科技有限公司 特征分箱方法、装置、电子设备和计算机可读存储介质

Also Published As

Publication number Publication date
CN113934983A (zh) 2022-01-14

Similar Documents

Publication Publication Date Title
WO2021120676A1 (zh) 联邦学习网络下的模型训练方法及其相关设备
WO2023071105A1 (zh) 一种特征变量的分析方法、装置、计算机设备及存储介质
WO2022126970A1 (zh) 金融欺诈风险识别方法、装置、计算机设备及存储介质
WO2021120677A1 (zh) 一种仓储模型训练方法、装置、计算机设备及存储介质
WO2021155713A1 (zh) 基于权重嫁接的模型融合的人脸识别方法及相关设备
CN110414987B (zh) 账户集合的识别方法、装置和计算机系统
WO2020173228A1 (zh) 机器学习模型的联合训练方法、装置、设备及存储介质
WO2021174877A1 (zh) 基于智能决策的目标检测模型的处理方法、及其相关设备
CN110855648B (zh) 一种网络攻击的预警控制方法及装置
Liu et al. Keep your data locally: Federated-learning-based data privacy preservation in edge computing
CN113435583A (zh) 基于联邦学习的对抗生成网络模型训练方法及其相关设备
WO2024007599A1 (zh) 基于异构图神经网络的目标服务确定方法和装置
WO2022116491A1 (zh) 基于横向联邦的dbscan聚类方法、及其相关设备
CN112995414B (zh) 基于语音通话的行为质检方法、装置、设备及存储介质
WO2023216494A1 (zh) 基于联邦学习的用户服务策略确定方法及装置
CN111475838A (zh) 基于深度神经网络的图数据匿名方法、装置、存储介质
CN115941322A (zh) 基于人工智能的攻击检测方法、装置、设备及存储介质
WO2024098699A1 (zh) 实体对象的威胁检测方法、装置、设备及存储介质
CN110969261B (zh) 基于加密算法的模型构建方法及相关设备
Chang et al. Cloud computing storage backup and recovery strategy based on secure IoT and spark
Akter et al. Edge intelligence-based privacy protection framework for iot-based smart healthcare systems
Huang et al. Encrypted speech retrieval based on long sequence Biohashing
CN116776150A (zh) 接口异常访问识别方法、装置、计算机设备及存储介质
Feng Application of edge computing and blockchain in smart agriculture system
WO2022142032A1 (zh) 手写签名校验方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22885019

Country of ref document: EP

Kind code of ref document: A1