WO2022188648A1 - 数据处理方法、装置、设备、计算机可读存储介质及计算机程序产品 - Google Patents

数据处理方法、装置、设备、计算机可读存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2022188648A1
WO2022188648A1 PCT/CN2022/078282 CN2022078282W WO2022188648A1 WO 2022188648 A1 WO2022188648 A1 WO 2022188648A1 CN 2022078282 W CN2022078282 W CN 2022078282W WO 2022188648 A1 WO2022188648 A1 WO 2022188648A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
bin
label
binning
distribution information
Prior art date
Application number
PCT/CN2022/078282
Other languages
English (en)
French (fr)
Inventor
范晓亮
陶阳宇
蒋杰
刘煜宏
陈鹏
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP22766174.1A priority Critical patent/EP4216074A4/en
Publication of WO2022188648A1 publication Critical patent/WO2022188648A1/zh
Priority to US18/073,333 priority patent/US20230100679A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/256Integrating or interfacing systems involving database management systems in federated or virtual databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/219Managing data history or versioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24547Optimisations to support specific applications; Extensibility of optimisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning

Definitions

  • the present application relates to artificial intelligence technology, and in particular, to a data processing method, apparatus, device, computer-readable storage medium, and computer program product.
  • Federated learning technology is an emerging privacy protection technology that can effectively combine data from all parties for model training without the need for local data.
  • Embodiments of the present application provide a data processing method, apparatus, computer-readable storage medium, and computer program product, which can improve data security.
  • An embodiment of the present application provides a data processing method, which is applied to a feature-side device, including:
  • the data to be processed includes multiple object identifiers and feature values corresponding to each object identifier;
  • initial binning is performed on the data to be processed to obtain a preset number of bins
  • the respective bins are merged by using the preset binning strategy and the distribution information of the respective bin labels to obtain a final binning result.
  • the embodiment of the present application provides a data processing method based on federated learning, which is applied to a tag side device, including:
  • An embodiment of the present application provides a data processing device, including:
  • a first acquisition module configured to acquire data to be processed, where the data to be processed includes multiple object identifiers and feature values corresponding to each object identifier;
  • a binning module configured to perform initial binning on the data to be processed based on the characteristic values corresponding to the respective object identifiers to obtain a preset number of bins
  • a first sending module configured to determine a plurality of target identification sets from each sub-box, and send each target identification set to the labeling device in sequence
  • a first receiving module configured to receive each set label distribution information corresponding to each target identification set sent by the label device, and determine the bin label distribution information corresponding to each bin based on the each set label distribution information;
  • the merging module is configured to use the preset binning strategy and the distribution information of each binning label to merge the respective bins to obtain a final binning result.
  • the binning module is further configured to:
  • M feature intervals are determined based on the minimum eigenvalue, the maximum eigenvalue and (M-1) feature quantile values
  • the apparatus further includes:
  • the third acquisition module is configured to acquire the preset partition rule and the number of partitions N;
  • the second determination module is configured to determine the partition identifier of each piece of feature data in the ith bin based on each object identifier in the ith bin and the partition rule, and the partition identifier corresponds to the N partitions. one of the partitions;
  • the first partition module is configured to allocate each piece of feature data in the ith bin to the ith bin in the partition corresponding to the partition identifier of each piece of feature data.
  • the first sending module is further configured to:
  • Randomly determine R unprocessed object identifiers corresponding to the jth eigenvalue, where R is a positive integer greater than 2, j 1, 2, ..., S;
  • the R unprocessed object identifiers are determined as a target identifier set.
  • the first receiving module is further configured to:
  • the bin label distribution information corresponding to the ith bin is updated based on the number of positive samples and the number of negative samples in the set label distribution information, until there is no unprocessed object identifier in the ith bin.
  • the apparatus further includes:
  • a deletion module configured to delete the set label distribution information when the set label distribution information is preset invalid information
  • the third determination module is configured to determine the object identifier in the target identifier set corresponding to the set label distribution information as the unprocessed object identifier.
  • the merging module is further configured to:
  • the label attribute information includes the number of positive samples, the number of negative samples, and the percentage of positive samples
  • the information value of each candidate bin is determined
  • Each target binning is combined with its own adjacent binning results again until the optimization goal is reached, and each final binning is obtained.
  • the apparatus further includes:
  • a fourth acquisition module configured to acquire each final binning of each feature dimension
  • a fourth determination module configured to determine the information value of each final binning of each feature dimension and the total information value corresponding to each feature dimension
  • the feature selection module is configured to perform feature selection based on the information values of each final binning and each total information value to obtain multiple target final bins;
  • the fifth obtaining module is configured to obtain the label distribution information of the final binning of each target, and perform modeling based on the feature data and the label distribution information in the final binning of each target.
  • An embodiment of the present application provides a data processing device, including:
  • the second receiving module is configured to receive the target identifier set sent by the feature-side device, and acquire multiple object identifiers in the target identifier set;
  • the second obtaining module is configured to obtain label information corresponding to each object identifier
  • a first determining module configured to determine the set label distribution information of the target identification set based on each label information
  • the second sending module is configured to send the collective label distribution information to the feature-side device.
  • the first determining module is further configured to:
  • the number of positive samples is less than the total number of object identifiers contained in the target identifier set, and the number of negative samples is less than the total number of object identifiers, the number of positive samples and the number of negative samples are determined as the set label distribution information.
  • the first determining module is further configured to:
  • the preset invalid information is determined as the set tag of the target identifier set distribution information.
  • the apparatus further includes:
  • the fifth obtaining module is configured to obtain the preset partition rule and the number of partitions N;
  • the fifth determination module is configured to determine the partition identifier of each piece of label data based on each object identifier and the partition rule, the label data includes the object identifier and the label information corresponding to the object identifier, and the partition identifier corresponds to N one of the partitions;
  • the second partition module is configured to add each piece of tag data to the corresponding partition based on the partition identifier.
  • the embodiment of the present application provides a data processing device based on federated learning, including:
  • the processor is configured to implement the method provided by the embodiments of the present application when executing the executable instructions stored in the memory.
  • Embodiments of the present application provide a computer-readable storage medium storing executable instructions for causing a processor to execute the methods provided by the embodiments of the present application.
  • Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium; when a processor of an electronic device is readable from the computer When the computer instruction is read by reading the storage medium, and the computer instruction is executed, the method provided by the embodiment of the present application is implemented.
  • initial binning is performed on the to-be-processed data to obtain a preset number of The binning result, and then send the multiple target identification sets included in each binning to the label-side device in turn, and then receive the respective set label distribution information corresponding to each target identification set sent by the label-side device, and based on the respective set labels
  • the distribution information determines the bin label distribution information corresponding to each bin; the preset bin strategy and each bin label distribution information are used to combine the bins to obtain a final bin result.
  • the feature-side device sends the target identification set composed of its own object identification to the label-side device, and the label-side device returns the label distribution information corresponding to the target label set to the feature-side device instead of sending the encrypted label information to the feature-side device. It can avoid information leakage caused by decryption of label information, thereby improving data security.
  • FIG. 1 is a schematic diagram of a network architecture of a data processing system 100 according to an embodiment of the present application
  • FIG. 2A is a schematic structural diagram of a feature-side device 400 provided by an embodiment of the present application.
  • FIG. 2B is a schematic structural diagram of the label-side device 200 provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of an implementation of a federated learning-based data processing method provided by an embodiment of the present application
  • FIG. 4 is a schematic flowchart of the implementation of determining the bin label distribution information corresponding to each bin provided by an embodiment of the present application;
  • FIG. 5 is a schematic flowchart of yet another implementation of the federated learning-based data processing method provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of an implementation process of optimally binning a feature provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an implementation process of optimal binning provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an implementation process of randomly selecting a feature and randomly selecting a feature value corresponding to the feature in an embodiment of the present application
  • FIG. 9 is a schematic diagram of information exchange between the Host party and the Guest party according to an embodiment of the present application.
  • first ⁇ second ⁇ third is only used to distinguish similar objects, and does not represent a specific ordering of objects. It is understood that “first ⁇ second ⁇ third” Where permitted, the specific order or sequence may be interchanged to enable the embodiments of the application described herein to be practiced in sequences other than those illustrated or described herein.
  • Federated Learning is a distributed artificial intelligence suitable for training machine learning models that decentralize the training process so that user privacy can be maintained without sending data to a centralized server. This also improves efficiency by spreading the training process across many devices.
  • OB Optimal binning
  • the host (Host) side a data source without a label, holds the feature dimension Xi, and the Host side corresponds to the feature-side device in the embodiment of the present application.
  • the guest (Guest) side which provides the data source of the label, the label indicates whether the sample is a positive example or a negative example, and takes a value of 0 or 1, and the Guest side corresponds to the label side device in the embodiment of the present application.
  • IV Information Value
  • the IV value is mainly used to encode the input variables and evaluate the predictive ability.
  • the size of the IV value of the characteristic variable indicates the strength of the variable's predictive ability.
  • the embodiments of the present application provide a feature binning method, device, device, computer-readable storage medium, and computer program product, which can solve the problems of label information leakage and credential stuffing.
  • the following describes the feature binning provided by the embodiments of the present application.
  • Exemplary applications of box devices the devices provided by embodiments of the present application may be implemented as notebook computers, tablet computers, desktop computers, set-top boxes, mobile devices (eg, mobile phones, portable music players, personal digital assistants, dedicated messaging devices, portable Various types of user terminals such as game equipment) can also be implemented as servers.
  • exemplary applications when the device is implemented as a feature-side device will be described.
  • FIG. 1 is a schematic diagram of a network architecture of a data processing system 100 provided by an embodiment of the present application.
  • the network architecture at least includes a tag side device 200 (corresponding to the Guest side in other embodiments), The network 300, the feature-side device 400 (corresponding to the Host side in other embodiments), the database 500-1 of the tag-side device, and the database 500-2 of the feature-side device.
  • the label-side device 200 and the feature-side device 400 may be the parties involved in jointly training a machine learning model in vertical federated learning.
  • the feature-side device 400 may be a client, such as a participant device that stores user feature data such as banks or hospitals, and the client may be a laptop, tablet computer, desktop computer, dedicated training equipment, or other equipment with model training functions , the feature-side device 400 may also be a server.
  • the label-side device 200 may be a device in the government affairs system, and the label-side device 200 may be a client or a server.
  • the tag-side device 200 is connected to the feature-side device 400 through the network 300, which can be a wide area network or a local area network, or a combination of the two, and uses wireless or wired links to realize data transmission.
  • the feature square device 400 When the feature square device 400 needs to perform binning on the data to be processed (that is, the feature data) stored by itself, it may first perform an initial equal-frequency binning based on the feature value, and then divide each initial binning result after the equal-frequency binning.
  • the multiple object identifiers in the target identifier set are sent to the label side device 200 as a target identifier set, and the label side device returns the label distribution information corresponding to the target identifier set.
  • the preset binning strategy optimizes multiple initial binning to obtain the final binning result.
  • the feature-side device after obtaining the final binning result, performs feature screening to obtain target feature data for modeling and training the model, and performs modeling and model training based on the target feature data.
  • FIG. 2A is a schematic structural diagram of a feature-side device 400 provided by an embodiment of the present application.
  • the feature-side device 400 shown in FIG. 2A includes: at least one processor 410 , memory 450 , at least one network interface 420 and user interface 430 .
  • the various components in the feature-side device 400 are coupled together by a bus system 440 .
  • the bus system 440 is used to implement the connection communication between these components.
  • the bus system 440 also includes a power bus, a control bus, and a status signal bus.
  • the various buses are labeled as bus system 440 in FIG. 2A.
  • the processor 410 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., where a general-purpose processor may be a microprocessor or any conventional processor or the like.
  • DSP Digital Signal Processor
  • User interface 430 includes one or more output devices 431 that enable presentation of media content, including one or more speakers and/or one or more visual display screens.
  • User interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, and other input buttons and controls.
  • Memory 450 may be removable, non-removable, or a combination thereof.
  • Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like.
  • Memory 450 optionally includes one or more storage devices that are physically remote from processor 410 .
  • Memory 450 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory may be a read-only memory (ROM, Read Only Memory), and the volatile memory may be a random access memory (RAM, Random Access Memory).
  • ROM read-only memory
  • RAM random access memory
  • the memory 450 described in the embodiments of the present application is intended to include any suitable type of memory.
  • memory 450 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
  • the operating system 451 includes system programs for processing various basic system services and performing hardware-related tasks, such as framework layer, core library layer, driver layer, etc., for implementing various basic services and processing hardware-based tasks;
  • a presentation module 453 for enabling presentation of information (eg, a user interface for operating peripherals and displaying content and information) via one or more output devices 431 (eg, a display screen, speakers, etc.) associated with the user interface 430 );
  • An input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.
  • FIG. 2A shows a data processing apparatus 455 stored in the memory 450, which may be software in the form of programs and plug-ins, including the following software modules : the first acquisition module 4551 , the binning module 4552 , the first sending module 4553 , the first receiving module 4554 and the merging module 4555 , these modules are logical, so they can be combined or split arbitrarily according to the implemented functions. The function of each module will be explained below.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the solutions provided in the embodiments of the present application mainly involve the machine learning technology of artificial intelligence, and the technology is described below.
  • Machine Learning is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence.
  • Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning and other techniques.
  • the feature-side device and the label-side device may be node devices in the blockchain system, and correspondingly, the data to be processed and the label information corresponding to each object identifier may be obtained from nodes on the blockchain .
  • the federated learning-based data processing method provided by the embodiment of the present application will be described with reference to the exemplary application and implementation of the terminal provided by the embodiment of the present application.
  • FIG. 3 is a schematic diagram of an implementation flowchart of a federated learning-based data processing method provided by an embodiment of the present application. The method is applied to a feature-side device. The steps shown in FIG. 3 will be described below.
  • Step S101 acquiring data to be processed.
  • the data to be processed includes multiple object identifiers and feature values corresponding to the respective object identifiers, an object identifier and the feature value corresponding to the object identifier constitute a piece of feature data, and the data to be processed includes multiple pieces of feature data.
  • an object identifier may correspond to one feature dimension (such as age), or may correspond to multiple feature dimensions (such as age, gender, region, monthly income, etc.), then an object identifier also corresponds to a feature value, or multiple eigenvalues.
  • the method provided by the embodiment of the present application can be used to separately calculate the final value of the data to be processed under each feature dimension. Binning results.
  • Step S102 based on the characteristic values corresponding to the respective object identifiers, perform initial binning on the data to be processed to obtain a preset number of bins.
  • an unsupervised binning method can be used to perform initial binning of the data to be processed, for example, an equal frequency binning method and an equidistant binning method can be used to perform initial binning to obtain a preset number of bins.
  • the preset number is also the preset binning number, assuming M, then based on the maximum eigenvalue and the minimum eigenvalue among the eigenvalues corresponding to each object identifier and the preset
  • the number of bins determines the binning interval, and based on the minimum feature value and the binning interval, each binning range is determined, and then according to each binning range and the feature value corresponding to each object identifier, the bin where each feature data is located is determined. box.
  • the equal-frequency binning method first obtain the number of object identifiers and the preset number in the data to be processed, and determine the number of object identifiers Q included in each bin according to the number of object identifiers and the preset number, and then Sort each piece of feature data included in the data to be processed based on the feature value to obtain the sorted feature data, and then divide the Q pieces of feature data into corresponding bins sequentially from front to back.
  • Step S103 Determine a plurality of target identification sets from each sub-box, and send each target identification set to the label-side device in sequence.
  • the target identifier set includes at least three object identifiers.
  • the eigenvalues corresponding to each object identifier in the multiple target identifier sets are generally the same.
  • the feature values can also be different, but each object identifier in the target identifier set belongs to the same bin.
  • the feature data in each bin can be sorted based on the feature value, for example, the feature data can be sorted in ascending order, the feature data with the same feature value are arranged together, and then the feature data existing in each bin can be sorted.
  • randomly extract a feature value then determine the object identification set corresponding to the feature value, and then randomly extract R object identifications from the object identification set corresponding to the feature value, and these R object identifications are determined as a target Identity collection.
  • the feature dimension is age
  • an initial bin contains feature data with ages ranging from 18 to 25.
  • determine the feature values included in the bin as ⁇ 18, 19, 20, 21, 22, 23 , 24, 25 ⁇ , randomly extract a feature value, for example, extract the feature value of 20, then determine the object ID set with age 20, assuming that there are 100 object IDs in the object ID set, and then from the 100 object IDs 10 are randomly selected each time, determined as a target identification set, and the target identification set is sent to the labeling device to obtain the corresponding label distribution information, until the target identification set extracted from the object identification set corresponding to the feature value corresponds to The label distribution information of , is obtained, and then a eigenvalue is randomly selected, and the above process is repeated until all eigenvalues under the feature dimension are traversed.
  • multiple target identification sets can be obtained from one sub-bin, and multiple target identification sets in one sub-bin can be processed in sequence, while target identification sets of different sub-bins can be processed in parallel .
  • Step S104 Receive each set label distribution information corresponding to each target identification set sent by the label-side device, and determine the bin label distribution information corresponding to each bin based on the each set label distribution information.
  • the set label distribution information is used to represent the number of positive samples and negative samples in the target identification set.
  • the set label distribution information can be (4, 6), where 4 represents the number of positive samples and 6 represents the number of negative samples, which means that one contains In the target identification set of 10 object identifications, there are 4 positive samples and 6 negative samples, but the feature square device cannot know which of the 10 object identifications is a positive sample and which is a negative sample.
  • the feature side device After receiving each set tag distribution information sent by the tag side device, the feature side device first determines whether the set tag distribution information is invalid information, and when the set tag information is invalid information, ignore the set tag distribution information; When the distribution information is not invalid information, the bin label distribution information corresponding to the bin is updated based on the set label distribution information.
  • the initial value of the label distribution information of the binning is (0,0), the initial value of the number of positive samples is 0, and the initial value of the number of negative samples is 0.
  • the first set label distribution information it is assumed to be (4,6), then the binning label distribution information is updated to (4,6) at this time.
  • the second set label distribution information assuming that it is (2,8), then the binning label distribution information is updated to ( 6, 14), and so on, until all the set label distribution information of the bin is obtained, the bin label distribution information of the bin is obtained.
  • Step S105 using the preset binning strategy and the distribution information of each binning label to merge the respective bins to obtain a final binning result.
  • the binning strategy may be a binning strategy based on an IV value, or a chi-square binning strategy.
  • each adjacent binning result can be pre-merged, and the label attribute information of the merged binning can be determined, and then it is determined according to the label attribute information of the merged classification that when the merging condition is satisfied, the merging is performed.
  • the binning result can be merged with the previous adjacent binning result, and can be merged with the next adjacent binning result
  • the IV value of the two merged bins can be compared to determine the target binning result, and so on. Merge until the preset convergence condition is reached, so as to obtain the final binning result.
  • the to-be-processed data is initially binned to obtain a preset number of binning results, and then multiple target identification sets included in each bin are sent to the labeling device in turn, and then each target identification sent by the labeling device is received.
  • Set the corresponding set label distribution information and determine the bin label distribution information corresponding to each bin based on the set label distribution information; use the preset bin strategy and each bin label distribution information to classify the bins Perform merging processing to obtain the final binning result.
  • the feature-side device sends the target identification set composed of its own object identification to the label-side device, and the label-side device returns the label distribution information corresponding to the target label set.
  • the label-side device instead of sending the encrypted label information to the characteristic party device, it can avoid information leakage caused by the decryption of the label information, thereby improving data security.
  • step S102 shown in FIG. 3 "based on the feature values corresponding to the respective object identifiers, initial binning is performed on the data to be processed to obtain a preset number of binning results" may be performed through the following steps accomplish:
  • Step S1021 based on the eigenvalues corresponding to the object identifiers, determine the maximum eigenvalue and the minimum eigenvalue.
  • the binning is performed based on the feature dimension of age, and the maximum feature value is 85, and the minimum feature value is 10.
  • Step S1022 based on the maximum eigenvalue, the minimum eigenvalue and the preset number M, determine (M-1) eigenquantile values.
  • Step S1023 M feature intervals are determined based on the minimum feature value, the maximum feature value and the (M-1) feature quantile values.
  • the first feature interval is [the smallest feature value, the first feature quantile value)
  • the second feature interval is [the first feature quantile value, the second feature quantile value)
  • the Mth The feature interval is [the M-1th feature quantile value, the largest feature value].
  • Step S1024 based on each feature value and M feature intervals in the data to be processed, perform initial binning on the data to be processed to obtain M bins.
  • the initial binning process can also be performed by means of equal-frequency binning to obtain multiple bins. At this time, the amount of data contained in each bin is the same, for example, 10 binning results are obtained, Then each binning result contains 10% of the data volume.
  • the data after the initial binning can also be partitioned by the following steps:
  • Step S1025 obtaining a preset partition rule and the number N of partitions.
  • the preset partitioning rule may be a hash partitioning rule, or may be other preset partitioning rules.
  • Step S1026 based on each object identifier in the ith bin and the partition rule, determine the partition identifier of each piece of feature data in the ith bin.
  • the object identifier can be subjected to a preset hash calculation to obtain a hash value, and then the hash value is modN, and the obtained result is the partition identifier, so the partition identifier is 0, 1, ..., N -1.
  • Step S1027 Allocate each piece of feature data in the ith bin to the ith bin in the partition corresponding to the partition identifier of each piece of feature data.
  • Each partition includes M bins, and each feature data in the ith bin will be divided into the ith bin of each partition. For example, there are 100 pieces of feature data in the third bin, and there are 4 partitions by default. These 1000 pieces of feature data will be divided into the 0th, 1st, 2nd and 3rd partitions. 3 binning, so that the data in the third binning result is evenly divided into each partition, thereby improving the operation efficiency.
  • the label-side device can perform the following steps to partition the label-side label data:
  • Step S201 the label-side device acquires a preset partition rule and the number N of partitions.
  • the partition rules and the number of partitions N adopted by the label-side device are the same as the partition rules and the number of partitions adopted by the feature-side device.
  • Step S202 the label-side device determines the partition identifier of each piece of label data based on each object identifier and the partition rule.
  • the label data includes an object identifier and label information corresponding to the object identifier.
  • the label information is used to indicate whether the object identifier is a positive sample or a negative sample.
  • the label information can be represented by 0 and 1, for example, 0 indicates that the object identifier is a negative sample , 1 indicates that the object is identified as a positive sample.
  • Corresponding positive samples can also be called good samples or normal samples, and negative samples can also be called bad samples or default samples.
  • Step S202 may be implemented by performing a preset hash calculation on the object identifier to obtain a hash value, and then modN the hash value, and the obtained result is the partition identifier.
  • the tag-side device in addition to storing the tag data, can also store government data packages such as identity features, spending power, credit history, qualifications and honors, etc., where the identity features are gender, age, ID card address, student status and education background. ; Consumption ability: social security provident fund payment base, etc.; credit history: the number of authorized applications, according to the authorization of credit services in the application (APP, Application), the number of consecutive performance of credit services, and the number of times of untrustworthy credit services.
  • APP Application
  • step S203 the label-side device adds each piece of label data to the corresponding partition based on the partition identifier.
  • the tag data can be divided into the same partition as the feature data based on the object identifier in the tag device, thereby ensuring that the feature device and the corresponding partition of the tag device have the same id set.
  • step S103 shown in Fig. 3 can be implemented by the following steps:
  • Step S1031 determine the number S of eigenvalues in the ith bin.
  • step S1031 first count each eigenvalue in the ith bin, and determine the number of eigenvalues.
  • the number of eigenvalues corresponding to different bins may be the same or different.
  • Step S1032 Randomly determine R unprocessed object identifiers corresponding to the jth feature value.
  • step S1032 first randomly determine one eigenvalue from the S eigenvalues each time, that is, randomly determine the jth eigenvalue for the jth time, and then determine the object identifier set corresponding to the jth eigenvalue, Then, R unprocessed object identifiers are randomly determined from the object identifier set corresponding to the jth eigenvalue.
  • each object identifier is considered to be an unprocessed object identifier.
  • Step S1033 Determine the R unprocessed object identifiers as a target identifier set.
  • the target identifier set includes R object identifiers, and R is required to be greater than 2 because, if R is equal to 2, the label information of each object identifier can be deduced from the label distribution information returned by the label device, resulting in information leakage.
  • step S104 may be implemented through steps S1041 to S1046 as shown in FIG. 4Describe each step.
  • Step S1041 it is judged whether the p-th set label distribution information in the i-th bin is invalid information.
  • the invalid information may be preset, for example, (0, 0). In some embodiments, the invalid information may also be other preset values, for example, a one-dimensional value of 0.
  • the p-th set label distribution information is invalid information, go to step S1045; when the p-th set label distribution information is not invalid information, it means that the set label distribution information is valid information, and then go to step S1042.
  • i 1,2,...,M, where M is a preset number;
  • p 1,2,...,W, where W is the total number of target label sets in the ith bin.
  • Step S1042 when the set label distribution information is not preset invalid information, acquire the number of positive samples and the number of negative samples in the p-th set label distribution information.
  • the set label distribution information is in the form of (X, Y), where X is the number of positive samples and Y is the number of negative samples, so obtaining the values of X and Y is The number of positive samples and the number of negative samples in the label distribution information of the p-th set can be obtained.
  • Step S1043 Determine the object identifier in the target identifier set corresponding to the set label distribution information as the processed object identifier.
  • a flag value (flag) representing processed or unprocessed can be set for each object identifier, and the flag value can be a binary value, where 0 means unprocessed, 1 means processed, and each object identifier is initialized by default. as unprocessed.
  • Step S1044 update the bin label distribution information corresponding to the ith bin based on the number of positive samples and the number of negative samples in the set label distribution information, until there is no unprocessed object in the ith bin logo.
  • Step S1045 when the p-th set label distribution information is preset invalid information, delete the p-th set label distribution information.
  • the set label distribution information is ignored at this time, and the set label distribution information is deleted.
  • Step S1046 Determine the object identifier in the target identifier set corresponding to the p-th set label distribution information as the unprocessed object identifier.
  • each object identifier in the target identifier set is kept as unprocessed object identifiers.
  • step S105 shown in FIG. 3 "using the preset binning strategy and the distribution information of each binning label to merge the respective bins to obtain the final binning result" can be implemented by the following steps:
  • Step S1051 Pre-merge each bin with its own adjacent bins to obtain multiple candidate bins.
  • merging two adjacent bins is to merge the feature intervals and label distributions corresponding to the two bins respectively.
  • the feature interval of the first bin is [10, 15)
  • the corresponding label distribution information is (10, 100)
  • the feature interval of the second bin is [15, 20)
  • the corresponding label distribution information is (20, 100).
  • the first bin and the second bin are merged,
  • the first candidate bin is obtained, the feature interval of the candidate bin is [10, 20), and the corresponding label distribution information is (30, 200).
  • Step S1052 Determine attribute information of candidate labels of each candidate bin.
  • the label attribute information includes the number of positive samples, the number of negative samples, and the percentage of positive samples. Following the above example, the number of positive samples for the first candidate bin is 30, the number of negative samples is 200, and the percentage of positive samples is 13%.
  • Step S1053 based on the attribute information of the candidate tag, determine whether the candidate bins satisfy the merging condition.
  • the merging conditions may include whether the number of positive samples is greater than a minimum positive sample number threshold, whether the number of negative samples is greater than a minimum negative sample number threshold, whether the percentage of positive samples is greater than a minimum percentage threshold, and less than a maximum percentage threshold.
  • step S1054 When it is determined based on the candidate tag attribute information that the candidate bin meets the merging condition, the process goes to step S1054, and when it is determined based on the candidate tag attribute information that the candidate bin does not meet the merging condition, the process goes to step S1056.
  • Step S1054 when it is determined that the candidate bins satisfy the merging condition based on the attribute information of the candidate tags, determine the information value of each candidate bin.
  • Step S1055 Determine the target bin based on the information value of each candidate bin.
  • step S1055 it can be first determined whether each candidate bin includes a candidate bin pair, and the candidate bin pair is two candidate bins including the same bin, for example, the first candidate bin is the first bin.
  • the bin is obtained by merging with the second bin, and the second candidate bin is obtained by merging the second bin and the third bin, then the first candidate bin and the second candidate bin are candidate bins. box pair.
  • candidate binning pairs are included in each candidate binning pair, the candidate binning pair with a higher information value is used as a target binning.
  • the multiple candidate binning pairs are compared in turn to determine the corresponding multiple target binning.
  • each candidate bin does not contain candidate bin pairs, then each candidate bin that satisfies the merging condition is the target bin.
  • Step S1056 cancel the pre-merging corresponding to the candidate binning.
  • the pre-merging corresponding to the candidate bin will be cancelled.
  • step S1057 each target binning is merged with its own adjacent binning again until the optimization target is reached, and each final binning result is obtained.
  • each target bin After the first round of merging, each target bin includes one or two initial bins, and the IV value of each target bin is obtained; then the second round of merging is performed.
  • each The target binning includes one, two, three or four initial binning, obtains the IV value of the target binning again, and determines whether the optimization goal is achieved based on the IV value of each target binning, and then determines whether the optimization goal is reached. When the final binning result is determined.
  • an embodiment of the present application further provides a data processing method based on federated learning, which is applied to the network architecture shown in FIG.
  • a schematic diagram of an implementation process, as shown in Figure 5, the process includes:
  • Step S501 the feature-side device acquires the data to be processed.
  • the data to be processed includes multiple object identifiers and feature values corresponding to each object identifier.
  • Step S502 the feature-side device performs initial binning on the data to be processed based on the feature values corresponding to the respective object identifiers to obtain a preset number of bins.
  • Step S503 the feature-side device determines a plurality of target identification sets from each bin, and sends each target identification set to the label-side device in sequence.
  • the target identification set includes at least three object identifications.
  • the feature-side device sends to the label-side device through the message queue Pulsar.
  • Step S504 the tag-side device receives the target identifier set sent by the feature-side device, and acquires multiple object identifiers in the target identifier set.
  • step S505 the tag-side device acquires tag information corresponding to each object identifier.
  • the tag side device stores the mapping between the object identifier and the tag information, thereby ensuring that the time complexity of acquiring the tag information corresponding to any object identifier is O(1).
  • Step S506 the tag side device determines the set tag distribution information of the target identification set based on each tag information.
  • the label-side device counts the number of positive samples and the number of negative samples corresponding to the target identification set, and determines the set label distribution information of the target identification set according to the number of positive samples and the number of negative samples.
  • the number of positive samples and the number of negative samples are verified, and when the number of positive samples and the number of negative samples pass the verification, the number of positive samples and the number of negative samples are determined as set label distribution information; When the number of and negative samples passes the verification, the preset invalid information is determined as the set label distribution information.
  • Step S507 the tag-side device sends the aggregated tag distribution information to the feature-side device.
  • Step S508 the feature-side device receives each set label distribution information corresponding to each target identification set sent by the label-side device, and determines the bin label distribution information corresponding to each bin based on the set label distribution information.
  • step S509 the feature-side device uses the preset binning strategy and the distribution information of each binning label to merge the respective bins to obtain a final binning result.
  • step S510 the feature-side device obtains each final binning of each feature dimension.
  • optimal binning can be performed based on each feature dimension through the above steps, and each final binning under each feature dimension can be obtained.
  • Step S511 the feature-side device determines the information value of each final binning of each feature dimension and the total information value corresponding to each feature dimension.
  • the IV value of each final binning under each feature dimension has been determined in the process of determining the final binning result, and the total information value corresponding to a certain feature dimension is the sum of the IV values of each final binning under the feature dimension.
  • Step S512 the feature-side device performs feature selection based on the information value of each final binning and each total information value, and obtains multiple target final bins.
  • step S512 it can first determine whether the total information value corresponding to each feature dimension is greater than the feature information threshold, and determine the feature dimension greater than the feature information threshold as the target feature dimension, and then determine each final binning under the target feature dimension. Whether the information value of is greater than the binning information threshold, the final binning under the target feature dimension that is greater than the binning information threshold is determined as the target final binning.
  • step S513 the feature-side device obtains the label distribution information of the final binning of each target, and performs modeling based on the feature data and the label distribution information in the final binning of each target.
  • step S513 a longitudinal logistic regression federated learning method can be used for modeling.
  • the model training needs to be performed by using the feature data and label distribution information in the final target binning, and finally a trained neural network model is obtained.
  • the feature-side device after acquiring the data to be processed including multiple object identifiers and the feature values corresponding to the respective object identifiers, perform initial binning on the data to be processed to obtain a preset number of binning, and then send multiple target identification sets included in each binning to the labeling device in turn, and then receive each Each set label distribution information corresponding to the target identification set, and then the feature square device determines the bin label distribution information corresponding to each bin based on the each set label distribution information, and then uses the preset bin strategy and each bin label distribution information.
  • the feature-side device sends the target identification set composed of its own object identification to the label-side device, and the label-side device sends the target label set.
  • the corresponding label distribution information is returned to the feature-side device instead of sending the encrypted label information to the feature-side device, which can avoid information leakage caused by the decryption of the label information, thereby improving data security.
  • feature screening is performed, and the final binning of the target obtained by the screening is used for logistic regression modeling of longitudinal federated learning.
  • the step S506 shown in FIG. 5 “the tag-side device determines the set tag distribution information of the target identification set based on each tag information” includes:
  • Step S5061 Determine the number of positive samples and the number of negative samples based on each label information.
  • Step S5062 judging whether the number of positive samples is less than the total number of object identifiers included in the target identifier set, and the number of negative samples is less than the total number of object identifiers.
  • the number of positive samples is less than the total number of object identifiers, and the number of negative samples is less than the total number of object identifiers, it means that the corresponding label information in the target identifier set has both positive samples and negative samples. In this case, the number of positive samples and the number of negative samples are determined.
  • step S5063 when the number of positive samples is equal to the total number of object identifiers, it indicates the object identifiers in the target identifier set
  • the corresponding label information is all positive samples, or when the number of negative samples is equal to the total number of object identifiers, it means that the label information corresponding to the object identifiers in the target identifier set are all negative samples.
  • step S5064 is entered.
  • Step S5063 when the number of positive samples is less than the total number of object identifiers contained in the target identifier set, and the number of negative samples is less than the total number of object identifiers, determine the number of positive samples and the number of negative samples as Collection label distribution information.
  • Step S5064 when the number of positive samples is equal to the total number of object identifiers contained in the target identifier set, or the number of negative samples is less than the total number of object identifiers, determine preset invalid information as the target identifier set collection label distribution information.
  • the party holding the label is the Guest (that is, the labeling device in other embodiments), and the other participants without labels are the Host (that is, in other embodiments). characteristic side device).
  • the Guest because the Guest holds the label, it has a natural advantage to optimally bin its own data.
  • the Host side since it does not have a label, it needs to use the label of the Guest side to achieve the purpose of optimal binning.
  • different age groups have different credit scores and different ability to keep promises. It is very important for banks to accurately reflect the characteristics of groups by age grouping.
  • the Host side needs to use the Y of the Guest side to perform optimal binning for X1, X2, X3, and X4 respectively.
  • the first step of optimal binning is to perform initial binning on the features, then count the label distribution histogram (histogram) of each initial binning (commonly used equal-frequency binning, Quantile Binning), and finally merge some adjacent binning. Bins until the optimal binning model converges and the optimal binning is obtained.
  • FIG. 6 is a schematic diagram of the implementation process of optimal binning of a feature provided by an embodiment of the present application.
  • the Host performs initial binning on the original feature data 601 to obtain 32 initial binning 602, and then obtains
  • the label distribution data in the 32 initial bins (that is, the aggregated label distribution information in other embodiments) are merged under the condition that the adjacent bins meet the merging conditions, where 603 in FIG. 6 shows each possible merging and binning, and determine the optimal merging scheme based on the IV value of each merging and binning, so as to obtain the optimal binning result 604 .
  • the feature binning method provided by the embodiment of the present application will be described below with reference to each step.
  • the feature binning method provided in this embodiment of the present application can be implemented through the following steps:
  • Step S701 the Host party performs equal-frequency binning on Xi according to the hyperparameter M (the number of initial bins).
  • Step S701 corresponds to step S102 in other embodiments.
  • the equal-frequency binning can ensure that roughly the same elements are allocated to each bin. If M is 20, then each bin contains roughly 5% of the elements, and the continuous features after binning are converted into discrete features.
  • Type feature the value range is 0 to M-1.
  • Step S702 the Guest side and the Host side respectively perform Hash Partition on the id column.
  • Step S702 corresponds to step S1025 and step S201 in other embodiments.
  • the ids in each bin can be hash partitioned, which can ensure that each hash partition includes all bins.
  • Step S703 sort each partition on the Host side according to the value of the feature Xi, and cache the mapping between the identifier and the label on the Guest side.
  • each hash partition on the Host side according to the value of Xi, order from small to large, and arrange elements with the same value together. As shown at 711 in FIG. 7 . Cache the mapping between the ID and the label in each partition of the Guest side, which can ensure that the time complexity of obtaining the label-id corresponding to any id is O(1).
  • Step S704 the Host side counts the label distribution of each initial bin.
  • the Host party generates a random number r within a certain range (for example, it can be 3-8), and sequentially takes r ids in the same bin of each hash to obtain idSet(r), and passes through the message queue. Sent to the Guest party. After the Guest receives the idSet(r), it counts the label distribution histogram and verifies the legitimacy of the label distribution histogram, that is, to check whether the label information of each id will be reversed from the result. If it is valid, send the label distribution histogram to the Host; otherwise, send (0,0) to the Host. After receiving the label distribution histogram, the Host side saves the data. Repeat this step until all sub-boxes on the Host side have completed preliminary label statistics.
  • a certain range for example, it can be 3-8
  • Step S705 the Host party summarizes and counts all the label distribution histograms, and obtains the label distribution histogram information of all bins (corresponding to the bin label distribution information in other embodiments).
  • Step S704 and step S705 correspond to step S104 in other embodiments.
  • bins1 has 5 initial bins, and after the above five steps, the label distribution of each bin can be obtained. Assuming that the label distribution of each bin is shown in Figure 7, the 0th bin contains 10 positive examples and 100 negative examples; the first bin contains 20 positive examples and 100 negative examples , the second bin contains 10 positive examples and 200 negative examples; the third bin contains 10 positive examples and 100 negative examples, and the fourth bin contains 10 Positive examples and 100 negative examples.
  • Step S706 the Host party performs optimal binning for the initial binning of binsi according to the optimal binning strategy.
  • Step S706 corresponds to step S105 in other embodiments, and essentially selects two adjacent bins each time and merges them into a newest bin. On the premise that the constraints are met, repeat this process until the end. maximum profit.
  • Commonly used optimal binning strategies include optimal binning based on IV value and optimal binning based on chi-square. Following the above example, merge the 0th sub-box with the 1st sub-box, merge the 3rd sub-box and the 4th sub-box, and finally get 3 sub-boxes, of which, the 0th sub-box contains 30 positive examples , 200 negative examples, the first bin contains 10 positive examples and 200 negative examples, and the second bin contains 20 positive examples and 200 negative examples.
  • step S704 may be implemented by the following steps:
  • Step S7041 randomly select the one-dimensional data Xi of the Host side.
  • Step S7042 randomly select a feature from Xi.
  • Step S7043 the Host party generates a random number r each time.
  • the random number r is required to be greater than 2, so as to ensure that the host cannot determine the sample label from the label distribution returned by the guest, and the random number r cannot be derived.
  • Step S7044 the Host party randomly selects r unmarked id (i)(j) (r) from the id (i)(j) set of F (i)(j ) and sends them to the Guest.
  • the Host sends id (i)(j) (r) to the Guest, and needs to ensure that after the Guest receives the id (i)(j) (r) , it cannot pass the id ( i) (j) (r). )(j) (r) derives X i and F (i)(j) .
  • Step S7045 after the Guest receives the id (i)(j) (r) , according to the label Y held by itself, count the label distribution of id (i)(j) (r) ⁇ n (i)(j)( 1) (r) ,n (i)(j)(0) (r) ⁇ .
  • 1 and 0 represent the two values of the binary label Y, respectively.
  • the Guest verifies the label distribution to ensure that it does not reveal the label information of a single sample, and the Host cannot derive the label of a single sample by other means.
  • Step S7046 the Guest side returns the label distribution ⁇ n (i)(j)(1) (r) ,n (i)(j)(0) (r) ⁇ to the Host side, as shown in FIG. 9 .
  • Step S7047 if the verification is successful, the Host side saves ⁇ n (i)(j)(1) (r) ,n (i)(j)(0) (r) ⁇ , and marks the id (i)(j ) (r) has been processed.
  • step S7043 After step S7047, until all id (i)(j ) of F (i) (j) are counted; then jump to step S7042, until all features of X i are processed; then jump again Go to step S7041, until the Host side contains feature X, it is processed.
  • each has a sample size of 17w, of which the Host side has 25-dimensional features, and the Guest side has 29-dimensional features.
  • Table 2 the KS value of baseline LR is 37.98%, and the KS effect of LR after optimal binning is 38.19%, a relative improvement of 0.5%. For large-scale financial scenarios, this effect may bring tens of millions of revenue.
  • Host party Guest side Baseline LR Binning based LR The amount of data 176892 176892 - - Feature dimension 25 29 - - Model KS value - - 37.98% 38.19%
  • the Host sends the IDSet to the Guest, instead of the Guest sending the label information to the Host, which can avoid the risk of leakage of the Guest's label information and improve information security. And use the IDSet positive and negative sample distribution detection mechanism to avoid the risk of credential stuffing.
  • the software modules stored in the data processing apparatus 455 of the memory 440 may include :
  • the first acquisition module 4551 is configured to acquire data to be processed, and the data to be processed includes multiple object identifiers and feature values corresponding to the respective object identifiers; the binning module 4552 is configured to be based on the feature values corresponding to the respective object identifiers, Perform initial binning on the data to be processed to obtain a preset number of bins; the first sending module 4553 is configured to determine multiple target identification sets from each bin, and send each target identification set to the label in turn The target identifier set includes at least three object identifiers; the first receiving module 4554 is configured to receive the distribution information of each set tag corresponding to each target identifier set sent by the tag party device, and distribute information based on the each set tag The information determines the bin label distribution information corresponding to each bin; the merging module 4555 is configured to use the preset bin strategy and each bin label distribution information to merge the bins to obtain a final bin result.
  • the apparatus further includes: a third acquiring module, configured to acquire a preset partition rule and the number of partitions N; a second determining module, configured to be based on the identification of each object in the ith bin and The partition rule determines the partition identifier of each piece of feature data in the ith bin; the first partition module is configured to assign each piece of feature data in the ith bin to each piece of feature data The partition ID of corresponds to the ith bin in the partition.
  • the apparatus further includes: a deletion module configured to delete the set label distribution information when the set label distribution information is preset invalid information; a third determination module configured to delete the set label distribution information
  • the object identifiers in the target identifier set corresponding to the set label distribution information are determined to be unprocessed object identifiers.
  • the merging module is further configured to: combine each bin with its own adjacent bins to obtain multiple candidate bins; determine candidate label attribute information of each candidate bin, the label The attribute information includes the number of positive samples, the number of negative samples, and the percentage of positive samples; when it is determined that the candidate bins meet the merging conditions based on the candidate label attribute information, the information value of each candidate bin is determined; based on the information of each candidate bin The value determines the target binning; each target binning is merged with its own adjacent binning again until the optimization goal is reached, and each final binning is obtained.
  • the apparatus further includes: a fourth obtaining module configured to obtain each final binning of each feature dimension; a fourth determining module configured to determine the information value and each final binning of each feature dimension The total information value corresponding to the feature dimension; the feature selection module is configured to perform feature selection based on the information value of each final bin and each total information value to obtain multiple target final bins; the fifth acquisition module is configured to obtain each target The label distribution information of the final binning is modeled based on the feature data and label distribution information in each optimal target binning result.
  • FIG. 2B is a schematic structural diagram of a tag side device 200 provided by an embodiment of the present application.
  • the tag side device 200 shown in FIG. 2B includes: at least one processor 210 , a memory 250 , at least one network interface 220 and a user interface 230 .
  • the various components in the tag-side device 200 are coupled together by a bus system 240 .
  • a bus system 240 For the functions and structures of the processor 210, the network interface 220, the user interface 230, the bus system 240 and the memory 250, reference may be made to the processor 410, the network interface 420, the user interface 430, the bus system 440 and the memory 450 in the characteristic device 400.
  • the embodiment of the present application further provides a data processing apparatus 255, which is stored in the storage memory 250 of the label-side device 200.
  • the software modules in the data processing apparatus 255 may include:
  • the second receiving module 2551 is configured to receive the target identifier set sent by the feature-side device, and acquire multiple object identifiers in the target identifier set; the second acquiring module 2552 is configured to acquire label information corresponding to each object identifier; A determining module 2553 is configured to determine the collective label distribution information of the target identification set based on each label information; the second sending module 2554 is configured to send the collective label distribution information to the feature-side device.
  • the first determining module 2553 is further configured to: when the number of positive samples is less than the total number of object identifiers included in the target identifier set, and the number of negative samples is less than the total number of object identifiers, The number of positive samples and the number of negative samples are determined as set label distribution information.
  • the first determining module 2553 is further configured to: when the number of positive samples is equal to the total number of object identifiers included in the target identifier set, or the number of negative samples is less than the total number of object identifiers When the preset invalid information is determined as the set label distribution information of the target identification set.
  • the apparatus further includes: a fifth obtaining module, configured to obtain a preset partition rule and the number of partitions N; a fifth determining module, configured to determine each object identifier and the partition rule based on the A partition identifier of a piece of label data, the label data includes an object identifier and label information corresponding to the object identifier; a second partition module is configured to add each piece of label data to a corresponding partition based on the partition identifier.
  • a fifth obtaining module configured to obtain a preset partition rule and the number of partitions N
  • a fifth determining module configured to determine each object identifier and the partition rule based on the A partition identifier of a piece of label data, the label data includes an object identifier and label information corresponding to the object identifier
  • a second partition module is configured to add each piece of label data to a corresponding partition based on the partition identifier.
  • Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the above-mentioned federated learning-based data processing method in the embodiment of the present application.
  • the embodiments of the present application provide a computer-readable storage medium storing executable instructions, wherein the executable instructions are stored, and when the executable instructions are executed by a processor, the processor will cause the processor to execute the method provided by the embodiments of the present application, for example , as shown in Figure 3, Figure 4 and Figure 5.
  • the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the foregoing memories Various equipment.
  • executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and which Deployment may be in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, a Hyper Text Markup Language (HTML, Hyper Text Markup Language) document
  • HTML Hyper Text Markup Language
  • One or more scripts in stored in a single file dedicated to the program in question, or in multiple cooperating files (eg, files that store one or more modules, subroutines, or code sections).
  • executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communication network execute on.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Bioethics (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于联邦学习的数据处理方法、装置、设备及计算机可读存储介质;该方法包括:获取待处理数据,待处理数据包括多个对象标识和各个对象标识对应的特征值;基于各个对象标识对应的特征值,对待处理数据进行初始分箱,得到预设个数的分箱;从各个分箱中确定多个目标标识集合,并将各个目标标识集合依次发送至标签方设备;接收所述标签方设备发送的各个目标标识集合对应的各个集合标签分布信息,并基于各个集合标签分布信息确定各个分箱对应的分箱标签分布信息;利用预设的分箱策略和各个分箱标签分布信息对所述各个分箱进行合并处理,得到最终分箱。

Description

数据处理方法、装置、设备、计算机可读存储介质及计算机程序产品
相关申请的交叉引用
本申请基于申请号为202110258944.9、申请日为2021年03月10日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及人工智能技术,尤其涉及一种数据处理方法、装置、设备、计算机可读存储介质及计算机程序产品。
背景技术
联邦学习技术是新兴的一种隐私保护技术,能够保证数据在不出本地的前提下,有效联合各方数据进行模型训练。
纵向联邦学习通常由不同的参与方联合训练机器学习模型,最优分箱作为一种非常常见的模型分箱,被广泛的应用于机器学习建模前的特征工程阶段,分箱后的数据可以很好的提升模型效果。这种非线性的分箱模型天生具有对连续型特征切分的能力,并利用分割点对特征进行离散化。然而,由于不同公司不同部门之间独立存储、独立维护数据,逐渐形成了“数据孤岛”。在数据隐私及安全保护法律越来越完善的情况下,如何在“数据孤岛”之间安全高效的进行联合分箱,成为一个挑战。
发明内容
本申请实施例提供一种数据处理方法、装置、计算机可读存储介质及计算机程序产品,能够提高数据安全性。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种数据处理方法,应用于特征方设备,包括:
获取待处理数据,所述待处理数据包括多个对象标识和各个对象标识对应的特征值;
基于所述各个对象标识对应的特征值,对所述待处理数据进行初始分箱,得到预设个数的分箱;
从各个分箱中确定多个目标标识集合,并将各个目标标识集合依次发送至标签方设备;
接收所述标签方设备发送的各个目标标识集合对应的各个集合标签分布信息,并基于所述各个集合标签分布信息确定各个分箱对应的分箱标签分布信息;
利用预设的分箱策略和各个分箱标签分布信息对所述各个分箱进行合并处理,得到最终分箱结果。
本申请实施例提供一种基于联邦学习的数据处理方法,应用于标签方设备,包括:
接收特征方设备发送的目标标识集合,并获取所述目标标识集合中的多个对象标 识;
获取各个对象标识对应的标签信息;
基于各个标签信息确定所述目标标识集合的集合标签分布信息;
将所述集合标签分布信息发送至所述特征方设备。
本申请实施例提供一种数据处理装置,包括:
第一获取模块,配置为获取待处理数据,所述待处理数据包括多个对象标识和各个对象标识对应的特征值;
分箱模块,配置为基于所述各个对象标识对应的特征值,对所述待处理数据进行初始分箱,得到预设个数的分箱;
第一发送模块,配置为从各个分箱中确定多个目标标识集合,并将各个目标标识集合依次发送至标签方设备;
第一接收模块,配置为接收所述标签方设备发送的各个目标标识集合对应的各个集合标签分布信息,并基于所述各个集合标签分布信息确定各个分箱对应的分箱标签分布信息;
合并模块,配置为利用预设的分箱策略和各个分箱标签分布信息对所述各个分箱进行合并处理,得到最终分箱结果。
在一些实施例中,所述分箱模块,还配置为:
基于所述对象标识对应的特征值,确定最大特征值和最小特征值;
基于所述最大特征值、最小特征值和预设个数M,确定(M-1)个特征分位值;
基于所述最小特征值、所述最大特征值和(M-1)个特征分位值确定M个特征区间;
基于所述待处理数据中的各个特征值和M个特征区间,对所述待处理数据进行初始分箱,得到M个分箱,其中,第i个分箱中包括多条特征数据,所述多条特征数据对应的特征值在第i个特征区间内,i=1,2,…,M。
在一些实施例中,所述装置还包括:
第三获取模块,配置为获取预设的分区规则和分区个数N;
第二确定模块,配置为基于第i个分箱中的各个对象标识和所述分区规则,确定第i个分箱中的各条特征数据的分区标识,所述分区标识对应N个分区中的其中一个分区;
第一分区模块,配置为将所述第i个分箱中的各条特征数据分配至所述各条特征数据的分区标识对应分区中的第i个分箱中。
在一些实施例中,所述第一发送模块,还配置为:
确定第i个分箱中的特征值个数S,其中,i=1,2,…,M,M为预设个数,S为正整数;
随机确定第j个特征值对应的R个未处理的对象标识,R为大于2的正整数,j=1,2,…,S;
将所述R个未处理的对象标识确定为一个目标标识集合。
在一些实施例中,所述第一接收模块,还配置为:
当第i个分箱对应的集合标签分布信息不为预设的无效信息时,获取所述集合标签分布信息中的正样本数量和负样本数量,其中,i=1,2,…,M,M为预设个数;
将所述集合标签分布信息对应的目标标识集合中的对象标识确定为已处理的对象标识;
基于所述集合标签分布信息中的正样本数量和负样本数量更新所述第i个分箱对应的分箱标签分布信息,直至所述第i个分箱中不存在未处理的对象标识。
在一些实施例中,所述装置还包括:
删除模块,配置为当所述集合标签分布信息为预设的无效信息时,删除所述集合标签分布信息;
第三确定模块,配置为将所述集合标签分布信息对应的目标标识集合中的对象标识确定为未处理的对象标识。
在一些实施例中,所述合并模块,还配置为:
将各个分箱与自身的相邻分箱进行合并,得到多个候选分箱;
确定各个候选分箱的候选标签属性信息,所述标签属性信息包括正样本数量、负样本数量、正样本百分比;
当基于所述候选标签属性信息确定所述候选分箱满足合并条件时,确定各个候选分箱的信息价值;
基于各个候选分箱的信息价值确定目标分箱;
将各个目标分箱与自身的相邻分箱结果再次进行合并,直至达到优化目标,得到各个最终分箱。
在一些实施例中,所述装置还包括:
第四获取模块,配置为获取各个特征维度的各个最终分箱;
第四确定模块,配置为确定各个特征维度的各个最终分箱的信息值和各个特征维度对应的总信息值;
特征选择模块,配置为基于各个最终分箱的信息值和各个总信息值,进行特征选择,得到多个目标最终分箱;
第五获取模块,配置为获取各个目标最终分箱的标签分布信息,基于各个目标最终分箱中的特征数据和标签分布信息进行建模。
本申请实施例提供一种数据处理装置,包括:
第二接收模块,配置为接收特征方设备发送的目标标识集合,并获取所述目标标识集合中的多个对象标识;
第二获取模块,配置为获取各个对象标识对应的标签信息;
第一确定模块,配置为基于各个标签信息确定所述目标标识集合的集合标签分布信息;
第二发送模块,配置为将所述集合标签分布信息发送至所述特征方设备。
基于各个标签信息确定正样本数量和负样本数量;
在一些实施例中,该第一确定模块,还配置为:
当所述正样本数量小于所述目标标识集合中包含的对象标识总数,且所述负样本数量小于所述对象标识总数时,将所述正样本数量和所述负样本数量确定为集合标签分布信息。
在一些实施例中,该第一确定模块,还配置为:
当于所述正样本数量等于所述目标标识集合中包含的对象标识总数,或者所述负样本数量小于所述对象标识总数时,将预设的无效信息确定为所述目标标识集合的集合标签分布信息。
在一些实施例中,该装置还包括:
第五获取模块,配置为获取预设的分区规则和分区个数N;
第五确定模块,配置为基于各个对象标识和所述分区规则,确定各条标签数据的分区标识,所述标签数据包括对象标识和所述对象标识对应的标签信息,所述分区标识对应N个分区中的其中一个分区;
第二分区模块,配置为基于所述分区标识将各条标签数据增加至对应的分区中。
本申请实施例提供一种基于联邦学习的数据处理设备,包括:
存储器,用于存储可执行指令;
处理器,用于执行所述存储器中存储的可执行指令时,实现本申请实施例提供的方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行时,实现本申请实施例提供的方法。
本申请实施例提供一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中;当电子设备的处理器从所述计算机可读存储介质读取所述计算机指令,并执行所述计算机指令时,实现本申请实施例提供的方法。
本申请实施例具有以下有益效果:
在获取到包括多个对象标识和各个对象标识对应的特征值的待处理数据之后,基于所述各个对象标识对应的特征值,对所述待处理数据进行初始分箱,得到预设个数的分箱结果,然后将各个分箱中包括的多个目标标识集合依次发送至标签方设备,进而接收标签方设备发送的各个目标标识集合对应的各个集合标签分布信息,并基于所述各个集合标签分布信息确定各个分箱对应的分箱标签分布信息;利用预设的分箱策略和各个分箱标签分布信息对所述各个分箱进行合并处理,得到最终分箱结果,在该数据处理过程中,特征方设备是将自身的对象标识构成的目标标识集合发送给标签方设备,标签方设备将目标标签集合对应的标签分布信息返回给特征方设备,而不是将加密后的标签信息发送给特征方设备,能够避免标签信息被解密造成信息泄露,从而提高数据安全性。
附图说明
图1为本申请实施例提供的数据处理系统100的网络架构示意图;
图2A是本申请实施例提供的特征方设备400的结构示意图;
图2B是本申请实施例提供的标签方设备200的结构示意图;
图3是本申请实施例提供的基于联邦学习的数据处理方法的一种实现流程示意图;
图4为本申请实施例提供的确定各个分箱对应的分箱标签分布信息的实现流程示意图;
图5为本申请实施例提供的基于联邦学习的数据处理方法的再一种实现流程示意图;
图6为本申请实施例提供的对某一特征进行最优分箱的实现过程示意图;
图7为本申请实施例提供的最优分箱的实现过程示意图;
图8为本申请实施例中随机挑选特征以及随机挑选该特征对应特征值的实现过程示意图;
图9为本申请实施例提供的Host方和Guest方的信息交互示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请进行描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在 不冲突的情况下相互结合。
在以下的描述中,所涉及的术语“第一\第二\第三”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
需要说明的是,在本申请实施例中,涉及到各种数据信息,例如,特征数据、对象标识等,当本申请实施例运用到实际产品或技术中时,需要获得许可或者同意,且相关数据的收集、使用和处理均遵守相关国家和地区的相关法律法规和标准。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
对本申请实施例进行说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)联邦学习(FL,Federated Learning),是一种分布式人工智能,适用于训练机器学习模型,该模型分散了训练过程,从而无需将数据发送到集中式服务器就可以维护用户的隐私。通过将训练过程分散到许多设备上,这也提高了效率。
2)纵向联邦学习(Vertical Federated Learning),在两个数据集的用户重叠较多、用户特征重叠较少的情况下,把数据集按照纵向(即特征维度)进行切分,并取出双方用户相同而用户特征不完全相同的部分数据进行训练的机器学习。
3)分箱处理,把一段连续的值切分成若干段,每一段的值看成一个分类。通常把连续值转换成离散值的过程,我们称之为分箱处理。
4)等频分箱,分箱的边界值要经过选择,使得每个分箱中包含的元素个数大致相同。例如,取M=50,每个分箱大概包含2%的元素。
5)最优分箱(OB,Optimal Binning),连续型特征离散化的一个基本假设是,不同区间的取值范围,对结果的贡献不同,最优分箱是一种通过确定分箱的个数以及分箱的边界,使得每个分箱对结果的贡献之和最大化的分箱方式。
6)主机(Host)方,不含标签的一方数据源,持有特征维度Xi,Host方对应本申请实施例中的特征方设备。
7)来宾(Guest)方,提供标签的一方数据源,标签标注了样本是正例或者负例,取值0或1,Guest方对应本申请实施例中的标签方设备。
8)信息价值(IV,Information Value),在机器学习的二分类问题中,IV值主要用来对输入变量进行编码和预测能力评估。特征变量IV值的大小即表示该变量预测能力的强弱。
为了更好地理解本申请实施例提供的特征分箱方法,首先对相关技术中的特征分箱方法以及存在的缺点进行说明。
相关技术中的特征分箱方法,在实现时,需要把Guest方的标签信息,经过同态加密后,发送给Host方,标签信息存在被Host方破解的风险;并且需要在Guest汇总标签分布信息,并计算证据权重(WOE,Weight of Evidence)和IV,可能存在撞库风险。
基于此,本申请实施例提供一种特征分箱方法、装置、设备、计算机可读存储介质及计算机程序产品,能够解决标签信息被泄露以及撞库问题,下面说明本申请实施例提供的特征分箱设备的示例性应用,本申请实施例提供的设备可以实施为笔记本电脑,平板电脑,台式计算机,机顶盒,移动设备(例如,移动电话,便携式音乐播放器,个人数字助理,专用消息设备,便携式游戏设备)等各种类型的用户终端,也可以实施为服务器。下面,将说明设备实施为特征方设备时示例性应用。
参见图1,图1为本申请实施例提供的数据处理系统100的网络架构示意图,如图1所示,在该网络架构中至少包括标签方设备200(对应其他实施例中的Guest方)、网络300、特征方设备400(对应其他实施例中的Host方)以及标签方设备的数据库500-1和特征方设备的数据库500-2。为实现支撑一个示例性应用,标签方设备200和特征方设备400可以为纵向联邦学习中联合训练机器学习模型的各参与方。其中,特征方设备400可以为客户端,例如各银行或医院等存储有用户特征数据的参与方设备,客户端可以是笔记本电脑,平板电脑,台式计算机,专用训练设备等具有模型训练功能的设备,特征方设备400还可以为服务器。标签方设备200可以为政务系统中的设备,标签方设备200可以是客户端,也可以是服务器。标签方设备200通过网络300连接特征方设备400,网络300可以是广域网或者局域网,又或者是二者的组合,使用无线或有线链路实现数据传输。
特征方设备400在需要对自身存储的待处理数据(也即特征数据)进行分箱时,可以首先基于特征值进行初始的等频分箱,并将等频分箱后的各个初始分箱结果中的多个对象标识作为一个目标标识集合发送至标签方设备200,标签方设备返回目标标识集合对应的标签分布信息,特征方设备400在获取到所有初始分箱的标签分布信息之后,再依据预设的分箱策略对多个初始分箱进行优化得到最终分箱结果。在一些实施例中,在得到最终分箱结果之后,特征方设备进行特征筛选,得到用于建模和训练模型的目标特征数据,并基于目标特征数据进行建模和模型训练。在数据处理系统中,可以有一个或者多个特征方设备,同样地,可以有一个或者多个标签方设备。
参见图2A,图2A是本申请实施例提供的特征方设备400的结构示意图,图2A所示的特征方设备400包括:至少一个处理器410、存储器450、至少一个网络接口420和用户接口430。特征方设备400中的各个组件通过总线系统440耦合在一起。可理解,总线系统440用于实现这些组件之间的连接通信。总线系统440除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2A中将各种总线都标为总线系统440。
处理器410可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
用户接口430包括使得能够呈现媒体内容的一个或多个输出装置431,包括一个或多个扬声器和/或一个或多个视觉显示屏。用户接口430还包括一个或多个输入装置432,包括有助于用户输入的用户接口部件,比如键盘、鼠标、麦克风、触屏显示屏、摄像头、其他输入按钮和控件。
存储器450可以是可移除的,不可移除的或其组合。示例性的硬件设备包括固态存储器,硬盘驱动器,光盘驱动器等。存储器450可选地包括在物理位置上远离处理器410的一个或多个存储设备。
存储器450包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器450旨在包括任意适合类型的存储器。
在一些实施例中,存储器450能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统451,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;
网络通信模块452,用于经由一个或多个(有线或无线)网络接口420到达其他计算设备,示例性的网络接口420包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;
呈现模块453,用于经由一个或多个与用户接口430相关联的输出装置431(例如,显示屏、扬声器等)使得能够呈现信息(例如,用于操作外围设备和显示内容和信息的用户接口);
输入处理模块454,用于对一个或多个来自一个或多个输入装置432之一的一个或多个用户输入或互动进行检测以及翻译所检测的输入或互动。
在一些实施例中,本申请实施例提供的装置可以采用软件方式实现,图2A示出了存储在存储器450中的数据处理装置455,其可以是程序和插件等形式的软件,包括以下软件模块:第一获取模块4551、分箱模块4552、第一发送模块4553、第一接收模块4554和合并模块4555,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或拆分。将在下文中说明各个模块的功能。
为了更好地理解本申请实施例提供的方法,首先对人工智能、人工智能的各个分支,以及本申请实施例提供的方法所涉及的应用领域进行说明。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。本申请实施例提供的方案主要涉及人工智能的机器学习技术,以下对该项技术进行说明。
机器学习(ML,Machine Learning)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习等技术。
在本申请实施例中,特征方设备和标签方设备可以是区块链系统中的节点设备,对应地,待处理数据和各个对象标识对应的标签信息可以是从区块链上的节点获取的。
将结合本申请实施例提供的终端的示例性应用和实施,说明本申请实施例提供的基于联邦学习的数据处理方法。
参见图3,图3是本申请实施例提供的基于联邦学习的数据处理方法的一种实现流程示意图,该方法应用于特征方设备,以下将结合图3示出的步骤进行说明。
步骤S101,获取待处理数据。
待处理数据包括多个对象标识和各个对象标识对应的特征值,一个对象标识和该对象标识对应的特征值构成一条特征数据,待处理数据中包括多条特征数据。
另外,一个对象标识可以对应有一个特征维度(例如年龄),也可以对应有多个特征维度(例如年龄、性别、地域、月收入等),那么一个对象标识也就对应有一个特征值,或者多个特征值。在本申请实施例中,假设对象标识对应有一个特征维度,当一个 对象标识对应有多个特征维度时,可以利用本申请实施例提供的方法,分别计算待处理数据在各个特征维度下的最终分箱结果。
步骤S102,基于各个对象标识对应的特征值,对所述待处理数据进行初始分箱,得到预设个数的分箱。
步骤S102在实现时,可以利用无监督的分箱方式对待处理数据进行初始分箱,例如可以利用等频分箱方式、等距分箱方式进行初始分箱,得到预设个数的分箱。
以等距分箱方式为例,此时预设个数也是预设的分箱数,假设为M,那么此时基于各个对象标识对应的特征值中的最大特征值和最小特征值以及预设个数确定出分箱间隔,并基于最小特征值和该分箱间隔,确定出各个分箱范围,然后根据各个分箱范围和各个对象标识对应的特征值,确定出各条特征数据所在的分箱。
以等频分箱方式为例,首先获取待处理数据中的对象标识数和预设个数,并根据对象标识数和预设个数确定出每个分箱中包括的对象标识数Q,然后基于特征值对待处理数据中包括的各条特征数据进行排序,得到排序后的特征数据,然后从前至后,将Q条特征数据依次分入对应的分箱。
步骤S103,从各个分箱中确定多个目标标识集合,并将各个目标标识集合依次发送至标签方设备。
所述目标标识集合中包括至少三个对象标识。在实现时,多个目标标识集合中的各个对象标识对应的特征值一般是相同的,在最后无法获取到三个或者三个以上相同的特征值时,目标标识集合中的各个对象标识对应的特征值也可以是不同的,但是目标标识集合中的各个对象标识是属于同一个分箱的。
步骤S103在实现时,可以基于特征值将各个分箱中的特征数据进行排序,例如可以按照从小到大的顺序进行排序,相同特征值的特征数据排在一起,然后从各个分箱存在的特征值中,随机抽取一个特征值,然后确定出该特征值对应的对象标识集合,然后再从该特征值对应的对象标识集合中随机抽取出R个对象标识,这R个对象标识确定为一个目标标识集合。
举例来说,特征维度为年龄,某一初始分箱中包含年龄为18至25岁的特征数据,首先,确定该分箱中包括的特征值为{18,19,20,21,22,23,24,25},随机抽取一个特征值,例如抽取到20这个特征值,此时确定年龄为20的对象标识集合,假设该对象标识集合中有100个对象标识,然后从这100个对象标识中每次随机抽取10个,确定为一个目标标识集合,将目标标识集合发送至标签方设备,以获取对应的标签分布信息,直至将该特征值对应的对象标识集合中抽取的目标标识集合对应的标签分布信息都获取到,然后再随机抽取一个特征值,并重复上述过程,直至遍历完该特征维度下的所有特征值。
在本申请实施例中,从一个分箱中能够获取到多个目标标识集合,可以对一个分箱中的多个目标标识集合依次进行处理,而不同分箱的目标标识集合是可以并行处理的。
步骤S104,接收标签方设备发送的各个目标标识集合对应的各个集合标签分布信息,并基于所述各个集合标签分布信息确定各个分箱对应的分箱标签分布信息。
集合标签分布信息用于表征目标标识集合中正样本和负样本的数量,例如,集合标签分布信息可以为(4,6),其中4表征正样本数量,6表征负样本数量,也就说明一个包含10个对象标识的目标标识集合中,有4个正样本,6个负样本,但是特征方设备并不能知道这10个对象标识哪个为正样本,哪个为负样本。特征方设备在接收到标签方设备发送的各个集合标签分布信息后,首先判断该集合标签分布信息是否为无效信息,当该集合标签信息为无效信息时,忽略该集合标签分布信息;当集合标签分布信息不为无效信息时,基于该集合标签分布信息更新该分箱对应的分箱标签分布信息。
在实现时,分箱标签分布信息的初始值为(0,0),表征正样本数量初始值为0,且负样本数量初始值为0,在接收到第一个集合标签分布信息,假设为(4,6),那么此时分箱标签分布信息更新为(4,6),在接收到第二个集合标签分布信息,假设为(2,8),那么此时分箱标签分布信息更新为(6,14),依次类推,直至得到该分箱所有的集合标签分布信息后,得出该分箱的分箱标签分布信息。
步骤S105,利用预设的分箱策略和各个分箱标签分布信息对所述各个分箱进行合并处理,得到最终分箱结果。
该分箱策略可以是基于IV值的分箱策略,还可以是卡方分箱策略。步骤S105在实现时,可以将各个相邻的分箱结果进行预合并,并确定合并分箱的标签属性信息,然后再根据合并分型的标签属性信息确定满足合并条件时,进行合并,当一个分箱结果即可以与前相邻的分箱结果进行合并,又可以跟后相邻的分箱结果进行合并时,可以比较两个合并分箱的IV值以确定目标分箱结果,如此重复进行合并,直至达到预设的收敛条件,从而得到最终的分箱结果。
在本申请实施例提供的基于联邦学习的数据处理方法中,在获取到包括多个对象标识和各个对象标识对应的特征值的待处理数据之后,基于所述各个对象标识对应的特征值,对所述待处理数据进行初始分箱,得到预设个数的分箱结果,然后将各个分箱中包括的多个目标标识集合依次发送至标签方设备,进而接收标签方设备发送的各个目标标识集合对应的各个集合标签分布信息,并基于所述各个集合标签分布信息确定各个分箱对应的分箱标签分布信息;利用预设的分箱策略和各个分箱标签分布信息对所述各个分箱进行合并处理,得到最终分箱结果,在该数据处理过程中,特征方设备是将自身的对象标识构成的目标标识集合发送给标签方设备,标签方设备将目标标签集合对应的标签分布信息返回给特征方设备,而不是将加密后的标签信息发送给特征方设备,能够避免标签信息被解密造成信息泄露,从而提高数据安全性。
在一些实施例中,图3所示的步骤S102“基于所述各个对象标识对应的特征值,对所述待处理数据进行初始分箱,得到预设个数的分箱结果”可以通过以下步骤实现:
步骤S1021,基于所述对象标识对应的特征值,确定最大特征值和最小特征值。
在本申请实施例中,假设是基于年龄这一特征维度进行分箱,且最大特征值为85,最小特征值为10。
步骤S1022,基于所述最大特征值、最小特征值和预设个数M,确定(M-1)个特征分位值。
步骤S1022在实现时,可以首先基于最大特征值、最小特征值和预设个数M确定出特征间隔值,特征间隔值=(最大特征值-最小特征值)/预设个数,然后在基于最小特征值确定M-1个特征定位值,且第i个特征分位值为最小特征值+i*特征间隔值。举例来说,假设预设个数为15,那么M-1个特征分位值分别为15、20、25、30、35、40、45、50、55、60、65、70、75、80。
步骤S1023,基于所述最小特征值、最大特征值和(M-1)个特征分位值确定M个特征区间。
第1个特征区间为[最小特征值,第1个特征分位值),第2个特征区间为[第1个特征分位值,第2个特征分位值),依次类推,第M个特征区间为[第M-1个特征分位值,最大特征值]。
步骤S1024,基于所述待处理数据中的各个特征值和M个特征区间,对所述待处理数据进行初始分箱,得到M个分箱。
其中,第i个分箱中包括多条特征数据,所述多条特征数据对应的特征值在第i个特征区间内,i=1,2,…,M。
通过上述步骤S1021至步骤S1024即完成了等距初始分箱过程,得到预设个数个分箱,并且每个分箱中包括多条特征数据。
在一些实施例中,还可以利用等频分箱的方式进行初始分箱过程,得到多个分箱,此时各个分箱中包含的数据量大小是相同的,例如得到10个分箱结果,那么每个分箱结果包含10%的数据量。
在一些实施例中,在通过上述步骤S1021至步骤S1024对待处理数据进行初始分箱之后,为了提高处理效率,还可以通过以下步骤将初始分箱后的数据进行分区:
步骤S1025,获取预设的分区规则和分区个数N。
在本申请实施例中,该预设的分区规则可以是哈希分区规则,也可以是其他预设的分区规则。
步骤S1026,基于第i个分箱中的各个对象标识和所述分区规则,确定第i个分箱中的各条特征数据的分区标识。
i=1,2,…,M。在实际实现时,可以是将对象标识进行预设的哈希计算,得到哈希值,然后该哈希值再modN,得到的结果即为分区标识,因此分区标识为0,1,…,N-1。
步骤S1027,将所述第i个分箱中的各条特征数据分配至所述各条特征数据的分区标识对应分区中的第i个分箱中。
每个分区中包括M个分箱,第i个分箱中的各条特征数据,会分到各个分区的第i个分箱里。举例来说,第3个分箱里有100条特征数据,预设有4个分区,这1000条特征数据会被分到第0分区、第1分区、第2分区和第3分区中的第3个分箱中,从而使得第3个分箱结果中的数据被均匀的分到各个分区,从而提高运行效率。
在一些实施例中,与特征方设备对待处理数据进行分区相对应,在标签方设备可以通过以下步骤对标签方的标签数据进行分区:
步骤S201,标签方设备获取预设的分区规则和分区个数N。
标签方设备所采用的分区规则以及分区个数N与特征方设备采用的分区规则和分区个数是相同的。
步骤S202,标签方设备基于各个对象标识和所述分区规则,确定各条标签数据的分区标识。
标签数据包括对象标识和所述对象标识对应的标签信息,该标签信息用于表征该对象标识为正样本还是负样本,该标签信息可以用0和1来表示,例如0表示对象标识为负样本,1表示对象标识为正样本。正样本对应又可以称为好样本或正常样本,负样本又可以称为坏样本或违约样本。
步骤S202在实现时可以是将对象标识进行预设的哈希计算,得到哈希值,然后该哈希值再modN,得到的结果即为分区标识。
在一些实施例中,标签方设备除了存储有标签数据之外,还可以身份特征、消费能力、信用历史、资质荣誉等政务数据包,其中,身份特征为性别、年龄、身份证地址、学籍学历;消费能力:社保公积金缴费基数等;信用历史:授权应用个数,根据应用程序(APP,Application)中的信用服务授权情况、信用服务连续履约次数、信用服务失信次数等。
步骤S203,标签方设备基于所述分区标识将各条标签数据增加至对应的分区中。
通过上述步骤S201至步骤S203,能够基于标签方设备中的对象标识对标签数据进行与特征数据相同的分区,从而保证也能保证特征方设备和标签方设备相应的分区中拥有相同id集合。
在一些实施例中,图3所示的步骤S103中的“从各个分箱中确定多个目标标识集 合”可以通过以下步骤实现:
步骤S1031,确定第i个分箱中的特征值个数S。
i=1,2,…,M,M为预设个数,S为正整数。步骤S1031在实现时,首先统计第i个分箱中的各个特征值,并确定出特征值个数。不同分箱对应的特征值个数可能是相同的,也可能是不同的。
步骤S1032,随机确定第j个特征值对应的R个未处理的对象标识。
R为大于2的正整数,j=1,2,…,S。步骤S1032在实现时,首先每次从S个特征值中随机确定出一个特征值,也即第j次随机确定出第j个特征值,再确定出第j个特征值对应的对象标识集合,然后再从第j个特征值对应的对象标识集合中随机确定出R个未处理的对象标识。
在本申请实施例中,初始时,各个对象标识均认为是未处理的对象标识。
步骤S1033,将所述R个未处理的对象标识确定为一个目标标识集合。
目标标识集合中包括R个对象标识,R要求大于2是因为,如果R等于2,那么就可以从标签方设备返回的标签分布信息中反推出各个对象标识的标签信息,造成信息泄露。
在本申请实施例中,对于各个分箱中的各条特征数据通过重复上述过程,从而确定出各个分箱中的多个目标标识集合。
在一些实施例中,步骤S104中的“基于所述各个集合标签分布信息确定各个分箱对应的分箱标签分布信息”可以通过如图4所示的步骤S1041至步骤S1046步骤实现,以下结合图4对各个步骤进行说明。
步骤S1041,判断第i个分箱中第p个集合标签分布信息是否为无效信息。
该无效信息可以为预设好的,例如可以是(0,0),在一些实施例中,该无效信息还可以为其他预设值,例如可以是一维数值0。当第p个集合标签分布信息为无效信息时,进入步骤S1045;当第p个集合标签分布信息不为无效信息时,说明该集合标签分布信息为有效信息,此时进入步骤S1042。i=1,2,…,M,M为预设个数;p=1,2,…,W,其中W为第i个分箱中目标标签集合的总数。
步骤S1042,当所述集合标签分布信息不为预设的无效信息时,获取第p个集合标签分布信息中的正样本数量和负样本数量。
当所述集合标签分布信息不为预设的无效信息时,那么集合标签分布信息为(X,Y)形式,其中X为正样本数量,Y为负样本数量,因此获取X和Y的值即可获取第p个集合标签分布信息中的正样本数量和和负样本数量。
步骤S1043,将所述集合标签分布信息对应的目标标识集合中的对象标识确定为已处理的对象标识。
在实现时,可以为各个对象标识设定一个表征处理或者未处理的标志值(flag),该标志值可以是二进制值,其中0表示未处理,1表示已处理,各个对象标识默认初始时均为未处理。
步骤S1044,基于所述集合标签分布信息中的正样本数量和负样本数量更新所述第i个分箱对应的分箱标签分布信息,直至所述第i个分箱中不存在未处理的对象标识。
步骤S1045,当第p个集合标签分布信息为预设的无效信息时,删除第p个集合标签分布信息。
当第p个集合标签分布信息为无效信息时,那么此时忽略该集合标签分布信息,并删除该集合标签分布信息。
步骤S1046,将第p个集合标签分布信息对应的目标标识集合中的对象标识确定为未处理的对象标识。
由于该集合标签分布信息为无效信息,那么此时认为还没有得到对应目标标识集合的标签分布,因此保持该目标标识集合中的各个对象标识为未处理的对象标识。
在一些实施例中,图3所示的步骤S105“利用预设的分箱策略和各个分箱标签分布信息对所述各个分箱进行合并处理,得到最终分箱结果”可以通过以下步骤实现:
步骤S1051,将各个分箱与自身的相邻分箱进行预合并,得到多个候选分箱。
假设有M个分箱,第i个分箱和第i+1个分箱结果分别进行预合并,得到M-1个候选分箱,i=1,2,…,M。在本申请实施例中,将两个相邻的分箱进行合并,是将两个分箱对应的特征区间以及标签分布进行分别进行合并,例如,第一个分箱的特征区间为[10,15),对应的标签分布信息为(10,100),第二个分箱的特征区间为[15,20),对应的标签分布信息为(20,100)将第一分箱和第二分箱进行合并,得到第一个候选分箱,该候选分箱的特征区间为[10,20),对应的标签分布信息为(30,200)。
步骤S1052,确定各个候选分箱的候选标签属性信息。
该标签属性信息包括正样本数量、负样本数量、正样本百分比。承接上述举例,第一个候选分箱的正样本数量为30,负样本数量为200,正样本百分比为13%。
步骤S1053,基于该候选标签属性信息确定所述候选分箱是否满足合并条件。
合并条件可以包括正样本数量是否大于最小正样本数量阈值、负样本数量是否大于最小负样本数量阈值,正样本百分比是否大于最小百分比阈值,并且小于最大百分比阈值。
当基于该候选标签属性信息确定该候选分箱满足合并条件时,进入步骤S1054,当基于该候选标签属性信息确定该候选分箱不满足合并条件时,进入步骤步骤S1056。
步骤S1054,当基于所述候选标签属性信息确定所述候选分箱满足合并条件时,确定各个候选分箱的信息值。
确定各个候选分箱的IV值在实现时,首先计算第q个候选分箱的证据权重(WoE,Weight of Evidence),其中WoE_q=ln(负样本百分比/正样本百分比),然后通过IV_q=(负样本百分比-正样本百分比)*WoE_q计算第q个候选分箱的IV值。
步骤S1055,基于各个候选分箱的信息值确定目标分箱。
步骤S1055在实现时,可以首先确定各个候选分箱中是否包括候选分箱对,候选分箱对为两个包括同一个分箱的候选分箱,例如第一个候选分箱为第一个分箱与第二个分箱合并得到的,第二个候选分箱为第二个分箱和第三个分箱合并得到的,那么第一个候选分箱和第二个候选分箱为候选分箱对。当各个候选分箱中包括候选分箱对时,将候选分箱对中信息值较高的一个作为一个目标分箱。依次对比多个候选分箱对,确定对应的多个目标分箱。当各个候选分箱中不包含候选分箱对时,那么满足合并条件的各个候选分箱即为目标分箱。
步骤S1056,取消所述候选分箱对应的预合并。
候选分箱不满足合并条件,则将取消该候选分箱对应的预合并。
步骤S1057,将各个目标分箱与自身的相邻分箱再次进行合并,直至达到优化目标,得到各个最终分箱结果。
经过第一轮的合并之后,各个目标分箱中包括一个或者两个初始的分箱,并获取各个目标分箱的IV值;然后再进行第二轮合并,经过第二轮的合并之后,各个目标分箱中包括一个、两个、三个或者四个初始的分箱,再一次获取目标分箱的IV值,并基于各个目标分箱的IV值确定是否达到优化目标,在确定达到优化目标时,确定得到最终的分箱结果。
基于前述的实施例,本申请实施例再提供一种基于联邦学习的数据处理方法,应用于图1所示的网络架构,图5为本申请实施例提供的基于联邦学习的数据处理方法的再 一种实现流程示意图,如图5所示,该流程包括:
步骤S501,特征方设备获取待处理数据。
待处理数据包括多个对象标识和各个对象标识对应的特征值。
步骤S502,特征方设备基于所述各个对象标识对应的特征值,对所述待处理数据进行初始分箱,得到预设个数的分箱。
步骤S503,特征方设备从各个分箱中确定多个目标标识集合,并将各个目标标识集合依次发送至标签方设备。
目标标识集合中包括至少三个对象标识。在实现时,特征方设备通过消息队列Pulsar发送给标签方设备。
步骤S504,标签方设备接收特征方设备发送的目标标识集合,并获取所述目标标识集合中的多个对象标识。
步骤S505,标签方设备获取各个对象标识对应的标签信息。
标签方设备存储有对象标识和标签信息的映射,从而保证获取任意一个对象标识对应的标签信息的时间复杂度为O(1)。
步骤S506,标签方设备基于各个标签信息确定所述目标标识集合的集合标签分布信息。
在实现时,标签方设备统计目标标识集合对应的正样本数量和负样本数量,并根据正样本数量和负样本数量,确定目标标识集合的集合标签分布信息。在一些实施例中,对正样本数量和负样本数量进行校验,当正样本数量和负样本数量通过校验时,将正样本数量和负样本数量确定为集合标签分布信息;当正样本数量和负样本数量为通过校验时,将预设的无效信息确定为集合标签分布信息。
步骤S507,标签方设备将所述集合标签分布信息发送至所述特征方设备。
步骤S508,特征方设备接收标签方设备发送的各个目标标识集合对应的各个集合标签分布信息,并基于所述各个集合标签分布信息确定各个分箱对应的分箱标签分布信息。
步骤S509,特征方设备利用预设的分箱策略和各个分箱标签分布信息对所述各个分箱进行合并处理,得到最终分箱结果。
需要说明的是,本申请实施例与其他实施例相同的概念和步骤,请参考其他实施例的说明。
步骤S510,特征方设备获取各个特征维度的各个最终分箱。
当特征方设备中的特征数据中包括多个特征维度时,可以通过上述步骤基于各个特征维度进行最优分箱,并得到各个特征维度下的各个最终分箱。
步骤S511,特征方设备确定各个特征维度的各个最终分箱的信息值和各个特征维度对应的总信息值。
各个特征维度下各个最终分箱的IV值在确定最终分箱结果的过程中已确定出,某一特征维度对应的总信息值为该特征维度下各个最终分箱的IV值的总和。
步骤S512,特征方设备基于各个最终分箱的信息值和各个总信息值,进行特征选择,得到多个目标最终分箱。
步骤S512在实现时,首先可以确定各个特征维度对应的总信息值是否大于特征信息阈值,并将大于该特性信息阈值的特征维度确定为目标特征维度,然后在确定目标特征维度下各个最终分箱的信息值是否大于分箱信息阈值,将目标特征维度下大于分箱信息阈值的最终分箱确定为目标最终分箱。
步骤S513,特征方设备获取各个目标最终分箱的标签分布信息,基于各个目标最终分箱中的特征数据和标签分布信息进行建模。
步骤S513在实现时,可以采用纵向逻辑回归联邦学习方法进行建模。在一些实施例中,在建模完成后,还需要利用目标最终分箱中的特征数据和标签分布信息进行模型训练,最终得到训练好的神经网络模型。
在本申请实施例提供的基于联邦学习的数据处理方法中,特征方设备在获取到包括多个对象标识和各个对象标识对应的特征值的待处理数据之后,基于所述各个对象标识对应的特征值,对所述待处理数据进行初始分箱,得到预设个数的分箱,然后将各个分箱中包括的多个目标标识集合依次发送至标签方设备,进而接收标签方设备发送的各个目标标识集合对应的各个集合标签分布信息,然后特征方设备基于所述各个集合标签分布信息确定各个分箱对应的分箱标签分布信息,进而利用预设的分箱策略和各个分箱标签分布信息对所述各个分箱进行合并处理,得到最终分箱结果,在该数据处理过程中,特征方设备是将自身的对象标识构成的目标标识集合发送给标签方设备,标签方设备将目标标签集合对应的标签分布信息返回给特征方设备,而不是将加密后的标签信息发送给特征方设备,能够避免标签信息被解密造成信息泄露,从而提高数据安全性,之后在获取到各个特征维度下的最终分箱后,进行特征筛选,并利用筛选得到的目标最终分箱进行纵向联邦学习的逻辑回归建模。
在一些实施例中,图5所示的步骤S506“标签方设备基于各个标签信息确定所述目标标识集合的集合标签分布信息”,包括:
步骤S5061,基于各个标签信息确定正样本数量和负样本数量。
步骤S5062,判断正样本数量是否小于目标标识集合中包含的对象标识总数,且负样本数量小于所述对象标识总数。
当正样本数量小于该对象标识总数,且负样本数量小于该对象标识总数时,说明该目标标识集中对应的标签信息既有正样本又有负样本,此时将正样本数量和负样本数量确定为标签分布信息,特征方设备并不能根据标签分布信息确定出各个对象标识对应的标签信息,此时进入步骤S5063;当正样本数量等于该对象标识总数时,说明该目标标识集合中的对象标识对应的标签信息都为正样本,或者当负样本数量等于该对象标识总数时,说明该目标标识集合中的对象标识对应的标签信息都为负样本,如果此时将正样本数量和负样本数量作为标签分布信息,那么特征方设备在接收到标签分布信息时,可以反推出各个对象标识对应的标签信息,这是在联邦学习中不允许的,此时进入步骤S5064。
步骤S5063,当所述正样本数量小于所述目标标识集合中包含的对象标识总数,且所述负样本数量小于所述对象标识总数时,将所述正样本数量和所述负样本数量确定为集合标签分布信息。
步骤S5064,当于所述正样本数量等于所述目标标识集合中包含的对象标识总数,或者所述负样本数量小于所述对象标识总数时,将预设的无效信息确定为所述目标标识集合的集合标签分布信息。
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用。
通常来说,在纵向联邦学习的场景中,持有标签(label)的一方是Guest(也即其他实施例中的标签方设备),其他没有标签的参与方是Host(也即其他实施例中的特征方设备)。对于Guest方来说,由于Guest方持有标签,对自己的数据做最优分箱具有天然优势。但对于Host方来说,由于自身没有标签,因此需要借助Guest方的标签达到最优分箱的目的。
举个例子,假设Host方的X1代表年龄(age),不同年龄段的人群,有不同的特质。可以把年龄分为0-30、31-50以及>50三个年龄段,也可以分成如表1所示0-20、21-30、31-45、46-60以及>=61岁,5个年龄段。对银行来说,不同的年龄段,其信用平分也不 一样,是否守约的能力也不一样。如何对年龄分段,能最准确的反映群体特质,对银行来说至关重要。
表1、针对年龄这一特征的分箱结果及标签分布统计表
AGE Bad(Y=1) Good(Y=0) Bad% Good% WOE
0-20 30 100 0.3 0.1 1.098612289
21-30 25 250 0.25 0.25 0
31-45 15 200 0.2 0.2 -0.287682072
46-60 15 200 0.2 0.2 -0.287682072
>61 15 250 0.25 0.25 -0.510825624
total 100 1000 1 1 0
假设Host方包含特征X={X1,X2,X3,X4},一共4个维度的特征,Guest方只包含Label={Y}信息,其中Y是二分类。Host方需要借助Guest方的Y对X1、X2、X3、X4分别做最优分箱。最优分箱的第一步是对特征做初始分箱,然后统计每个初始分箱(常用等频分箱,Quantile Binning)的标签分布直方图(histogram),最后再合并一些相邻的分箱直到最优分箱模型收敛,得到最优分箱。
图6为本申请实施例提供的对某一特征进行最优分箱的实现过程示意图,如图6所示,Host对原始特征数据601进行初始分箱,得到32个初始分箱602,然后获取该32个初始分箱中标签分布数据(也即其他实施例中的集合标签分布信息),并在相邻分箱满足合并条件的情况下进行合并,其中图6中的603所示的为各个可能的合并分箱,并基于各个合并分箱的IV值确定出最优的合并方案,从而得到最优分箱结果604。
以下结合各步骤对本申请实施例提供的特征分箱方法进行说明。本申请实施例提供的特征分箱方法可以通过以下步骤实现:
步骤S701,Host方根据超参数M(初始分箱数)对Xi做等频分箱。
步骤S701对应其他实施例中的步骤S102。这里,通过等频分箱能够保证每个分箱中分配大致相同的元素,如果M为20,那么每个分箱中大致包含5%的元素,经过分箱后的连续型特征被转换成离散型特征,取值范围0到M-1。
步骤S702,Guest方和Host方分别对id列做哈希分区(HashPartition)。
步骤S702对应其他实施例中的步骤S1025和步骤S201。为了保证运行效率,需要对Guest方和Host方分别对id列做哈希分区,这样既能保证数据均匀分散在不同的分区中,也能保证Gues t方和Host方相应的分区中拥有相同标识集合。
在实现时,可以是将各个分箱中的id进行哈希分区,这样能够保证每个哈希分区中包括所有分箱。
步骤S703,在Host方每个分区中根据特征Xi的值进行排序,并在Guest方缓存标识和标签的映射。
这里,可以在Host方的每个哈希分区中,根据Xi的值,从小到大排序,取值相同的元素排列在一起。如图7中的711所示。在Guest方每个分区内部缓存标识和标签的映射,能够保证获取任意一个id对应的label-id的时间复杂度为O(1)。
步骤S704,Host方统计各个初始分箱的标签分布。
在实现时,Host方在某个范围内(例如可以是3-8)生成一个随机数r,在各个哈希的同一个分箱内部顺序取r个id,得到idSet(r),通过消息队列发送给Guest方。Guest方收到idSet(r)后,统计标签分布直方图,并校验标签分布直方图的合法性,即校验是否会从结果中反推每个id的标签信息。如果合法,将标签分布直方图发送给Host方;否则,发送(0,0)给Host方。Host方收到标签分布直方图后,保存数据。重复该步骤直到Host方所有分箱都完成初步的标签统计。
步骤S705,Host方汇总统计所有的标签分布直方图,得到所有分箱的标签分布直 方图信息(对应其他实施例中的分箱标签分布信息)。
步骤S704和步骤S705对应其他实施例中的步骤S104。举例来说,bins1有5个初始分箱,经过上述五个步骤后,就能得到每个分箱的标签分布。假设最终各个分箱的标签分布如图7所示,第0个分箱中包含了10个正例和100个负例;第1个分箱中,包含了20个正例和100个负例,第2个分箱中,包含了10个正例,200个负例;第3个分箱中,包含了10个正例和100个负例,第4个分箱中,包含了10个正例和100个负例。
步骤S706,Host方根据最优分箱的策略,对binsi的初始分箱做最优分箱。
步骤S706对应其他实施例中的步骤S105,本质上是每次选择两个相邻的分箱并将其合并成一个最新的分箱,在满足约束条件的前提下,重复这一过程,直到最后的收益最大。常用的最优分箱的策略有基于IV值的最优分箱和基于卡方的最优分箱。承接上述举例,将第0个分箱和第1分箱合并,将第3分箱和第4分箱合并,最终得到3个分箱,其中,第0个分箱中,包含30个正例,200个负例,第1个分箱中,包含了10个正例和200个负例,第2个分箱中,包含了20个正例,200个负例。
在本申请实施例中,上述步骤S704可以通过以下步骤实现:
步骤S7041,随机选取Host方的一维数据Xi。
这里,假设Host方包含特征X={X1,X2,X3,X4},如图8所示,Xi为X1,X2,X3,X4任意一个维度,Host方需保证Xi 801这个信息不被Guest方知道,Guest方也无法通过其他方式推导出Xi。
步骤S7042,从Xi中随机挑选一个特征。
假设X i包含{F 1,F 2,…,F n-1}等n个离散特征,如图8所示,随机挑选一个特征F (i)(j)802,并得到该特征F (i)(j)的id (i)(j)集合。同样Host方需要保证F (i)(j)这个信息不被Guest方知道,Guest方也无法通过其他方式推导出F (i)(j)
步骤S7043,Host方每次产生一个随机数r。
这里,要求随机数r需要大于2,从而保证Host方无法通过Guest方返回的标签分布中确定出样本标签,且要求随机数r无法被推导出。
步骤S7044,Host方从F (i)(j)的id (i)(j)集合中随机挑选出r个未被标记的id (i)(j) (r),发送给Guest。
这里,如图9所示,Host方将id (i)(j) (r)发送至Guest方,并且需要保证Guest方收到id (i)(j) (r)后,无法通过id (i)(j) (r)推导出X i和F (i)(j)
步骤S7045,Guest方收到id (i)(j) (r)后,根据自身持有的标签Y,统计id (i)(j) (r)的标签分布{n (i)(j)(1) (r),n (i)(j)(0) (r)}。
其中,1和0分别代表了二分类标签Y的两个值。Guest方校验标签分布,确保其不会泄露单条样本的标签信息,Host方也无法通过其他方式推导出单条样本的标签。
在对标签分布进行校验时,确定标签分布{n (i)(j)(1) (r),n (i)(j)(0) (r)}中n (i)(j)(1) (r)和n (i)(j)(0) (r)中是否存在至少一个为0,如果n (i)(j)(1) (r)和n (i)(j)(0) (r)任意一个为0,则确定校验失败,向Host方返回(0,0),以通知Host方校验失败。否则,确定校验成功,进入步骤S7046,把{n (i)(j)(1) (r),n (i)(j)(0) (r)}发送给Host。
步骤S7046,Guest方将标签分布{n (i)(j)(1) (r),n (i)(j)(0) (r)}返回至Host方,如图9所示。
步骤S7047,如果校验成功,则Host方保存{n (i)(j)(1) (r),n (i)(j)(0) (r)},并标记id (i)(j) (r)已被处理。
在步骤S7047之后进入步骤S7043,直至F (i)(j)的id (i)(j)全都统计完;然后再跳转至步骤S7042,直到X i的所有特征都被处理;然后再跳转至步骤S7041,直到Host方包含特征X都被处理。
在某支付场景中,经过样本对齐后,各有17w的样本量,其中Host方有25维特征,Guest方有29维特征。如表2所示,基线LR的KS值为37.98%,经过最优分箱后LR的KS效果为38.19%,有相对0.5%的提升。对于大规模金融场景,这个效果可能带来上千万的营收。
表2、基线LS的模型KS值与基于分箱的LR的模型KS对比表
  Host方 Guest方 基线LR 基于分箱的LR
数据量 176892 176892 - -
特征维度 25 29 - -
模型KS值 - - 37.98% 38.19%
在本申请实施例提供的特征分箱方法中,Host方发送IDSet到Guest方,而不是采用Guest方发送标签信息到Host方的方式,能够避免Guest方label信息泄露的风险,提高信息安全性,并且利用IDSet正负样本分布检测机制,避免撞库风险。
下面继续说明本申请实施例提供的数据处理装置455的实施为软件模块的示例性结构,在一些实施例中,如图2A所示,存储在存储器440的数据处理装置455中的软件模块可以包括:
第一获取模块4551,配置为获取待处理数据,所述待处理数据包括多个对象标识和各个对象标识对应的特征值;分箱模块4552,配置为基于所述各个对象标识对应的特征值,对所述待处理数据进行初始分箱,得到预设个数的分箱;第一发送模块4553,配置为从各个分箱中确定多个目标标识集合,并将各个目标标识集合依次发送至标签方设备,所述目标标识集合中包括至少三个对象标识;第一接收模块4554,配置为接收标签方设备发送的各个目标标识集合对应的各个集合标签分布信息,并基于所述各个集合标签分布信息确定各个分箱对应的分箱标签分布信息;合并模块4555,配置为利用预设的分箱策略和各个分箱标签分布信息对所述各个分箱进行合并处理,得到最终分箱结果。
在一些实施例中,该分箱模块,还配置为:基于所述对象标识对应的特征值,确定最大特征值和最小特征值;基于所述最大特征值、最小特征值和预设个数M,确定(M-1)个特征分位值;基于所述最小特征值、最大特征值和M个特征分位值确定M个特征区间;基于所述待处理数据中的各个特征值和M个特征区间,对所述待处理数据进行初始分箱,得到M个分箱结果,其中,第i个分箱中包括多条特征数据,所述多条特征数据对应的特征值在第i个特征区间内,i=1,2,…,M。
在一些实施例中,所述装置还包括:第三获取模块,配置为获取预设的分区规则和分区个数N;第二确定模块,配置为基于第i个分箱中的各个对象标识和所述分区规则,确定第i个分箱中的各条特征数据的分区标识;第一分区模块,配置为将所述第i个分箱中的各条特征数据分配至所述各条特征数据的分区标识对应分区中的第i个分箱中。
在一些实施例中,所述第一发送模块,还配置为:确定第i个分箱中的特征值个数S,其中,i=1,2,…,M,M为预设个数,S为正整数;随机确定第j个特征值对应的R个未处理的对象标识,R为大于2的正整数,j=1,2,…,S;将所述R个未处理的对象标识确定为一个目标标识集合。
在一些实施例中,所述第一接收模块,还配置为:当第i个分箱对应的集合标签分布信息不为预设的无效信息时,获取所述集合标签分布信息中的正样本数量和负样本数量,其中,i=1,2,…,M,M为预设个数;将所述集合标签分布信息对应的目标标识集合中的对象标识确定为已处理的对象标识;基于所述集合标签分布信息中的正样本数量和负样本数量更新所述第i个分箱对应的分箱标签分布信息,直至所述第i个分箱中不存在未处理的对象标识。
在一些实施例中,所述装置还包括:删除模块,配置为当所述集合标签分布信息为 预设的无效信息时,删除所述集合标签分布信息;第三确定模块,配置为将所述集合标签分布信息对应的目标标识集合中的对象标识确定为未处理的对象标识。
在一些实施例中,所述合并模块,还配置为:将各个分箱与自身的相邻分箱进行合并,得到多个候选分箱;确定各个候选分箱的候选标签属性信息,所述标签属性信息包括正样本数量、负样本数量、正样本百分比;当基于所述候选标签属性信息确定所述候选分箱满足合并条件时,确定各个候选分箱的信息价值;基于各个候选分箱的信息价值确定目标分箱;将各个目标分箱与自身的相邻分箱再次进行合并,直至达到优化目标,得到各个最终分箱。
在一些实施例中,所述装置还包括:第四获取模块,配置为获取各个特征维度的各个最终分箱;第四确定模块,配置为确定各个特征维度的各个最终分箱的信息值和各个特征维度对应的总信息值;特征选择模块,配置为基于各个最终分箱的信息值和各个总信息值,进行特征选择,得到多个目标最终分箱;第五获取模块,配置为获取各个目标最终分箱的标签分布信息,基于各个最优目标分箱结果中的特征数据和标签分布信息进行建模。
这里需要指出的是:以上数据处理装置实施例项的描述,与上述方法描述是类似的,具有同方法实施例相同的有益效果。对于本申请基于联邦学习的数据处理装置实施例中未披露的技术细节,本领域的技术人员请参照本申请方法实施例的描述而理解。
图2B是本申请实施例提供的标签方设备200的结构示意图,图2B所示的标签方设备200包括:至少一个处理器210、存储器250、至少一个网络接口220和用户接口230。标签方设备200中的各个组件通过总线系统240耦合在一起。处理器210、网络接口220、用户接口230、总线系统240和存储器250的作用和组成结构可以参考特征方设备400中处理器410、网络接口420、用户接口430、总线系统440和存储器450。
本申请实施例再提供一种数据处理装置255,存储于标签方设备200的存储存储器250中,该数据处理装置255中的软件模块可以包括:
第二接收模块2551,配置为接收特征方设备发送的目标标识集合,并获取所述目标标识集合中的多个对象标识;第二获取模块2552,配置为获取各个对象标识对应的标签信息;第一确定模块2553,配置为基于各个标签信息确定所述目标标识集合的集合标签分布信息;第二发送模块2554,配置为将所述集合标签分布信息发送至所述特征方设备。
在一些实施例中,该第一确定模块2553,还配置为:当所述正样本数量小于所述目标标识集合中包含的对象标识总数,且所述负样本数量小于所述对象标识总数时,将所述正样本数量和所述负样本数量确定为集合标签分布信息。
在一些实施例中,所述第一确定模块2553,还配置为:当于所述正样本数量等于所述目标标识集合中包含的对象标识总数,或者所述负样本数量小于所述对象标识总数时,将预设的无效信息确定为所述目标标识集合的集合标签分布信息。
在一些实施例中,所述装置还包括:第五获取模块,配置为获取预设的分区规则和分区个数N;第五确定模块,配置为基于各个对象标识和所述分区规则,确定各条标签数据的分区标识,所述标签数据包括对象标识和所述对象标识对应的标签信息;第二分区模块,配置为基于所述分区标识将各条标签数据增加至对应的分区中。
这里需要指出的是:以上数据处理装置实施例项的描述,与上述方法描述是类似的,具有同方法实施例相同的有益效果。对于本申请数据处理装置实施例中未披露的技术细节,本领域的技术人员请参照本申请方法实施例的描述而理解。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算 机设备执行本申请实施例上述的基于联邦学习的数据处理方法。
本申请实施例提供一种存储有可执行指令的计算机可读存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的方法,例如,如图3、图4和图5示出的方法。
在一些实施例中,计算机可读存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。
作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (17)

  1. 一种数据处理方法,应用于特征方设备,包括:
    获取待处理数据,所述待处理数据包括多个对象标识和各个对象标识对应的特征值;
    基于所述各个对象标识对应的特征值,对所述待处理数据进行初始分箱,得到预设个数的分箱;
    从各个分箱中确定多个目标标识集合,并将各个目标标识集合发送至标签方设备;
    接收所述标签方设备发送的所述各个目标标识集合对应的各个集合标签分布信息,并基于所述各个集合标签分布信息确定所述各个分箱对应的分箱标签分布信息;
    利用预设的分箱策略和各个分箱标签分布信息对所述各个分箱进行合并处理,得到最终分箱结果。
  2. 根据权利要求1中所述的方法,其中,所述基于所述各个对象标识对应的特征值,对所述待处理数据进行初始分箱,得到预设个数的分箱,包括:
    基于所述对象标识对应的特征值,确定最大特征值和最小特征值;
    基于所述最大特征值、所述最小特征值和预设个数M,确定(M-1)个特征分位值;
    基于所述最小特征值、所述最大特征值和所述(M-1)个特征分位值确定M个特征区间;
    基于所述待处理数据中的各个特征值和所述M个特征区间,对所述待处理数据进行初始分箱,得到M个分箱,其中,第i个分箱中包括多条特征数据,所述多条特征数据对应的特征值在第i个特征区间内,i=1,2,…,M。
  3. 根据权利要2中所述的方法,其中,所述方法还包括:
    获取预设的分区规则和分区个数N;
    基于第i个分箱中的各个对象标识和所述分区规则,确定所述第i个分箱中的各条特征数据的分区标识,所述分区标识对应N个分区中的其中一个分区;
    将所述第i个分箱中的各条特征数据分配至所述各条特征数据的分区标识对应分区中的第i个分箱中。
  4. 根据权利要求1中所述的方法,其中,所述从各个分箱中确定多个目标标识集合,包括:
    确定第i个分箱中的特征值个数S,其中,i=1,2,…,M,M为预设个数,S为正整数;
    随机确定第j个特征值对应的R个未处理的对象标识,R为大于2的正整数,j=1,2,…,S;
    将所述R个未处理的对象标识确定为一个目标标识集合。
  5. 根据权利要求1中所述的方法,其中,所述基于所述各个集合标签分布信息确定各个分箱对应的分箱标签分布信息,包括:
    当第i个分箱对应的集合标签分布信息不为预设的无效信息时,获取所述集合标签分布信息中的正样本数量和负样本数量,其中,i=1,2,…,M,M为预设个数;
    将所述集合标签分布信息对应的目标标识集合中的对象标识确定为已处理的对象标识;
    基于所述集合标签分布信息中的正样本数量和负样本数量更新所述第i个分箱对应的分箱标签分布信息,直至所述第i个分箱中不存在未处理的对象标识。
  6. 根据权利要求5中所述的方法,其中,所述方法还包括:
    当所述集合标签分布信息为所述预设的无效信息时,删除所述集合标签分布信息;
    将所述集合标签分布信息对应的目标标识集合中的对象标识确定为未处理的对象 标识。
  7. 根据权利要求1中所述的方法,其中,所述利用预设的分箱策略和各个分箱标签分布信息对所述各个分箱进行合并处理,得到最终分箱结果,包括:
    将各个分箱与自身的相邻分箱进行合并,得到多个候选分箱;
    确定各个候选分箱的候选标签属性信息,所述候选标签属性信息包括正样本数量、负样本数量、正样本百分比;
    当基于所述候选标签属性信息确定所述候选分箱满足合并条件时,确定各个候选分箱的信息价值;
    基于各个候选分箱的信息价值确定目标分箱;
    将各个目标分箱与自身的相邻分箱再次进行合并,直至达到优化目标,得到多个最终分箱。
  8. 根据权利要求1至7任一项所述的方法,其中,所述方法还包括:
    获取各个特征维度的各个最终分箱;
    确定所述各个特征维度的所述各个最终分箱的信息值和所述各个特征维度对应的总信息值;
    基于所述各个最终分箱的信息值和各个总信息值,进行特征选择,得到多个目标最终分箱;
    获取各个目标最终分箱的标签分布信息,基于所述各个目标最终分箱中的特征数据和标签分布信息进行建模。
  9. 一种数据处理方法,应用于标签方设备,包括:
    接收特征方设备发送的目标标识集合,并获取所述目标标识集合中的多个对象标识;
    获取各个对象标识对应的标签信息;
    基于各个标签信息确定所述目标标识集合的集合标签分布信息;
    将所述集合标签分布信息发送至所述特征方设备。
  10. 根据权利要求9中所述的方法,其中,所述基于各个标签信息确定所述目标标识集合的集合标签分布信息,包括:
    基于所述各个标签信息确定正样本数量和负样本数量;
    当所述正样本数量小于所述目标标识集合中包含的对象标识总数,且所述负样本数量小于所述对象标识总数时,将所述正样本数量和所述负样本数量确定为集合标签分布信息。
  11. 根据权利要求10中所述的方法,其中,所述于各个标签信息确定所述目标标识集合的集合标签分布信息,包括:
    当所述正样本数量等于所述目标标识集合中包含的对象标识总数,或者所述负样本数量等于所述对象标识总数时,将预设的无效信息确定为所述目标标识集合的集合标签分布信息。
  12. 根据权利要求9至11任一项所述的方法,其中,所述方法还包括:
    获取预设的分区规则和分区个数N;
    基于各个对象标识和所述分区规则,确定各条标签数据的分区标识,所述标签数据包括对象标识和所述对象标识对应的标签信息,所述分区标识对应N个分区中的其中一个分区;
    基于所述分区标识将所述各条标签数据增加至对应的分区中。
  13. 一种数据处理装置,包括:
    第一获取模块,配置为获取待处理数据,所述待处理数据包括多个对象标识和各个对象标识对应的特征值;
    分箱模块,配置为基于所述各个对象标识对应的特征值,对所述待处理数据进行初始分箱,得到预设个数的分箱;
    第一发送模块,配置为从各个分箱中确定多个目标标识集合,并将各个目标标识集合发送至标签方设备;
    第一接收模块,配置为接收所述标签方设备发送的所述各个目标标识集合对应的各个集合标签分布信息,并基于所述各个集合标签分布信息确定所述各个分箱对应的分箱标签分布信息;
    合并模块,配置为利用预设的分箱策略和各个分箱标签分布信息对所述各个分箱进行合并处理,得到最终分箱结果。
  14. 一种数据处理装置,包括:
    第二接收模块,配置为接收特征方设备发送的目标标识集合,并获取所述目标标识集合中的多个对象标识;
    第二获取模块,配置为获取各个对象标识对应的标签信息;
    第一确定模块,配置为基于各个标签信息确定所述目标标识集合的集合标签分布信息;
    第二发送模块,配置为将所述集合标签分布信息发送至所述特征方设备。
  15. 一种数据处理设备,包括:
    存储器,用于存储可执行指令;
    处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至8任一项或者权利要求9至12任一项所述的方法。
  16. 一种计算机可读存储介质,存储有可执行指令,用于被处理器执行时,实现权利要求1至8任一项或者权利要求9至12任一项所述的方法。
  17. 一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中;
    当电子设备的处理器从所述计算机可读存储介质读取所述计算机指令,并执行所述计算机指令时,实现权利要求1至8任一项或者权利要求9至12任一项所述的方法。
PCT/CN2022/078282 2021-03-10 2022-02-28 数据处理方法、装置、设备、计算机可读存储介质及计算机程序产品 WO2022188648A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22766174.1A EP4216074A4 (en) 2021-03-10 2022-02-28 DATA PROCESSING METHOD AND APPARATUS, DEVICE, COMPUTER-READABLE STORAGE MEDIUM AND COMPUTER PROGRAM PRODUCT
US18/073,333 US20230100679A1 (en) 2021-03-10 2022-12-01 Data processing method, apparatus, and device, computer-readable storage medium, and computer program product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110258944.9A CN112632045B (zh) 2021-03-10 2021-03-10 数据处理方法、装置、设备及计算机可读存储介质
CN202110258944.9 2021-03-10

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/073,333 Continuation US20230100679A1 (en) 2021-03-10 2022-12-01 Data processing method, apparatus, and device, computer-readable storage medium, and computer program product

Publications (1)

Publication Number Publication Date
WO2022188648A1 true WO2022188648A1 (zh) 2022-09-15

Family

ID=75297829

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/078282 WO2022188648A1 (zh) 2021-03-10 2022-02-28 数据处理方法、装置、设备、计算机可读存储介质及计算机程序产品

Country Status (4)

Country Link
US (1) US20230100679A1 (zh)
EP (1) EP4216074A4 (zh)
CN (1) CN112632045B (zh)
WO (1) WO2022188648A1 (zh)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632045B (zh) * 2021-03-10 2021-06-04 腾讯科技(深圳)有限公司 数据处理方法、装置、设备及计算机可读存储介质
CN112990487B (zh) * 2021-05-13 2021-08-03 上海冰鉴信息科技有限公司 一种快速卡方分箱的方法及装置
CN113344626A (zh) * 2021-06-03 2021-09-03 上海冰鉴信息科技有限公司 一种基于广告推送的数据特征优化方法及装置
CN113535697B (zh) * 2021-07-07 2024-05-24 广州三叠纪元智能科技有限公司 爬架数据清理方法、爬架控制装置及存储介质
CN113362048B (zh) * 2021-08-11 2021-11-30 腾讯科技(深圳)有限公司 数据标签分布确定方法、装置、计算机设备和存储介质
CN113449048B (zh) * 2021-08-31 2021-11-09 腾讯科技(深圳)有限公司 数据标签分布确定方法、装置、计算机设备和存储介质
CN113722744A (zh) * 2021-09-15 2021-11-30 京东科技信息技术有限公司 用于联邦特征工程的数据处理方法、装置、设备以及介质
CN113923006B (zh) * 2021-09-30 2024-02-02 北京淇瑀信息科技有限公司 设备数据认证方法、装置及电子设备
CN114329127B (zh) * 2021-12-30 2023-06-20 北京瑞莱智慧科技有限公司 特征分箱方法、装置及存储介质
CN114401079B (zh) * 2022-03-25 2022-06-14 腾讯科技(深圳)有限公司 多方联合信息价值计算方法、相关设备及存储介质
CN115983636B (zh) * 2022-12-26 2023-11-17 深圳市中政汇智管理咨询有限公司 风险评估方法、装置、设备及存储介质
CN116628428B (zh) * 2023-07-24 2023-10-31 华能信息技术有限公司 一种数据加工方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200111022A1 (en) * 2018-10-03 2020-04-09 Cerebri AI Inc. Collaborative multi-parties/multi-sources machine learning for affinity assessment, performance scoring, and recommendation making
CN111242244A (zh) * 2020-04-24 2020-06-05 支付宝(杭州)信息技术有限公司 特征值分箱方法、系统及装置
CN111523679A (zh) * 2020-04-26 2020-08-11 深圳前海微众银行股份有限公司 特征分箱方法、设备及可读存储介质
CN111539535A (zh) * 2020-06-05 2020-08-14 支付宝(杭州)信息技术有限公司 基于隐私保护的联合特征分箱方法及装置
CN112632045A (zh) * 2021-03-10 2021-04-09 腾讯科技(深圳)有限公司 数据处理方法、装置、设备及计算机可读存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032878B (zh) * 2019-03-04 2021-11-02 创新先进技术有限公司 一种安全的特征工程方法和装置
US20210049473A1 (en) * 2019-08-14 2021-02-18 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Robust Federated Training of Neural Networks
CN111507479B (zh) * 2020-04-15 2021-08-10 深圳前海微众银行股份有限公司 特征分箱方法、装置、设备及计算机可读存储介质
CN111506485B (zh) * 2020-04-15 2021-07-27 深圳前海微众银行股份有限公司 特征分箱方法、装置、设备及计算机可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200111022A1 (en) * 2018-10-03 2020-04-09 Cerebri AI Inc. Collaborative multi-parties/multi-sources machine learning for affinity assessment, performance scoring, and recommendation making
CN111242244A (zh) * 2020-04-24 2020-06-05 支付宝(杭州)信息技术有限公司 特征值分箱方法、系统及装置
CN111523679A (zh) * 2020-04-26 2020-08-11 深圳前海微众银行股份有限公司 特征分箱方法、设备及可读存储介质
CN111539535A (zh) * 2020-06-05 2020-08-14 支付宝(杭州)信息技术有限公司 基于隐私保护的联合特征分箱方法及装置
CN112632045A (zh) * 2021-03-10 2021-04-09 腾讯科技(深圳)有限公司 数据处理方法、装置、设备及计算机可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4216074A4 *

Also Published As

Publication number Publication date
EP4216074A4 (en) 2024-04-24
US20230100679A1 (en) 2023-03-30
CN112632045A (zh) 2021-04-09
EP4216074A1 (en) 2023-07-26
CN112632045B (zh) 2021-06-04

Similar Documents

Publication Publication Date Title
WO2022188648A1 (zh) 数据处理方法、装置、设备、计算机可读存储介质及计算机程序产品
Liu et al. Alleviating the inconsistency problem of applying graph neural network to fraud detection
US10963786B1 (en) Establishing a trained machine learning classifier in a blockchain network
Kadziński et al. Integrated framework for preference modeling and robustness analysis for outranking-based multiple criteria sorting with ELECTRE and PROMETHEE
US11113293B2 (en) Latent network summarization
Kocaguneli et al. Software effort models should be assessed via leave-one-out validation
Maleki et al. A comprehensive literature review of the rank reversal phenomenon in the analytic hierarchy process
Asim et al. Significance of machine learning algorithms in professional blogger's classification
JP6892454B2 (ja) データの秘匿性−実用性間のトレードオフを算出するためのシステムおよび方法
Romsaiyud et al. Automated cyberbullying detection using clustering appearance patterns
Hong et al. Using group genetic algorithm to improve performance of attribute clustering
US20210256115A1 (en) Method and electronic device for generating semantic representation of document to determine data security risk
Zhong et al. Toward automated multiparty privacy conflict detection
Mejía Template iterations with non-definable ccc forcing notions
Clark et al. Global and saturated probabilistic approximations based on generalized maximal consistent blocks
Huynh-Van et al. Classifying the lung images for people infected with COVID-19 based on the extracted feature interval
Liu et al. Multimodal learning based approaches for link prediction in social networks
Vezzetti et al. Similarity measures for face recognition
Garg et al. A comparative study of clustering algorithms using mapreduce in hadoop
CN114357177A (zh) 知识超图的生成方法、装置、终端设备及存储介质
Patil et al. Classification of text documents
Huo et al. Sparse embedding for interpretable hospital admission prediction
Clark et al. Complexity of rule sets induced from data with many lost values and “do not care” conditions
Chen Classification algorithm on gene expression profiles of tumor using neighborhood rough set and support vector machine
Mao et al. [Retracted] Research on the Popularity Prediction of Multimedia Network Information Based on Fast K Neighbor Algorithm

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22766174

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022766174

Country of ref document: EP

Effective date: 20230419

NENP Non-entry into the national phase

Ref country code: DE