CN116502255A

CN116502255A - Feature extraction method and device based on secret sharing

Info

Publication number: CN116502255A
Application number: CN202310791344.8A
Authority: CN
Inventors: 周凯明; 巫锡斌; 陈超超; 郑小林
Original assignee: Hangzhou Jinzhita Technology Co ltd
Current assignee: Hangzhou Jinzhita Technology Co ltd
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-07-28
Anticipated expiration: 2043-06-30
Also published as: CN116502255B

Abstract

The embodiment of the specification provides a feature extraction method and device based on secret sharing, wherein the feature extraction method based on secret sharing is applied to a first data end and comprises the following steps: acquiring a plurality of sample data and a plurality of sample features of each sample data; constructing a tree model according to the plurality of sample data and sample characteristics, wherein the tree model comprises a plurality of splitting nodes, the splitting nodes split based on target sample characteristics and target segmentation values, the target sample characteristics and the target segmentation values are determined based on segmentation indexes of the sample characteristics, and the segmentation indexes are obtained based on safety sample labels of the plurality of sample data and safety sample characteristics on a second data end; determining importance coefficients of the target sample characteristics according to the segmentation indexes of the target sample characteristics corresponding to the split nodes; and sending the importance coefficient of each target sample characteristic to a second data end so that the second data end performs characteristic extraction according to the importance coefficient. The feature extraction efficiency of each sample data is improved.

Description

Feature extraction method and device based on secret sharing

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a feature extraction method based on secret sharing.

Background

With the development of computer technology, machine learning has been widely used in various fields as an analysis processing technology of mass data. In model training, a large number of data features are typically used as inputs to a machine learning model. However, the introduction of these features brings about a great amount of model calculation, and some features cannot improve the prediction accuracy of the model and also cause privacy disclosure, so how to perform feature extraction based on privacy protection gradually becomes an important research content in the learning of the privacy-preserving machine.

Currently, manual screening is typically performed by experienced personnel, or feature extraction is performed using some common statistical methods, such as filtering, pearson correlation coefficients, and the like. Further, to protect privacy security, homomorphic encryption algorithms may also be used in the feature extraction process. However, the above scheme needs to encrypt and decrypt massive data features, resulting in extremely low feature extraction efficiency, so an efficient feature extraction scheme capable of protecting the security of private data is needed.

Disclosure of Invention

In view of this, the present embodiments provide a feature extraction method based on secret sharing. One or more embodiments of the present specification relate to a secret sharing-based feature extraction apparatus, a computing device, a computer-readable storage medium, and a computer program to solve the technical drawbacks of the prior art.

According to a first aspect of embodiments of the present disclosure, there is provided a feature extraction method based on secret sharing, applied to a first data end, the method including:

acquiring a plurality of sample data and a plurality of sample features of each sample data;

constructing a tree model according to the plurality of sample data and sample characteristics, wherein the tree model comprises a plurality of splitting nodes, the splitting nodes split based on target sample characteristics and target segmentation values, the target sample characteristics and the target segmentation values are determined based on segmentation indexes of the sample characteristics, and the segmentation indexes are obtained based on safety sample labels of the plurality of sample data and safety sample characteristics on a second data end;

determining importance coefficients of the target sample characteristics according to the segmentation indexes of the target sample characteristics corresponding to the split nodes;

and sending the importance coefficient of each target sample characteristic to a second data end so that the second data end performs characteristic extraction according to the importance coefficient.

According to a second aspect of embodiments of the present disclosure, there is provided a feature extraction method based on secret sharing, applied to a second data terminal, the method including:

receiving importance coefficients of target sample features sent by a first data end, wherein the importance coefficients are determined according to segmentation indexes of the target sample features corresponding to splitting nodes in a tree model, the segmentation indexes are obtained based on safety sample tags of a plurality of sample data and safety sample features on a second data end, the splitting nodes split based on the target sample features and target segmentation values, the target sample features and the target segmentation values are determined based on segmentation indexes of the sample features, and the tree model is constructed and obtained according to a plurality of sample data and a plurality of sample features of the sample data;

And carrying out feature extraction according to the importance coefficient of each target sample feature to obtain a feature extraction result.

According to a third aspect of embodiments of the present disclosure, there is provided a feature extraction device based on secret sharing, applied to a first data end, the device including:

an acquisition module configured to acquire a plurality of sample data and a plurality of sample features of each sample data;

the construction module is configured to construct a tree model according to the plurality of sample data and sample characteristics, wherein the tree model comprises a plurality of splitting nodes, the splitting nodes split based on target sample characteristics and target segmentation values, the target sample characteristics and the target segmentation values are determined based on segmentation indexes of the sample characteristics, and the segmentation indexes are obtained based on security sample labels of the plurality of sample data and security sample characteristics on a second data end;

the determining module is configured to determine importance coefficients of the target sample features according to the segmentation indexes of the target sample features corresponding to the split nodes;

and the sending module is configured to send the importance coefficient of each target sample characteristic to the second data end so that the second data end performs characteristic extraction according to the importance coefficient.

According to a fourth aspect of embodiments of the present disclosure, there is provided a secret sharing-based feature extraction apparatus, applied to a second data terminal, the apparatus comprising:

The system comprises a receiving module, a splitting module and a processing module, wherein the receiving module is configured to receive importance coefficients of target sample features sent by a first data end, wherein the importance coefficients are determined according to splitting indexes of the target sample features corresponding to splitting nodes in a tree model, the splitting indexes are obtained based on safety sample tags of a plurality of sample data and safety sample features on a second data end, the splitting nodes split based on the target sample features and target splitting values, the target sample features and the target splitting values are determined based on splitting indexes of the sample features, and the tree model is constructed according to the plurality of sample data and the sample features of the sample data;

and the extraction module is configured to perform feature extraction according to the importance coefficient of each target sample feature to obtain a feature extraction result.

According to a fifth aspect of embodiments of the present specification, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions that, when executed by the processor, implement the steps of the secret sharing-based feature extraction method provided in the first aspect or the second aspect.

According to a sixth aspect of embodiments of the present specification, there is provided a computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the secret sharing based feature extraction method provided in the first or second aspect above.

According to a seventh aspect of embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the secret sharing based feature extraction method provided in the first aspect or the second aspect described above.

According to the feature extraction method based on secret sharing, which is applied to the first data end, provided by the embodiment of the specification, a plurality of sample data and a plurality of sample features of each sample data are obtained; constructing a tree model according to the plurality of sample data and sample characteristics, wherein the tree model comprises a plurality of splitting nodes, the splitting nodes split based on target sample characteristics and target segmentation values, the target sample characteristics and the target segmentation values are determined based on segmentation indexes of the sample characteristics, and the segmentation indexes are obtained based on safety sample labels of the plurality of sample data and safety sample characteristics on a second data end; determining importance coefficients of the target sample characteristics according to the segmentation indexes of the target sample characteristics corresponding to the split nodes; and sending the importance coefficient of each target sample characteristic to a second data end so that the second data end performs characteristic extraction according to the importance coefficient. When each split node splits, the target sample characteristics and the target score values of each node are determined through the security sample characteristics shared by the second data end in a secret way, encryption and decryption of data on the first data end and the second data end are not needed through a secret key, the construction time of a tree model is saved on the basis of protecting the security of private data, and the characteristic extraction efficiency of each sample data is further improved.

Drawings

FIG. 1 is an architecture diagram of a secret sharing based feature extraction system provided by one embodiment of the present description;

FIG. 2 is an architecture diagram of another secret sharing based feature extraction system provided by one embodiment of the present description;

FIG. 3 is a flow chart of a feature extraction method based on secret sharing according to one embodiment of the present disclosure;

FIG. 4 is a flow chart of another feature extraction method based on secret sharing provided by one embodiment of the present disclosure;

FIG. 5 is a flowchart of a method for constructing a tree model in a feature extraction method based on secret sharing according to an embodiment of the present disclosure;

FIG. 6 is a process flow diagram of a feature extraction method based on secret sharing according to one embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a feature extraction device based on secret sharing according to an embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of another feature extraction device based on secret sharing according to one embodiment of the present disclosure;

FIG. 9 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

Furthermore, it should be noted that, user information (including, but not limited to, user equipment information, user personal information, etc.) and data (including, but not limited to, data for analysis, stored data, presented data, etc.) according to one or more embodiments of the present disclosure are information and data authorized by a user or sufficiently authorized by each party, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation entries for the user to select authorization or denial.

First, terms related to one or more embodiments of the present specification will be explained.

Random forests: the random forest is composed of a plurality of decision trees, and different decision trees are not associated. When the classification task is carried out, each decision tree in the random forest can respectively judge and classify the input sample, and each decision tree can obtain a classification result.

With the improvement of computer power, machine learning is widely applied to various fields such as a business risk recognition model, a business classification model, a business decision model and the like as an analysis processing technology of mass data. Machine learning models typically use a large number of features as input features to the model during training, e.g., in an actual scenario, machine learning models may use up to ten thousand features. The more the number of model features is, the larger the computation amount of the machine learning model is, and only a part of the features truly affect the model accuracy, and the input of a large number of features can bring a great model computation amount, so that the model prediction accuracy cannot be improved.

Two major challenges are faced in the development of machine learning technology: firstly, the data security is difficult to be ensured, and the problem of private data leakage needs to be solved; secondly, network security isolation and industry privacy exist, data barriers exist among different industries and departments, so that data form an island and cannot be shared safely, and the performance of a machine learning model trained only by independent data of each department cannot reach global optimization. Thus, privacy preserving machine learning techniques have evolved. In recent years, there are continuously scholars and researchers applying privacy protection techniques to machine learning algorithms, such as federal machine learning, multiparty security computing, etc. Meanwhile, after multiparty data are summarized, a huge feature quantity is generated, and larger calculation cost is brought, so that the feature extraction method based on privacy protection becomes an important research content in privacy protection machine learning.

Currently, manual screening is typically performed by experienced personnel, or feature extraction is performed using some common statistical methods, such as filtering, pearson correlation coefficients, information values IV (Information Value), and the like. Further, to protect privacy security, homomorphic encryption algorithms may also be used in the feature extraction process. However, the above scheme needs to encrypt and decrypt massive data features, resulting in extremely low feature extraction efficiency, so an efficient feature extraction scheme capable of protecting the security of private data is needed.

In one or more embodiments of the present disclosure, a part of features may be selected randomly based on a random forest algorithm, and an optimal feature is selected to split during node splitting, and further, according to a splitting index of each feature during node splitting, an importance coefficient of the feature is determined, so as to implement feature extraction. Specifically, a plurality of sample data and a plurality of sample features of each sample data are acquired; constructing a tree model according to the plurality of sample data and sample characteristics, wherein the tree model comprises a plurality of splitting nodes, the splitting nodes split based on target sample characteristics and target segmentation values, the target sample characteristics and the target segmentation values are determined based on segmentation indexes of the sample characteristics, and the segmentation indexes are obtained based on safety sample labels of the plurality of sample data and safety sample characteristics on a second data end; determining importance coefficients of the target sample characteristics according to the segmentation indexes of the target sample characteristics corresponding to the split nodes; and sending the importance coefficient of each target sample characteristic to a second data end so that the second data end performs characteristic extraction according to the importance coefficient.

Particularly, through the scheme, random forest screening characteristics are realized under the condition that the characteristics of each data end, labels and other information are not disclosed, when each split node is split, the target sample characteristics and the target segmentation values of each node are determined through the security sample characteristics shared by the second data end in a secret manner, the data on the first data end and the second data end are not required to be encrypted and decrypted through a secret key, the construction time of a tree model is saved on the basis of protecting the security of private data, and the characteristic extraction efficiency of each sample data is further improved.

In the present specification, a feature extraction method based on secret sharing is provided, and the present specification relates to a feature extraction apparatus based on secret sharing, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

Referring to fig. 1, fig. 1 illustrates an architecture diagram of a secret sharing-based feature extraction system provided in one embodiment of the present description, which may include a first data side 100 and a second data side 200;

a first data terminal 100, configured to obtain a plurality of sample data and a plurality of sample features of each sample data; constructing a tree model according to the plurality of sample data and sample characteristics, wherein the tree model comprises a plurality of splitting nodes, the splitting nodes split based on target sample characteristics and target segmentation values, the target sample characteristics and the target segmentation values are determined based on segmentation indexes of the sample characteristics, and the segmentation indexes are obtained based on safety sample labels of the plurality of sample data and safety sample characteristics on a second data end; determining importance coefficients of the target sample characteristics according to the segmentation indexes of the target sample characteristics corresponding to the split nodes; transmitting the importance coefficients of each target sample feature to the second data terminal 200;

And the second data end 200 is configured to perform feature extraction according to the importance coefficient of each target sample feature, so as to obtain a feature extraction result.

By applying the scheme of the embodiment of the specification, when each split node splits, the target sample characteristics and the target score values of each node are determined through the security sample characteristics shared by the second data end in a secret way, the data on the first data end and the second data end are not required to be encrypted and decrypted through a secret key, the construction time of a tree model is saved on the basis of protecting the security of the private data, and the characteristic extraction efficiency of each sample data is further improved.

Referring to fig. 2, fig. 2 illustrates an architecture diagram of another secret sharing-based feature extraction system provided in an embodiment of the present disclosure, where the secret sharing-based feature extraction system may include a first data terminal 100 and a plurality of second data terminals 200, where the first data terminal 100 may be referred to as a master node and the second data terminal 200 may be referred to as a client node. The plurality of second data terminals 200 may be connected to each other by the first data terminal 100, and in the feature extraction scenario, the first data terminal 100 is used to provide feature extraction services between the plurality of second data terminals 200, and the plurality of second data terminals 200 may respectively serve as a transmitting terminal or a receiving terminal, so that communication is implemented by the first data terminal 100.

Wherein, the second data terminal 200 and the first data terminal 100 establish a connection through a network. The network provides a medium for a communication link between the second data end 200 and the first data end 100. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The data transmitted by the second data terminal 200 may need to be encoded, transcoded, compressed, etc. before being distributed to the first data terminal 100.

The second data terminal 200 may be a browser, APP (Application), or web Application such as H5 (HyperText Markup Language, hypertext markup language (html) version 5) Application, or a light Application (also called applet, a lightweight Application) or cloud Application, etc., and the second data terminal 200 may be based on a software development kit (SDK, software Development Kit) of the corresponding service provided by the first data terminal 100, such as a real-time communication (RTC, real Time Communication) based SDK development and the like. The second data terminal 200 may be deployed in an electronic device, need to run depending on the device or some APP in the device, etc. The electronic device may for example have a display screen and support information browsing etc. as may be a personal mobile terminal such as a mobile phone, tablet computer, personal computer etc. Various other types of applications are also commonly deployed in electronic devices, such as human-machine conversation type applications, model training type applications, text processing type applications, web browser applications, shopping type applications, search type applications, instant messaging tools, mailbox clients, social platform software, and the like.

The first data terminal 100 may include a server that provides various services, such as a server that provides communication services for a plurality of clients, a server for background training that provides support for a model used on a client, a server that processes data sent by a client, and so on. It should be noted that, the first data terminal 100 may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. The server may also be a server of a distributed system or a server that incorporates a blockchain. The server may also be a cloud server for cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN, content Delivery Network), and basic cloud computing services such as big data and artificial intelligence platforms, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.

Referring to fig. 3, fig. 3 shows a flowchart of a feature extraction method based on secret sharing according to an embodiment of the present disclosure, where the feature extraction method based on secret sharing is applied to a first data end, and specifically includes the following steps:

Step 302: a plurality of sample data and a plurality of sample features for each sample data are acquired.

In one or more embodiments of the present disclosure, the first data end may acquire a plurality of sample data and a plurality of sample features of each sample data, so as to construct a tree model based on the plurality of sample data and the plurality of sample features of each sample data, to determine importance coefficients of each sample feature by using the tree model, and further implement feature extraction according to the importance coefficients.

Specifically, the first data end refers to a data end having a sample tag corresponding to sample data, and the first data end may also be referred to as a master node. The second data end refers to a data end other than the first data end in the plurality of data ends, and the second data end may also be called a client node, and the number of the second data end may be one or a plurality of second data ends. The secret sharing refers to that data interaction between the first data end and the second data end is performed in a secret sharing mode, and data interacted between the first data end and the second data end can be called secret sharing data and safety data.

Sample data refers to data carrying sample tags for building a tree model. The sample tag refers to a real tag of sample data, and the sample feature is obtained by encoding the sample data. The sample data may be data in different scenarios, such as e-commerce scenarios, financial scenarios, and so on. Sample data may also be data in different tasks, such as emotion analysis tasks, text translation tasks, entity recognition tasks, and so forth. The data format of the sample data may be text format, image format, audio format, etc., and is specifically selected according to the actual situation, which is not limited in any way in the embodiment of the present specification.

In practical applications, there are various ways to obtain multiple sample data, and the method is specifically selected according to practical situations, which is not limited in any way in the embodiments of the present disclosure. In one possible implementation of the present disclosure, a large amount of sample data carrying sample tags may be read from other data acquisition devices or databases. In another possible implementation manner of the present disclosure, a large amount of sample data carrying a sample tag input by a user may be received.

Further, there are various ways of generating the sample characteristics of the sample data, and the sample characteristics are specifically selected according to the actual situation, which is not limited in any way in the embodiments of the present specification. In One possible implementation of the present specification, the sample features of the sample data may be generated by means of One-Hot encoding (One-Hot). In another possible implementation of the present description, the encoder may be used to generate sample features of the sample data.

It should be noted that the sample features in the sample data set participating in feature extraction may be longitudinally split and distributed between a first data end and at least one second data end, where each data end holds a portion of the sample features of the sample data. For example, the sample features include age features and emotion features, the second data terminal a holds age features, and the second data terminal B holds emotion features. The first data end can acquire a plurality of sample data and a plurality of sample characteristics of each sample data from the sample data set, and inform each second data end of the sample data identification and the sample characteristic identification selected. When a plurality of sample data and a plurality of sample characteristics of each sample data are acquired, random selection can be performed based on the idea of a guide aggregation algorithm (Bootstrap aggregating), so that generalization of characteristic extraction is improved.

Step 304: and constructing a tree model according to the plurality of sample data and the sample characteristics, wherein the tree model comprises a plurality of splitting nodes, the splitting nodes split based on the target sample characteristics and the target segmentation values, the target sample characteristics and the target segmentation values are determined based on segmentation indexes of the sample characteristics, and the segmentation indexes are obtained based on safety sample labels of the plurality of sample data and safety sample characteristics on a second data end.

In one or more embodiments of the present disclosure, after obtaining a plurality of sample data and a plurality of sample features of each sample data, further, the first data end may perform multiparty security processing in conjunction with the second data end, and construct a tree model according to the plurality of sample data and the sample features.

Specifically, the Tree model may be called a Decision Tree (Decision Tree), where each node in the Tree model of the Tree structure represents a judgment on an attribute, each branch represents an output of a judgment result, and each leaf node represents a classification result.

In the embodiment of the present disclosure, a plurality of sample data and a plurality of sample features of each sample data may be randomly obtained from a sample data set repeatedly, so as to construct a plurality of tree models, and the plurality of tree models are fused to obtain a random forest model.

In an alternative embodiment of the present disclosure, the building a tree model according to the plurality of sample data and sample features may include the following steps:

obtaining a plurality of candidate score values of each sample feature;

determining target sample characteristics of the current splitting node and target score values of the target sample characteristics according to a plurality of candidate score values of each sample characteristic;

the method comprises the steps of sending target sample characteristics and target segmentation values to a target second data end, and receiving a splitting strategy of a current splitting node sent by the target second data end, wherein the target second data end is a second data end comprising the target sample characteristics, and the splitting strategy is used for determining sample data on each sub-node obtained by splitting the current splitting node from a plurality of sample data;

and dividing the plurality of sample data according to the splitting strategy corresponding to each splitting node until the splitting stopping condition is reached, so as to obtain a tree model.

Specifically, the candidate score value refers to a feature value for dividing sample data among sample feature values corresponding to sample features. For example, the sample characteristic is age, and the sample characteristic values include 10, 20, 30. Assuming that the candidate score value is 10, the sample feature value a corresponding to the sample data a is 36, and the sample feature value B corresponding to the sample data B is 8, it can be determined that the sample data a belongs to the left sub-tree and the sample data B belongs to the right sub-tree according to the candidate score value.

Splitting nodes are tree nodes in the tree model, and splitting refers to splitting sample data on the current tree node to each child node corresponding to the splitting node. The split stop condition includes, but is not limited to, the depth of the tree model reaching a preset depth threshold, the sample data on the minimum leaf node of the tree model being less than a preset sample threshold. By setting a preset depth threshold, the tree model can be prevented from being over fitted; by setting the preset sample threshold, if the sample data on the minimum leaf node is smaller than the preset sample threshold, the sample data on the minimum leaf node is enough, node splitting is not needed, and redundant splitting is avoided.

For example, assuming that the preset depth threshold is 5, and five layers of the current tree model exist, the depth of the tree model reaches the preset depth threshold, and the recursively creating the tree model can be stopped, so as to obtain the constructed tree model. Assuming that the preset sample threshold is 10, 8 sample data are provided on the minimum leaf node of the current tree model, the sample data on the minimum leaf node of the current tree model can be determined to be smaller than the preset sample threshold, the recursion creation of the tree model can be stopped, and the constructed tree model is obtained

The plurality of candidate score values of the sample feature may be obtained in various manners, and specifically selected according to practical situations, which are not limited in any way in the embodiments of the present disclosure. In one possible implementation manner of the present disclosure, each sample feature value of the sample feature may be used as a candidate score value. In another possible implementation of the present disclosure, the candidate score values may be selected randomly from among the sample feature values of the sample feature, or may be selected using a gradient boost decision tree algorithm (XGBoost, extreme Gradient Boosting). For example, sample feature values of 1-100, the quartile points are determined using a gradient-lifting decision tree algorithm, whereby 18, 48, 79, 91 are taken as candidate score values.

After obtaining the multiple candidate cut values of each sample feature, a safe matrix multiplication can be used to determine the cut index of each candidate cut value according to the multiple candidate cut values of the sample feature, and further determine the target cut value of the sample feature according to the cut index of each candidate cut value. And determining target sample characteristics from the sample characteristics according to the characteristic segmentation indexes of the sample characteristics.

By applying the scheme of the embodiment of the specification, the target sample characteristics and the target segmentation values are sent to the target second data end, the splitting strategy of the current splitting node sent by the target second data end is received, and the plurality of sample data are divided according to the splitting strategy corresponding to each splitting node until the splitting stopping condition is reached, so that the tree model is obtained. And the splitting strategy is determined through the second data terminal, so that the pressure of the first data terminal is reduced, and the stability of the tree model construction process is ensured.

In an alternative embodiment of the present disclosure, the determining the target sample feature of the current splitting node and the target score value of the target sample feature according to the plurality of candidate score values of each sample feature may include the following steps:

determining a segmentation index of each candidate segmentation value of a first sample feature aiming at the current split node, wherein the first sample feature is any one of a plurality of sample features;

determining a first target segmentation value of the first sample feature and a first feature segmentation index of the first sample feature according to the segmentation index of each candidate segmentation value;

and determining the target sample characteristics of the current splitting node according to the characteristic segmentation indexes of the sample characteristics.

Specifically, the segmentation index may be referred to as a gini (gini) coefficient, and the segmentation index characterizes the reasonable degree of distribution of the positive and negative labels in the two classification problems.

After determining the segmentation index of each candidate segmentation value of the first sample feature, the segmentation indexes of each candidate segmentation value may be summed to obtain a first feature segmentation index of the first sample feature.

In practical applications, there are various ways to determine the first target score of the first sample feature according to the segmentation index of each candidate score, and the embodiment of the present disclosure does not limit this in any way.

In one possible implementation manner of the present disclosure, the segmentation indexes of the candidate segmentation values of the first sample feature may be ranked, and the candidate segmentation value with the largest segmentation index is used as the first target segmentation value of the first sample feature.

In another possible implementation manner of the present disclosure, a segmentation threshold may be obtained, a segmentation index of each candidate segmentation value of the first sample feature is compared with the segmentation threshold, and a first target segmentation value of the first sample feature is selected from candidate segmentation values with the segmentation index being greater than or equal to the segmentation threshold.

Further, according to the segmentation index of each sample feature, there are various ways of determining the target sample feature of the current splitting node, and the method is specifically selected according to the actual situation, which is not limited in any way in the embodiment of the present specification.

In one possible implementation manner of the present disclosure, feature segmentation indexes of sample features on a current split node may be ranked, and a sample feature with a maximum feature segmentation index is used as a target sample feature of the current split node.

In another possible implementation manner of the present disclosure, a feature segmentation threshold may be obtained, a feature segmentation index of each sample feature is compared with the feature segmentation threshold, and a target sample feature of the current split node is selected from sample features with feature segmentation indexes greater than or equal to the feature segmentation threshold.

By applying the scheme of the embodiment of the specification, the target sample characteristics of the current splitting node are determined according to the characteristic splitting indexes of each sample characteristic, the target splitting values of the target sample characteristics are determined according to the splitting indexes of each candidate splitting value of the target sample characteristics aiming at the target sample characteristics, the target sample characteristics and the target splitting values of the current splitting node are accurately determined, a basis is provided for the splitting process of the current splitting node, and the accuracy of tree model construction is ensured.

In an optional embodiment of the present disclosure, when calculating the segmentation index, the distribution of various labels in the current split node and each sub-node after splitting thereof, that is, the sample class distribution matrix, needs to be known, after determining the sample class distribution matrix, the segmentation index of each candidate segmentation value of the sample class distribution matrix enterprise, that is, the segmentation index of each candidate segmentation value for determining the first sample feature, may include the following steps:

encoding sample tags of the plurality of sample data to generate a security sample tag;

determining a sample class distribution matrix of the current split node according to the security sample label and the security sample characteristics sent by the second data terminal, wherein the security sample characteristics are obtained by safely processing the second sample characteristics on the second data terminal by the second data terminal, and the sample class distribution matrix comprises the number of samples on the child nodes corresponding to each candidate segmentation value;

And determining the segmentation index of each candidate segmentation value according to the number of samples on the child node corresponding to each candidate segmentation value.

Specifically, the security sample tag refers to a secret sharing value of the sample tag. The secure sample feature refers to a secret sharing value of the sample feature on the second data side.

The sample tags of the plurality of sample data are encoded, and the security sample tag is generated in various ways, and specifically, the method is selected according to the actual situation, which is not limited in any way in the embodiment of the present disclosure. In one possible implementation manner of the present disclosure, the sample tag may be encoded by using a single thermal encoding, to generate an encoded sample tag, and further transpose and secure the encoded sample tag to generate a secure sample tag. In another possible implementation manner of the present disclosure, an encoder may be used to encode a sample tag to generate an encoded sample tag, and further transpose the encoded sample tag and perform security processing to generate a security sample tag, where a manner of security processing may be to subtract a random value from the encoded sample tag, or add a random value to the encoded sample tag, which is specifically selected according to an actual situation, and embodiments of the present disclosure are not limited in any way.

By applying the scheme of the embodiment of the specification, sample tags of a plurality of sample data are encoded to generate a security sample tag; determining a sample category distribution matrix of the current split node according to the security sample label and the security sample characteristics sent by the second data terminal; according to the number of samples on the child node corresponding to each candidate segmentation value, determining the segmentation index of each candidate segmentation value, and because the safe sample characteristic is a secret sharing value obtained by carrying out safe processing on the second sample characteristic by the second data end, the first data end can not know the accurate second sample characteristic, but can directly calculate by utilizing the safe sample characteristic without encrypting and decrypting the second sample characteristic through a secret key, so that the construction time of a tree model is saved on the basis of protecting the safety of private data, and the characteristic extraction efficiency of each sample data is further improved.

In an alternative embodiment of the present disclosure, to avoid leakage of the sample tag on the first data side and the eigenvalue on the second data side, the determining the sample class distribution matrix of each current split node by secure matrix multiplication, that is, determining the sample class distribution matrix of the current split node according to the secure sample tag and the secure sample feature sent by the second data side, may include the following steps:

Transmitting the security sample tag to a second data terminal so that the second data terminal generates a first sample class distribution matrix according to the security sample tag;

receiving a first sample class distribution matrix and a security sample characteristic sent by a second data terminal;

generating a second sample category distribution matrix according to the security sample characteristics and the sample labels;

and generating a sample category distribution matrix according to the first sample category distribution matrix and the second sample category distribution matrix.

It should be noted that, assume that the first data end performs one-time thermal encoding on the sample tag to obtain an encoded sample tag Y _oh For Y _oh Taking transposed encoded sample label Y _oh ^T Further label Y for transposed encoded samples _oh ^T Performing security processing to obtain a security sample label<Y _oh ^T >. The first data terminal labels the security sample<Y _oh ^T >To the second data end to make the second data end according to the security sample label<Y _oh ^T >A first sample class distribution matrix C1 is generated. The first data terminal receives the first sample category distribution matrix C1 and the safety sample characteristic sent by the second data terminal<W>According to the security sample characteristics<W>The coded sample labels corresponding to the sample labels generate a second sample category distribution matrix C2, and the first sample category distribution matrix C1 and the second sample category distribution matrix C2 are combined to generate a sample category distribution matrix C. Wherein, the liquid crystal display device comprises a liquid crystal display device, N represents the number of candidate score values, C _ij Representing the current split node by candidate score value x _j The number of categories i in the first child node after segmentation, C _i0 For the total number of the categories i in the current father node, the number of the categories i in the second child node can be represented by C _i0 -C _ij (j. Noteq.0).

In one possible implementation manner of the present disclosure, the first data end may encrypt the security sample tag and transmit the security sample tag to the second data end, where the second data end calculates an encrypted first sample class distribution matrix and transmits the encrypted first sample class distribution matrix to the first data end, and the first data end decrypts the encrypted first sample class distribution matrix to obtain a real first sample class distribution matrix, and regenerates the sample class distribution matrix.

By applying the scheme of the embodiment of the specification, the first sample class distribution matrix and the safety sample characteristics sent by the second data end are received; generating a second sample category distribution matrix according to the security sample characteristics and the sample labels; and generating a sample category distribution matrix according to the first sample category distribution matrix and the second sample category distribution matrix, so as to accurately determine the number of samples on the child nodes corresponding to each candidate score value.

In an alternative embodiment of the present disclosure, the child nodes include a first child node and a second child node; determining the segmentation index of each candidate segmentation value according to the number of samples on the child node corresponding to each candidate segmentation value may include the following steps:

Aiming at any candidate segmentation value, calculating a left segmentation index of a first sub-node according to the number of samples on the first sub-node corresponding to the candidate segmentation value;

calculating a right segmentation index of the second sub-node according to the number of samples on the second sub-node corresponding to the candidate segmentation value;

and determining the segmentation index of the candidate segmentation value according to the left segmentation index and the right segmentation index.

It should be noted that, according to the sample class distribution matrix, the number of samples on the first sub-node and the number of samples on the second sub-node corresponding to each candidate segmentation value may be determined.

The left segmentation index of the first child node may be calculated by the following formula (1), the right segmentation index of the second child node may be calculated by the following formula (2), and the segmentation index of the candidate segmentation value may be calculated by the following formula (3):

（1）

（2）

（3）

wherein D is _L For the sample data set in the first child node, D _R For the sample data set in the second child node, D is the sample data set in the parent node of the current split node, C _k The number of samples of class i in the current split node.

By applying the scheme of the embodiment of the specification, aiming at any candidate segmentation value, calculating a left segmentation index of a first sub-node according to the number of samples on the first sub-node corresponding to the candidate segmentation value; calculating a right segmentation index of the second sub-node according to the number of samples on the second sub-node corresponding to the candidate segmentation value; and determining the segmentation index of the candidate segmentation value according to the left segmentation index and the right segmentation index, so that the characteristic extraction can be performed according to the segmentation index in the follow-up process.

Step 306: and determining the importance coefficient of the target sample characteristic according to the segmentation index of the target sample characteristic corresponding to each split node.

In one or more embodiments of the present disclosure, a plurality of sample data and a plurality of sample features of each sample data are obtained, and after a tree model is constructed according to the plurality of sample data and the sample features, further, an importance coefficient of a target sample feature may be determined according to a segmentation index of the target sample feature corresponding to each split node.

Specifically, the importance coefficient is used to guide the feature extraction process, and the importance coefficient characterizes the importance degree of the target sample feature.

In an optional embodiment of the present disclosure, the determining the importance coefficient of the target sample feature according to the segmentation index of the target sample feature corresponding to each split node may include the following steps:

for any split node, determining a parent node and a child node of the split node;

determining the segmentation index gain of the target sample characteristic corresponding to the splitting node according to the segmentation index corresponding to the father node and the segmentation index corresponding to the child node;

and determining the importance coefficient of the target sample characteristic according to the segmentation index gain of the target sample characteristic corresponding to each split node.

Specifically, the segmentation index gain characterizes the segmentation index variation of the splitting node before and after splitting, namely the importance of the sample characteristic in the splitting node. The segmentation index gain of the target sample feature corresponding to the split node may be determined by the following equation (4):

（4）

wherein, the liquid crystal display device comprises a liquid crystal display device,for target sample feature X _j Importance at split node q +.>For splitting the splitting index of node q, +.>For the splitting index of the first child node (left child node) obtained after splitting of splitting node q,/>The split index is the splitting index of the second child node (right child node) obtained after splitting the split node q.

It should be noted that, when determining the importance coefficient of the target sample feature according to the segmentation index gain of the target sample feature corresponding to each split node, the node set Q of the target sample feature in the tree model t may be determined, so as to determine the importance index of the target sample feature according to the segmentation index gain of the target sample feature corresponding to each split node Q in the node set Q, and further normalize the importance index of the target sample feature to obtain the importance coefficient of the target sample feature.

Specifically, the importance index of the target sample feature can be determined by the following formula (5):

（5）

In one possible implementation manner of the present disclosure, if the first data end constructs T tree models, the importance index of the target sample feature may be determined by the following formula (6):

（6）

further, the importance coefficient of the target sample feature may be determined by the following formula (7):

（7）

where i is the i-th target sample feature and m is the number of all sample features.

By applying the scheme of the embodiment of the specification, aiming at any split node, determining the father node and the child node of the split node; determining the segmentation index gain of the target sample characteristic corresponding to the splitting node according to the segmentation index corresponding to the father node and the segmentation index corresponding to the child node; according to the segmentation index gain of the target sample characteristics corresponding to each splitting node, the importance coefficient of the target sample characteristics is determined, and the importance of the target sample characteristics is ensured

Step 308: and sending the importance coefficient of each target sample characteristic to a second data end so that the second data end performs characteristic extraction according to the importance coefficient.

In one or more embodiments of the present disclosure, a plurality of sample data and a plurality of sample features of each sample data are obtained, a tree model is constructed according to the plurality of sample data and the sample features, after determining an importance coefficient of a target sample feature according to a segmentation index of the target sample feature corresponding to each split node, the importance coefficient of each target sample feature may be further sent to a second data end, so that the second data end performs feature extraction according to the importance coefficient.

In practical applications, the importance coefficient of each target sample feature is sent to the second data end in various manners, and specifically, the importance coefficient is selected according to practical situations, which is not limited in any way in the embodiment of the present disclosure. In one possible implementation of the present disclosure, the importance coefficients of each target sample feature may be sent to all second data terminals. In another possible implementation manner of the present disclosure, the importance coefficient of each target sample feature may be sent to the second data end that holds the corresponding target sample feature, so as to reduce the data transmission amount.

Referring to fig. 4, fig. 4 shows a flowchart of another feature extraction method based on secret sharing according to an embodiment of the present disclosure, where the feature extraction method based on secret sharing is applied to a second data terminal, and specifically includes the following steps:

Step 402: and receiving importance coefficients of all target sample characteristics sent by a first data end, wherein the importance coefficients are determined according to segmentation indexes of the target sample characteristics corresponding to all splitting nodes in a tree model, the segmentation indexes are obtained based on safe sample labels of a plurality of sample data and the safe sample characteristics on a second data end, the splitting nodes split based on the target sample characteristics and target segmentation values, the target sample characteristics and the target segmentation values are determined based on the segmentation indexes of all the sample characteristics, and the tree model is constructed according to a plurality of sample data and a plurality of sample characteristics of all the sample data.

Step 404: and carrying out feature extraction according to the importance coefficient of each target sample feature to obtain a feature extraction result.

In practical applications, there are various ways of extracting features according to the importance coefficient of each target sample feature, and the method is specifically selected according to the practical situation, which is not limited in any way in the embodiments of the present disclosure.

In one possible implementation manner of the present disclosure, the target sample features may be ranked according to the importance coefficient of each target sample feature, so as to extract Y target sample features before ranking, and obtain a feature extraction result, where Y is specifically selected according to an actual situation, and the embodiment of the present disclosure is not limited in any way.

In another possible implementation manner of the present disclosure, a preset importance coefficient threshold may be obtained, so that, according to an importance coefficient of each target sample feature and the preset importance coefficient threshold, a target sample feature with an importance coefficient greater than or equal to the importance coefficient threshold is extracted, and a feature extraction result is obtained.

Illustratively, it is assumed that the target sample features include an age feature, an emotion feature, and an academic feature, the importance coefficient corresponding to the age feature is 0.6, the importance coefficient corresponding to the emotion feature is 0.3, and the importance coefficient corresponding to the academic feature is 0.5. And sequencing the characteristics of each target sample from large to small according to the importance coefficient of the characteristics of each target sample, and extracting the first age characteristic of sequencing as a characteristic extraction result.

By applying the scheme of the embodiment of the specification, the importance coefficient of each target sample feature sent by the first data end is received, and feature extraction is performed according to the importance coefficient of each target sample feature, so that a feature extraction result is obtained. In the determining process of the importance coefficient, the first data end determines the target sample characteristics and the target score values of all nodes through the security sample characteristics shared by the second data end in a secret manner, encryption and decryption of data on the first data end and the second data end are not needed through a secret key, the construction time of a tree model is saved on the basis of protecting the security of private data, and the characteristic extraction efficiency of all sample data is further improved.

In an alternative embodiment of the present disclosure, the second data terminal includes a target sample feature and a feature value of the target sample feature; before the importance coefficients of the target sample features sent by the first data end are received, the method may further include the following steps:

receiving a target sample characteristic of a current split node and a target score value of the target sample characteristic, which are sent by a first data terminal;

and determining a splitting strategy of the current splitting node according to the target score value and the characteristic value of the target sample characteristic, and sending the splitting strategy to the first data end.

It should be noted that the second sample feature included in the second data end may further include other sample features in addition to the target sample feature. After the second data end receives the target sample feature and the target score value sent by the first data end, the second data end can select the target sample feature from the second sample features included in the second data end, and compare the feature value of the target sample feature with the target score value to obtain a splitting strategy of the current splitting node: if the feature value is smaller than the target score value, determining that sample data corresponding to the target sample feature in the current splitting node belongs to a left subtree; if the feature value is larger than the target score value, determining that the sample data corresponding to the target sample feature in the current splitting node belongs to the right subtree.

In practical application, when the second data end sends the splitting strategy to the first data end, the splitting strategy can carry the sample identification of the sample data, so that the first data end can efficiently split the nodes. In the embodiment of the present disclosure, only the second data end that provides the target sample feature may store the splitting policy, and the other second data end only knows that the target sample feature is not provided by itself.

By applying the scheme of the embodiment of the specification, the second data end receives the target sample characteristics of the current split node and the target score values of the target sample characteristics, which are sent by the first data end; and determining a splitting strategy of the current splitting node according to the target segmentation value and the characteristic value of the target sample characteristic, and transmitting the splitting strategy to the first data end without generating the splitting strategy by the first data end, so that the privacy of data on the second data end is ensured, and meanwhile, the pressure of the first data end is reduced.

In another alternative embodiment of the present disclosure, the second data terminal includes a second sample feature; before receiving the importance coefficient of each target sample feature sent by the first data end, the method may further include the following steps:

receiving a security sample label sent by a first data end;

Cutting the second sample feature according to the plurality of candidate cutting values of the second sample feature to obtain an updated second sample feature;

generating a first sample category distribution matrix according to the security sample label and the updated second sample characteristics;

carrying out safety processing on the updated second sample characteristics to obtain safety sample characteristics;

the first sample class distribution matrix and the secure sample feature are sent to a first data side.

It should be noted that the second sample is assumed to be { X } ₁ ,X ₂ ,...,X _k Dividing the second sample feature according to the n candidate score values to obtain updated second sample feature W= { W ₀ ,W ₁ ,...,W _nk A second sample feature W, which may also be referred to as a bit matrix, the elements W in the bit matrix ₀ Is a vector of 1, W _j (j≠0)={W _ij ,i ₁ ≤i≤i _k A value of 0 or 1,1 indicates that the sample data i is divided by the candidate segmentation value to the left child node, and 0 indicates that the sample data i is divided by the candidate segmentation value to the right child node.

Further, the security sample is labeled<Y _oh ^T >And multiplying the updated second sample characteristic W to obtain a first sample category distribution matrix. The method for performing the security processing on the updated second sample feature may be to subtract a random value from the updated second sample feature, or to add a random value to the updated second sample feature, which is specifically selected according to the actual situation, which is not limited in the embodiment of the present disclosure.

By applying the scheme of the embodiment of the specification, the security sample tag sent by the first data end is received; cutting the second sample feature according to the plurality of candidate cutting values of the second sample feature to obtain an updated second sample feature; generating a first sample category distribution matrix according to the security sample label and the updated second sample characteristics; carrying out safety processing on the updated second sample characteristics to obtain safety sample characteristics; and the first sample class distribution matrix and the security sample characteristics are sent to the first data terminal, so that the privacy of data on the second data terminal is ensured.

Referring to fig. 5, fig. 5 shows a flowchart of a method for constructing a tree model in a feature extraction method based on secret sharing according to an embodiment of the present disclosure, which specifically includes the following steps:

step 502: a plurality of sample data and a plurality of sample features for each sample data are acquired.

Step 504: sample tags of the plurality of sample data are encoded to generate a security sample tag.

Step 506: and sending the security sample tag to the second data end so that the second data end generates a first sample class distribution matrix according to the security sample tag.

Step 508: and receiving the first sample category distribution matrix and the security sample characteristics sent by the second data terminal.

Step 510: and generating a second sample category distribution matrix according to the security sample characteristics and the sample labels.

Step 512: and generating a sample category distribution matrix according to the first sample category distribution matrix and the second sample category distribution matrix, wherein the sample category distribution matrix comprises the number of samples on the child nodes corresponding to the candidate score values of the sample characteristics.

Step 514: aiming at any candidate segmentation value, calculating a left segmentation index of a first sub-node according to the number of samples on the first sub-node corresponding to the candidate segmentation value; calculating a right segmentation index of the second sub-node according to the number of samples on the second sub-node corresponding to the candidate segmentation value; and determining the segmentation index of the candidate segmentation value according to the left segmentation index and the right segmentation index.

Step 516: and determining a first target segmentation value of the first sample feature and a first feature segmentation index of the first sample feature according to the segmentation index of each candidate segmentation value, wherein the first sample feature is any one of a plurality of sample features.

Step 518: and determining the target sample characteristics of the current splitting node according to the characteristic segmentation indexes of the sample characteristics.

Step 520: and sending the target sample characteristics and the target segmentation values of the target sample characteristics to a target second data end, and receiving a splitting strategy of the current splitting node sent by the target second data end, wherein the target second data end is the second data end comprising the target sample characteristics, and the splitting strategy is used for determining sample data on each sub-node obtained by splitting the current splitting node from a plurality of sample data.

Step 522: and dividing the plurality of sample data according to the splitting strategy corresponding to each splitting node until the splitting stopping condition is reached, so as to obtain a tree model.

It should be noted that, the implementation manners of steps 502 to 522 are the same as the implementation manner of the feature extraction method based on secret sharing provided in fig. 3 and 4, and will not be repeated in this specification.

In practical application, according to the splitting strategy corresponding to each splitting node, a plurality of sample data are divided, and each time a layer of tree structure is built, whether a splitting stop condition is reached can be checked. If the split stop condition is reached, leaf nodes are created accordingly. If the split stop condition is not reached, all split nodes enter a branching state, and a subtree is recursively created.

By applying the scheme of the embodiment of the specification, the longitudinal federal random forest feature screening method based on secret sharing is provided, and random forest screening and feature extraction are more efficiently realized under the condition that the security of private data of a plurality of data ends is ensured.

Referring to fig. 6, fig. 6 shows a process flow chart of a feature extraction method based on secret sharing according to an embodiment of the present disclosure, which specifically includes:

Assuming that the first data end A has sample data XA and sample labels Y, and the second data end has sample data XB; after the model construction is started, a t-th tree model can be constructed, and after the t-th tree model is constructed, whether all the tree models are constructed or not or whether all the tree models are converged is judged: if not, returning to the step of starting the model construction; if yes, determining that the tree model is built, determining importance coefficients of sample features of sample data by the first data end, and sending the importance coefficients of the sample features to the second data end with corresponding sample features, wherein the second data end can realize feature extraction according to the importance coefficients.

It should be noted that the t-th tree model may be constructed by:

the first data end encodes the sample label Y and transposes the sample label Y to obtain a safe sample label<Y _oh ^T >. Feature segmentation is carried out on the second data end to obtain a zone bit matrix W, and safety processing is carried out on the zone bit matrix W to obtain safety sample features<W>. The first data end will<Y _oh ^T >Secret sharing to a second data terminal, which will<W>Secret sharing is carried out on the first data end;

the second data terminal is based on the security sample label<Y _oh ^T >The flag bit matrix W generates a first sample class distribution matrix C1, and sends the C1 to a first data end; the first data end is based on the secure matrix multiplication and according to the secure sample characteristics <W>And the sample tag generates a second sample class distribution matrix C2.

The first data end calculates a sample category distribution matrix C=C1+C2, and counts sample categories of the current split node according to the sample category distribution matrix to determine a segmentation index of each candidate segmentation value;

the first data end selects a candidate segmentation value corresponding to the maximum segmentation index as a target segmentation value of the current segmentation node, and sends an index of the target segmentation value in a sample category distribution matrix to a second data end with target sample characteristics corresponding to the target segmentation value, wherein the first data end stores the target sample characteristics corresponding to the current segmentation node and the segmentation index gain of the target sample characteristics;

the first data end judges whether the construction of the t-th tree model meets the splitting stop condition, if not, the first data end is executed to carry out coding processing on the sample label Y and transpose is carried out to obtain a security sample label<Y _oh ^T >Continuously constructing a tree model; if yes, determining that the construction of the t-th tree is completed.

By applying the scheme of the embodiment of the specification, through the longitudinal federal random forest feature extraction method based on secret sharing, the first data terminal randomly selects sample features and sample data and distributes the sample features and the sample data to each second data terminal, so that randomness is ensured. Secondly, when the target sample characteristics of the tree node splitting are selected, the safety matrix multiplication is used for calculating the splitting index of the current tree node sample according to the selected candidate splitting value by each second data end, so that the construction efficiency of the tree model is improved, and meanwhile, the characteristic data of each second data end and the label data of the first data end are not leaked.

Corresponding to the embodiment of the feature extraction method based on secret sharing applied to the first data end, the present disclosure further provides an embodiment of a feature extraction device based on secret sharing, and fig. 7 shows a schematic structural diagram of the feature extraction device based on secret sharing provided in one embodiment of the present disclosure, where the feature extraction device based on secret sharing is applied to the first data end. As shown in fig. 7, the apparatus includes:

an acquisition module 702 configured to acquire a plurality of sample data and a plurality of sample features for each sample data;

a building module 704 configured to build a tree model from the plurality of sample data and the sample features, wherein the tree model comprises a plurality of splitting nodes, the splitting nodes split based on the target sample features and the target cut values, the target sample features and the target cut values determined based on cut indicators of the sample features, the cut indicators derived based on security sample tags of the plurality of sample data and security sample features on the second data side;

a determining module 706, configured to determine an importance coefficient of the target sample feature according to the segmentation index of the target sample feature corresponding to each split node;

a sending module 708 is configured to send the importance coefficient of each target sample feature to the second data end, so that the second data end performs feature extraction according to the importance coefficient.

Optionally, the constructing module 704 is further configured to obtain a plurality of candidate score values of each sample feature; determining target sample characteristics of the current splitting node and target score values of the target sample characteristics according to a plurality of candidate score values of each sample characteristic; the method comprises the steps of sending target sample characteristics and target segmentation values to a target second data end, and receiving a splitting strategy of a current splitting node sent by the target second data end, wherein the target second data end is a second data end comprising the target sample characteristics, and the splitting strategy is used for determining sample data on each sub-node obtained by splitting the current splitting node from a plurality of sample data; and dividing the plurality of sample data according to the splitting strategy corresponding to each splitting node until the splitting stopping condition is reached, so as to obtain a tree model.

Optionally, the constructing module 704 is further configured to determine, for the current split node, a segmentation indicator of each candidate segmentation value of a first sample feature, wherein the first sample feature is any one of a plurality of sample features; determining a first target segmentation value of the first sample feature and a first feature segmentation index of the first sample feature according to the segmentation index of each candidate segmentation value; and determining the target sample characteristics of the current splitting node according to the characteristic segmentation indexes of the sample characteristics.

Optionally, the constructing module 704 is further configured to encode sample tags of the plurality of sample data, generating a secure sample tag; determining a sample class distribution matrix of the current split node according to the security sample label and the security sample characteristics sent by the second data terminal, wherein the security sample characteristics are obtained by safely processing the second sample characteristics on the second data terminal by the second data terminal, and the sample class distribution matrix comprises the number of samples on the child nodes corresponding to each candidate segmentation value; and determining the segmentation index of each candidate segmentation value according to the number of samples on the child node corresponding to each candidate segmentation value.

Optionally, the building block 704 is further configured to send the security sample tag to the second data terminal, so that the second data terminal generates the first sample class distribution matrix according to the security sample tag; receiving a first sample class distribution matrix and a security sample characteristic sent by a second data terminal; generating a second sample category distribution matrix according to the security sample characteristics and the sample labels; and generating a sample category distribution matrix according to the first sample category distribution matrix and the second sample category distribution matrix.

Optionally, the child nodes include a first child node and a second child node; the construction module 704 is further configured to calculate, for any candidate segmentation value, a left segmentation index of the first child node according to the number of samples on the first child node corresponding to the candidate segmentation value; calculating a right segmentation index of the second sub-node according to the number of samples on the second sub-node corresponding to the candidate segmentation value; and determining the segmentation index of the candidate segmentation value according to the left segmentation index and the right segmentation index.

Optionally, the determining module 706 is further configured to determine, for any split node, a parent node and a child node of the split node; determining the segmentation index gain of the target sample characteristic corresponding to the splitting node according to the segmentation index corresponding to the father node and the segmentation index corresponding to the child node; and determining the importance coefficient of the target sample characteristic according to the segmentation index gain of the target sample characteristic corresponding to each split node.

The above is an exemplary scheme of the feature extraction device based on secret sharing in this embodiment. It should be noted that, the technical solution of the feature extraction device based on secret sharing and the technical solution of the feature extraction method based on secret sharing applied to the first data end belong to the same concept, and details of the technical solution of the feature extraction device based on secret sharing, which are not described in detail, can be referred to the description of the technical solution of the feature extraction method based on secret sharing applied to the first data end.

Corresponding to the above embodiment of the feature extraction method based on secret sharing applied to the second data end, the present disclosure further provides an embodiment of a feature extraction device based on secret sharing, and fig. 8 shows a schematic structural diagram of another feature extraction device based on secret sharing provided in one embodiment of the present disclosure, where the feature extraction device based on secret sharing is applied to the second data end. As shown in fig. 8, the apparatus includes:

the receiving module 802 is configured to receive an importance coefficient of each target sample feature sent by the first data end, where the importance coefficient is determined according to a segmentation index of each target sample feature corresponding to each splitting node in the tree model, the segmentation index is obtained based on a security sample tag of the plurality of sample data and a security sample feature on the second data end, the splitting node splits based on the target sample feature and a target segmentation value, the target sample feature and the target segmentation value are determined based on a segmentation index of each sample feature, and the tree model is constructed according to the plurality of sample data and the plurality of sample features of each sample data;

the extraction module 804 is configured to perform feature extraction according to the importance coefficient of each target sample feature, so as to obtain a feature extraction result.

Optionally, the second data end includes a target sample feature and a feature value of the target sample feature; the apparatus further comprises: the splitting strategy sending module is configured to receive the target sample characteristics of the current splitting node and the target score values of the target sample characteristics sent by the first data end; and determining a splitting strategy of the current splitting node according to the target score value and the characteristic value of the target sample characteristic, and sending the splitting strategy to the first data end.

Optionally, the second data terminal comprises a second sample feature; the apparatus further comprises: the processing module is configured to receive the security sample tag sent by the first data end; cutting the second sample feature according to the plurality of candidate cutting values of the second sample feature to obtain an updated second sample feature; generating a first sample category distribution matrix according to the security sample label and the updated second sample characteristics; carrying out safety processing on the updated second sample characteristics to obtain safety sample characteristics; the first sample class distribution matrix and the secure sample feature are sent to a first data side.

The above is an exemplary scheme of the feature extraction device based on secret sharing in this embodiment. It should be noted that, the technical solution of the feature extraction device based on secret sharing and the technical solution of the feature extraction method based on secret sharing applied to the second data end belong to the same concept, and details of the technical solution of the feature extraction device based on secret sharing, which are not described in detail, can be referred to the description of the technical solution of the feature extraction method based on secret sharing applied to the second data end.

FIG. 9 illustrates a block diagram of a computing device provided by one embodiment of the present description. The components of computing device 900 include, but are not limited to, memory 910 and processor 920. Processor 920 is coupled to memory 910 via bus 930 with database 950 configured to hold data.

Computing device 900 also includes an access device 940, access device 940 enabling computing device 900 to communicate via one or more networks 960. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, local Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. Access device 940 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network Interface Card), such as an IEEE802.11 wireless local area network (WLAN, wireless Local Area Networks) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, world Interoperability for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, a near field communication (NFC, near Field Communication) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 900 and other components not shown in FIG. 9 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 9 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 900 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, personal Computer). Computing device 900 may also be a mobile or stationary server.

The processor 920 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of the secret sharing-based feature extraction method described above.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the feature extraction method based on secret sharing belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the feature extraction method based on secret sharing.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the secret sharing-based feature extraction method described above.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the feature extraction method based on secret sharing belong to the same concept, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the description of the technical solution of the feature extraction method based on secret sharing.

An embodiment of the present disclosure further provides a computer program, where the computer program, when executed in a computer, causes the computer to perform the steps of the above-mentioned feature extraction method based on secret sharing.

The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program and the technical solution of the feature extraction method based on secret sharing belong to the same concept, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solution of the feature extraction method based on secret sharing.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. The feature extraction method based on secret sharing is characterized by being applied to a first data end, and comprises the following steps:

constructing a tree model according to the plurality of sample data and sample characteristics, wherein the tree model comprises a plurality of splitting nodes, the splitting nodes split based on target sample characteristics and target segmentation values, the target sample characteristics and the target segmentation values are determined based on segmentation indexes of the sample characteristics, and the segmentation indexes are obtained based on security sample labels of the plurality of sample data and security sample characteristics on a second data end;

determining importance coefficients of the target sample features according to the segmentation indexes of the target sample features corresponding to the split nodes;

and sending the importance coefficient of each target sample characteristic to the second data end so that the second data end performs characteristic extraction according to the importance coefficient.

2. The method of claim 1, wherein said constructing a tree model from said plurality of sample data and sample features comprises:

obtaining a plurality of candidate score values of each sample feature;

Determining target sample characteristics of the current splitting node and target score values of the target sample characteristics according to the candidate score values of the sample characteristics;

transmitting the target sample characteristics and the target segmentation values to a target second data end, and receiving a splitting strategy of the current splitting node transmitted by the target second data end, wherein the target second data end is a second data end comprising the target sample characteristics, and the splitting strategy is used for determining sample data on each sub-node obtained by splitting the current splitting node from the plurality of sample data;

and dividing the plurality of sample data according to the splitting strategies corresponding to the splitting nodes until the splitting stopping conditions are reached, so as to obtain a tree model.

3. The method of claim 2, wherein determining the target sample feature for the current split node and the target score value for the target sample feature based on the plurality of candidate score values for the respective sample features comprises:

determining a segmentation index of each candidate segmentation value of a first sample feature for a current split node, wherein the first sample feature is any one of the plurality of sample features;

4. A method according to claim 3, wherein determining the segmentation indicator for each candidate segmentation value for the first sample feature comprises:

determining a sample class distribution matrix of the current split node according to the security sample label and the security sample characteristics sent by the second data end, wherein the security sample characteristics are obtained by safely processing the second sample characteristics on the second data end by the second data end, and the sample class distribution matrix comprises the number of samples on the child nodes corresponding to each candidate score value;

5. The method of claim 4, wherein the determining the sample class distribution matrix of the current split node based on the security sample tag and the security sample characteristics sent by the second data terminal comprises:

Transmitting the security sample tag to the second data end so that the second data end generates a first sample class distribution matrix according to the security sample tag;

receiving a first sample class distribution matrix and a security sample characteristic sent by the second data terminal;

generating a second sample category distribution matrix according to the security sample features and the sample labels;

6. The method of claim 4, wherein the child nodes comprise a first child node and a second child node; determining the segmentation index of each candidate segmentation value according to the number of samples on the child node corresponding to each candidate segmentation value, including:

for any candidate segmentation value, calculating a left segmentation index of the first sub-node according to the number of samples on the first sub-node corresponding to the candidate segmentation value;

calculating a right segmentation index of the second child node according to the number of samples on the second child node corresponding to the candidate segmentation value;

7. The method according to claim 1, wherein determining the importance coefficient of the target sample feature according to the segmentation index of the target sample feature corresponding to each split node includes:

8. A method for extracting features based on secret sharing, which is applied to a second data terminal, the method comprising:

receiving importance coefficients of target sample features sent by a first data end, wherein the importance coefficients are determined according to segmentation indexes of the target sample features corresponding to splitting nodes in a tree model, the segmentation indexes are obtained based on safety sample tags of a plurality of sample data and safety sample features on a second data end, the splitting nodes split based on the target sample features and target segmentation values, the target sample features and the target segmentation values are determined based on the segmentation indexes of the sample features, and the tree model is constructed according to the plurality of sample data and the plurality of sample features of the sample data;

9. The method of claim 8, wherein the second data side includes a target sample feature and a feature value of the target sample feature;

before receiving the importance coefficients of the target sample features sent by the first data end, the method further comprises:

receiving a target sample characteristic of a current split node and a target dividing value of the target sample characteristic, which are sent by a first data end;

10. The method of claim 8, wherein the second data side comprises a second sample feature; before receiving the importance coefficients of the target sample features sent by the first data end, the method further comprises:

receiving a security sample label sent by a first data end;

cutting the second sample feature according to a plurality of candidate cutting values of the second sample feature to obtain an updated second sample feature;

Generating a first sample category distribution matrix according to the security sample tag and the updated second sample characteristics;

and transmitting the first sample class distribution matrix and the security sample characteristics to a first data end.

11. A secret sharing-based feature extraction apparatus, applied to a first data side, the apparatus comprising:

a building module configured to build a tree model from the plurality of sample data and sample features, wherein the tree model comprises a plurality of split nodes that split based on target sample features and target cut values determined based on cut indicators of each sample feature, the cut indicators being derived based on secure sample tags of the plurality of sample data and secure sample features on a second data end;

12. A secret sharing based feature extraction apparatus for application to a second data side, the apparatus comprising:

the system comprises a receiving module, a splitting module and a processing module, wherein the receiving module is configured to receive importance coefficients of target sample features sent by a first data end, wherein the importance coefficients are determined according to splitting indexes of the target sample features corresponding to splitting nodes in a tree model, the splitting indexes are obtained based on safety sample tags of a plurality of sample data and safety sample features on a second data end, the splitting nodes split based on the target sample features and target splitting values, the target sample features and the target splitting values are determined based on the splitting indexes of the sample features, and the tree model is constructed according to the plurality of sample data and the sample features of the sample data;

13. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer executable instructions, the processor being configured to execute the computer executable instructions, which when executed by the processor, implement the steps of the method of any one of claims 1 to 7 or any one of claims 8 to 10.

14. A computer readable storage medium, characterized in that it stores computer executable instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 7 or any one of claims 8 to 10.