CN113792044A

CN113792044A - Data fusion platform and neural network model hosting training method

Info

Publication number: CN113792044A
Application number: CN202110976705.7A
Authority: CN
Inventors: 张金琳; 高航; 俞学劢
Original assignee: Zhejiang Shuqin Technology Co Ltd
Current assignee: Zhejiang Shuqin Technology Co Ltd
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2021-12-14

Abstract

The invention relates to the technical field of machine learning, and provides a data fusion platform which comprises a data receiving node, a plurality of data storage nodes and model nodes, wherein data lines are stored in a scattered mode, a data demand side sends a data calling request, and the data storage nodes substitute the data lines into a data processing model to obtain the output of the data processing model. The invention also provides a neural network model hosting and training method, which comprises the following steps: establishing a hosting node; connecting a target neural network model and a token; obtaining a model number; obtaining an input field set and an output field; obtaining a line number; sending a data calling request to a data storage node; substituting the data row into the submodel, feeding the result back to the hosting node, and bringing the result into a bill; substituting the main model to obtain a loss value and a gradient value; selecting the number of the next line; the accuracy of the target neural network model is periodically checked. The substantial effects of the invention are as follows: the privacy of the data is protected, and the training of the neural network model is adapted to the condition that the data is obtained gradually.

Description

Data fusion platform and neural network model hosting training method

Technical Field

The invention relates to the technical field of machine learning, in particular to a data fusion platform and a neural network model hosting training method.

Background

Artificial intelligence was introduced as early as the last 60 centuries. However, until the last few years, artificial intelligence has begun to grow rapidly, thanks to the rise in data volume, the increase in computational power, and the advent and development of machine learning. The application fields of artificial intelligence are also expanding, such as expert systems, machine learning, evolutionary computing, fuzzy logic, computer vision, natural language processing and recommendation systems, etc. One of the most basic artificial intelligence implementations of machine learning is to use algorithms to parse data, learn from it, and then make decisions and predictions about events in the real world. The neural network model shows good accuracy and robustness in processing control problems or prediction problems under complex problems and uncertain conditions. Has important effects on solving new problems and improving life and production efficiency. Has become the subject of intense research in the field, attracting a large number of scholars and institutions. And also presents a problem of data shortage in practice. Due to reasons such as business competition and privacy protection, data exchange between enterprises cannot be widely performed. Although various industries accumulate large amounts of data, large data islands are formed. The shortage of data limits the application and the deep development of machine learning technology. However, a technical scheme for effectively fusing and circulating data is lacking at present.

For example, chinese patent CN104809597A discloses a data resource management platform based on data fusion, whose publication date is 29/7/2015, and includes front and rear ends of the data resource management platform, a data resource management platform database, four data center data resources, and data resources of each service system of an electric power company; the method comprises the following steps of (1) butting with a database of each business system of an electric company; the data resource management platform database comprises data resources of four data centers which are combed and data resources of each business system of the power company; the front end and the back end of the data resource management platform comprise a data resource view module, a data resource application module, a data resource maintenance module, a data resource acquisition module, a data center demand management module, an online data resource operation and maintenance module and a data interface configuration and management module. The technical scheme realizes unified management of data of all business systems and effective management and comprehensive utilization of enterprise-level data resources. However, the technical scheme is only suitable for data fusion of a single enterprise, and cannot solve the technical problem of data fusion between different enterprises.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: at present, a technical scheme for effectively fusing multi-source data is lacked, so that the technical problem that training of a neural network model is lacked of enough data is caused. A data fusion platform and a neural network model hosting training method are provided. Sufficient data are obtained by means of a data fusion platform, and a new technical scheme is provided for training a neural network model to adapt to multi-source data.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: the data fusion platform comprises a data receiving node, a plurality of data storage nodes and a model node, wherein a data source side and a data demand side data receiving node register to obtain a user identifier and a virtual account, the data source side requires a batch number and a line number to the data receiving node, the data source side sends a data introduction text, a quotation and a field structure associated batch number to the data receiving node, the data source side sends a data line hash value associated line number to the data receiving node, the data receiving node discloses the introduction text, the quotation, the field structure, the data line hash value and the line number, the data demand side dispersedly stores data lines on the data storage nodes, the data demand side submits a data processing model to the model node, the model node allocates the model number to the data processing model, and the data demand side sends a data calling request to the data storage node, the data calling request comprises a line number and a model number, a plurality of data storage nodes establish multi-party calculation, substitute the data line corresponding to the line number into a data processing model to obtain the output of the data processing model, and feed the output of the data processing model back to a data demand side, the data storage nodes bring the line number of the data line substituted into the data processing model into a bill, periodically submit the bill to a data receiving node, the data receiving node distributes the bill number for the bill, and the data receiving node pays the corresponding number of tokens to a data source side by using spare money pre-paid by the data demand side.

Preferably, the data receiving node discloses a standard field table, the data source side modifies the field name of the data line field structure according to the standard field table, submits the field structure to the data receiving node, the data receiving node allocates field numbers for the data fields, the data demander specifies the line number and the field number for each input of the data processing model, submits the line number and the field number to the data storage node, and the data storage node generates bills for the data lines related to the data processing model.

Preferably, the model node establishes a history table for each data processing model, the history table records the row number and the corresponding bill number of the data row of the data processing model, the model node periodically extracts the hash value of each history table, which is periodically newly added and recorded, extracts the hash value again from the hash values of all the history tables, uploads the hash value to the block chain for storage, before the data row is substituted into the data processing model, the model node checks whether the row number of the data row exists in the history table of the corresponding data processing model, if so, the data row is not included in the bill, and if not, the data row of the bill is included in the bill.

Preferably, the data demander generates a substitution number table for each non-numerical field, replaces the real value of the non-numerical field of the data row with a substitution number, submits the substitution number table to the data receiving node for disclosure, and multiple data demanders agree the same substitution number table for the same non-numerical field.

Preferably, the data demand side generates a plurality of copies for the data line, the number of the copies is the same as that of the data storage nodes, an obfuscating number is generated for each field, each field has two copies to store real values, each field stores the obfuscating numbers in the rest copies, the obfuscating numbers are different from each other, when the data processing model is executed, the data storage nodes construct safe multi-party calculation, the real values and the obfuscating values are judged, and the real values are restored and substituted into the data processing model.

Preferably, the data demand side generates a plurality of copies for the data line, the number of copies is the same as the number of data storage nodes, generates a plurality of addends and confusion numbers for each field, allocates the addends and the confusion numbers to a plurality of copy storages, the sum of all the used confusion numbers is 0, converts the data processing model into a neural network model, splits the neural network model into a plurality of sub models and a main model, each neuron of the layer 1 of the neural network model corresponds to a sub model, the output of the sub model is the input number of the activation function of the neuron of the layer 1, the input of the sub model is the input layer neuron connected with the neuron of the layer 1, the output of the sub model is the sum of the weighted sum of the input and the offset number, the main model deletes the input layer for the neural network model, and changes the input of the neuron of the layer 1 into the output of the corresponding sub model, and the data demand party submits the plurality of submodels to a plurality of data storage nodes, the plurality of data storage nodes respectively calculate and obtain the output of the submodels, the obtained output is submitted to the data demand party, the data demand party sums the outputs of the submodels of all the data storage nodes to obtain the final output of the submodels, and the output of all the submodels is substituted into the main model to solve the data processing model.

Preferably, when the data storage node receives the plurality of sub-models, the data storage node performs privacy security check on the plurality of sub-models, and the privacy security check method includes: deleting the connection with the weight coefficient of 0 in the sub-model; checking whether the output of the submodel only relates to the connection of one input layer neuron, if so, not passing the privacy security check of the submodel, otherwise, passing the privacy security check of the submodel; and if the privacy security check of all the submodels passes, judging that the privacy security check passes, and submitting the output of the calculation submodel to a data demand side.

The neural network model hosting training method comprises the following steps: establishing a hosting node, wherein the hosting node is registered as a data demand party with a data receiving node; the hosting node receives a target neural network model and a plurality of tokens submitted by a hosting user; the hosting node submits the target neural network model to the model node to obtain a model number; the hosting node analyzes the target neural network model to obtain an input field set and an output field; the hosting node obtains the line number of the data line covering the input field set and the output field of the target neural network model according to the field structure and the line number disclosed by the data receiving node; sending a data calling request to a data storage node, wherein the data calling request comprises a line number, a model number and a plurality of sub-models; substituting the corresponding data rows into the submodels by the data storage nodes, feeding back the results of the submodels to the hosting nodes, and bringing the row numbers into bills; the hosting node substitutes the output of the sub-model into the main model to obtain a loss value, and updates the target neural network model, the main model and the sub-model by using the gradient value until a training stopping condition is reached; and periodically checking the accuracy of the target neural network model, and if the accuracy meets a preset condition, informing the hosting user that the hosting training is finished.

Preferably, the hosting node periodically establishes backup and records the accuracy of a target neural network model at the backup; and when the accuracy of the target neural network model is periodically checked, if the accuracy is lower than that of the backup part, the target neural network model is returned to the backup.

Preferably, the hosting user also submits a normalization operator with a field, and when the hosting node sends the sub-model to the data storage node, the normalization operator corresponding to the field is sent to the data storage node; and the data storage nodes normalize the data of the data rows by using a normalization operator and then substitute the data into the sub-model.

The substantial effects of the invention are as follows: 1) the data processing method has the advantages that isolation is established between the data demand side and the data through the data processing model, so that the data demand side cannot directly obtain the data when using the data, the privacy of the data is protected, and the worry of the data source side about data privacy protection is eliminated; 2) according to the data call record, the token on the block chain is used for automatic bill payment, so that the rights and interests of a data source party are guaranteed, and the data source party is facilitated to improve the data providing enthusiasm; 3) the data are scattered in a plurality of data storage nodes for storage, so that the safety of the data is ensured; 4) the neural network model is used for hosting and training, so that training of the neural network model can adapt to the situation that training data are not prepared at the same time, and the training can be passively carried out along with the data entering a fusion platform, and the training is automatically carried out without manual watching.

Drawings

Fig. 1 is a schematic structural diagram of a data fusion platform according to an embodiment.

FIG. 2 is a diagram illustrating a history representation according to an embodiment.

FIG. 3 is a diagram illustrating a data line storage according to an embodiment.

FIG. 4 is a diagram illustrating a neural network model according to an embodiment.

Fig. 5 is a schematic flowchart of a neural network model hosting training method according to an embodiment.

Fig. 6 is a schematic backup flow chart according to an embodiment.

Wherein: 10. data source side, 11, introduction text, 12, price quote, 13, field structure, 14, line number, 20, data receiving node, 21, bill number, 30, data storage node, 31, copy, 40, model node, 50, data demander, 51, data processing model, 52, data call request, 53, history list, 60, block chain, 70, data line, 71, addend, 72, obfuscation number, 511, layer 0, 512, layer 1, 513, output layer.

Detailed Description

The following provides a more detailed description of the present invention, with reference to the accompanying drawings.

The first embodiment is as follows:

referring to fig. 1, the data fusion platform includes a data receiving node 20, a plurality of data storage nodes 30, and a model node 40, a data source 10 and a data demander 50 register with the data receiving node 20 to obtain a user identifier and a virtual account, the data source 10 requests a batch number and a line number 14 from the data receiving node 20, the data source 10 sends a data introduction text 11, a quotation 12, and a field structure 13 associated with the batch number to the data receiving node 20, the data source 10 sends a data line 70 hash value associated with the line number 14 to the data receiving node 20, the data receiving node 20 discloses the introduction text 11, the quotation 12, the field structure 13, the data line 70 hash value, and the line number 14, the data demander 50 dispersedly stores data lines 70 on the plurality of data storage nodes 30, the data demander 50 submits a data processing model 51 to the model node 40, the model node 40 distributes a model number for the data processing model 51, the data demand side 50 sends a data call request 52 to the data storage nodes 30, the data call request 52 comprises a line number 14 and a model number, a plurality of data storage nodes 30 establish multi-party calculation, data lines 70 corresponding to the line number 14 are substituted into the data processing model 51 to obtain the output of the data processing model 51, the output of the data processing model 51 is fed back to the data demand side 50, the data storage nodes 30 bring the line number 14 of the data lines 70 substituted into the data processing model 51 into a bill, the bill is periodically submitted to the data receiving node 20, the data receiving node 20 distributes a bill number 21 for the bill, and the data receiving node 20 pays a corresponding number of tokens to the data source side 10 by using spare funds pre-paid by the data demand side 50. As shown in table 1, the introduction text 11 submitted for the bank mail records the relevant information of the data. Where the data fields are used to match the data consumers 50. When the data field contains data that is needed by the data demander 50, the data demander 50 may initiate a data call request 52, thereby generating value for the data.

Table 1 introduction text 11 submitted by bank a

Bank first branch
	Statistical data of card used in nearly two years for general user account of branch bank
Data volume: 10 ten thousand rows
	Data field: family nameFirst name \| age \| deposit balance \| income in recent two years per month \| income in recent two years per year \| consumption in recent two years per month \| science Calendar loan
Introduction of data: the data is generated by simple statistics of the card flow data of the user of the common account type in the current bank. The branch line is positioned in the XX path in the XX district in the XX city, the user mainly serves as a payroll card for residents within 8 km nearby and employees of nearby companies. The users with extremely low account use frequency are screened out Very low means less than 10 runs per year and less than 1 thousand runs total. The data is complete and has no missing, and the data is real by using the card. The data field includes name and year Age, deposit balance, income in the last two years per month, income in the last two years per year, consumption in the last two years per month, consumption in the last two years per year, academic calendar, and loan data. It is composed of In the above, consumption means the expenditure of money to the goods and service providing body, the transfer between user cards, the payment by credit card, the payment by loan and the property of the purchase The product is not counted in monthly consumption and annual consumption. The value range of the deposit balance is 0-10000 ten thousand yuan, the value range of the age is 0-150, and the value range of the monthly income is 0-1000 ten thousand yuan Yuan, annual income value range is 0-10000 Ten thousand Yuan, monthly consumption value range is 0-1000 Ten thousand Yuan, annual consumption value range is 0-10000 Ten thousand Yuan …

Data from multiple data sources often presents heterogeneous problems. In this embodiment, the data processing model 51 is used to solve the data heterogeneous problem, and the data demander 50 creates different data processing models 51 for different data sources according to the field structure 13 disclosed in the introduction 11, so as to obtain the required output.

Further, the data receiving node 20 of this embodiment discloses a standard field table, the data source side 10 modifies the field name of the field structure 13 of the data line 70 according to the standard field table, submits the field structure 13 to the data receiving node 20, the data receiving node 20 assigns a field number to the data field, the data demander 50 specifies a line number 14 and a field number for each input of the data processing model 51, submits the line number 14 and the field number to the data storage node 30, and the data storage node 30 generates bills for the data lines 70 related to the data processing model 51, respectively.

The model node 40 creates a history table 53 for each data processing model 51, please refer to fig. 2, the history table 53 records the row number 14 of the data row 70 substituted into the data processing model 51 and the corresponding bill number 21, the model node 40 periodically extracts the hash value of each history table 53 that is periodically newly recorded, extracts the hash value again from all the hash values of the history tables 53, uploads the hash values to the block chain 60 for storage, before the data storage node 30 substitutes the data row 70 into the data processing model 51, checks whether the row number 14 of the data row 70 already exists in the history table 53 corresponding to the data processing model 51 or not to the model node 40, if so, the data row 70 does not include the bill, and if not, the data row 70 is included in the bill.

The data demander 50 generates a substitution number table for each non-numeric field, replaces the real value of the non-numeric field of the data row 70 with a substitution number, submits the substitution number table to the data receiving node 20 for disclosure, and the plurality of data demanders 50 agree the same substitution number table for the same non-numeric field. As shown in table 2, an alternative number table used in this embodiment replaces the values of the academic calendar field with numbers 0-3, thereby converting the non-numeric field into a numeric field. If the text type field with an indefinite value range exists in the data, the text type field needs to be deleted. Such as detailed addresses, remarks, etc., which have an indefinite range of values and sometimes allow null values, it is difficult to extract valuable information, and therefore, a deletion process is employed.

Table 2 an alternative number table for use in the present embodiment

Non-numeric field value	Number of substitutions
		High school and below school calendar	0
Study calendar major	1
		Academic calendar	2
Student calendar	3

The data demand side 50 generates a plurality of copies 31 for the data line 70, the number of the copies 31 is the same as that of the data storage nodes 30, an addend 71 and an obfuscation number 72 are generated for each field, each field has two copies 31 for storing real values, each field stores the obfuscation number 72 in the other copies 31, the obfuscation numbers 72 are different from each other, when the data processing model 51 is executed, the data storage nodes 30 construct a safe multi-party calculation, judge the real values and the obfuscation values, restore the real values and substitute the real values into the data processing model 51.

As an alternative, referring to fig. 3, the data demander 50 generates several copies 31 for the data line 70, the number of copies 31 being the same as the number of data storage nodes 30, generates several addends 71 and confusion numbers 72 for each field, allocates the addends 71 and confusion numbers 72 to several copy stores, the sum of all used confusion numbers 72 being 0, converts the data processing model 51 into a neural network model, splits the neural network model into several submodels and a master model, referring to fig. 4, one submodel for each neuron of layer 1 512 of the neural network model, the output of the submodel being the input number of the activation function of layer 1 neuron, the input of the submodel 512 being the layer 0 511 neuron connected to the layer 1 neuron, the output of the submodel being the sum of the weighted sum of the inputs and the deviation number, the master model deleting layer 0 511 for the neural network model, and the input of the layer 1 neuron is changed into the output of the corresponding submodel, the data demander 50 submits a plurality of submodels to a plurality of data storage nodes 30, the data storage nodes 30 respectively calculate and obtain the output of the submodel and submit the obtained output to the data demander 50, the data demander 50 sums the outputs of the submodels of all the data storage nodes 30 to obtain the final output of the submodel, and the output of all the submodels is substituted into the main model to solve the data processing model 51. The input number of the neurons of the neural network is equal to the output value of the neuron of the last layer connected with the neurons, the weighted summation is carried out, and the weighted summation is obtained by adding the offset value. And performing weighted summation on all addends and the confusion number, and then summing, wherein the confusion number in the sum is cancelled because the sum of the confusion numbers 72 used in each field is 0, and the results obtained by performing weighted summation on a plurality of addends respectively and then summing are the same as the results obtained by directly performing weighted summation on the real value, so that the calculation of the submodel is completed.

When the data storage node 30 receives the plurality of sub-models, the plurality of sub-models are subjected to privacy security check, and the privacy security check method comprises the following steps: deleting the connection with the weight coefficient of 0 in the sub-model; checking whether the output of the sub-model only relates to the connection of a layer 0 511 neuron, if so, not passing the privacy security check of the sub-model, otherwise, passing the privacy security check of the sub-model; if the privacy security checks of all the submodels pass, it is determined that the privacy security checks pass, and the output of the calculation submodel is delivered to the data consumer 50. As in fig. 4, the layer 0 511 and layer 1 512 neurons to which the function f1 is connected constitute a submodel, but the output of this submodel only relates to one layer 0 511 neuron, and thus the privacy security check does not pass. If only the submodel formed by the functions f2, f3, and f4 in fig. 4 is considered, the privacy security check is passed. Substituting the outputs of the submodels into the functions f5 and f6 in fig. 4 can result in the output layer 513 of the data processing model 51.

Referring to fig. 5, the neural network model hosting training method includes the following steps:

step a 01), establishing a managed node, and registering the managed node as the data demander 50 with the data receiving node 20;

step A02) the hosting node receiving a target neural network model and a number of tokens submitted by a hosting user;

step A03) the hosting node submits the target neural network model to the model node 40 to obtain a model number;

step A04), the managed node analyzes the target neural network model to obtain an input field set and an output field;

step A05) the hosting node obtains the line number 14 of the data line 70 covering the input field set and the output field of the target neural network model according to the field structure 13 and the line number 14 disclosed by the data receiving node 20;

step A06) sending a data call request 52 to the data storage node 30, the data call request 52 including a line number 14, a model number and a number of sub-models;

step a 07) substituting the corresponding data lines 70 into the submodel by a plurality of data storage nodes 30, feeding back the result of the submodel to the escrow node, and bringing the line number 14 into the bill;

step A08) the trusteeship node substitutes the output of the submodel into the main model to obtain a loss value, and updates the target neural network model, the main model and the submodel by using the gradient value until a training stop condition is reached;

step A09) periodically checking the accuracy of the target neural network model, and if the accuracy meets a preset condition, informing the hosting user that the hosting training is finished.

For example, a customer loss early warning model established by a bank A is used for judging whether a customer has loss risks or not so as to remind bank staff to contact the customer in time to know business states and recover customer loss. In this embodiment, customer churn refers to a customer terminating and numbering all services in the row. But a particular business segment may define individually the termination behavior of a customer across all or some of the business segments of that segment. This embodiment takes the early warning of the loss of credit card customers as an example for illustration. Due to the annual cost of the credit card, once the customer no longer uses the credit card of a certain bank, the customer usually sells the credit card in a timely manner. Thus, in this embodiment, the credit card client is considered to be an attrition client. The input fields of the neural network are as follows: the age, gender, school calendar, city of residence, average time between credit card transactions, average amount of credit card transactions, number of times credit card promotional campaigns are engaged, whether there is a home debit card, and the length of time since the last credit card transaction. Bank A, bank B and bank C all have data covering the above-mentioned input fields.

The service life of the bank card is short, the data of the credit card user is less, and the training of the client loss early warning model is difficult to effectively carry out. Therefore, the structure of the target neural network model is established, after the initial weight coefficient, the loss function and the gradient function are determined, the target neural network model is submitted to the hosting node, and part of the existing credit card client data is submitted to the hosting node as test data. If the bank A does not submit the test data, the managed node appoints a plurality of data lines 70 from the data source side 10 to carry out the accuracy test of the target neural network.

The managed node analyzes the target neural network to obtain an input field and an output field of the target neural network, wherein the output field is a result of whether the user sells the user. The bank b uploads its data to the data receiving node 20 to share its data while earning revenue. This embodiment will not be discussed with respect to pricing for use of a single data row 70. After receiving the data submitted by the bank B, the data receiving node 20 discloses the field structure 13 and the line number 14 of the data. The hosting node queries the field structure 13 to find that the data submitted by bank b can be used for training of the target neural network. And by means of the published line number 14, the data line 70 can be specified accurately. The hosting node requests a call to the data line 70 of line number 14, sending line number 14 to the data storage node 30. The data storage nodes 30 restore the credit card data provided by the bank B and are stored in a scattered manner, and the restored credit card data is substituted into the target neural network model to obtain the output of the target neural network model. The loss function value is obtained in comparison with the result of whether the customer has sold the account in the data line 70, and a gradient value is obtained. And updating the weight coefficients of the target neural network model by means of the gradient values. The line number 14 of the next data line 70 is then specified again, and the call request is again initiated to the data storage node 30. So continuously, the target neural network will be continuously trained and optimized. When the data provided by the bank B are all substituted, the test is carried out, and the result shows that the preset accuracy is still not achieved enough, namely more data are needed for training.

Thereafter, after waiting for a period of time, the bank uploads a batch of credit card customer usage data, again containing all the fields of the target neural network model. The managed node also performs call training of data in the manner described above. And after all the data submitted by the bank C are called, testing the accuracy, and finding that the data meet the preset accuracy requirement. The target neural network model is then delivered to the hosting user. Hosting for wanting to continue hosting training to further improve prediction accuracy, the hosting node may continue to wait for new eligible data to be submitted to the data receiving node 20.

The managed node further performs the following steps to improve the training effect of the neural network model, referring to fig. 6, including: step B01), the managed node periodically establishes backup and records the accuracy of the target neural network model at the backup; step B02) periodically checking the accuracy of the target neural network model; step B03) rolling back the target neural network model to the backup if the accuracy is lower than the accuracy at the backup. Only good quality data can improve the accuracy of the target neural network model. However, how to judge whether the data is good before training is still lack of a more accurate method. In this embodiment, a verification after substitution is adopted for determination. And returning to the backup point if the accuracy of the target neural network model does not rise or fall due to the fact that poor data are substituted into the target neural network model for training. In this embodiment, these data are not paid for. The specific embodiment can be changed into payment or discount payment according to the requirement.

The hosting user also submits a normalization operator with a field, and when the hosting node sends the sub-model to the data storage node 30, the hosting node simultaneously sends the normalization operator with the corresponding field to the data storage node 30; the data storage node 30 normalizes the data of the data line 70 using a normalization operator and substitutes the normalized data into the sub-model.

The beneficial technical effects of this embodiment are: when the data demand side 50 uses the data, the data can not be directly obtained, the privacy of the data is protected, and the concern of the data source side 10 about the data privacy protection is eliminated. According to the data call record, the token on the block chain 60 is used for automatic bill payment, so that the rights and interests of the data source party 10 are guaranteed, and the enthusiasm of the data source party 10 for providing data is improved. The data are scattered in a plurality of data storage nodes 30 for storage, and the safety of the data is ensured. The neural network model is used for hosting and training, so that training of the neural network model can adapt to the situation that training data are not prepared at the same time, and the training can be passively carried out along with the data entering a fusion platform, and the training is automatically carried out without manual watching.

The above embodiment is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and other variations and modifications may be made without departing from the technical scope of the claims.

Claims

1. A data fusion platform is characterized in that,

the data processing model comprises a data receiving node, a plurality of data storage nodes and a model node, wherein a data source side and a data demand side register the data receiving node to obtain a user identifier and a virtual account, the data source side requires a batch number and a line number to the data receiving node, the data source side sends a data introduction text, a quotation and a field structure associated batch number to the data receiving node, the data source side sends a data line hash value associated line number to the data receiving node, the data receiving node discloses the introduction text, the quotation, the field structure, the data line hash value and the line number, the data demand side dispersedly stores data lines on the data storage nodes, the data demand side submits the data processing model to the model node, the model node allocates the model number to the data processing model, and the data demand side sends a data calling request to the data storage node, the data calling request comprises a line number and a model number, a plurality of data storage nodes establish multi-party calculation, substitute the data line corresponding to the line number into a data processing model to obtain the output of the data processing model, and feed the output of the data processing model back to a data demand side, the data storage nodes bring the line number of the data line substituted into the data processing model into a bill, periodically submit the bill to a data receiving node, the data receiving node distributes the bill number for the bill, and the data receiving node pays the corresponding number of tokens to a data source side by using spare money pre-paid by the data demand side.

2. The data fusion platform of claim 1,

the data receiving node discloses a standard field table, a data source side modifies field names of data line field structures according to the standard field table and submits the field structures to the data receiving node, the data receiving node distributes field numbers for the data fields, a data demand side specifies the line numbers and the field numbers for each input of the data processing model and submits the line numbers and the field numbers to a data storage node, and the data storage node respectively generates bills for data lines related to the data processing model.

3. The data fusion platform of claim 2,

the model node establishes a history record table for each data processing model, records of the history record table are substituted into line numbers of data lines of the data processing model and corresponding bill numbers, the model node periodically extracts hash values of periodically newly-added records of each history record table, the hash values of all the history record tables are extracted again together, the hash values are uploaded to a block chain for storage, before the data lines are substituted into the data processing model, the model node checks whether the line numbers of the data lines exist in the history record tables of the corresponding data processing model, if so, the data lines do not include bills, and if not, the data lines do not include bills.

4. The data fusion platform of any one of claims 1 to 3,

the data demanders generate a substitution number table for each non-numerical field, real values of the non-numerical fields of the data rows are replaced by substitution numbers, the substitution number table is submitted to the data receiving node to be disclosed, and the data demanders agree the same substitution number table for the same non-numerical fields.

5. The data fusion platform of claim 4,

the data demand side generates a plurality of copies for a data line, the number of the copies is the same as that of the data storage nodes, confusion numbers are generated for each field, each field is provided with two copy storage real values, the confusion numbers are stored in the rest copies of each field and are different from each other, when a data processing model is executed, the data storage nodes construct safe multi-party calculation, the real values and the confusion values are judged, and the real values are restored and substituted into the data processing model.

6. The data fusion platform of claim 4,

the data demand side generates a plurality of copies for a data line, the number of the copies is the same as that of data storage nodes, a plurality of addends and confusion numbers are generated for each field, the addends and the confusion numbers are distributed to a plurality of copy storage, the sum of all used confusion numbers is 0, the data processing model is converted into a neural network model, the neural network model is divided into a plurality of sub models and a main model, each neuron of the layer 1 of the neural network model corresponds to a sub model, the output of the sub model is the input number of an activation function of the layer 1 neuron, the input of the sub model is the layer 0 neuron connected with the layer 1 neuron, the output of the sub model is the sum of the weighted sum of the input and the offset number, the main model deletes the layer 0 of the neural network model, and changes the input of the layer 1 neuron into the output of the corresponding sub model, and the data demand party submits the plurality of submodels to a plurality of data storage nodes, the plurality of data storage nodes respectively calculate and obtain the output of the submodels, the obtained output is submitted to the data demand party, the data demand party sums the outputs of the submodels of all the data storage nodes to obtain the final output of the submodels, and the output of all the submodels is substituted into the main model to solve the data processing model.

7. The data fusion platform of claim 6,

when the data storage node receives the plurality of sub-models, privacy security check is carried out on the plurality of sub-models, and the privacy security check method comprises the following steps:

deleting the connection with the weight coefficient of 0 in the sub-model;

checking whether the output of the submodel only relates to the connection of a layer 0 neuron, if so, not passing the privacy security check of the submodel, otherwise, passing the privacy security check of the submodel;

and if the privacy security check of all the submodels passes, judging that the privacy security check passes, and submitting the output of the calculation submodel to a data demand side.

8. A neural network model hosting training method for hosting training of a neural network model on the data fusion platform according to any one of claims 1 to 7,

the method comprises the following steps:

establishing a hosting node, wherein the hosting node is registered as a data demand party with a data receiving node;

the hosting node receives a target neural network model and a plurality of tokens submitted by a hosting user;

the hosting node submits the target neural network model to the model node to obtain a model number;

the hosting node analyzes the target neural network model to obtain an input field set and an output field;

the hosting node obtains the line number of the data line covering the input field set and the output field of the target neural network model according to the field structure and the line number disclosed by the data receiving node;

sending a data calling request to a data storage node, wherein the data calling request comprises a line number, a model number and a plurality of sub-models;

substituting the corresponding data rows into the submodels by the data storage nodes, feeding back the results of the submodels to the hosting nodes, and bringing the row numbers into bills;

the hosting node substitutes the output of the sub-model into the main model to obtain a loss value, and updates the target neural network model, the main model and the sub-model by using the gradient value until a training stopping condition is reached;

and periodically checking the accuracy of the target neural network model, and if the accuracy meets a preset condition, informing the hosting user that the hosting training is finished.

9. The neural network model hosting training method of claim 8,

the hosting node periodically establishes backup and records the accuracy of a target neural network model at the backup position;

and when the accuracy of the target neural network model is periodically checked, if the accuracy is lower than that of the backup part, the target neural network model is returned to the backup.

10. The neural network model hosting training method of claim 9,

the hosting user also submits a normalization operator with a field, and when the hosting node sends the sub-model to the data storage node, the normalization operator corresponding to the field is sent to the data storage node;

and the data storage nodes normalize the data of the data rows by using a normalization operator and then substitute the data into the sub-model.