WO2019200738A1 - 数据特征提取的方法、装置、计算机设备和存储介质 - Google Patents

数据特征提取的方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2019200738A1
WO2019200738A1 PCT/CN2018/095388 CN2018095388W WO2019200738A1 WO 2019200738 A1 WO2019200738 A1 WO 2019200738A1 CN 2018095388 W CN2018095388 W CN 2018095388W WO 2019200738 A1 WO2019200738 A1 WO 2019200738A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
feature
calculate
input
original
Prior art date
Application number
PCT/CN2018/095388
Other languages
English (en)
French (fr)
Inventor
王义文
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019200738A1 publication Critical patent/WO2019200738A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/382Payment protocols; Details thereof insuring higher security of transaction
    • G06Q20/3829Payment protocols; Details thereof insuring higher security of transaction involving key management

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method, an apparatus, a computer device, and a storage medium for extracting data features.
  • Blockchain is a decentralized, trust-free new data architecture that is owned, managed, and supervised by all nodes in the network and does not accept a single aspect of control.
  • Blockchain is a newly emerging technology. Enterprises are doing pre-technical R&D and development layout. Therefore, analyzing the data on the blockchain is a necessary process, but as the data on the blockchain increases, How to quickly extract the feature data of the original data on the blockchain is an urgent problem to be solved.
  • the main purpose of the present application is to provide a method, an apparatus, a computer device and a storage medium for data feature extraction, which are intended to quickly extract feature data of original data on a blockchain.
  • the present application provides a data feature extraction method for performing data feature extraction on data on a blockchain, the method comprising:
  • the raw data is input into a CCIPCA algorithm to calculate feature data of the original data.
  • the present application further provides an apparatus for extracting data features for performing data feature extraction on data on a blockchain, the apparatus comprising:
  • An obtaining unit configured to obtain raw data on a blockchain
  • a feature extraction unit configured to input the original data into a CCIPCA algorithm to calculate feature data of the original data.
  • the present application also provides a computer device comprising a memory and a processor, the memory storing computer readable instructions, wherein the processor, when the computer readable instructions are executed, implements the method of any of the above step.
  • the present application also provides a computer non-transitory readable storage medium having stored thereon computer readable instructions, wherein the computer readable instructions are executed by a processor to implement the method of any of the above step.
  • the method, the device, the computer device and the storage medium of the data feature extraction of the present application use the data to be downloaded on the blockchain, and the characteristics thereof cannot be tampered, so the process of discrete point processing is not performed in the process of extracting data features.
  • Data feature extraction is performed directly using the CCIPCA algorithm, and data feature extraction is faster.
  • FIG. 1 is a schematic flowchart diagram of a method for extracting data features according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a method for data feature extraction according to an embodiment of the present application
  • FIG. 3 is a schematic flowchart diagram of a method for extracting data features according to an embodiment of the present application
  • FIG. 4 is a schematic flowchart of a method for data feature extraction according to an embodiment of the present application.
  • FIG. 5 is a schematic structural block diagram of an apparatus for extracting data features according to an embodiment of the present application.
  • FIG. 6 is a schematic block diagram showing the structure of a feature extraction unit according to an embodiment of the present application.
  • FIG. 7 is a schematic block diagram showing the structure of a feature extraction unit according to an embodiment of the present application.
  • FIG. 8 is a schematic block diagram showing the structure of a feature extraction unit according to an embodiment of the present application.
  • FIG. 9 is a schematic structural block diagram of an apparatus for extracting data features according to an embodiment of the present application.
  • FIG. 10 is a schematic structural block diagram of an apparatus for extracting data features according to an embodiment of the present application.
  • FIG. 11 is a schematic block diagram showing the structure of an apparatus for extracting data features according to an embodiment of the present application.
  • FIG. 12 is a schematic block diagram showing the structure of a computer device according to an embodiment of the present application.
  • an embodiment of the present application provides a data feature extraction method for performing data feature extraction on data on a blockchain, where the method includes:
  • S2 Input the original data into a CCIPCA algorithm to calculate feature data of the original data.
  • the above-mentioned original data refers to data directly downloaded from the blockchain, and data that has not undergone any data processing.
  • the method of obtaining the original data from the blockchain includes inputting a keyword such as a keyword or a keyword of the data to be downloaded, and then downloading the data related to the search term.
  • block downloading may also be set, that is, as long as there is data update in the designated block, the updated data is downloaded to achieve high efficiency of real-time analysis processing.
  • the above block refers to a block in a specified area or an enterprise.
  • the CCIPCA Candid Covariance-free Incremental Principal Component Analysis
  • the algorithm is abnormal for the data stream.
  • the points are more sensitive, and the dimensionality reduction accuracy is greatly affected by the abnormal points.
  • the feature that the data on the blockchain is not falsified is fully utilized. Therefore, the process of outlier processing is not required before the dimension reduction by the CCIPCA algorithm, and the efficiency of extracting data features is improved.
  • A ⁇ u(n)u T (n) ⁇ is a covariance matrix of dxd dimensions, and T represents matrix transposition.
  • u 1 (n) u(n)
  • u 2 (n) is used as the input to the next iteration.
  • the step S2 of inputting the original data into the CCIPCA algorithm to calculate the feature data of the original data includes:
  • the windowing process refers to adding a sliding window to the data for discarding part of the historical data, and processing only the data in the sliding window, so that the present application pays more attention to the feature extraction of the new data.
  • the windowing process refers to adding a sliding window to the data for discarding part of the historical data, and processing only the data in the sliding window, so that the present application pays more attention to the feature extraction of the new data.
  • the step S2 of inputting the original data into the CCIPCA algorithm to calculate the feature data of the original data includes:
  • the obtained original data is stored in a buffer area.
  • the original data in the buffer area is input into the CCIPCA algorithm in batches. After the input of the original data of one batch is completed, the iterative calculation is started to obtain the feature data of the original data.
  • the above buffer area refers to a storage space for storing original data.
  • the original data on the blockchain is obtained, it is not directly input into the CCIPCA algorithm, but is first stored in the buffer area, and then the original data in the cache is processed in batches according to the time schedule.
  • the original data in the buffer area is divided according to certain rules. For example, the data amount per X is a batch, and then the raw data is input into the CCIPCA algorithm in batches according to the order of time. Specifically, the original data in the buffer area is batched, the amount of data of each batch is equal, and then it is iteratively input into the CCIPCA algorithm batch by batch according to the time of data acquisition.
  • the algorithm runs iteratively after all the samples of the sample data are input, and at other time, the raw data that has been obtained is put into the buffer, waiting for the input of other raw data.
  • the iterative process is: when the CCIPCA algorithm is calculated, after receiving a batch of original data for refreshing, the i-th eigenvector is sequentially refreshed to obtain a new estimated value of the i-th eigenvector, and then the new estimated value is made. Residual operation. The i+1th feature vector is refreshed with the newly obtained sample.
  • the eigenvector error in the early stage of the calculation can be relatively small, and the convergence becomes stable, and then the sample is subjected to residual operation, thereby controlling the accumulation of errors, as follows:
  • step S2 of inputting the original data into the CCIPCA algorithm to calculate the feature data of the original data includes:
  • the first sample refers to the original data corresponding to the first feature data to be sought.
  • the above judgment is based on the fact that the distance between the r-th stubs of the i-th feature vector (the distance is defined as the absolute value of the inner product and the distance of 1) is less than a threshold q (where q is less than 10 -4 ), The feature vector converges to obtain the best convergence value that the algorithm can obtain.
  • the original data is sequentially subjected to a residual operation on the convergence values obtained by the first to the i-th feature vectors, and the i+1th feature vector is refreshed.
  • an additional termination condition can be added: when the input m (m is greater than 10 4 ) of the original data, the convergence is not completed, and the loop is terminated.
  • the method includes:
  • S4 Processing irrelevant feature data in the same batch that is unrelated to other feature data in the batch according to a preset rule.
  • the above correlation matrix is also called a correlation coefficient matrix, which is composed of correlation coefficients between columns of the matrix. That is to say, the elements of the i-th row and the j-th column of the correlation matrix are the correlation coefficients of the i-th column and the j-th column of the original matrix.
  • a covariance matrix is generally used for analysis. The covariance is used to measure the overall error of two variables. If the trends of the two variables are consistent, the covariance is a positive value, indicating that the two variables are positively correlated. If the two variables change in opposite directions, the covariance is a negative value, indicating that the two variables are negatively correlated. If the two variables are independent of each other, the covariance is 0, indicating that the two variables are irrelevant. When the variables are greater than or equal to three groups, the corresponding covariance matrix is used.
  • the above irrelevant feature data may be fraudulent data, and the fraud data is not falsified data, but fraud data completed according to a formal way, similar to the existing Taobao brush list (under the self Buying your own goods and then conducting a positive evaluation in the message area, etc., at this time, the fraud data can be identified, that is, the above-mentioned processing according to the preset rules.
  • the above irrelevant feature data may be identified by the Voronoi algorithm for outliers to obtain fraud data.
  • the specific process includes:
  • the above irrelevant feature data is made into a Voronoi diagram of the point set S;
  • b Calculate the V-anomaly factor of each point in the point set S, and find the V-adjacent point of each point, specifically: b1, determine the Voronoi polygon V(pi) of a point pi in the point set S Near the point, calculate the average distance of pi to its neighbors, and use the reciprocal of the average distance to measure the abnormal degree of Pi;
  • V(p) the neighboring point of p determined by the V(p) side
  • V(p) the neighboring point of p determined by the V(p) side
  • V(p) the set of all V-adjacent points of point p
  • Vd(p) the reciprocal of the average distance of all V-adjacent points to p at point p, called the V-abnormality factor of point p, denoted as Vd(p),
  • ⁇ Vd(p) ⁇ is the number of all V-adjacent points of p
  • Vd(p) reflects the distribution density of points around point p.
  • the desired action can be made according to the specific situation of the fraud data. For example, if the fraudulent data is generated by a cooperative enterprise, it will automatically send out an alert email to the senior executives of the company, so that the senior executives of the company are vigilant when the cooperative enterprise cooperates.
  • the method includes:
  • the scatter diagram scatter diagram refers to the distribution map of the data points on the Cartesian coordinate system plane in the regression analysis. The more data you have in a scatter plot, the better the comparison will be.
  • the extracted feature data is embodied in a scatter plot in the form of points in real time, so that people can discover discrete points in time through the naked eye, so as to analyze the data corresponding to the discrete points.
  • the method includes:
  • the above classification of feature data refers to bringing together different types of feature data, for example, the feature data includes multiple types, such as financial, logistics, export, crop, livestock. Classes, etc., specific classification methods, can be classified according to the source of the data, for example, the raw data corresponding to the feature data is the data of the financial enterprise, which is classified into the characteristic data of the financial class.
  • the classification of the feature data may be that the designer pre-selects the classified categories, or may automatically classify, for example, according to the attributes of the feature data.
  • the corresponding computing models mentioned above include various types, such as a short-term profit model, an export volume prediction model, and a logistics speed prediction model.
  • the above-mentioned classified feature data set may be invoked, for example, the short-term profitability of the financial industry needs to be predicted, Then, the feature data of the above financial class is invoked, and then the short-term profit model is input for prediction, specifically: the feature data of the financial class is input into the K-means algorithm, and the first clustering calculation is performed; the first clustering calculation is obtained.
  • the various types of clusters are input into the preset SVR prediction model for regression prediction; the short-term profitability of the financial industry is determined according to the prediction results, and if the short-term profitability of the financial industry is relatively high, the loans and financial services corresponding to the financial industry are launched. .
  • the data feature extraction method of the present application utilizes the data to be downloaded on the blockchain, and the characteristics of the data cannot be falsified, so the process of discrete point processing is not performed in the process of data feature extraction, and the data feature is directly performed using the CCIPCA algorithm. Extraction, data feature extraction is faster.
  • an embodiment of the present application provides a device for extracting data features for performing data feature extraction on data on a blockchain.
  • the device includes:
  • the obtaining unit 10 is configured to obtain original data on the blockchain
  • the feature extraction unit 20 is configured to input the original data into the CCIPCA algorithm to calculate feature data of the original data.
  • the above-mentioned original data refers to data directly downloaded from the blockchain, and data that has not undergone any data processing.
  • the method of obtaining the original data from the blockchain includes inputting a keyword such as a keyword or a keyword of the data to be downloaded, and then downloading the data related to the search term.
  • block downloading may also be set, that is, as long as there is data update in the designated block, the updated data is downloaded to achieve high efficiency of real-time analysis processing.
  • the above block refers to a block in a specified area or an enterprise.
  • the CCIPCA Candid Covariance-free Incremental Principal Component Analysis
  • the algorithm is used in the data stream.
  • the anomaly point is sensitive, and the dimensionality reduction accuracy is greatly affected by the abnormal point.
  • the feature that the data on the blockchain is not falsified is fully utilized. Therefore, the process of outlier processing is not required before the dimension reduction by the CCIPCA algorithm, and the efficiency of extracting data features is improved.
  • A ⁇ u(n)u T (n) ⁇ is a covariance matrix of dxd dimensions, and T represents matrix transposition.
  • u 1 (n) u(n)
  • u 2 (n) is used as the input to the next iteration.
  • the feature extraction unit 20 includes:
  • a windowing module 21 configured to perform windowing processing on the original data
  • the first calculating module 22 is configured to input the original data in the window into the CCIPCA algorithm to calculate the feature data of the original data.
  • the windowing processing refers to adding a sliding window to the data for discarding part of the historical data, and processing only the data in the sliding window, so that the application is more focused on the new application.
  • Feature extraction of data to achieve real-time processing In this application, after adding the sliding window, although there is a certain influence on the accuracy of feature extraction, reducing the dependence on historical data can greatly reduce the amount of calculation, thereby improving the speed of feature extraction of the raw data acquired in real time. .
  • the feature extraction unit 20 includes:
  • the cache module 201 is configured to store the acquired original data into a buffer area
  • the second calculating module 202 is configured to input the original data in the buffer area into the CCIPCA algorithm in batches. After the input of the raw data of one batch is completed, the iterative calculation is started to obtain the characteristic data of the original data. .
  • the buffer area refers to a storage space for storing original data.
  • the original data on the blockchain is obtained, it is not directly input into the CCIPCA algorithm, but is first stored in the buffer area, and then the original data in the cache is processed in batches according to the time schedule.
  • the original data in the buffer area is divided according to certain rules. For example, the data amount per X is a batch, and then the raw data is input into the CCIPCA algorithm in batches according to the order of time. Specifically, the original data in the buffer area is batched, the amount of data of each batch is equal, and then it is iteratively input into the CCIPCA algorithm batch by batch according to the time of data acquisition.
  • the algorithm runs iteratively after all the samples of the sample data are input, and at other time, the raw data that has been obtained is put into the buffer, waiting for the input of other raw data.
  • the iterative process is: when the CCIPCA algorithm is calculated, after receiving a batch of original data for refreshing, the i-th eigenvector is sequentially refreshed to obtain a new estimated value of the i-th eigenvector, and then the new estimated value is made. Residual operation. The i+1th feature vector is refreshed with the newly obtained sample.
  • the eigenvector error in the early stage of the calculation can be relatively small, and the convergence becomes stable, and then the sample is subjected to residual operation, thereby controlling the accumulation of errors, as follows:
  • the feature extraction unit 20 includes:
  • the third calculating unit 203 is configured to, when calculating the CCIPCA algorithm, input the first sample for the first feature data to be sought until the convergence, and calculate the residual for the subsequent input samples to calculate the latter feature data. And so on, calculate the feature data one by one.
  • the first sample refers to the original data corresponding to the first feature data to be sought.
  • the above judgment is based on the fact that the distance between the r-th stubs of the i-th feature vector (the distance is defined as the absolute value of the inner product and the distance of 1) is less than a threshold q (where q is less than 10 -4 ), The feature vector converges to obtain the best convergence value that the algorithm can obtain.
  • the original data is sequentially subjected to a residual operation on the convergence values obtained by the first to the i-th feature vectors, and the i+1th feature vector is refreshed.
  • an additional termination condition can be added: when the input m (m is greater than 10 4 ) of the original data, the convergence is not completed, and the loop is terminated.
  • the apparatus for extracting data features further includes:
  • a correlation analysis unit 30 configured to perform correlation matrix analysis on the acquired feature data in batches
  • the processing unit 40 is configured to process irrelevant feature data in the same batch that is not related to other feature data in the batch according to a preset rule.
  • the correlation matrix is also called a correlation coefficient matrix, which is composed of correlation coefficients between columns of the matrix. That is to say, the elements of the i-th row and the j-th column of the correlation matrix are the correlation coefficients of the i-th column and the j-th column of the original matrix.
  • a covariance matrix is generally used for analysis. The covariance is used to measure the overall error of two variables. If the trends of the two variables are consistent, the covariance is a positive value, indicating that the two variables are positively correlated. If the two variables change in opposite directions, the covariance is a negative value, indicating that the two variables are negatively correlated. If the two variables are independent of each other, the covariance is 0, indicating that the two variables are irrelevant. When the variables are greater than or equal to three groups, the corresponding covariance matrix is used.
  • the above irrelevant feature data may be fraudulent data, and the fraud data is not falsified data, but fraud data completed according to a regular way, similar to the existing Taobao brush list (under the self Buying your own goods and then conducting a positive evaluation in the message area, etc., at this time, the fraud data can be identified, that is, the above-mentioned processing according to the preset rules.
  • the above irrelevant feature data may be identified by the Voronoi algorithm for outliers to obtain fraud data.
  • the specific process includes:
  • the above irrelevant feature data is made into a Voronoi diagram of the point set S;
  • b Calculate the V-anomaly factor of each point in the point set S, and find the V-adjacent point of each point, specifically: b1, determine the Voronoi polygon V(pi) of a point pi in the point set S Near the point, calculate the average distance of pi to its neighbors, and use the reciprocal of the average distance to measure the abnormal degree of Pi;
  • V(p) the neighboring point of p determined by the V(p) side
  • V(p) the neighboring point of p determined by the V(p) side
  • V(p) the set of all V-adjacent points of point p
  • Vd(p) the reciprocal of the average distance of all V-adjacent points to p at point p, called the V-abnormality factor of point p, denoted as Vd(p),
  • ⁇ Vd(p) ⁇ is the number of all V-adjacent points of p
  • Vd(p) reflects the distribution density of points around point p.
  • the desired action can be made according to the specific situation of the fraud data. For example, if the fraudulent data is generated by a cooperative enterprise, it will automatically send out an alert email to the senior executives of the company, so that the senior executives of the company are vigilant when the cooperative enterprise cooperates.
  • the apparatus for extracting data features further includes:
  • the adding unit 50 is configured to add the output feature data to the visualized scattergram in real time.
  • the above-described scatter diagram scatter diagram refers to a distribution map of data points on a Cartesian coordinate system plane in the regression analysis. The more data you have in a scatter plot, the better the comparison will be.
  • the extracted feature data is embodied in a scatter plot in the form of points in real time, so that people can discover discrete points in time through the naked eye, so as to analyze the data corresponding to the discrete points.
  • the apparatus for extracting data features further includes:
  • a classifying unit 60 configured to classify the output feature data
  • the operation unit 70 is configured to input the classified feature data into a corresponding operation model for calculation.
  • the classification of the feature data refers to bringing together different types of feature data, for example, the feature data includes multiple types, such as financial, logistics, export, and crop. , livestock, etc., the specific classification method, can be classified according to the source of the data, such as the raw data corresponding to the characteristic data is the data of the financial enterprise, which is classified into the characteristic data of the financial category.
  • the classification of the feature data may be that the designer pre-selects the classified categories, or may automatically classify, for example, according to the attributes of the feature data.
  • the corresponding computing models mentioned above include various types, such as a short-term profit model, an export volume prediction model, and a logistics speed prediction model.
  • the above-mentioned classified feature data set may be invoked, for example, the short-term profitability of the financial industry needs to be predicted, Then, the feature data of the above financial class is invoked, and then the short-term profit model is input for prediction, specifically: the feature data of the financial class is input into the K-means algorithm, and the first clustering calculation is performed; the first clustering calculation is obtained.
  • the various types of clusters are input into the preset SVR prediction model for regression prediction; the short-term profitability of the financial industry is determined according to the prediction results, and if the short-term profitability of the financial industry is relatively high, the loans and financial services corresponding to the financial industry are launched. .
  • the computer device may be a server, and its internal structure may be as shown in FIG.
  • the computer device includes a processor, memory, network interface, and database connected by a system bus. Among them, the computer designed processor is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the memory provides an environment for the operation of operating systems and computer readable instructions in a non-volatile storage medium.
  • the database of the computer device is used to store data such as the CCIPCA algorithm and the derived feature data.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the computer readable instructions are executed by a processor to implement a method of data feature extraction.
  • the foregoing processor performs the foregoing method for extracting data features for performing data feature extraction on data on a blockchain, the method comprising: acquiring original data on a blockchain; and inputting the original data into a CCIPCA algorithm for calculation Characteristic data of the original data.
  • the step of inputting the original data into the CCIPCA algorithm to calculate the feature data of the original data includes: windowing the original data; and inputting the original data in the window to the CCIPCA algorithm. Calculating the feature data of the original data.
  • the step of inputting the original data into the CCIPCA algorithm to calculate feature data of the original data includes: storing the acquired original data into a buffer area; and using raw data in the buffer area
  • the CCIPCA algorithm is input in batches. After the input of the raw data of one batch is completed, iterative calculation is started to obtain the characteristic data of the original data.
  • the step of inputting the original data into the CCIPCA algorithm to calculate feature data of the original data includes: inputting the first feature data for the first feature to be requested when calculating the CCIPCA algorithm The samples are calculated until they converge, the residuals are calculated for the subsequent input samples to calculate the latter feature data, and so on, and the feature data is calculated one by one.
  • the method comprises: performing the correlation matrix analysis on the acquired feature data in batches;
  • the irrelevant feature data that is not related to other feature data in the batch is processed according to a preset rule.
  • the step of inputting the raw data into the CCIPCA algorithm to calculate the feature data of the original data includes: adding the output feature data to the visualized scattergram in real time.
  • the method includes: classifying the output feature data; and inputting the classified feature data into the corresponding operation.
  • the model is calculated.
  • FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the present application is applied.
  • An embodiment of the present invention further provides a computer non-volatile readable storage medium having stored thereon computer readable instructions for implementing data feature extraction when executed by a processor for use in a blockchain
  • the data on the data is extracted, and the method includes: acquiring original data on the blockchain; and inputting the original data into a CCIPCA algorithm to calculate feature data of the original data.
  • the step of the processor inputting the original data into the CCIPCA algorithm to calculate the feature data of the original data comprises: windowing the original data; inputting the original data in the window to The feature data of the original data is calculated in the CCIPCA algorithm.
  • the step of the processor inputting the original data into the CCIPCA algorithm to calculate the feature data of the original data includes: storing the acquired original data into a buffer area; The raw data is input into the CCIPCA algorithm in batches. After the input of the raw data of one batch is completed, iterative calculation is started to obtain the characteristic data of the original data.
  • the step of the processor inputting the original data into the CCIPCA algorithm to calculate the feature data of the original data includes: inputting, for the first feature data to be sought, the first feature data to be requested when calculating by the CCIPCA algorithm The first sample is calculated until it converges, the residual is calculated for the subsequent input samples to calculate the latter feature data, and so on, and the feature data is calculated one by one.
  • the method includes: performing the correlation matrix analysis on the acquired feature data in batches; The irrelevant feature data in the batch that is not related to other feature data in the batch is processed according to a preset rule.
  • the method includes: adding the output feature data to the visualized scattergram in real time.
  • the processor includes: classifying the output feature data; and inputting the classified feature data into the corresponding The calculation model is calculated.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronization.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • SSRSDRAM dual speed rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM Link (Synchlink) DRAM
  • SLDRAM Memory Bus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Computer Security & Cryptography (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请揭示了数据特征提取的方法、装置、计算机设备和存储介质,其中方法包括:获取区块链上的原始数据;将原始数据输入到CCIPCA算法中计算原始数据的特征数据。本申请利用数据是区块链上下载,无法篡改的特性,所以在数据特征提取的过程中并没有进行离散点处理的过程,而使用CCIPCA算法直接进行数据特征提取,数据特征提取更快。

Description

数据特征提取的方法、装置、计算机设备和存储介质
本申请要求于2018年4月20日提交中国专利局、申请号为2018103627855,申请名称为“数据特征提取的方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及到计算机技术领域,特别是涉及到一种数据特征提取的方法、装置、计算机设备和存储介质。
背景技术
区块链是一种去中心化、无需信任的新型数据架构,它由网络中所有的节点共同拥有、管理和监督,不接受单一方面的控制。
区块链是一个刚刚兴起的技术,各企业正在做前期的技术研发和发展布局,所以对区块链上的数据进行分析是一个必要的过程,但是随着区块链上的数据日益增多,如何快速地提取区块链上的原始数据的特征数据,是亟需解决的问题。
技术问题
本申请的主要目的为提供一种数据特征提取的方法、装置、计算机设备和存储介质,旨在可以快速提取区块链上的原始数据的特征数据。
技术解决方案
本申请提出一种数据特征提取的方法,用于对区块链上的数据进行数据特征提取,所述方法包括:
获取区块链上的原始数据;
将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
本申请还提供一种数据特征提取的装置,用于对区块链上的数据进行数据特征提取,所述装置包括:
获取单元,用于获取区块链上的原始数据;
特征提取单元,用于将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
本申请还提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现上述任一项所述方法的步骤。
本申请还提供一种计算机非易失性可读存储介质,其上存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现上述任一项所述的方法的步骤。
有益效果
本申请的数据特征提取的方法、装置、计算机设备和存储介质,利用数据是区块链上下载的,其无法篡改的特性,所以在数据特征提取的过程中并没有进行离散点处理的过程,而使用CCIPCA算法直接进行数据特征提取,数据特征提取更快。
附图说明
图1为本申请一实施例的数据特征提取的方法的流程示意图;
图2为本申请一实施例的数据特征提取的方法的流程示意图;
图3为本申请一实施例的数据特征提取的方法的流程示意图;
图4为本申请一实施例的数据特征提取的方法的流程示意图;
图5为本申请一实施例的数据特征提取的装置的结构示意框图;
图6为本申请一实施例的特征提取单元的结构示意框图;
图7为本申请一实施例的特征提取单元的结构示意框图;
图8为本申请一实施例的特征提取单元的结构示意框图;
图9为本申请一实施例的数据特征提取的装置的结构示意框图;
图10为本申请一实施例的数据特征提取的装置的结构示意框图;
图11为本申请一实施例的数据特征提取的装置的结构示意框图;
图12为本申请一实施例的计算机设备的结构示意框图。
本发明的最佳实施方式
参照图1,本申请实施例提出一种数据特征提取的方法,用于对区块链上的数据进行数据特征提取,所述方法包括:
S1、获取区块链上的原始数据;
S2、将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
如上述步骤S1所述,上述的原始数据是指从区块链上直接下载下来的数据,未经过任何数据处理的数据。从区块链上获取原始数据的方法包括,输入待下载的数据的关键字、关键词等检索词,然后将与检索词相关的数据下载下来。在其它实施例中,还可以设置区块下载,即指定区块只要有数据更新,即会将更新的数据下载下来,以达到实时分析处理的高效性。上述区块是指某一指定领域或某一企业的区块。
如上述步骤S2所述,上述CCIPCA(无偏协方差无关增量主成分分析,Candid Covariance-free Incremental Principal Component Analysis)算法,可用于在线数据流降维的处理,该算法对数据流中的异常点较为敏感,降维精度受异常点的影响较大。本实施例中,充分利用区块链上的数据不会被篡改的产生的特点,所以在用CCIPCA算法降维之前无需进行离群点处理的过程,提高提取数据特征的效率。
本实施例中,上述CCIPCA算法中计算所述原始数据的特征数据的具体过程如下:
假设数据流按样本向量u(1),u(2),…收集,向量可能无限大。每个u(n),n=1,2,…,是一个d维向量。不失一般性,假设u(n)的均值为0。A={u(n)u T(n)}是一个dxd维的协方差矩阵,T代表矩阵转置。采用增量更新的方式计算协方差矩阵:
Figure PCTCN2018095388-appb-000001
令v(0)=v(1),即数据分布的第一个方向,ν代表协方差矩阵。对于增量估计,上式可以写成一种递归的形式:
Figure PCTCN2018095388-appb-000002
其中,v=λx为样本协方差矩阵,特征向量X和特征值λ可分别计算x=v/||v||和λ=||v||得到。由以上得到的是第一阶向量,第二阶向量如下:
Figure PCTCN2018095388-appb-000003
其中,u 1(n)=u(n),在完备空间中,u 2(n)被用作下一迭代的输入。
本实施例中,上述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤S2,包括:
S21、对所述原始数据进行加窗处理;
S22、将窗口内的原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
如上述步骤S21和S22所述,上述加窗处理是指在数据上加一个滑动窗口,用于将部分历史数据丢弃,只处理滑动窗口内的数据,使本申请更关注于新数据的特征提取,以达到实时处理的效果。本申请中,加入滑动窗口之后,虽然对于特征提取的精准度有一定的影响,但是减少对历史数据的依赖性,可以大大地降低计算量,进而提高对实时获取的原始数据进行特征提取的速度。
在一个实施例中,上述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤S2,包括:
S201、将获取的所述原始数据存入到缓存区;
S202、将缓存区内的原始数据分批次地输入所述CCIPCA算法中,当一个批次的原始数据输入完毕后,开始进行迭代计算,得到所述原始数据的特征数据。
如上述步骤S201和S202所述,上述缓存区是指用于存放原始数据的存储空间。本实施例中,获取到区块链上的原始数据之后,并不是直接输入到CCIPCA算法中计算,而是先存储到缓存区内,然后将缓存内的原始数据按照时间进度进行分批处理,即将缓存区内的原始数据按照一定的规则进行划分,比如,每X的数据量为一个批次等,然后按照时间上的先后顺序,分批次的将原始数据输入到CCIPCA算法中计算。具体地,缓存区内的原始数据,进行分批,每一批次的数据量相等,然后按照数据获取的时间,逐批次地输入到CCIPCA算法中进行迭代。设一批输入p个样本,算法运行时在一批样本数据全部输入后才进行迭代计算,其它时候则将已经获得的原始数据放入到缓冲区,等待其它原始数据的输入。迭代过程为:在CCIPCA算法计算时,在接收到一批原始数据进行刷新后,依次对第i个特征向量进行 刷新,得到第i个特征向量新的估计值,然后对这个新的估计值做残差运算。再以新得到的样本对第i+1个特征向量进行刷新。相比于逐个输入原始数据进行刷新,可以使计算前期的特征向量误差相对较小,收敛趋稳定后,再让样本对其进行残差运算,以此来控制误差的累积,具体如下:
对于每批p个原始数据的样本向量:u(1),u(2),……,u(p),在前k各主成分v 1(n),v 2(n),……,v k(n)由如下方法刷新:
对i=1,2,…,k:
1)v i(n)=u i(n)
2)对n=1,2,…,p;
Figure PCTCN2018095388-appb-000004
3)对n=1,2,…,p;
Figure PCTCN2018095388-appb-000005
在另一个实施例中,上述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤S2,包括:
S203、在CCIPCA算法计算时,对于第一个待求的特征数据,先输入第一个样本进行计算直到其收敛,对后面的输入样本计算残差以计算后一个特征数据,并以此类推,逐个计算特征数据。
如上述步骤S203所述,上述第一个样本是指对应上述第一个待求的特征数据的原始数据。上述判断收敛的依据是,第i个特征向量连续r个固执之间的距离(距离定义为内积的绝对值与1的距离)都小于一个阈值q(设q小于10 -4)时,认为该特征向量收敛完毕,以得到算法所能得到的最好的收敛值。然后,让原始数据对第1到i个特征向量最后得到的收敛值依次做残差运算,在对第i+1个特征向量进行刷新。为了防止某个特征向量一直无法收敛使算法长时间循环,可以另外附加一个终止条件:当输入m(m大于10 4)个原始数据后,依然没有完成收敛,则终止循环。
参照图2,本实施例中,上述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤S2之后,包括:
S3、将获取到的特征数据分批次地进行相关矩阵分析;
S4、将同一批次中与该批次内的其它特征数据不相关的不相关特征数据,按照预设规则进行处理。
如上述步骤S3所述,上述相关矩阵也叫相关系数矩阵,其是由矩阵各列间的相关系数构成的。也就是说,相关矩阵第i行第j列的元素是原矩阵第i列和第j列的相关系数。本实施例中一般用到协方差矩阵进行分析,协方差用来衡量两个变量的总体误差,如果两个变量的变化趋势一致,协方差就是正值,说明两个变量正相关。如果两个变量的变化趋势相反,协方差就是负值,说明两个变量负相关。如果两个变量相互独立,那么协方差就是0,说明两个变量不相关,当变量大于或等于三组的时候,即会使用相应的协方差矩阵。
如上述步骤S4所述,上述的不相关特征数据可能是欺诈数据,该欺诈数据并不是篡改后的数据, 而是按照正规的途径完成的欺诈数据,近似于现有的淘宝刷单(自己下单买自己的货物,然后在留言区内进行正向评价等),此时可以进行欺诈数据的识别,即上述的按照预设规则进行处理。在一个实施例中,可以将上述的不相关特征数据通过Voronoi算法对其进行异常值识别,得出欺诈数据。具体的过程包括:
a、将上述不相关特征数据制作成点集S的Voronoi图;
b、计算点集S中各点的V-异常因子,并找出每个点的V-邻近点,具体为:b1、对点集S中的一点pi的Voronoi多边形V(pi)来确定其临近点,计算pi到其各邻近点的平均距离,用平均距离的倒数来衡量Pi的异常程度;
b2、对点集S的任意一点p,由V(p)边确定的p的邻近点称为p的V-邻近点,点p所有V-邻近点的集合记作V(p)。
b3、点p所有V-邻近点到p的平均距离的倒数,称为p点的V-异常因子,记作Vd(p),
Figure PCTCN2018095388-appb-000006
其中,∣Vd(p)∣为p所有V-邻近点的个数;
Vd(p)反映了点p周围点的分布密度,Vd(p)越大,表面p点周围点集的分布越稀疏,其异常因子也就越小。
c、根据各点的V-异常因子从小到大排列;
d、输出各点的V-异常因子,以及异常因子最小的前n个点,该前n个点对应的数据即会判定为欺诈数据风险最高的数据。
当获取到欺诈数据后,可以根据欺诈数据的具体情况作出想用的动作。比如,欺诈数据是合作企业产生的,则自动发出警报邮件等给本企业的高管人员,使本企业高管在于合作企业进行合作时保持警惕。
参照图3,在一个实施例中,上述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤S2之后,包括:
S5、将输出的特征数据实时地添加到可视化的散点图中。
如上述步骤S5所述,上述散点图scatter diagram)在回归分析中是指数据点在直角坐标系平面上的分布图。散点图中包含的数据越多,比较的效果就越好。本实施例中会实时的将提取的特征数据以点的形式体现在散点图中,以便于人们通过肉眼及时地发现离散点,以便于对离散点对应的数据进行分析等。
参照图4,本实施例中,上述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤S2之后,包括:
S6、对输出的特征数据进行分类;
S7、将分类后的特征数据输入到对应的运算模型进行计算。
如上述步骤S6和S7所述,上述将特征数据分类是指将不同类型的特征数据集合到一起,比如,特征数据中包括多种类型,如金融类、物流类、出口类、农作物类、牲畜类等,具体的分类方法,可以根 据数据的来源等进行分类,比如特征数据对应的原始数据是金融企业的数据,其归类到金融类的特征数据。上述特征数据的分类,可以是设计者预选分好类别,也可以是自动进行分类,比如根据特征数据的属性进行分类等。上述对应的运算模型包括多种,比如短期盈利模型、出口量预测模型、物流速度预测模型等。在一具体实施例中,需要对各种行业的保险、贷款等业务进行对应场景的预测,那么可以对上述分类后的特征数据集合进行调用,比如,需要对金融行业的短期盈利能力进行预测,那么调用上述金融类的特征数据,然后输入短期盈利模型进行预测,具体为:将金融类的特征数据输入到K-means算法中,进行第一次聚类计算;将第一次聚类计算得到的各类聚类输入到预设的SVR预测模型中进行回归预测;根据预测结果确定金融行业的短期盈利能力,如果金融行业的短期盈利能力比较高,则推出对应金融行业的贷款、理财等业务。
本申请的数据特征提取的方法,利用数据是区块链上下载的,其无法篡改的特性,所以在数据特征提取的过程中并没有进行离散点处理的过程,而使用CCIPCA算法直接进行数据特征提取,数据特征提取更快。
参照图5,本申请实施例提出一种数据特征提取的装置,用于对区块链上的数据进行数据特征提取,所述装置包括:
获取单元10,用于获取区块链上的原始数据;
特征提取单元20,用于将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
在上述获取单元10中,上述的原始数据是指从区块链上直接下载下来的数据,未经过任何数据处理的数据。从区块链上获取原始数据的方法包括,输入待下载的数据的关键字、关键词等检索词,然后将与检索词相关的数据下载下来。在其它实施例中,还可以设置区块下载,即指定区块只要有数据更新,即会将更新的数据下载下来,以达到实时分析处理的高效性。上述区块是指某一指定领域或某一企业的区块。
在上述特征提取单元20中,上述CCIPCA(无偏协方差无关增量主成分分析,Candid Covariance-free Incremental Principal Component Analysis)算法,可用于在线数据流降维的处理,该算法对数据流中的异常点较为敏感,降维精度受异常点的影响较大。本实施例中,充分利用区块链上的数据不会被篡改的产生的特点,所以在用CCIPCA算法降维之前无需进行离群点处理的过程,提高提取数据特征的效率。
本实施例中,上述CCIPCA算法中计算所述原始数据的特征数据的具体过程如下:
假设数据流按样本向量u(1),u(2),…收集,向量可能无限大。每个u(n),n=1,2,…,是一个d维向量。不失一般性,假设u(n)的均值为0。A={u(n)u T(n)}是一个dxd维的协方差矩阵,T代表矩阵转置。采用增量更新的方式计算协方差矩阵:
Figure PCTCN2018095388-appb-000007
令v(0)=v(1),即数据分布的第一个方向,ν代表协方差矩阵。对于增量估计,上式可以写成一种递归的形式:
Figure PCTCN2018095388-appb-000008
其中,v=λx为样本协方差矩阵,特征向量x和特征值λ可分别计算x=v/||v||和λ=||v||得到。由以上得到的是第一阶向量,第二阶向量如下:
Figure PCTCN2018095388-appb-000009
其中,u 1(n)=u(n),在完备空间中,u 2(n)被用作下一迭代的输入。
参照图6,本实施例中,上述特征提取单元20,包括:
加窗模块21,用于对所述原始数据进行加窗处理;
第一计算模块22,用于将窗口内的原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
上述加窗模块21和第一计算模块22中,上述加窗处理是指在数据上加一个滑动窗口,用于将部分历史数据丢弃,只处理滑动窗口内的数据,使本申请更关注于新数据的特征提取,以达到实时处理的效果。本申请中,加入滑动窗口之后,虽然对于特征提取的精准度有一定的影响,但是减少对历史数据的依赖性,可以大大地降低计算量,进而提高对实时获取的原始数据进行特征提取的速度。
参照图7,在一个实施例中,上述特征提取单元20,包括:
缓存模块201,用于将获取的所述原始数据存入到缓存区;
第二计算模块202,用于将缓存区内的原始数据分批次地输入所述CCIPCA算法中,当一个批次的原始数据输入完毕后,开始进行迭代计算,得到所述原始数据的特征数据。
在上述缓存模块201和第二计算模块202中,上述缓存区是指用于存放原始数据的存储空间。本实施例中,获取到区块链上的原始数据之后,并不是直接输入到CCIPCA算法中计算,而是先存储到缓存区内,然后将缓存内的原始数据按照时间进度进行分批处理,即将缓存区内的原始数据按照一定的规则进行划分,比如,每X的数据量为一个批次等,然后按照时间上的先后顺序,分批次的将原始数据输入到CCIPCA算法中计算。具体地,缓存区内的原始数据,进行分批,每一批次的数据量相等,然后按照数据获取的时间,逐批次地输入到CCIPCA算法中进行迭代。设一批输入p个样本,算法运行时在一批样本数据全部输入后才进行迭代计算,其它时候则将已经获得的原始数据放入到缓冲区,等待其它原始数据的输入。迭代过程为:在CCIPCA算法计算时,在接收到一批原始数据进行刷新后,依次对第i个特征向量进行刷新,得到第i个特征向量新的估计值,然后对这个新的估计值做残差运算。再以新得到的样本对第i+1个特征向量进行刷新。相比于逐个输入原始数据进行刷新,可以使计算前期的特征向量误差相对较小,收敛趋稳定后,再让样本对其进行残差运算,以此来控制误差的累积,具体如下:
对于每批p个原始数据的样本向量:u(1),u(2),……,u(p),在前k各主成分v 1(n),v 2(n),……,v k(n) 由如下方法刷新:
对i=1,2,…,k:
1)v i(n)=u i(n)
2)对n=1,2,…,p;
Figure PCTCN2018095388-appb-000010
3)对n=1,2,…,p;
Figure PCTCN2018095388-appb-000011
参照图8,在另一个实施例中,上述特征提取单元20,包括:
第三计算单元203,用于在CCIPCA算法计算时,对于第一个待求的特征数据,先输入第一个样本进行计算直到其收敛,对后面的输入样本计算残差以计算后一个特征数据,并以此类推,逐个计算特征数据。
在上述第三计算单元203中,上述第一个样本是指对应上述第一个待求的特征数据的原始数据。上述判断收敛的依据是,第i个特征向量连续r个固执之间的距离(距离定义为内积的绝对值与1的距离)都小于一个阈值q(设q小于10 -4)时,认为该特征向量收敛完毕,以得到算法所能得到的最好的收敛值。然后,让原始数据对第1到i个特征向量最后得到的收敛值依次做残差运算,在对第i+1个特征向量进行刷新。为了防止某个特征向量一直无法收敛使算法长时间循环,可以另外附加一个终止条件:当输入m(m大于10 4)个原始数据后,依然没有完成收敛,则终止循环。
参照图9,本实施例中,上述数据特征提取的装置,还包括:
相关分析单元30,用于将获取到的特征数据分批次地进行相关矩阵分析;
处理单元40,用于将同一批次中与该批次内的其它特征数据不相关的不相关特征数据,按照预设规则进行处理。
在上述相关分析单元30中,上述相关矩阵也叫相关系数矩阵,其是由矩阵各列间的相关系数构成的。也就是说,相关矩阵第i行第j列的元素是原矩阵第i列和第j列的相关系数。本实施例中一般用到协方差矩阵进行分析,协方差用来衡量两个变量的总体误差,如果两个变量的变化趋势一致,协方差就是正值,说明两个变量正相关。如果两个变量的变化趋势相反,协方差就是负值,说明两个变量负相关。如果两个变量相互独立,那么协方差就是0,说明两个变量不相关,当变量大于或等于三组的时候,即会使用相应的协方差矩阵。
在上述处理单元40中,上述的不相关特征数据可能是欺诈数据,该欺诈数据并不是篡改后的数据,而是按照正规的途径完成的欺诈数据,近似于现有的淘宝刷单(自己下单买自己的货物,然后在留言区内进行正向评价等),此时可以进行欺诈数据的识别,即上述的按照预设规则进行处理。在一个实施例中,可以将上述的不相关特征数据通过Voronoi算法对其进行异常值识别,得出欺诈数据。具体的过程包括:
a、将上述不相关特征数据制作成点集S的Voronoi图;
b、计算点集S中各点的V-异常因子,并找出每个点的V-邻近点,具体为:b1、对点集S中的一点pi的Voronoi多边形V(pi)来确定其临近点,计算pi到其各邻近点的平均距离,用平均距离的倒数来衡量Pi的异常程度;
b2、对点集S的任意一点p,由V(p)边确定的p的邻近点称为p的V-邻近点,点p所有V-邻近点的集合记作V(p)。
b3、点p所有V-邻近点到p的平均距离的倒数,称为p点的V-异常因子,记作Vd(p),
Figure PCTCN2018095388-appb-000012
其中,∣Vd(p)∣为p所有V-邻近点的个数;
Vd(p)反映了点p周围点的分布密度,Vd(p)越大,表面p点周围点集的分布越稀疏,其异常因子也就越小。
c、根据各点的V-异常因子从小到大排列;
d、输出各点的V-异常因子,以及异常因子最小的前n个点,该前n个点对应的数据即会判定为欺诈数据风险最高的数据。
当获取到欺诈数据后,可以根据欺诈数据的具体情况作出想用的动作。比如,欺诈数据是合作企业产生的,则自动发出警报邮件等给本企业的高管人员,使本企业高管在于合作企业进行合作时保持警惕。
参照图10,在一个实施例中,上述数据特征提取的装置,还包括:
添加单元50,用于将输出的特征数据实时地添加到可视化的散点图中。
在上述添加单元50中,上述散点图scatter diagram)在回归分析中是指数据点在直角坐标系平面上的分布图。散点图中包含的数据越多,比较的效果就越好。本实施例中会实时的将提取的特征数据以点的形式体现在散点图中,以便于人们通过肉眼及时地发现离散点,以便于对离散点对应的数据进行分析等。
参照图11,本实施例中,上述数据特征提取的装置,还包括:
分类单元60,用于对输出的特征数据进行分类;
运算单元70,用于将分类后的特征数据输入到对应的运算模型进行计算。
在上述分类单元70和运算单元80中,上述将特征数据分类是指将不同类型的特征数据集合到一起,比如,特征数据中包括多种类型,如金融类、物流类、出口类、农作物类、牲畜类等,具体的分类方法,可以根据数据的来源等进行分类,比如特征数据对应的原始数据是金融企业的数据,其归类到金融类的特征数据。上述特征数据的分类,可以是设计者预选分好类别,也可以是自动进行分类,比如根据特征数据的属性进行分类等。上述对应的运算模型包括多种,比如短期盈利模型、出口量预测模型、物流速度预测模型等。在一具体实施例中,需要对各种行业的保险、贷款等业务进行对应场景的预测,那么可以对上述分类后的特征数据集合进行调用,比如,需要对金融行业的短期盈利能力进行预测,那么调用上述金融类的特征数据,然后输入短期盈利模型进行预测,具体为:将金融类的特征数据输入到K-means 算法中,进行第一次聚类计算;将第一次聚类计算得到的各类聚类输入到预设的SVR预测模型中进行回归预测;根据预测结果确定金融行业的短期盈利能力,如果金融行业的短期盈利能力比较高,则推出对应金融行业的贷款、理财等业务。
参照图12,本发明实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图12所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储CCIPCA算法以及得出的特征数据等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种数据特征提取的方法。
上述处理器执行上述数据特征提取的方法,用于对区块链上的数据进行数据特征提取,所述方法包括:获取区块链上的原始数据;将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
在一个实施例中,上述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤,包括:对所述原始数据进行加窗处理;将窗口内的原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
在一个实施例中,上述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤,包括:将获取的所述原始数据存入到缓存区;将缓存区内的原始数据分批次地输入所述CCIPCA算法中,当一个批次的原始数据输入完毕后,开始进行迭代计算,得到所述原始数据的特征数据。
在一个实施例中,上述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤,包括:在CCIPCA算法计算时,对于第一个待求的特征数据,先输入第一个样本进行计算直到其收敛,对后面的输入样本计算残差以计算后一个特征数据,并以此类推,逐个计算特征数据。
在一个实施例中,上述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤之后,包括:将获取到的特征数据分批次地进行相关矩阵分析;将同一批次中与该批次内的其它特征数据不相关的不相关特征数据,按照预设规则进行处理。
在一个实施例中,上述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤之后,包括:将输出的特征数据实时地添加到可视化的散点图中。
在一个实施例中,上述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤之后,包括:对输出的特征数据进行分类;将分类后的特征数据输入到对应的运算模型进行计算。
本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。
本发明一实施例还提供一种计算机非易失性可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时实现数据特征提取的方法,用于对区块链上的数据进行数据特征提取,所述方法 包括:获取区块链上的原始数据;将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
在一个实施例中,上述处理器将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤,包括:对所述原始数据进行加窗处理;将窗口内的原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
在一个实施例中,上述处理器将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤,包括:将获取的所述原始数据存入到缓存区;将缓存区内的原始数据分批次地输入所述CCIPCA算法中,当一个批次的原始数据输入完毕后,开始进行迭代计算,得到所述原始数据的特征数据。
在一个实施例中,上述处理器将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤,包括:在CCIPCA算法计算时,对于第一个待求的特征数据,先输入第一个样本进行计算直到其收敛,对后面的输入样本计算残差以计算后一个特征数据,并以此类推,逐个计算特征数据。
在一个实施例中,上述处理器将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤之后,包括:将获取到的特征数据分批次地进行相关矩阵分析;将同一批次中与该批次内的其它特征数据不相关的不相关特征数据,按照预设规则进行处理。
在一个实施例中,上述处理器将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤之后,包括:将输出的特征数据实时地添加到可视化的散点图中。
在一个实施例中,上述处理器将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤之后,包括:对输出的特征数据进行分类;将分类后的特征数据输入到对应的运算模型进行计算。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其它相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种数据特征提取的方法,其特征在于,用于对区块链上的数据进行数据特征提取,所述方法包括:
    获取区块链上的原始数据;
    将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
  2. 根据权利要求1所述的数据特征提取的方法,其特征在于,所述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤,包括:
    对所述原始数据进行加窗处理;
    将窗口内的原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
  3. 根据权利要求1所述的数据特征提取的方法,其特征在于,所述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤,包括:
    将获取的所述原始数据存入到缓存区;
    将缓存区内的原始数据分批次地输入所述CCIPCA算法中,当一个批次的原始数据输入完毕后,开始进行迭代计算,得到所述原始数据的特征数据。
  4. 根据权利要求1所述的数据特征提取的方法,其特征在于,所述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤,包括:
    在CCIPCA算法计算时,对于第一个待求的特征数据,先输入第一个样本进行计算直到其收敛,对后面的输入样本计算残差以计算后一个特征数据,并以此类推,逐个计算特征数据。
  5. 根据权利要求1所述的数据特征提取的方法,其特征在于,所述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤之后,包括:
    将获取到的特征数据分批次地进行相关矩阵分析;
    将同一批次中与该批次内的其它特征数据不相关的不相关特征数据,按照预设规则进行处理。
  6. 根据权利要求1所述的的数据特征提取的方法,其特征在于,所述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤之后,包括:
    将输出的特征数据实时地添加到可视化的散点图中。
  7. 根据权利要求1所述的的数据特征提取的方法,其特征在于,所述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤之后,包括:
    对输出的特征数据进行分类;
    将分类后的特征数据输入到对应的运算模型进行计算。
  8. 一种数据特征提取的装置,其特征在于,用于对区块链上的数据进行数据特征提取,所述装置包括:
    获取单元,用于获取区块链上的原始数据;
    特征提取单元,用于将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
  9. 根据权利要求8所述的数据特征提取的装置,其特征在于,所述特征提取单元,包括:
    加窗模块,用于对所述原始数据进行加窗处理;
    第一计算模块,用于将窗口内的原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
  10. 根据权利要求8所述的数据特征提取的装置,其特征在于,所述特征提取单元,包括:
    缓存模块,用于将获取的所述原始数据存入到缓存区;
    第二计算模块,用于将缓存区内的原始数据分批次地输入所述CCIPCA算法中,当一个批次的原始数据输入完毕后,开始进行迭代计算,得到所述原始数据的特征数据。
  11. 根据权利要求8所述的数据特征提取的装置,其特征在于,所述特征提取单元,包括:
    第三计算单元,用于在CCIPCA算法计算时,对于第一个待求的特征数据,先输入第一个样本进行计算直到其收敛,对后面的输入样本计算残差以计算后一个特征数据,并以此类推,逐个计算特征数据。
  12. 根据权利要求8所述的数据特征提取的装置,其特征在于,所述数据特征提取的装置,还包括:
    相关分析单元,用于将获取到的特征数据分批次地进行相关矩阵分析;
    处理单元,用于将同一批次中与该批次内的其它特征数据不相关的不相关特征数据,按照预设规则进行处理。
  13. 根据权利要求8所述的的数据特征提取的装置,其特征在于,所述数据特征提取的装置,还包括:
    添加单元,用于将输出的特征数据实时地添加到可视化的散点图中。
  14. 根据权利要求8所述的的数据特征提取的装置,其特征在于,所述数据特征提取的装置,还包括:
    分类单元,用于对输出的特征数据进行分类;
    运算单元,用于将分类后的特征数据输入到对应的运算模型进行计算。
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现数据特征提取的方法,用于对区块链上的数据进行数据特征提取,所述方法包括:
    获取区块链上的原始数据;
    将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
  16. 根据权利要求15所述的计算机设备,其特征在于,所述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤,包括:
    对所述原始数据进行加窗处理;
    将窗口内的原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
  17. 根据权利要求15所述的计算机设备,其特征在于,所述将所述原始数据输入到CCIPCA算法 中计算所述原始数据的特征数据的步骤,包括:
    将获取的所述原始数据存入到缓存区;
    将缓存区内的原始数据分批次地输入所述CCIPCA算法中,当一个批次的原始数据输入完毕后,开始进行迭代计算,得到所述原始数据的特征数据。
  18. 根据权利要求15所述的计算机设备,其特征在于,所述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤,包括:
    在CCIPCA算法计算时,对于第一个待求的特征数据,先输入第一个样本进行计算直到其收敛,对后面的输入样本计算残差以计算后一个特征数据,并以此类推,逐个计算特征数据。
  19. 一种计算机非易失性可读存储介质,其上存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现数据特征提取的方法,用于对区块链上的数据进行数据特征提取,所述方法包括:
    获取区块链上的原始数据;
    将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
  20. 根据权利要求19所述的计算机非易失性可读存储介质,其特征在于,所述将所述原始数据输入到CCIPCA算法中计算所述原始数据的特征数据的步骤,包括:
    对所述原始数据进行加窗处理;
    将窗口内的原始数据输入到CCIPCA算法中计算所述原始数据的特征数据。
PCT/CN2018/095388 2018-04-20 2018-07-12 数据特征提取的方法、装置、计算机设备和存储介质 WO2019200738A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810362785.5 2018-04-20
CN201810362785.5A CN108763305A (zh) 2018-04-20 2018-04-20 数据特征提取的方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2019200738A1 true WO2019200738A1 (zh) 2019-10-24

Family

ID=64011024

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/095388 WO2019200738A1 (zh) 2018-04-20 2018-07-12 数据特征提取的方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN108763305A (zh)
WO (1) WO2019200738A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245514B (zh) * 2019-04-30 2021-09-03 清华大学 一种基于区块链的分布式计算方法及系统
US11164658B2 (en) 2019-05-28 2021-11-02 International Business Machines Corporation Identifying salient features for instances of data
CN110569654B (zh) * 2019-08-30 2020-05-12 广州奇化有限公司 供应链快速响应模式的区块链可信数据处理方法及装置
CN110705321B (zh) * 2019-10-16 2023-02-28 榆林学院 计算机辅助翻译系统
CN115048278A (zh) * 2019-12-13 2022-09-13 厦门华厦学院 一种移动终端通信故障采集系统
CN111008227A (zh) * 2019-12-27 2020-04-14 广西民族师范学院 一种数据分析处理平台
CN117310348B (zh) * 2023-11-23 2024-03-12 东莞市时实电子有限公司 一种电源适配器故障实时监测方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1758283A (zh) * 2005-11-03 2006-04-12 复旦大学 模拟多尺度交叠感受野的神经网络及其建立方法和应用
US20120170659A1 (en) * 2009-09-04 2012-07-05 Stmicroelectronics Pvt. Ltd. Advance video coding with perceptual quality scalability for regions of interest
CN104933089A (zh) * 2015-05-15 2015-09-23 江苏博智软件科技有限公司 一种基于加速迭代的大数据集谱聚类的方法
CN107194950A (zh) * 2017-04-26 2017-09-22 天津大学 一种基于慢特征分析的多人跟踪方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563260A (zh) * 2016-06-30 2018-01-09 中国矿业大学 一种基于主成分分析和最近邻图的密度峰值聚类方法及系统
CN107633254A (zh) * 2017-07-25 2018-01-26 平安科技(深圳)有限公司 建立预测模型的装置、方法及计算机可读存储介质
CN107483969A (zh) * 2017-09-19 2017-12-15 上海爱优威软件开发有限公司 一种基于pca的数据传输方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1758283A (zh) * 2005-11-03 2006-04-12 复旦大学 模拟多尺度交叠感受野的神经网络及其建立方法和应用
US20120170659A1 (en) * 2009-09-04 2012-07-05 Stmicroelectronics Pvt. Ltd. Advance video coding with perceptual quality scalability for regions of interest
CN104933089A (zh) * 2015-05-15 2015-09-23 江苏博智软件科技有限公司 一种基于加速迭代的大数据集谱聚类的方法
CN107194950A (zh) * 2017-04-26 2017-09-22 天津大学 一种基于慢特征分析的多人跟踪方法

Also Published As

Publication number Publication date
CN108763305A (zh) 2018-11-06

Similar Documents

Publication Publication Date Title
WO2019200738A1 (zh) 数据特征提取的方法、装置、计算机设备和存储介质
CN109165840B (zh) 风险预测处理方法、装置、计算机设备和介质
US11720821B2 (en) Automated and customized post-production release review of a model
Hernández-Orallo ROC curves for regression
TWI740891B (zh) 利用訓練資料訓練模型的方法和訓練系統
US20180253657A1 (en) Real-time credit risk management system
TWI631518B (zh) 具有一或多個計算裝置的電腦伺服系統及訓練事件分類器模型的電腦實作方法
WO2017133615A1 (zh) 一种业务参数获取方法及装置
CN112633426B (zh) 处理数据类别不均衡的方法、装置、电子设备及存储介质
US11790369B2 (en) Systems and method for enhanced active machine learning through processing of partitioned uncertainty
Bucci Cholesky–ANN models for predicting multivariate realized volatility
JP6855604B2 (ja) 短期利益を予測する方法、装置、コンピューターデバイス、プログラムおよび記憶媒体
US10956825B1 (en) Distributable event prediction and machine learning recognition system
JP6971514B1 (ja) 情報処理装置、情報処理方法及びプログラム
Mendonça et al. Approximating network centrality measures using node embedding and machine learning
CN110912908A (zh) 网络协议异常检测方法、装置、计算机设备和存储介质
US11977978B2 (en) Finite rank deep kernel learning with linear computational complexity
CN113674087A (zh) 企业信用等级评定方法、装置、电子设备和介质
US20240086736A1 (en) Fault detection and mitigation for aggregate models using artificial intelligence
CN113762005A (zh) 特征选择模型的训练、对象分类方法、装置、设备及介质
Gujar et al. Genethos: A synthetic data generation system with bias detection and mitigation
US20220012613A1 (en) System and method for evaluating machine learning model behavior over data segments
JP2023506739A (ja) 機械学習を用いたマージンコールの要因を検出するための方法及びシステム
CN113689020A (zh) 业务信息预测方法、装置、计算机设备和存储介质
CN117009883B (zh) 对象分类模型构建方法、对象分类方法、装置和设备

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 23/02/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18915660

Country of ref document: EP

Kind code of ref document: A1