CN115687907A

CN115687907A - Data processing method, apparatus, device, medium, and program product

Info

Publication number: CN115687907A
Application number: CN202211043758.4A
Authority: CN
Inventors: 李熠; 杨晓然; 邬子庄
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2023-02-03

Abstract

The disclosure provides a data processing method which can be applied to the technical field of big data. The method comprises the following steps: acquiring original data, preprocessing the original data, and acquiring view data to be split, wherein the view data to be split is single-view data; carrying out view splitting on the view data to be split according to the dimension based on a principal component analysis method, and converting the view data to be split into multi-view data; establishing a model by using a single-view algorithm or a multi-view algorithm based on the multi-view data, wherein the single-view data are high-dimensional data, and dimensions in the high-dimensional data correspond to entity features; the multi-view data is data distributed to a plurality of views, wherein the dimension number of the data in each view is the same or different, and the aggregation of the dimension number of the data in each view is equal to the dimension of single-view data. The present disclosure also provides a data processing apparatus, a device, a storage medium, and a program product.

Description

Data processing method, apparatus, device, medium, and program product

Technical Field

The present disclosure relates to the field of big data technologies, and in particular, to a data processing method, apparatus, device, medium, and program product.

Background

With the rapid development of big data technology, big data modeling has wide application in a plurality of technical fields such as finance, business, government affairs and the like. The essence of big data modeling is to extract useful information of data in a data feature space and use the useful information for classification regression and other purposes. The data modeling model can be divided into two types of methods, namely single-view method and multi-view method. The single view means that samples are all sampled from the same data distribution, and is the basis of modeling problem research, the multi-view means that the samples are sampled from two or more different sources, the distribution information of the samples is more complex, but the contained information is richer, and the structure of the information is more obviously reflected.

When modeling is performed by using single-view data, a large amount of single-view data used for modeling are high-dimensional data due to the continuous increase in complexity of application scenes. In the process of utilizing high-dimensional data to carry out single-view modeling, the problems of high redundancy rate among characteristics, existence of Housekeeping phenomenon and the like are easily caused due to overhigh dimensionality. And when the data dimension is higher, the data structure information is difficult to mine. The traditional dimension reduction processing method can lose information after dimension reduction. How to reduce the characteristic redundancy rate and reduce the occurrence of the houss phenomenon when modeling is performed by utilizing single-view data, and meanwhile, the original data information is not lost, and the problem of more fully mining the association between data is urgent to solve.

Disclosure of Invention

In view of the foregoing, embodiments of the present disclosure provide a data processing method, apparatus, device, medium, and program product that improve single-view data modeling data utilization, reduce feature redundancy rate, and reduce the houss phenomenon.

According to a first aspect of the present disclosure, there is provided a data processing method including: acquiring original data, preprocessing the original data, and acquiring view data to be split, wherein the view data to be split is single-view data; carrying out view splitting on the view data to be split according to the dimension based on a principal component analysis method, and converting the view data to be split into multi-view data; establishing a model by using a single-view algorithm or a multi-view algorithm based on the multi-view data, wherein the single-view data is high-dimensional data, and dimensions in the high-dimensional data correspond to the entity features; the multi-view data is data distributed to a plurality of views, wherein the dimension number of the data in each view is the same or different, and the aggregation of the dimension number of the data in each view is equal to the dimension of single-view data.

According to the embodiment of the disclosure, the view splitting is performed on the view data to be split according to the dimension based on the principal component analysis method, and the step of converting the view data to be split into the multi-view data comprises a dimension splitting step, wherein the dimension splitting step comprises: constructing a principal component analysis classification indication function, wherein the principal component analysis classification indication function is constructed based on a characteristic classification indication matrix, the characteristic classification indication matrix is an n x k matrix, n is the dimensionality of data to be split, and k is the preset view splitting number; solving the optimization problem of the principal component analysis classification indication function to obtain a solution result, wherein the solution result comprises a weight distribution result of a feature classification indication matrix, and the weight distribution result of the feature classification indication matrix is a matrix formed by feature vectors corresponding to minimum feature values corresponding to the view splitting numbers; and carrying out view splitting on the dimension of the view data to be split based on the weight distribution result of the feature classification indication matrix to obtain a data dimension splitting result, wherein the dimension of the view data to be split is m, and splitting the ith data dimension comprises: and splitting the ith data dimension into a view corresponding to the maximum eigenvalue in the eigenvector corresponding to the ith data dimension in the characteristic classification indication matrix, wherein i belongs to [1, m ].

According to an embodiment of the present disclosure, after obtaining the dimension splitting result of the m dimension data, the method further includes: and distributing the view data to be split according to the dimension splitting result to obtain the multi-view data.

According to the embodiment of the disclosure, the preset number of view splits is obtained by processing the dimension of the data to be split based on one of an automatic clustering algorithm or a similarity algorithm.

According to an embodiment of the present disclosure, wherein the automatic clustering algorithm comprises one of a density-based noise application spatial clustering algorithm, a fuzzy clustering algorithm, a K-means clustering algorithm; and/or the similarity algorithm comprises one of a cosine similarity algorithm, a distance similarity algorithm and a Pearson correlation coefficient.

According to an embodiment of the present disclosure, the preprocessing the raw data includes: and carrying out normalization processing on the original data.

According to an embodiment of the present disclosure, solving the optimization problem of the principal component analysis classification indicator function includes: and solving the optimization problem of the principal component analysis classification indication function based on a generalized eigenvalue solving method.

According to an embodiment of the present disclosure, the multi-view algorithm includes one of canonical correlation analysis, multi-canonical correlation analysis, kernel canonical correlation analysis, local preservation canonical correlation analysis, discriminant canonical correlation analysis, generalized multi-view analysis, multi-view discriminant analysis, and multi-view dimension reduction model.

According to an embodiment of the present disclosure, the raw data contains user characteristic data, and the model is used to construct a user representation.

A second aspect of the present disclosure provides a data processing apparatus comprising: the system comprises an acquisition module, a data processing module and a data processing module, wherein the acquisition module is configured to acquire original data, preprocess the original data and acquire view data to be split, the view data to be split is single-view data, the single-view data is high-dimensional data, and dimensions in the high-dimensional data correspond to entity characteristics; the data splitting module is configured to perform view splitting on the view data to be split according to dimensions based on a principal component analysis method, and convert the view data to be split into multi-view data, wherein the multi-view data are data distributed to multiple views, the dimension number of the data in each view is the same or different, and the aggregation of the dimension number of the data in each view is equal to the dimension of single-view data; and a model building module configured to build a model using a single-view algorithm or a multi-view algorithm based on the multi-view data.

According to an embodiment of the present disclosure, the data splitting module may include a construction sub-module, a solution sub-module, and a dimension splitting sub-module. The construction submodule is configured to construct a principal component analysis classification indication function, wherein the principal component analysis classification indication function is constructed based on a characteristic classification indication matrix, the characteristic classification indication matrix is an n x k matrix, n is the dimension of data to be split, and k is a preset view splitting number. And the solving submodule is configured to solve the optimization problem of the principal component analysis classification indication function, and obtain a solving result, wherein the solving result comprises a weight distribution result of a feature classification indication matrix, and the weight distribution result of the feature classification indication matrix is a matrix formed by feature vectors corresponding to minimum feature values corresponding to the view splitting numbers. The dimension splitting submodule is configured to perform view splitting on the dimension of the view data to be split based on the weight distribution result of the feature classification indication matrix, and obtain a data dimension splitting result, where the dimension of the view data to be split is m, and splitting the ith data dimension includes: and splitting the ith data dimension into a view corresponding to the maximum eigenvalue in the eigenvector corresponding to the ith data dimension in the characteristic classification indication matrix, wherein i belongs to [1, m ].

According to an embodiment of the present disclosure, the data splitting module may include a construction sub-module, a solving sub-module, a dimension splitting sub-module, and a data allocation sub-module. And the data distribution submodule is configured to distribute the view data to be split according to the dimension splitting result to acquire the multi-view data.

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the above-described data processing method.

A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described data processing method.

The fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described data processing method.

According to the method provided by the embodiment of the disclosure, by taking the thought of multi-view modeling as a reference, the dimensionality of the single-view data is split by using a principal component analysis method, the single-view data is converted into the multi-view data without losing data dimensionality information, the structural information of the data is increased, the range of selectable models in the data modeling process is expanded, and the problems of high redundancy rate among features, houss phenomenon and the like caused by overhigh data dimensionality are relieved to a certain extent.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, taken in conjunction with the accompanying drawings of which:

fig. 1 schematically illustrates an application scenario diagram of a data processing method, apparatus, device, medium, and program product according to embodiments of the present disclosure.

Fig. 2 schematically shows a flow chart of a data processing method according to an embodiment of the present disclosure.

Fig. 3 schematically shows a flow chart of a dimension splitting method according to an embodiment of the present disclosure.

Fig. 4 schematically shows a flowchart of a method of splitting an ith data dimension according to an embodiment of the present disclosure.

Fig. 5 schematically illustrates a flow chart of a method of view data allocation according to an embodiment of the present disclosure.

Fig. 6 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure.

FIG. 7 is a block diagram that schematically illustrates a data splitting module, in accordance with an embodiment of the present disclosure.

FIG. 8 is a block diagram schematically illustrating a structure of a data splitting module according to an embodiment of the present disclosure.

Fig. 9 schematically shows a block diagram of an electronic device adapted to implement a data processing method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B, and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B, and C" would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.).

With the rapid development of big data technology, big data modeling has wide application in a plurality of technical fields such as finance, business, government affairs and the like. The essence of big data modeling is to extract useful information of data in a data feature space and use the useful information for classification regression and other purposes. The data modeling model can be divided into two types of methods, namely single-view method and multi-view method. The single view means that samples are all sampled from the same data distribution, and is the basis of modeling problem research. The multi-view is that samples are sampled from two or more different sources, the distribution information of the samples is more complex, but the contained information is richer, and the structure between the information is more obviously embodied. For example, if one knows from a visual aspect, one can consider it to be single view data; when one recognizes a person from multiple perspectives such as appearance, sound, and motion, it can be considered as multi-view data. Examples of typical multi-view data may be: multimedia video can be represented by both video and audio signals; as another example, a web page may be represented by hyperlinks and textual information. The hidden characteristics of the multi-view data are consistent, but due to the difference of the statistical characteristics of the data, each view angle of the multi-view data contains information unique to one aspect of an object, and the difference exists. The data are more comprehensively described by using the information of consistency and complementarity between the multi-view data, so that the deep understanding and analysis of the target are realized, and the application effect is proved to be good in practical application.

In many scenes, the data used for modeling are still single-view data, and when modeling is performed by using the single-view data, due to the continuous increase of the complexity of the application scenes, a large amount of single-view data used for modeling are high-dimensional data. In the process of utilizing high-dimensional data to carry out single-view modeling, the problems of high redundancy rate among characteristics, existence of a Houss phenomenon and the like are easily caused due to overhigh dimensionality. And when the data dimension is higher, the data structure information is difficult to mine. The traditional dimension reduction processing method can lose information after dimension reduction. How to reduce the characteristic redundancy rate and reduce the occurrence of the houss phenomenon when modeling is performed by utilizing single-view data, and meanwhile, the original data information is not lost, and the problem of more fully mining the association between data is urgent to solve.

To solve the above problems in the prior art, an embodiment of the present disclosure provides a data processing method, including: acquiring original data, preprocessing the original data, and acquiring view data to be split, wherein the view data to be split is single-view data; carrying out view splitting on the view data to be split according to dimensions based on a principal component analysis method, and converting the view data to be split into multi-view data; establishing a model by using a single-view algorithm or a multi-view algorithm based on the multi-view data, wherein the single-view data are high-dimensional data, and dimensions in the high-dimensional data correspond to entity features; the multi-view data is data distributed to a plurality of views, wherein the dimension number of the data in each view is the same or different, and the aggregation of the dimension number of the data in each view is equal to the dimension of single-view data.

According to the method provided by the embodiment of the disclosure, the thought of multi-view modeling is used for reference, the principal component analysis method is used for carrying out dimension splitting on the single-view data, the single-view data is converted into the multi-view data under the condition that the data dimension information is not lost, and the structure information of the data is increased. The split multi-view data can be modeled by using a multi-view algorithm, and also can be merged and restored into single-view data and modeled by using the single-view algorithm. Therefore, the range of selectable models in the data modeling process can be enlarged, and the problems of high redundancy rate among features, huss phenomenon and the like caused by overhigh data dimensionality can be relieved to a certain extent.

It should be noted that the data processing method, apparatus, device, medium, and program product provided in the embodiments of the present disclosure may be used in data preprocessing related aspects of big data technology before modeling, and may also be used in various fields other than big data technology, such as financial fields. The application fields of the data processing method, the data processing device, the data processing apparatus, the data processing medium and the program product provided by the embodiments of the present disclosure are not limited.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure, application and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations, necessary confidentiality measures are taken, and the customs of the public order is not violated.

In the technical scheme of the disclosure, before the personal information of the user is obtained or collected, the authorization or the consent of the user is obtained.

The above-described operations for carrying out at least one of the objects of the present disclosure will be described with reference to the accompanying drawings and description thereof.

As shown in fig. 1, an application scenario 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only). For example, a user may use

terminal devices

101, 102, 103 to send raw data to server 105 over network 104 for modeling.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device. For example, after receiving the raw data sent by the

terminal devices

101, 102, and 103 through the network 104, the server 105 may complete view data splitting and model building based on the data processing method of the embodiment of the present disclosure, and send the model processing result to the

terminal devices

101, 102, and 103 through the network 104.

It should be noted that the data processing method provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the data processing apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The data processing method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the data processing apparatus provided in the embodiments of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The data processing method, apparatus, device, medium, and program product method of the disclosed embodiments will be described in detail below with fig. 2-5 based on the scenario described in fig. 1.

As shown in fig. 2, the data processing method of the embodiment includes operations S210 to S230, and the data processing method may be performed by a processor, or may be performed by any electronic device including a processor.

In operation S210, original data is obtained, the original data is preprocessed, and view data to be split is obtained, where the view data to be split is single view data.

According to an embodiment of the present disclosure, the raw data is single-view data. The single-view data is high-dimensional data, and dimensions in the high-dimensional data correspond to the characteristics of the entity, that is, each dimension in the high-dimensional data can be used for describing the attribute characteristics of a certain aspect of the entity. In embodiments of the present disclosure, high dimensional data may be data with data dimensions of tens of dimensions or more. In some application scenarios, the high dimensional data may be data of more than 50 dimensions. Prior to view splitting the raw data, the raw data may be preprocessed to optimize the data samples. Typical preprocessing modes may include data cleansing, data normalization, and the like. For example, the unique attribute is removed, missing values and abnormal values are filled up, and the meaning of the feature value is given. In some embodiments, the pre-processing method of the raw data may preferably include normalizing the raw data. Specifically, common normalization processing methods such as a maximum-minimum value method, a 0-mean normalization method, a logarithm function conversion method, an inverse cotangent function conversion method or a Z-Score method can be used for scaling all dimensions to be in a range of [0,1], so that dimensional influence of features is eliminated, the features have comparability, and the scientificity of data splitting can be improved. And after preprocessing, obtaining the view data to be split for view splitting.

In operation S220, view splitting is performed on the view data to be split according to dimensions based on a principal component analysis method, and the view data to be split is converted into multi-view data.

In operation S230, a model is built using a single-view algorithm or a multi-view algorithm based on the multi-view data.

In the embodiment of the disclosure, view splitting is performed on view data to be split according to the dimension by using the idea of principal component analysis. The traditional principal component analysis method is a typical data analysis method, original data is transformed into a group of linearly independent representations of each dimension through linear transformation, the principal characteristic components of the data can be extracted, and the method is commonly used for reducing dimensions of high-dimensional data. However, in the data dimension reduction process, since only the main feature component is extracted or data of multiple dimensions is transformed, dimension information is inevitably lost. In order to avoid dimension information loss, the embodiment of the disclosure uses the idea of principal component analysis for reference, obtains a feature classification indication matrix by maximizing the dispersion degree among samples, then allocates dimension features to the view with the largest projection weight through the feature classification indication matrix, further realizes data splitting, and allocates original single-view data to multiple views to form multi-view data. After splitting, the data among the views have intrinsic consistency and complementarity, thereby realizing the deep understanding and analysis of the target. Wherein, consistency means that data between different views have internal relation; the complementarity refers to the difference between different view data, so that the data in each view contains the unique information of a certain aspect of the target object described by the original data, the mutually complementary information can more comprehensively describe the data, and the information can be used for deeply mining the association among the features to construct a model with better performance. After obtaining the multi-view data, the split multi-view data may be modeled using a multi-view algorithm. It can be understood that the multi-view data is the original single-view data after being aggregated, and therefore, the modeling can also be performed by using a single-view algorithm. According to the data processing method, the data dimension is subjected to data splitting, so that the application range of the data and the selectable range of the model are expanded, and the data processing method is suitable for modeling of more scenes. On the other hand, the number of dimensions of data in each view is the same or different, and the aggregation of the number of dimensions of data in the views is equivalent to the dimensions of single-view data. Thus, the information of all dimensions of the original data is preserved. In addition, due to the fact that the data dimension in each view is reduced, the problems that redundancy rate is high, and the Houss phenomenon exists among features and the like caused by the fact that the original single-view data dimension is too high can be relieved to a certain extent. And to a certain extent saving computational resources when modeling based on view data.

In the embodiment of the disclosure, the view data to be split is subjected to view splitting according to the dimension based on a principal component analysis method, and the key of the step of converting the view data to be split into the multi-view data is dimension splitting.

As shown in fig. 3, the dimension splitting method of this embodiment includes operations S310 to S330.

In operation S310, a principal component analysis classification indication function is constructed, where the principal component analysis classification indication function is constructed based on a feature classification indication matrix, and the feature classification indication matrix is an n × k matrix, where n is a dimension of data to be split, and k is a preset view splitting number.

In some embodiments, the preset number of view splits may be determined based on expert experience.

In some embodiments, solving the optimization problem for the principal component analysis classification indicator function comprises: and solving the optimization problem of the principal component analysis classification indication function based on a generalized eigenvalue solving method. In solving the optimization problem of the principal component analysis classification indicative function, different methods including generalized eigenvalue solution and singular value decomposition may be applied. The generalized characteristic value solving method solves the optimization problem of the principal component analysis classification indicating function, and can realize the classification of the characteristics to finish the view splitting.

Specifically, the optimization problem of the principal component analysis classification indication function can be represented by formula (1):

in the formula (1), W is in the range of R ^n×k Indicating a matrix for feature classification, n being the original data dimension, k being the number of views, tr (W) ^T XX ^T W) is a principal component term used to maximize divergence between the variables after feature aggregation. Solving an optimization problem based on a principal component analysis classification indicator function of an equation (1), wherein the meaning is as follows: the original n-dimensional feature data is projected to a group of k-dimensional orthogonal bases by a method of maximizing the divergence (principal component analysis) between variables after feature aggregation, and the projection weight of each feature to k mutually orthogonal dimensions, namely W, can be obtained.

In operation S320, an optimization problem of the principal component analysis classification indication function is solved, and a solution result is obtained, where the solution result includes a weight distribution result of a feature classification indication matrix, and the weight distribution result of the feature classification indication matrix is a matrix formed by feature vectors corresponding to minimum feature values corresponding to the view splitting numbers. In particular, XX can be calculated ^T W = (- α) W. Wherein, the solution of W is a matrix formed by the characteristic vectors corresponding to the first k minimum characteristic values, namely the weight distribution result of the characteristic classification indication matrix, and the weight distribution result can be usedThe feature is split.

In operation S330, view splitting is performed on the dimension of the view data to be split based on the weight distribution result of the feature classification indication matrix, so as to obtain a data dimension splitting result, where the dimension of the view data to be split is m.

As shown in fig. 4, the method for splitting the ith data dimension of this embodiment includes operation S410.

In operation S410, the ith data dimension is split into a view corresponding to the largest eigenvalue in the eigenvector corresponding to the ith data dimension in the feature classification indication matrix, where i ∈ [1, m ].

According to the solving result of the characteristic classification indicating matrix, each row of the characteristic classification indicating matrix represents the projection weight of the characteristic to different views, wherein the row corresponding to the maximum characteristic value is the view with the maximum projection weight after the characteristic aggregation, namely the view with the maximum characteristic information after the projection. Therefore, the features of the corresponding dimension should be split into the views corresponding to the column where the maximum feature value is located. It should be understood that each of the m data dimensions may be split to a corresponding view according to the method of splitting the ith data dimension to complete the data dimension splitting.

The embodiment of the disclosure skillfully applies an optimization problem solving method of principal component analysis, constructs a principal component analysis optimization function based on the number of views, and solves a characteristic classification indication matrix. And finally, searching for the features with the maximum projection weight for each view in the aggregation process based on the solving result of the feature classification indication matrix, extracting the corresponding feature dimension as the splitting dimension of the current view, and completing dimension splitting.

According to the embodiment of the disclosure, after obtaining the dimension splitting result of the m dimension data, the method further includes a step of view data allocation.

As shown in fig. 5, the method of view data allocation of this embodiment includes operation S510.

In operation S510, view data is allocated to the view data to be split according to the dimension splitting result, so as to obtain the multi-view data. After completing the dimension splitting, the data can be recombined based on the rule of the dimension splitting, the original data of the single-view with n characteristics is split into the data with k views, and the sum of the characteristics of all the views is n, so that the data splitting is completed. The data splitting method can ensure that the characteristics in the views have consistency, namely the current view contains the most information of the corresponding characteristics, and meanwhile, the views have consistency and complementarity, so that data association information is mined to a greater extent, and the utilization rate of data is improved.

In some preferred embodiments, the preset number of view splits is obtained by processing the dimension of the data to be split based on one of an automatic clustering algorithm or a similarity algorithm. Specifically, the association between the feature dimensions can be preliminarily mined by automatically clustering the feature dimensions through an automatic clustering algorithm or calculating the similarity of the feature dimensions through a similarity algorithm. The clustering result or the similarity result is used as a setting basis of the view splitting number, so that the optimization of the view splitting result is facilitated, the scientificity and the accuracy of the data processing method of the embodiment of the disclosure are further improved, and the model expression of subsequent modeling is improved.

In some specific embodiments, the automatic clustering algorithm comprises one of a density-based noise application spatial clustering algorithm, a fuzzy clustering algorithm, a K-means clustering algorithm; and/or the similarity algorithm comprises one of a cosine similarity algorithm, a distance similarity algorithm and a Pearson correlation coefficient. Wherein, the automatic clustering algorithm preferably can use a density-based noise application spatial clustering algorithm. Density-based noise application spatial clustering (DBSCAN) is an unsupervised ML clustering algorithm. It does not use pre-labeled targets to cluster data points. DBSCAN does not require specifying the number of clusters, avoids outliers, and has a better clustering effect in clusters of arbitrary shape and size. DBSCAN has no centroid and cluster clusters are formed by the process of connecting neighboring points together. In a specific embodiment of the present disclosure, a DBSCAN clustering algorithm or a cosine similarity algorithm may be applied to automatically perform preliminary mining of feature correlations to preset a more accurate number of views.

In some embodiments, the multi-view algorithm includes one of Canonical Correlation Analysis (CCA), local preserving canonical correlation analysis, multiple Canonical Correlation Analysis (MCCA), kernel canonical correlation analysis, discriminative canonical correlation analysis, generalized multi-view analysis, multi-view discriminative analysis, multi-view dimensionality reduction model. The idea of the method is to perform dimension reduction on the two-view data subjected to averaging, extract a plurality of pairs of typically uncorrelated variables in the dimension reduction process, and maximize the correlation between the variables. And (3) removing non-adjacent points when the correlation matrix is calculated by using the typical correlation analysis in the Local Preserving Canonical Correlation Analysis (LPCCA) so as to protect the local flow pattern structure of the sample point, thereby improving the discrimination capability of the model to a certain extent. In addition, not only the correlation of the same point under different views but also the correlation of adjacent points under different views are considered when calculating the overall correlation, another Local Preserving Canonical Correlation Analysis (ALPCCA) can be formed, and in this way, the local adjacent information of the data is added into the model, so that the local flow pattern structure of the data can be preserved when reducing the dimension. Multiple Canonical Correlation Analysis (MCCA) can perform dimensionality reduction on more-view data to achieve an extension from two views to more views. The Kernel Canonical Correlation Analysis (KCCA) introduces the idea of kernel function on the basis of CCA, firstly, nonlinearly mapping sample points to a high-dimensional kernel function space to enable the nonlinearly distributed sample points to be in linear distribution in the kernel function space, and then, performing dimension reduction on the two-view data by utilizing the traditional canonical correlation analysis in the kernel function space. The goal of Discriminant Canonical Correlation Analysis (DCCA) is to hopefully find a new subspace where homogeneous points between different views have the greatest correlation and heterogeneous points have less correlation. Generalized multi-view analysis (GMA) considers both the single view itself and the relationship between different views. The goal is to find a dimension reduction direction for each view, where the distance between heterogeneous points is far after dimension reduction in a single view, and there is a large correlation between different views. The goal of multi-view discriminant analysis (MvDA) is to find a common subspace while maintaining intra-class compactness and inter-class separation. The multi-view dimension reduction model (MDcR) does not require that the data of any view angle is reduced to the same space by maximizing the similarity of the data of different view angles in a nuclear space after dimension reduction and the correlation of a single view angle. The multi-view model is selected and applied based on the data characteristics of the corresponding application scene in practical application, so that the correlation among different views can be maximized, data information can be fully mined, and the expression effect of the model is improved.

In some embodiments, the raw data includes user characteristic data, and the model is used to construct a user representation. It can be understood that the data processing method of the embodiment of the present disclosure can be used for view splitting of user feature data, and further modeling the user feature data by using a multi-view algorithm to construct a user portrait. When the dimension of user feature data is high, the traditional single-view modeling method can cause the modeling problems of high redundancy rate between features, huss phenomenon and the like due to the fact that the dimension of the data is too high. By applying the data processing method of the embodiment of the disclosure, the dimension of the user characteristic data is subjected to view splitting, and the model is constructed based on the split multiple views, so that the relevance among the user characteristic data can be better mined, the learning accuracy and efficiency of subsequent models are effectively improved, the over-fitting phenomenon is avoided, and more accurate user portrait is constructed.

In embodiments of the present disclosure, prior to obtaining user characteristic data, consent or authorization of the user may be obtained. For example, a request may be issued to a user to obtain user characteristic data. The user characteristic data is obtained in case the user agrees or authorizes to obtain the user characteristic data.

It should be understood that the user portrait construction described above is only an exemplary applicable specific scenario of the data processing method of the embodiment of the present disclosure, and does not constitute a limitation to the practical application scenario of the data processing method of the embodiment of the present disclosure.

In one specific example, a bank builds a user representation. Assuming that there are currently 3 thousand pieces of customer data, for each piece of data, there are 500-dimensional characteristics of customer age, occupation, assets, deposits, consumer transactions, etc., applying data processing of embodiments of the present disclosure splits the raw data into 3-view data and models to construct a user representation. Specifically, the normalization process is performed first. Scaling all dimensions to [0,1 using the max-min method]Within the scope, the dimensional effect of the features is eliminated. Then, constructing and solving a classification indication function based on principal component analysis, wherein a characteristic classification indication matrix W epsilon R ^500×3 For splitting feature dimensions. Solving the optimization problem to obtain the weight distribution result of the matrix W, and assuming that the first row vector of the weight distribution result matrix is [0.7,0.2 and 0.1 ]]It means that the first view contains the most information for this feature, and therefore the row-corresponding feature is assigned to the first view. And distributing all the characteristic dimensions in the same way to finish dimension splitting. Further, the original data can be distributed to each view based on the rule of dimension splitting to complete view splitting. And further establishing a model by using a multi-view algorithm, so that the user portrait can be established.

Based on the data processing method, the embodiment of the disclosure also provides a data processing device. The apparatus will be described in detail below with reference to fig. 6.

As shown in fig. 6, the data processing apparatus 600 of this embodiment includes an obtaining module 610, a data splitting module 620, and a model building module 630.

The obtaining module 610 is configured to obtain original data, preprocess the original data, and obtain view data to be split, where the view data to be split is single-view data, the single-view data is high-dimensional data, and dimensions in the high-dimensional data correspond to entity features. In an embodiment, the obtaining module 610 may be configured to perform the operation S210 described above, which is not described herein again.

The data splitting module 620 is configured to perform view splitting on the view data to be split according to dimensions based on a principal component analysis method, and convert the view data to be split into multi-view data, where the multi-view data are data allocated to multiple views, the number of dimensions of the data in each view is the same or different, and the aggregation of the number of dimensions of the data in each view is equal to the dimension of the single-view data. In an embodiment, the data splitting module 620 may be configured to perform the operation S220 described above, which is not described herein again.

The model building module 630 is configured to build a model using a single-view algorithm or a multi-view algorithm based on the multi-view data. In an embodiment, the model building module 630 may be configured to perform the operation S230 described above, which is not described herein again.

According to an embodiment of the present disclosure, the data splitting module may further include a construction submodule, a solution submodule, and a dimension splitting submodule.

As shown in FIG. 7, the data splitting module 620 of this embodiment includes a construction sub-module 6201, a solution sub-module 6202, and a dimension splitting sub-module 6203.

The constructing submodule 6201 is configured to construct a principal component analysis classification indication function, where the principal component analysis classification indication function is constructed based on a feature classification indication matrix, the feature classification indication matrix is an n × k matrix, where n is a dimension of data to be split, and k is a preset view splitting number.

The solving submodule 6202 is configured to solve the optimization problem of the principal component analysis classification indicating function, and obtain a solution result, where the solution result includes a weight distribution result of a feature classification indicating matrix, and the weight distribution result of the feature classification indicating matrix is a matrix formed by feature vectors corresponding to minimum feature values corresponding to the view split numbers.

The dimension splitting sub-module 6203 is configured to, based on the weight assignment result of the feature classification indication matrix, perform view splitting on the dimension of the view data to be split, and obtain a data dimension splitting result, where the dimension of the view data to be split is m, and splitting the ith data dimension includes: and splitting the ith data dimension into a view corresponding to the maximum eigenvalue in the eigenvector corresponding to the ith data dimension in the characteristic classification indication matrix, wherein i belongs to [1, m ].

According to an embodiment of the present disclosure, the data splitting module may further include a data allocation submodule.

As shown in fig. 8, the data splitting module 620 of this embodiment may further include a data allocation sub-module 6204 in addition to the constructing sub-module 6201, the solving sub-module 6202 and the dimension splitting sub-module 6203.

The data allocation submodule 6204 is configured to allocate the view data to be split according to the dimension splitting result, so as to obtain the multi-view data.

According to an embodiment of the present disclosure, any multiple modules of the obtaining module 610, the data splitting module 620, the model building module 630, the constructing sub-module 6201, the solving sub-module 6202, the dimension splitting sub-module 6203, and the data allocating sub-module 6204 may be combined and implemented in one module, or any one module may be split into multiple modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the obtaining module 610, the data splitting module 620, the model building module 630, the constructing submodule 6201, the solving submodule 6202, the dimension splitting submodule 6203, and the data allocating submodule 6204 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementation manners of software, hardware, and firmware, or by a suitable combination of any of them. Alternatively, at least one of the acquisition module 610, the data splitting module 620, the model building module 630, the construction sub-module 6201, the solving sub-module 6202, the dimension splitting sub-module 6203, and the data allocation sub-module 6204 may be at least partially implemented as a computer program module that, when executed, may perform corresponding functions.

As shown in fig. 9, an electronic apparatus 900 according to an embodiment of the present disclosure includes a processor 901 which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. Processor 901 can include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or related chipset(s) and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), and/or the like. The processor 901 may also include on-board memory for caching purposes. The processor 901 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.

In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. The processor 901 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the programs may also be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

Electronic device 900 may also include input/output (I/O) interface 905, input/output (I/O) interface 905 also connected to bus 904, according to an embodiment of the present disclosure. The electronic device 900 may also include one or more of the following components connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

The present disclosure also provides a computer-readable storage medium, which may be embodied in the device/apparatus/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement a method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 902 and/or the RAM 903 described above and/or one or more memories other than the ROM 902 and the RAM 903.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated in the flow chart. When the computer program product runs in a computer system, the program code is used for causing the computer system to realize the method provided by the embodiment of the disclosure.

The computer program performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure when executed by the processor 901. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, and downloaded and installed through the communication section 909 and/or installed from the removable medium 911. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The computer program, when executed by the processor 901, performs the above-described functions defined in the system of the embodiment of the present disclosure. The above described systems, devices, apparatuses, modules, units, etc. may be implemented by computer program modules according to embodiments of the present disclosure.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments of the present disclosure and/or the claims may be made without departing from the spirit and teachings of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A data processing method, comprising:

acquiring original data, preprocessing the original data, and acquiring view data to be split, wherein the view data to be split is single-view data;

carrying out view splitting on the view data to be split according to the dimension based on a principal component analysis method, and converting the view data to be split into multi-view data; and

building a model using a single-view algorithm or a multi-view algorithm based on the multi-view data,

the single-view data is high-dimensional data, and dimensions in the high-dimensional data correspond to entity features; the multi-view data is data distributed to a plurality of views, wherein the dimension number of the data in each view is the same or different, and the aggregation of the dimension number of the data in each view is equal to the dimension of the single-view data.

2. A method according to claim 1, wherein the view splitting is performed on the view data to be split according to dimensions based on a principal component analysis method, and the step of converting the view data to be split into multi-view data comprises a step of dimension splitting, wherein the step of dimension splitting comprises:

constructing a principal component analysis classification indication function, wherein the principal component analysis classification indication function is constructed based on a characteristic classification indication matrix, the characteristic classification indication matrix is an n x k matrix, n is the dimensionality of data to be split, and k is the preset view splitting number;

solving the optimization problem of the principal component analysis classification indication function to obtain a solution result, wherein the solution result comprises a weight distribution result of a feature classification indication matrix, and the weight distribution result of the feature classification indication matrix is a matrix formed by feature vectors corresponding to minimum feature values corresponding to the view splitting numbers; and

view splitting is performed on the dimensionality of the view data to be split based on the weight distribution result of the feature classification indication matrix, and a data dimensionality splitting result is obtained, wherein the dimensionality of the view data to be split is m, and splitting the ith data dimensionality comprises: and

and splitting the ith data dimension into a view corresponding to the maximum eigenvalue in the eigenvector corresponding to the ith data dimension in the characteristic classification indication matrix, wherein i belongs to [1, m ].

3. The method of claim 2, wherein after obtaining the dimension splitting result for the m dimension data, the method further comprises:

and distributing the view data to be split according to the dimension splitting result to obtain the multi-view data.

4. A method according to claim 1, wherein the preset number of view splits is obtained by processing the dimension of the data to be split based on one of an automatic clustering algorithm and a similarity algorithm.

5. A method according to claim 4, wherein said automatic clustering algorithm comprises applying one of a spatial clustering algorithm, a fuzzy clustering algorithm, a K-means clustering algorithm based on density noise; and/or the similarity algorithm comprises one of a cosine similarity algorithm, a distance similarity algorithm and a Pearson correlation coefficient.

6. A method according to claim 1, wherein said pre-processing said raw data comprises:

and carrying out normalization processing on the original data.

7. A method according to claim 2, wherein said solving an optimization problem for said principal component analysis classification indicative function comprises:

and solving the optimization problem of the principal component analysis classification indication function based on a generalized eigenvalue solving method.

8. A method according to claim 1, wherein the multi-view algorithm comprises one of canonical correlation analysis, multi-canonical correlation analysis, kernel canonical correlation analysis, local preserving canonical correlation analysis, discriminant canonical correlation analysis, generalized multi-view analysis, multi-view discriminant analysis, and multi-view dimensionality reduction model.

9. A method as claimed in claim 1, wherein the raw data contains user characteristic data and the model is used to construct a user representation.

10. A data processing apparatus comprising:

the system comprises an acquisition module, a data processing module and a data processing module, wherein the acquisition module is configured to acquire original data, preprocess the original data and acquire view data to be split, the view data to be split is single-view data, the single-view data is high-dimensional data, and dimensions in the high-dimensional data correspond to entity characteristics;

the data splitting module is configured to split the view data to be split according to dimensions based on a principal component analysis method, and convert the view data to be split into multi-view data, wherein the multi-view data are data distributed to multiple views, the number of dimensions of the data in each view is the same or different, and the aggregation of the number of dimensions of the data in each view is equal to the dimension of single-view data; and

a model building module configured to build a model using a single-view algorithm or a multi-view algorithm based on the multi-view data.

11. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-9.

12. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any one of claims 1 to 9.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 9.