CN112016581A - Multidimensional data processing method and device, computer equipment and storage medium - Google Patents

Multidimensional data processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112016581A
CN112016581A CN201910472215.6A CN201910472215A CN112016581A CN 112016581 A CN112016581 A CN 112016581A CN 201910472215 A CN201910472215 A CN 201910472215A CN 112016581 A CN112016581 A CN 112016581A
Authority
CN
China
Prior art keywords
data
category
processed
determining
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910472215.6A
Other languages
Chinese (zh)
Inventor
盛捷来
季纺纺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910472215.6A priority Critical patent/CN112016581A/en
Publication of CN112016581A publication Critical patent/CN112016581A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multidimensional data processing method, a multidimensional data processing device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring data to be processed, wherein the data to be processed is multidimensional data with multi-class attributes; performing dimensionality reduction on the data to be processed to obtain target dimensionality data, and determining the category attribute number of the data to be processed according to the target dimensionality data, so that the category attribute number of the data to be processed can be accurately determined through the dimensionality reduction; further, performing cluster analysis on the data to be processed according to the category attribute number to obtain a cluster center and a data category corresponding to the category attribute number; and determining the category attribute corresponding to the data category according to the clustering center corresponding to each data category.

Description

Multidimensional data processing method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of logistics data processing technologies, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.
Background
With the development of communication technology, various data flood all aspects of people's life, and analysis and processing of large-scale data are more and more important in the field of scientific research. The high dimensionality and the complex structure of the data bring certain difficulties to the analysis and processing of the data. How to effectively find out the characteristic information of the high-dimensional data is a basic problem in the fields of information science and statistical science, and is also a main challenge faced by high-dimensional data analysis.
At present, to the analysis problem of high-dimensional data, often need carry out a large amount of carding in advance and analysis through the manual work to high-dimensional data, carry out prejudgement and form classification rule to the dimension and the classification of data, so not only need pay a large amount of efforts, and classification result's accuracy can rely on classification rule's rationality to a very big degree to not only consume a large amount of human costs, and lead to the inaccurate so that can't be applied to scientific research field effectively of classification result easily.
Disclosure of Invention
In view of the above, the present invention provides a multidimensional data processing method, an apparatus, a computer device and a storage medium, which can analyze a data category and determine a category attribute according to objective data characteristics to achieve objective and accurate classification of multidimensional data.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the embodiment of the invention provides a multidimensional data processing method, which comprises the following steps:
acquiring data to be processed, wherein the data to be processed is multidimensional data with multi-class attributes;
performing dimensionality reduction on the data to be processed to obtain target dimension data, and determining the category attribute number of the data to be processed according to the target dimension data;
performing clustering analysis on the data to be processed according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number;
and determining the category attribute corresponding to the data category according to the clustering center corresponding to each data category.
The method for obtaining target dimension data by performing dimensionality reduction on the data to be processed and determining the category attribute number of the data to be processed according to the target dimension data includes:
reducing the dimension of the data to be processed to obtain corresponding two-dimensional data, and determining the category attribute number of the data to be processed according to the two-dimensional data; or the like, or, alternatively,
and reducing the dimension of the data to be processed to obtain corresponding three-dimensional data, and determining the category attribute number of the data to be processed according to the three-dimensional data.
The method for obtaining target dimension data by performing dimensionality reduction on the data to be processed and determining the category attribute number of the data to be processed according to the target dimension data includes:
and performing dimensionality reduction on the data to be processed through a t-SNE algorithm to obtain target dimensional data, and determining the category attribute number of the data to be processed according to the target dimensional data.
The dimensionality reduction of the data to be processed through a t-SNE algorithm to obtain target dimensionality data comprises the following steps:
mapping the data to be processed to a high-dimensional space through Gaussian distribution to obtain high-dimensional data, and determining a first probability distribution parameter corresponding to the Gaussian distribution;
and fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and the relative entropy, and obtaining target dimension data corresponding to the data to be processed through t distribution corresponding to the second probability distribution parameter.
Determining the category attribute corresponding to each data category according to the clustering center corresponding to each data category, wherein the determining the category attribute corresponding to each data category comprises:
determining the position coordinates of the clustering center corresponding to each data category;
and calculating the distance between the position coordinate of the clustering center corresponding to each data category and the coordinate origin, and determining the category attribute corresponding to each data category according to the data category of the clustering center, the distance between which and the coordinate origin accords with the set conditions.
The clustering analysis is performed on the data to be processed according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number, and the method comprises the following steps:
and clustering the data to be processed through a spectral clustering algorithm according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number.
The clustering of the data to be processed according to the category attribute number by a spectral clustering algorithm to obtain a clustering center and a data category corresponding to the category attribute number comprises the following steps:
determining a class attribute weight value between each piece of data to be processed and other pieces of data to be processed in the data to be processed;
and when the weight value of the category attributes of any two data to be processed is greater than a set value, determining the data to be the same category and marking the data to be clustered, obtaining a clustering result containing the data category corresponding to the category attribute number, and determining a clustering center corresponding to the data category according to the category attribute number.
The method for determining the category attribute number of the data to be processed according to the target dimension data comprises the following steps:
performing dimensionality reduction on the logistics service data to obtain target dimension data, and determining that the logistics service data comprises three category attributes according to the target dimension data;
determining the category attribute corresponding to each data category according to the clustering center corresponding to each data category, including:
and determining the category attributes corresponding to the data categories to be a price attribute, a service attribute and an aging attribute respectively according to the clustering center corresponding to each data category.
Before the dimension reduction processing is performed on the logistics business data to obtain target dimension data, the method comprises the following steps:
and screening the logistics service data according to preset parameters, and deleting the logistics service data which do not meet preset conditions.
An embodiment of the present invention provides a multidimensional data processing apparatus, including:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring data to be processed, and the data to be processed is multidimensional data with multi-class attributes;
the dimensionality reduction module is used for carrying out dimensionality reduction on the data to be processed to obtain target dimensionality data and determining the category attribute number of the data to be processed according to the target dimensionality data;
the clustering module is used for carrying out clustering analysis on the data to be processed according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number;
and the determining module is used for determining the category attribute corresponding to the data category according to the clustering center corresponding to each data category.
The dimension reduction module is further used for reducing the dimension of the data to be processed to obtain corresponding two-dimensional data, and determining the category attribute number of the data to be processed according to the two-dimensional data; or the like, or, alternatively,
the dimension reduction module is further configured to perform dimension reduction on the data to be processed to obtain corresponding three-dimensional data, and determine the category attribute number of the data to be processed according to the three-dimensional data.
The dimension reduction module is further configured to perform dimension reduction on the data to be processed through a t-SNE algorithm to obtain target dimension data, and determine the category attribute number of the data to be processed according to the target dimension data.
The dimensionality reduction module is further configured to map the data to be processed to a high-dimensional space through gaussian distribution to obtain high-dimensional data, and determine a first probability distribution parameter corresponding to the gaussian distribution;
and fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and the relative entropy, obtaining target dimension data corresponding to the data to be processed through t distribution corresponding to the second probability distribution parameter, and determining the category attribute number of the data to be processed according to the target dimension data.
The determining module is further configured to determine a position coordinate of a clustering center corresponding to each data category;
and calculating the distance between the position coordinate of the clustering center corresponding to each data category and the coordinate origin, and determining the category attribute corresponding to each data category according to the data category of the clustering center, the distance between which and the coordinate origin accords with the set conditions.
The clustering module is further configured to cluster the data to be processed through a spectral clustering algorithm according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number.
The clustering module is further configured to determine a class attribute weight value between each piece of data to be processed and other pieces of data to be processed in the pieces of data to be processed;
and when the weight value of the category attributes of any two data to be processed is greater than a set value, determining the data to be the same category and marking the data to be clustered, obtaining a clustering result containing the data category corresponding to the category attribute number, and determining a clustering center corresponding to the data category according to the category attribute number.
The dimension reduction module is further used for performing dimension reduction processing on the logistics service data to obtain target dimension data, and determining that the logistics service data comprises three category attributes according to the target dimension data;
the determining module is further configured to determine, according to the clustering center corresponding to each data category, that the category attributes corresponding to the data categories are a price attribute, a service attribute, and an aging attribute, respectively.
The acquisition module is further used for screening the logistics service data according to preset parameters and deleting the logistics service data which do not meet preset conditions.
An embodiment of the present invention provides a computer device, including: a processor and a memory for storing a computer program capable of running on the processor;
wherein, when the processor is used for running the computer program, the multidimensional data processing method of any embodiment of the invention is realized.
The embodiment of the invention provides a computer storage medium, wherein a computer program is stored in the computer storage medium, and when being executed by a processor, the computer program realizes the multidimensional data processing method provided by any embodiment of the invention.
The embodiment of the invention provides a multidimensional data processing method, a multidimensional data processing device, computer equipment and a storage medium, wherein the method comprises the steps of obtaining data to be processed, wherein the data to be processed is multidimensional data with multi-class attributes; performing dimensionality reduction on the data to be processed to obtain target dimensionality data, and determining the category attribute number of the data to be processed according to the target dimensionality data, so that the category attribute number of the data to be processed can be accurately determined through the dimensionality reduction; further, performing cluster analysis on the data to be processed according to the category attribute number to obtain a cluster center and a data category corresponding to the category attribute number; according to the clustering center corresponding to each data category, the category attribute corresponding to the data category is determined, so that the number of the category attributes is determined by firstly reducing the dimension of the data, and then the data category and the category attribute are analyzed by combining the number of the category attributes determined by reducing the dimension with the clustering, so that the data category and the category attribute can be analyzed according to objective data characteristics of the data, the dimension and the category of the data do not need to be artificially pre-judged and a classification rule is not formed, the classification result is not influenced by the pre-formed classification rule, the cost is reduced, and objective and accurate classification of the multi-dimensional data is realized.
Drawings
FIG. 1 is a flowchart illustrating a multidimensional data processing method according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a multi-dimensional data processing apparatus according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating a multidimensional data processing method according to another embodiment of the present invention.
Detailed Description
The present disclosure will be described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the examples provided herein are merely illustrative of the present disclosure and are not intended to limit the present disclosure. In addition, the embodiments provided below are some embodiments for implementing the disclosure, not all embodiments for implementing the disclosure, and the technical solutions described in the embodiments of the disclosure may be implemented in any combination without conflict.
It should be noted that, in the embodiments of the present disclosure, the terms "comprises," "comprising," or any other variation thereof are intended to cover a non-exclusive inclusion, so that a method or apparatus including a series of elements includes not only the explicitly recited elements but also other elements not explicitly listed or inherent to the method or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other related elements in a method or apparatus including the element (e.g., steps in a method or elements in an apparatus, such as units that may be part of a circuit, part of a processor, part of a program or software, etc.).
For example, although the multidimensional data processing method provided by the embodiment of the present disclosure includes a series of steps, the multidimensional data processing method provided by the embodiment of the present disclosure is not limited to the described steps, and similarly, the multidimensional data processing device provided by the embodiment of the present disclosure includes a series of modules, but the multidimensional data processing device provided by the embodiment of the present disclosure is not limited to include the explicitly described modules, and may further include a unit that is required to be provided for acquiring relevant information or performing processing based on the information.
Before further detailed description of the embodiments of the present disclosure, terms and expressions referred to in the embodiments of the present disclosure are explained, and the terms and expressions referred to in the embodiments of the present disclosure are applied to the following explanations.
Dimension is a specific angle from which people observe data, and is a type of attribute when considering problems, and a set of attributes forms a dimension (time dimension, geographical dimension, etc.).
Level of dimension (Level): there may also be various descriptive aspects (time dimensions: date, month, quarter, year) that differ in the degree of detail for a particular angle (i.e. a dimension) at which people observe the data.
Members of dimension (Member): one value of a dimension is a description of the location of a data item in a dimension. (a "day of a month of a year" is a description of a location in the time dimension).
Multidimensional data, generally referred to as data of three or more dimensions, is multidimensional data.
High dimensions, usually referred to as more than three dimensions, are high.
The low dimension, usually called two-dimensional, three-dimensional belongs to the low dimension.
Relative entropy, also known as KL divergence, is a method to describe the difference between two probability distributions P and Q. It is asymmetric, which means that D (P | | Q) ≠ D (Q | | P). Specifically, in the information theory, D (P | | Q) represents the information loss that occurs when a true distribution P is fitted with a probability distribution Q, where P represents the true distribution and Q represents the fitted distribution of P. For example, in the embodiment of the present application, t distribution is obtained by gaussian distribution and relative entropy fitting.
Logistics business data, which refers to historical data of using logistics services by merchants on the e-commerce platform, including the price of the waybill, the time of the waybill to reach the customer, complaints of the transaction by the customer, and the like;
and the category attribute is to classify various types of data of the logistics service data after processing into several categories, for example, into 3 categories, each category has a corresponding category attribute, and the logistics service data can be classified into a price attribute, a service attribute and an aging attribute.
Loss, the use rate of a certain logistics by a merchant is reduced to exceed a set value;
and the loss is avoided, and the use rate of a certain logistics by a merchant is not reduced to exceed a set value.
As shown in fig. 1, an embodiment of the present invention provides a multidimensional data processing method, which includes the following steps:
step 101: acquiring data to be processed, wherein the data to be processed is multidimensional data with multi-class attributes;
here, the number of the data to be processed is the dimension of the data to be processed. The data to be processed may include the following categories: average per-unit freight price, per-unit delivery time, complaint rate, value-added services, etc. Further, the categories may be specifically divided into a plurality of category attributes, where a multi-category attribute refers to a category into which the data to be processed may be specifically divided.
Step 102: performing dimensionality reduction on the data to be processed to obtain target dimension data, and determining the category attribute number of the data to be processed according to the target dimension data;
here, the target dimension data generally refers to low-dimensional data, and the data to be processed is subjected to dimension reduction processing to reduce the multidimensional data into the low-dimensional data, which may be two-dimensional data or three-dimensional data.
The dimensionality reduction processing refers to displaying multidimensional data in a low dimension, and dimensionality reduction methods can include a t-SNE (t-distributed random neighborhood embedding) algorithm, a PCA (Principal Component Analysis) algorithm and the like.
Here, performing dimension reduction on the data to be processed to obtain target dimension data refers to displaying multidimensional data with multiple category attributes in a low dimension. Specifically, for example, taking t-SNE algorithm as an example, for data to be processed, mapping the data to be processed to a high-dimensional space through gaussian distribution to obtain high-dimensional data, determining a first probability distribution parameter corresponding to the gaussian distribution, fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and relative entropy, and obtaining target dimensional data corresponding to the data to be processed through t distribution corresponding to the second probability distribution parameter.
Here, the target dimension data is visualized data, and determining the category attribute number of the data to be processed according to the target dimension data means that the distribution of the target dimension data in the target dimension has separability, and the target dimension data is determined as a corresponding category attribute number according to the separable category. For example, the dimension of the data to be processed is reduced to obtain two-dimensional data, the data to be processed is projected in a two-dimensional coordinate system, and the data to be processed can be divided into several categories, for example, three categories according to the distribution of the points of the two-dimensional data in the two-dimensional coordinate system, that is, the number of corresponding category attributes is determined to be 3.
Step 103: performing clustering analysis on the data to be processed according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number;
here, the cluster analysis may include a spectral clustering algorithm, a K-means clustering algorithm, and the like; here, performing cluster analysis on the to-be-processed data according to the category attribute number means clustering the to-be-processed data according to the corresponding category attribute number to obtain a cluster center and a data category corresponding to the category attribute number. For example, when the number of the category attributes is determined to be 3, the data to be processed is clustered according to the clustering number of 3, and the data to be processed is clustered into 3 data categories and 3 clustering centers.
Step 104: and determining the category attribute corresponding to the data category according to the clustering center corresponding to each data category.
Here, determining the category attribute corresponding to the data category refers to determining a data category finally obtained after clustering analysis of the data to be processed and a category attribute corresponding to each data category.
In the above embodiments of the present application, to-be-processed data is obtained, where the to-be-processed data is multidimensional data with multiple category attributes; performing dimensionality reduction on the data to be processed to obtain target dimensionality data, and determining the category attribute number of the data to be processed according to the target dimensionality data, so that the category attribute number of the data to be processed can be accurately determined through the dimensionality reduction; further, performing cluster analysis on the data to be processed according to the category attribute number to obtain a cluster center and a data category corresponding to the category attribute number; according to the clustering center corresponding to each data category, the category attribute corresponding to the data category is determined, so that the number of the category attributes is determined by firstly reducing the dimension of the data, and then the data category and the category attribute are analyzed by combining the number of the category attributes determined by reducing the dimension with the clustering, so that the data category and the category attribute can be analyzed according to objective data characteristics of the data, the dimension and the category of the data do not need to be artificially pre-judged and a classification rule is not formed, the classification result is not influenced by the pre-formed classification rule, the cost is reduced, and objective and accurate classification of the multi-dimensional data is realized.
In an embodiment, the performing dimension reduction processing on the data to be processed to obtain target dimension data, and determining the category attribute number of the data to be processed according to the target dimension data includes:
reducing the dimension of the data to be processed to obtain corresponding two-dimensional data, and determining the category attribute number of the data to be processed according to the two-dimensional data; or the like, or, alternatively,
and reducing the dimension of the data to be processed to obtain corresponding three-dimensional data, and determining the category attribute number of the data to be processed according to the three-dimensional data.
Here, performing dimensionality reduction on the data to be processed to obtain corresponding three-dimensional data means that multidimensional data with multiple category attributes are displayed in three dimensions by dimensionality reduction into three-dimensional data. Specifically, the dimension of the data to be processed is reduced to obtain three-dimensional data, the data to be processed is projected in a three-dimensional coordinate system, and the data to be processed can be divided into several categories, for example, three categories according to the distribution of the points of the three-dimensional data in the three-dimensional coordinate system, that is, the number of corresponding category attributes is determined to be 3.
Here, performing dimensionality reduction on the data to be processed to obtain corresponding two-dimensional data means that multidimensional data with multiple category attributes are displayed in two dimensions by performing dimensionality reduction on the multidimensional data into two-dimensional data. Specifically, the dimension of the data to be processed is reduced to obtain two-dimensional data, the data to be processed is projected in a two-dimensional coordinate system, and the data to be processed can be divided into several categories, for example, three categories according to the distribution of points of the two-dimensional data in the two-dimensional coordinate system, that is, the number of corresponding category attributes is determined to be 3.
In the above embodiment, the dimension reduction is performed on the data to be processed to obtain corresponding two-dimensional data or three-dimensional data, and the category attribute number of the data to be processed is determined according to the two-dimensional data or the three-dimensional data, so that the category attribute number of the data to be processed can be accurately determined through the dimension reduction, that is, the clustering number K value of the subsequent data to be processed during cluster analysis is determined.
In an embodiment, the performing dimension reduction processing on the data to be processed to obtain target dimension data, and determining the category attribute number of the data to be processed according to the target dimension data includes:
and performing dimensionality reduction on the data to be processed through a t-SNE algorithm to obtain target dimensional data, and determining the category attribute number of the data to be processed according to the target dimensional data.
The step of performing dimensionality reduction on the data to be processed by the t-SNE algorithm refers to performing dimensionality reduction and visualization analysis on the data to be processed, wherein the multidimensional data with the multi-class attributes are displayed in a target dimensionality through the dimensionality reduction processing of the t-SNE algorithm to obtain target dimensionality data.
Here, the target dimension data is visualized data, and determining the category attribute number of the data to be processed according to the target dimension data means that the distribution of the target dimension data in the target dimension has separability, and the target dimension data is determined as a corresponding category attribute number according to the separable category. For example, the dimension of the data to be processed is reduced to obtain two-dimensional data, the data to be processed is projected in a two-dimensional coordinate system, and the data to be processed can be divided into several categories, for example, three categories according to the distribution of the points of the two-dimensional data in the two-dimensional coordinate system, that is, the number of corresponding category attributes is determined to be 3.
In the above embodiment, the target dimension data is obtained by performing dimension reduction through a t-SNE algorithm, and the category attribute number of the data to be processed is determined according to the target dimension data. Therefore, the probability of the points corresponding to the multi-dimensional data and the target dimension of the data to be processed is ensured to be the same, the aim of completely mapping the data to be processed to the target dimension data from the multi-dimensional data is achieved, meanwhile, for dissimilar data, a smaller distance can generate a larger gradient to exclude the points, and the classification of the data to be processed is accurately achieved.
In an embodiment, the performing dimension reduction on the data to be processed by a t-SNE algorithm to obtain target dimension data includes:
mapping the data to be processed to a high-dimensional space through Gaussian distribution to obtain high-dimensional data, and determining a first probability distribution parameter corresponding to the Gaussian distribution;
and fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and the relative entropy, and obtaining target dimension data corresponding to the data to be processed through t distribution corresponding to the second probability distribution parameter.
Here, mapping the data to be processed to the high-dimensional space through the gaussian distribution to obtain the high-dimensional data means converting a high-dimensional euclidean distance between every two data points in the data to be processed into a conditional probability representing similarity, specifically, converting the data to be processed into the conditional probability representing similarity through the gaussian distribution, and determining a first probability distribution parameter corresponding to the gaussian distribution.
Further, fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and the relative entropy, and obtaining target dimension data corresponding to the data to be processed through t distribution corresponding to the second probability distribution parameter. Here, the target dimension data generally refers to low-dimensional data.
Figure BDA0002081130030000111
Referring to formula (1) calculation formula of relative entropy, D (p | | q) refers to the relative entropy of gaussian distribution to t distribution, p (x) is a first probability distribution parameter, and q (x) is a second probability distribution parameter. And fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and the relative entropy refers to calculating a second probability distribution parameter determined when the probability distribution of the data to be processed in a high-dimensional space and the relative entropy in a low-dimensional space are minimum.
Further, obtaining the target dimension data corresponding to the data to be processed through the t distribution corresponding to the second probability distribution parameter means obtaining the target dimension data by reducing the dimension of the data to be processed through the t distribution corresponding to the second probability distribution parameter. Here, the target dimension data generally refers to low-dimensional data.
Therefore, the distance is converted into probability distribution by using Gaussian distribution in a high-dimensional space, and t distribution with more bias weight and long tail distribution is used in a low-dimensional space, so that the middle-low distance in the high-dimensional space has larger distance in the low-dimensional space after transformation, the probability distribution in the high-dimensional space and the relative entropy in the low-dimensional space are kept to be minimum, and the classification of the data to be processed is realized to determine the category attribute number.
In an embodiment, the determining, according to the clustering center corresponding to each of the data categories, the category attribute corresponding to the data category includes:
determining the position coordinates of the clustering center corresponding to each data category;
and calculating the distance between the position coordinate of the clustering center corresponding to each data category and the coordinate origin, and determining the category attribute corresponding to each data category according to the data category of the clustering center, the distance between which and the coordinate origin accords with the set conditions.
Here, there is a cluster center for each data category, and each cluster center has a corresponding position coordinate in the target dimension, for example, in a three-dimensional coordinate axis, a cluster center A has a position coordinate of (x1, y1, z1), and the distance between the position coordinate of the cluster center and the origin of the coordinate is the distance
Figure BDA0002081130030000121
Here, the setting condition may be two cluster centers each of which is farthest from the origin of coordinates or one cluster center, and may be set by itself here. And calculating the distance between each cluster center and the coordinate origin, and determining the category attribute corresponding to each data category meeting the set condition, namely determining the category attribute corresponding to the data category of one cluster center or two cluster centers farthest from the coordinate origin. For example, three cluster centers with three data categories are determined, the position coordinates of the three cluster centers are determined to obtain the corresponding distance d1, d2 and d3, the set condition is that the cluster center is farthest away from the coordinate origin, the maximum value among d1, d2 and d3 is taken, and if the cluster center is d1, the cluster center corresponding to d1 is determined, and therefore the category attribute of the corresponding data category is determined.
Therefore, the category attribute of the data category deviating from the coordinate origin corresponding to each clustering center is screened out according to the set conditions for the distance between the clustering center and the coordinate origin after clustering, and the specific reason of the loss of the merchants is accurately obtained.
In an embodiment, the performing cluster analysis on the to-be-processed data according to the category attribute number to obtain a cluster center and a data category corresponding to the category attribute number includes:
and clustering the data to be processed through a spectral clustering algorithm according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number.
Here, the performing of the cluster analysis on the data to be processed according to the category attribute number means that the data to be processed is clustered according to the corresponding category attribute number by using a spectral clustering algorithm, and further, a cluster center and a data category corresponding to the category attribute number are obtained. For example, when the number of the category attributes is determined to be 3, clustering the data to be processed by using a spectral clustering algorithm according to the clustering number of 3, and clustering the data to be processed into 3 data categories and 3 clustering centers.
In the above embodiment of the present application, the data to be processed is clustered by a spectral clustering algorithm according to the number of the category attributes, and the multidimensional data with multiple category attributes is clustered according to the determined clustering number, so that the loss reason of the merchant with a high degree of discrimination is accurately given.
In an embodiment, the clustering the data to be processed according to the category attribute number by a spectral clustering algorithm to obtain a clustering center and a data category corresponding to the category attribute number includes:
determining a class attribute weight value between each piece of data to be processed and other pieces of data to be processed in the data to be processed;
and when the weight value of the category attributes of any two data to be processed is greater than a set value, determining the data to be the same category and marking the data to be clustered, obtaining a clustering result containing the data category corresponding to the category attribute number, and determining a clustering center corresponding to the data category according to the category attribute number.
Here, the class attribute weight value of each piece of data to be processed before other pieces of data to be processed is calculated, that is, the similarity between the piece of data to be processed and each of the other pieces of data to be processed is obtained.
For the data to be processed containing n data points, n (n-1)/2 category attribute weight values are total, a set value alpha is taken, when the category attribute weight values of any two data to be processed are larger than the set value, the data to be processed are determined to be the same category and marked to be clustered, the weight value between the data to be processed with high similarity is high, and the corresponding clustering center and the data category are determined according to the data distributed according to the weight value according to the category attribute number.
In the above embodiment of the application, the class attribute weight value between each piece of data to be processed and other pieces of data to be processed is determined through a spectral clustering algorithm, so that multidimensional data with multiple class attributes are clustered according to the determined clustering number, and a merchant loss reason with a large degree of distinction is accurately given.
In an embodiment, the determining the category attribute number of the to-be-processed data according to the target dimension data includes:
performing dimensionality reduction on the logistics service data to obtain target dimension data, and determining that the logistics service data comprises three category attributes according to the target dimension data;
determining the category attribute corresponding to each data category according to the clustering center corresponding to each data category, including:
and determining the category attributes corresponding to the data categories to be a price attribute, a service attribute and an aging attribute respectively according to the clustering center corresponding to each data category.
Here, goods transported in the logistics business may be referred to as packages. Taking the data to be processed as the logistics business data as an example, the specific logistics business data of the package may include but is not limited to: the transaction order number, the storage order number, the geographical position of the package, the weight of the package, the number of products in the package, the package information customized by service and the like of the package, and the logistics nodes where the package is located, such as warehouses, picking up pieces, transportation trunks, delivery and the like. Therefore, the logistics business data can also comprise the express delivery time data of the merchant, the express delivery cost of the single package, the complaint and favorable data of the merchant and the like.
Here, the target dimension data is visual data, and determining that the logistics service data includes three category attributes according to the target dimension data means that the distribution of the target dimension data in the target dimension has separability, and the target dimension data is determined as three category attributes according to the separable categories. For example, the dimension of the data to be processed is reduced to obtain two-dimensional data, the data to be processed is projected in a two-dimensional coordinate system, and three category attributes can be determined according to the distribution of points of the two-dimensional data in the two-dimensional coordinate system, that is, the number of the corresponding category attributes is determined to be 3.
Here, the category attribute of the data category corresponding to each cluster center is determined according to three cluster centers corresponding to the three category attributes, which are respectively a price attribute, a service attribute and an aging attribute.
In the above embodiment, in the logistics service data, the data to be processed is subjected to cluster analysis according to the category attribute number, so as to obtain a cluster center and a data category corresponding to the category attribute number; determining the category attribute corresponding to each data category according to the cluster center corresponding to the data category, for example, dividing the category attribute into a price attribute, a service attribute and an aging attribute, wherein the three cluster centers corresponding to the three cluster centers are respectively C1, C2 and C3, for example, in a three-dimensional coordinate axis, the position coordinate of C1 is (x1, y1, z1), and the distance between the position coordinate of the cluster center and the coordinate origin is the distance between the position coordinate of the cluster center and the coordinate origin, that is, the distance between the position coordinate of the cluster center and the coordinate origin
Figure BDA0002081130030000151
According to the method, the distances from the C1, the C2 and the C3 to the coordinate origin are respectively calculated, the clustering center with the farthest distance is determined to be the category attribute with the largest influence on the loss of the merchant, for example, the category attribute is the price attribute, and the category attribute with the largest influence on the loss of the merchant is determined to be the price attribute. Therefore, the specific reason for the loss of the merchant can be accurately analyzed.
In an embodiment, before performing the dimension reduction processing on the logistics service data to obtain the target dimension data, the method includes:
and screening the logistics service data according to preset parameters, and deleting the logistics service data which do not meet preset conditions.
Here, the input data includes logistics business data of all merchants.
Further, the preset parameter may be a seasonal index D, see formula (2), where S1Indicating the sales of the goods for a certain month,
Figure BDA0002081130030000154
representing the average sales per month of the year.
Figure BDA0002081130030000152
Here, when D is greater than 1, it means that the commodity sales for the month is greater than the average value; a threshold value, for example 4, may be set, and if the month that D is greater than 1 does not exceed 4, this indicates that the product of the merchant has seasonal sales, and such merchant is screened out and not put into the pending data for processing.
Alternatively, the preset parameter may be the merchant decline index W, see equation (3), where SkIndicating the sales of the goods in the k month, Sk-1Indicating the sales of the goods in the month immediately preceding the month k.
Figure BDA0002081130030000153
Here, when W is less than 1, it means that the commodity sales of the month is less than the commodity sales of the last month, and the merchant decline index may select a one-year period for screening, for example, if the merchant has 10 values less than 1 in the one-year W value, which means that the merchant has 10 months of commodity sales in a month-by-month decline state, the merchant is determined as a merchant in the decline period, and such merchant is screened out and not put into the data to be processed for processing.
Therefore, after the merchants are screened in advance by setting the preset parameters, the logistics business data are obtained, so that the influence of seasonal reasons and the self-operation factors of the merchants is eliminated, and the specific reasons of merchant loss are more accurately analyzed.
In another embodiment, as shown in fig. 2, there is also provided a multi-dimensional data processing apparatus including: the system comprises an acquisition module 21, a dimension reduction module 22, a clustering module 23 and a determination module 24; wherein the content of the first and second substances,
the obtaining module 21 is configured to obtain data to be processed, where the data to be processed is multidimensional data with multiple category attributes;
the dimensionality reduction module 22 is configured to perform dimensionality reduction on the data to be processed to obtain target dimensionality data, and determine a category attribute number of the data to be processed according to the target dimensionality data;
the clustering module 23 is configured to perform clustering analysis on the data to be processed according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number;
the determining module 24 is configured to determine a category attribute corresponding to each data category according to a clustering center corresponding to each data category.
In the above embodiments of the present application, to-be-processed data is obtained, where the to-be-processed data is multidimensional data with multiple category attributes; performing dimensionality reduction on the data to be processed to obtain target dimensionality data, and determining the category attribute number of the data to be processed according to the target dimensionality data, so that the category attribute number of the data to be processed can be accurately determined through the dimensionality reduction; further, performing cluster analysis on the data to be processed according to the category attribute number to obtain a cluster center and a data category corresponding to the category attribute number; and determining the category attribute corresponding to the data category according to the clustering center corresponding to each data category, so that the cost and inaccuracy of manual return visit are avoided, and the specific reason of merchant loss is obtained by adopting the determined category attribute number, namely the clustering number K value to perform clustering analysis on the data to be processed.
Optionally, the dimension reduction module 22 is further configured to perform dimension reduction on the data to be processed to obtain corresponding two-dimensional data, and determine the category attribute number of the data to be processed according to the two-dimensional data; or the like, or, alternatively,
the dimension reduction module 22 is further configured to perform dimension reduction on the data to be processed to obtain corresponding three-dimensional data, and determine the category attribute number of the data to be processed according to the three-dimensional data.
Optionally, the dimension reduction module 22 is further configured to perform dimension reduction on the data to be processed through a t-SNE algorithm to obtain target dimension data, and determine the category attribute number of the data to be processed according to the target dimension data.
Optionally, the dimension reduction module 22 is further configured to map the data to be processed to a high-dimensional space through gaussian distribution to obtain high-dimensional data, and determine a first probability distribution parameter corresponding to the gaussian distribution;
and fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and the relative entropy, obtaining target dimension data corresponding to the data to be processed through t distribution corresponding to the second probability distribution parameter, and determining the category attribute number of the data to be processed according to the target dimension data.
Optionally, the determining module 24 is further configured to determine a position coordinate of a cluster center corresponding to each of the data categories;
and calculating the distance between the position coordinate of the clustering center corresponding to each data category and the coordinate origin, and determining the category attribute corresponding to each data category according to the data category of the clustering center, the distance between which and the coordinate origin accords with the set conditions.
Optionally, the clustering module 23 is further configured to cluster the to-be-processed data according to the category attribute number by using a spectral clustering algorithm, so as to obtain a clustering center and a data category corresponding to the category attribute number.
Optionally, the clustering module 23 is further configured to determine a class attribute weight value between each piece of to-be-processed data in the to-be-processed data and other pieces of to-be-processed data;
and when the weight value of the category attributes of any two data to be processed is greater than a set value, determining the data to be the same category and marking the data to be clustered, obtaining a clustering result containing the data category corresponding to the category attribute number, and determining a clustering center corresponding to the data category according to the category attribute number.
Optionally, the dimension reduction module 22 is further configured to perform dimension reduction processing on the logistics service data to obtain target dimension data, and determine that the logistics service data includes three category attributes according to the target dimension data;
the determining module 24 is further configured to determine, according to the clustering center corresponding to each data category, that the category attributes corresponding to the data category are a price attribute, a service attribute, and an aging attribute, respectively.
Optionally, the obtaining module is further configured to screen logistics service data according to preset parameters, and delete the logistics service data that does not meet preset conditions.
In another embodiment, as shown in fig. 3, there is also provided a multi-dimensional data processing apparatus including: at least one processor 210 and a memory 211 for storing computer programs capable of running on the processor 210; the processor 210 illustrated in fig. 3 is not used to refer to the number of processors as one, but is only used to refer to the position relationship of the processor with respect to other devices, and in practical applications, the number of processors may be one or more; similarly, the memory 211 illustrated in fig. 3 is also used in the same sense, i.e. it is only used to refer to the position relationship of the memory with respect to other devices, and in practical applications, the number of the memory may be one or more.
Wherein, when the processor 210 is used for running the computer program, the following steps are executed:
acquiring data to be processed, wherein the data to be processed is multidimensional data with multi-class attributes;
performing dimensionality reduction on the data to be processed to obtain target dimension data, and determining the category attribute number of the data to be processed according to the target dimension data;
performing clustering analysis on the data to be processed according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number;
and determining the category attribute corresponding to the data category according to the clustering center corresponding to each data category.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
reducing the dimension of the data to be processed to obtain corresponding two-dimensional data, and determining the category attribute number of the data to be processed according to the two-dimensional data; or the like, or, alternatively,
and reducing the dimension of the data to be processed to obtain corresponding three-dimensional data, and determining the category attribute number of the data to be processed according to the three-dimensional data.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
and performing dimensionality reduction on the data to be processed through a t-SNE algorithm to obtain target dimensional data, and determining the category attribute number of the data to be processed according to the target dimensional data.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
mapping the data to be processed to a high-dimensional space through Gaussian distribution to obtain high-dimensional data, and determining a first probability distribution parameter corresponding to the Gaussian distribution;
and fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and the relative entropy, and obtaining target dimension data corresponding to the data to be processed through t distribution corresponding to the second probability distribution parameter.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
determining the position coordinates of the clustering center corresponding to each data category;
and calculating the distance between the position coordinate of the clustering center corresponding to each data category and the coordinate origin, and determining the category attribute corresponding to each data category according to the data category of the clustering center, the distance between which and the coordinate origin accords with the set conditions.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
and clustering the data to be processed through a spectral clustering algorithm according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
determining a class attribute weight value between each piece of data to be processed and other pieces of data to be processed in the data to be processed;
and when the weight value of the category attributes of any two data to be processed is greater than a set value, determining the data to be the same category and marking the data to be clustered, obtaining a clustering result containing the data category corresponding to the category attribute number, and determining a clustering center corresponding to the data category according to the category attribute number.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
performing dimensionality reduction on the logistics service data to obtain target dimension data, and determining that the logistics service data comprises three category attributes according to the target dimension data;
determining the category attribute corresponding to each data category according to the clustering center corresponding to each data category, including:
and determining the category attributes corresponding to the data categories to be a price attribute, a service attribute and an aging attribute respectively according to the clustering center corresponding to each data category.
In an alternative embodiment, the processor 210 is further configured to execute the following steps when the computer program runs:
and screening the logistics service data according to preset parameters, and deleting the logistics service data which do not meet preset conditions.
The multi-dimensional data processing apparatus further includes: at least one network interface 212. The various components on the transmit side are coupled together by a bus system 213. It will be appreciated that the bus system 213 is used to enable communications among the connections of these components. The bus system 213 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as bus system 213 in fig. 3.
The memory 211 may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 211 described in connection with the embodiments of the invention is intended to comprise, without being limited to, these and any other suitable types of memory.
The memory 211 in the embodiment of the present invention is used to store various types of data to support the operation of the transmitting end. Examples of such data include: any computer program for operating on the sender side, such as an operating system and application programs. The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs may include various application programs for implementing various application services. Here, the program that implements the method of the embodiment of the present invention may be included in an application program.
The embodiment further provides a computer storage medium, for example, including a memory 211 storing a computer program, which can be executed by a processor 210 in the transmitting end to perform the steps of the foregoing method. The computer storage medium can be FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM; or various devices including one or any combination of the above memories, such as a smart phone, a tablet computer, a notebook computer, and the like. A computer storage medium having a computer program stored therein, the computer program, when executed by a processor, performing the steps of:
acquiring data to be processed, wherein the data to be processed is multidimensional data with multi-class attributes;
performing dimensionality reduction on the data to be processed to obtain target dimension data, and determining the category attribute number of the data to be processed according to the target dimension data;
performing clustering analysis on the data to be processed according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number;
and determining the category attribute corresponding to the data category according to the clustering center corresponding to each data category.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
reducing the dimension of the data to be processed to obtain corresponding two-dimensional data, and determining the category attribute number of the data to be processed according to the two-dimensional data; or the like, or, alternatively,
and reducing the dimension of the data to be processed to obtain corresponding three-dimensional data, and determining the category attribute number of the data to be processed according to the three-dimensional data.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
and performing dimensionality reduction on the data to be processed through a t-SNE algorithm to obtain target dimensional data, and determining the category attribute number of the data to be processed according to the target dimensional data.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
mapping the data to be processed to a high-dimensional space through Gaussian distribution to obtain high-dimensional data, and determining a first probability distribution parameter corresponding to the Gaussian distribution;
and fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and the relative entropy, and obtaining target dimension data corresponding to the data to be processed through t distribution corresponding to the second probability distribution parameter.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
determining the position coordinates of the clustering center corresponding to each data category;
and calculating the distance between the position coordinate of the clustering center corresponding to each data category and the coordinate origin, and determining the category attribute corresponding to each data category according to the data category of the clustering center, the distance between which and the coordinate origin accords with the set conditions.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
and clustering the data to be processed through a spectral clustering algorithm according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
determining a class attribute weight value between each piece of data to be processed and other pieces of data to be processed in the data to be processed;
and when the weight value of the category attributes of any two data to be processed is greater than a set value, determining the data to be the same category and marking the data to be clustered, obtaining a clustering result containing the data category corresponding to the category attribute number, and determining a clustering center corresponding to the data category according to the category attribute number.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
performing dimensionality reduction on the logistics service data to obtain target dimension data, and determining that the logistics service data comprises three category attributes according to the target dimension data;
determining the category attribute corresponding to each data category according to the clustering center corresponding to each data category, including:
and determining the category attributes corresponding to the data categories to be a price attribute, a service attribute and an aging attribute respectively according to the clustering center corresponding to each data category.
In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:
and screening the logistics service data according to preset parameters, and deleting the logistics service data which do not meet preset conditions.
The following takes the data to be processed as the logistics business data as an example, and further details the working process of the multidimensional data processing method through an alternative embodiment, please refer to fig. 4 in combination. The multidimensional data processing method comprises the following steps:
s1: screening logistics service data according to preset parameters, and deleting the logistics service data which do not meet preset conditions, wherein the logistics service data are multidimensional data with multi-class attributes;
here, the initial logistics business data includes logistics business data of all merchants.
Further, the preset parameter may be a seasonal index D, see formula (2), where S1Indicating the sales of the goods for a certain month,
Figure BDA0002081130030000243
representing the average sales per month of the year.
Figure BDA0002081130030000241
Here, when D is greater than 1, it means that the commodity sales for the month is greater than the average value; here, a threshold value, for example, 4, may be set, and if the month that D is greater than 1 does not exceed 4, it indicates that the product of the merchant has seasonal sales, and such merchant is screened out and not put into the to-be-processed logistics business data for processing.
Alternatively, the preset parameter may be the merchant decline index W, see equation (3), where SkIndicating the sales of the goods in the k month, Sk-1Indicating the sales of the goods in the month immediately preceding the month k.
Figure BDA0002081130030000242
Here, when W is less than 1, it means that the commodity sales of the month is less than the commodity sales of the last month, and the merchant decline index may select a year period for screening, for example, if the merchant has 10 values less than 1 in the year, that is, the merchant has 10 months of commodity sales in a month-by-month decline state, the merchant is determined as a merchant in the decline period, and such merchant is screened out and is not put into the to-be-processed logistics business data for processing.
S2: reducing the dimensions of the logistics service data through a t-SNE algorithm to obtain corresponding two-dimensional data/three-dimensional data, and determining the category attribute number of the logistics service data;
the step of performing dimensionality reduction on the data to be processed by the t-SNE algorithm refers to performing dimensionality reduction and visualization analysis on the data to be processed, wherein the multidimensional data with the multi-class attributes are subjected to dimensionality reduction processing by the t-SNE algorithm to form two-dimensional data or three-dimensional data, and the two-dimensional data or the three-dimensional data are displayed in two dimensions or three dimensions.
Here, the two-dimensional data or the three-dimensional data is visualized data, and is mapped in two-dimensional coordinates or three-dimensional coordinates. For example, the dimension of the data to be processed is reduced to obtain two-dimensional data, the data to be processed is projected in a two-dimensional coordinate system, and the data to be processed can be divided into several categories, for example, three categories according to the distribution of the points of the two-dimensional data in the two-dimensional coordinate system, that is, the number of corresponding category attributes is determined to be 3.
Specifically, the obtaining of the two-dimensional data or the three-dimensional data by performing the dimension reduction on the logistics service data through a t-SNE algorithm includes:
mapping the logistics service data to a high-dimensional space through Gaussian distribution to obtain high-dimensional data, and determining a first probability distribution parameter corresponding to the Gaussian distribution;
and fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and the relative entropy, and obtaining two-dimensional data or three-dimensional data corresponding to the logistics service data through t distribution corresponding to the second probability distribution parameter.
Here, mapping the logistics service data to a high-dimensional space through gaussian distribution to obtain high-dimensional data means converting a high-dimensional euclidean distance between every two data points in the logistics service data into a conditional probability representing similarity, specifically, converting the logistics service data into the conditional probability representing similarity through gaussian distribution, and determining a first probability distribution parameter corresponding to the gaussian distribution.
Further, fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and the relative entropy, and obtaining two-dimensional data or three-dimensional data corresponding to the logistics business data through t distribution corresponding to the second probability distribution parameter.
Figure BDA0002081130030000251
Referring to formula (1) calculation formula of relative entropy, D (p | | q) refers to the relative entropy of gaussian distribution to t distribution, p (x) is a first probability distribution parameter, and q (x) is a second probability distribution parameter. And fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and the relative entropy refers to calculating a second probability distribution parameter determined when the probability distribution of the logistics service data in a high-dimensional space and the relative entropy in a low-dimensional space are minimum.
Further, obtaining the two-dimensional data or the three-dimensional data corresponding to the logistics service data through the t distribution corresponding to the second probability distribution parameter means obtaining the two-dimensional data or the three-dimensional data by reducing the dimension of the logistics service data through the t distribution corresponding to the second probability distribution parameter.
S3: clustering the logistics service data through a spectral clustering algorithm according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number;
determining the category attribute number of the logistics service data through a t-SNE algorithm, and clustering the logistics service data through a spectral clustering algorithm according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number; for example, when the number of the category attributes is determined to be 3, clustering the data to be processed by using a spectral clustering algorithm according to the clustering number of 3, and clustering the data to be processed into 3 data categories and 3 clustering centers.
S4: and determining the category attribute corresponding to the data category according to the clustering center corresponding to each data category.
Here, there is a cluster center for each data category, and each cluster center has a corresponding position coordinate in the target dimension, for example, in a three-dimensional coordinate axis, a cluster center A has a position coordinate of (x1, y1, z1), and the distance between the position coordinate of the cluster center and the origin of the coordinate is the distance
Figure BDA0002081130030000261
Here, the setting condition may be two cluster centers each of which is farthest from the origin of coordinates or one cluster center, and may be set by itself here. And calculating the distance between each cluster center and the coordinate origin, and determining the category attribute corresponding to each data category meeting the set condition, namely determining the category attribute corresponding to the data category of one cluster center or two cluster centers farthest from the coordinate origin. For example, three cluster centers with three data categories are determined, the position coordinates of the three cluster centers are determined to obtain the corresponding distance d1, d2 and d3, the set condition is that the cluster center is farthest away from the coordinate origin, the maximum value among d1, d2 and d3 is taken, and if the cluster center is d1, the cluster center corresponding to d1 is determined, and therefore the category attribute of the corresponding data category is determined.
Thus, by acquiring logistics service data; reducing the dimensions of the logistics business data through a t-SNE algorithm to obtain corresponding two-dimensional data/three-dimensional data, and determining the category attribute number of the logistics business data, wherein on one hand, the influence of seasonal reasons and self-operation factors of merchants can be eliminated through the logistics business data obtained by screening, and on the other hand, the category attribute number of the logistics business data can be accurately determined through the t-SNE algorithm; further, clustering the logistics service data through a spectral clustering algorithm according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number, and determining the category attribute corresponding to the data category according to the clustering center corresponding to each data category, so that the data category and the category attribute are analyzed from objective data characteristics, value information in the data is determined according to the data category and the category attribute, and the data category and the category attribute of the logistics service data are determined through analysis by taking the data to be processed as the logistics service data as an example, so that specific reasons of loss of merchants can be known.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (20)

1. A method of multidimensional data processing, the method comprising:
acquiring data to be processed, wherein the data to be processed is multidimensional data with multi-class attributes;
performing dimensionality reduction on the data to be processed to obtain target dimension data, and determining the category attribute number of the data to be processed according to the target dimension data;
performing clustering analysis on the data to be processed according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number;
and determining the category attribute corresponding to the data category according to the clustering center corresponding to each data category.
2. The multidimensional data processing method of claim 1, wherein the performing the dimensionality reduction on the data to be processed to obtain target dimensional data, and determining the category attribute number of the data to be processed according to the target dimensional data comprises:
reducing the dimension of the data to be processed to obtain corresponding two-dimensional data, and determining the category attribute number of the data to be processed according to the two-dimensional data; or the like, or, alternatively,
and reducing the dimension of the data to be processed to obtain corresponding three-dimensional data, and determining the category attribute number of the data to be processed according to the three-dimensional data.
3. The multidimensional data processing method of claim 1, wherein the performing the dimensionality reduction on the data to be processed to obtain target dimensional data, and determining the category attribute number of the data to be processed according to the target dimensional data comprises:
and performing dimensionality reduction on the data to be processed through a t-SNE algorithm to obtain target dimensional data, and determining the category attribute number of the data to be processed according to the target dimensional data.
4. The multidimensional data processing method of claim 3, wherein the reducing the dimension of the data to be processed by the t-SNE algorithm to obtain the target dimension data comprises:
mapping the data to be processed to a high-dimensional space through Gaussian distribution to obtain high-dimensional data, and determining a first probability distribution parameter corresponding to the Gaussian distribution;
and fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and the relative entropy, and obtaining target dimension data corresponding to the data to be processed through t distribution corresponding to the second probability distribution parameter.
5. The multidimensional data processing method of claim 1, wherein determining the category attribute corresponding to each of the data categories according to the cluster center corresponding to the data category comprises:
determining the position coordinates of the clustering center corresponding to each data category;
and calculating the distance between the position coordinate of the clustering center corresponding to each data category and the coordinate origin, and determining the category attribute corresponding to each data category according to the data category of the clustering center, the distance between which and the coordinate origin accords with the set conditions.
6. The multidimensional data processing method of claim 1, wherein the performing cluster analysis on the data to be processed according to the category attribute number to obtain a cluster center and a data category corresponding to the category attribute number comprises:
and clustering the data to be processed through a spectral clustering algorithm according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number.
7. The multi-dimensional data processing method according to claim 6, wherein the clustering the data to be processed according to the class attribute number by a spectral clustering algorithm to obtain a cluster center and a data class corresponding to the class attribute number comprises:
determining a class attribute weight value between each piece of data to be processed and other pieces of data to be processed in the data to be processed;
and when the weight value of the category attributes of any two data to be processed is greater than a set value, determining the data to be the same category and marking the data to be clustered, obtaining a clustering result containing the data category corresponding to the category attribute number, and determining a clustering center corresponding to the data category according to the category attribute number.
8. The multidimensional data processing method of claim 1, wherein the data to be processed comprises logistics service data, the performing dimension reduction processing on the data to be processed to obtain target dimension data, and determining the category attribute number of the data to be processed according to the target dimension data comprises:
performing dimensionality reduction on the logistics service data to obtain target dimension data, and determining that the logistics service data comprises three category attributes according to the target dimension data;
determining the category attribute corresponding to each data category according to the clustering center corresponding to each data category, including:
and determining the category attributes corresponding to the data categories to be a price attribute, a service attribute and an aging attribute respectively according to the clustering center corresponding to each data category.
9. The multidimensional data processing method of claim 8, wherein before the dimension reduction processing is performed on the logistics service data to obtain target dimension data, the method comprises:
and screening the logistics service data according to preset parameters, and deleting the logistics service data which do not meet preset conditions.
10. A multi-dimensional data processing apparatus, comprising:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring data to be processed, and the data to be processed is multidimensional data with multi-class attributes;
the dimensionality reduction module is used for carrying out dimensionality reduction on the data to be processed to obtain target dimensionality data and determining the category attribute number of the data to be processed according to the target dimensionality data;
the clustering module is used for carrying out clustering analysis on the data to be processed according to the category attribute number to obtain a clustering center and a data category corresponding to the category attribute number;
and the determining module is used for determining the category attribute corresponding to the data category according to the clustering center corresponding to each data category.
11. The multidimensional data processing device of claim 10, wherein the dimension reduction module is further configured to perform dimension reduction on the data to be processed to obtain corresponding two-dimensional data, and determine the category attribute number of the data to be processed according to the two-dimensional data; or the like, or, alternatively,
the dimension reduction module is further configured to perform dimension reduction on the data to be processed to obtain corresponding three-dimensional data, and determine the category attribute number of the data to be processed according to the three-dimensional data.
12. The multidimensional data processing apparatus of claim 10, wherein the dimension reduction module is further configured to perform dimension reduction on the data to be processed through a t-SNE algorithm to obtain target dimension data, and determine the category attribute number of the data to be processed according to the target dimension data.
13. The multidimensional data processing apparatus of claim 10, wherein the dimension reduction module is further configured to map the data to be processed to a high-dimensional space through a gaussian distribution to obtain high-dimensional data, and determine a first probability distribution parameter corresponding to the gaussian distribution;
and fitting a corresponding second probability distribution parameter according to the first probability distribution parameter and the relative entropy, obtaining target dimension data corresponding to the data to be processed through t distribution corresponding to the second probability distribution parameter, and determining the category attribute number of the data to be processed according to the target dimension data.
14. The multidimensional data processing apparatus of claim 10, wherein the determining module is further configured to determine a location coordinate of a cluster center corresponding to each of the data categories;
and calculating the distance between the position coordinate of the clustering center corresponding to each data category and the coordinate origin, and determining the category attribute corresponding to each data category according to the data category of the clustering center, the distance between which and the coordinate origin accords with the set conditions.
15. The multidimensional data processing device of claim 10, wherein the clustering module is further configured to cluster the data to be processed by a spectral clustering algorithm according to the category attribute number to obtain a cluster center and a data category corresponding to the category attribute number.
16. The multi-dimensional data processing apparatus of claim 15, wherein the clustering module is further configured to determine a class attribute weight value between each of the to-be-processed data and other to-be-processed data;
and when the weight value of the category attributes of any two data to be processed is greater than a set value, determining the data to be the same category and marking the data to be clustered, obtaining a clustering result containing the data category corresponding to the category attribute number, and determining a clustering center corresponding to the data category according to the category attribute number.
17. The multidimensional data processing device of claim 10, wherein the dimension reduction module is further configured to perform dimension reduction processing on the logistics service data to obtain target dimension data, and determine that the logistics service data includes three category attributes according to the target dimension data;
the determining module is further configured to determine, according to the clustering center corresponding to each data category, that the category attributes corresponding to the data categories are a price attribute, a service attribute, and an aging attribute, respectively.
18. The multidimensional data processing device of claim 17, wherein the obtaining module is further configured to filter the logistics service data according to a preset parameter, and delete the logistics service data that does not satisfy a preset condition.
19. A computer device, comprising: a processor and a memory for storing a computer program capable of running on the processor;
wherein the processor is configured to implement the multi-dimensional data processing method of any one of claims 1 to 9 when running the computer program.
20. A computer storage medium, in which a computer program is stored, wherein the computer program, when executed by a processor, implements the multidimensional data processing method of any one of claims 1 to 9.
CN201910472215.6A 2019-05-31 2019-05-31 Multidimensional data processing method and device, computer equipment and storage medium Pending CN112016581A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910472215.6A CN112016581A (en) 2019-05-31 2019-05-31 Multidimensional data processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910472215.6A CN112016581A (en) 2019-05-31 2019-05-31 Multidimensional data processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112016581A true CN112016581A (en) 2020-12-01

Family

ID=73506174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910472215.6A Pending CN112016581A (en) 2019-05-31 2019-05-31 Multidimensional data processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112016581A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112885080A (en) * 2021-01-11 2021-06-01 重庆长安新能源汽车科技有限公司 Construction method for driving condition of new energy automobile
CN114510525A (en) * 2022-04-18 2022-05-17 深圳丰尚智慧农牧科技有限公司 Data format conversion method and device, computer equipment and storage medium
CN116679888A (en) * 2023-07-27 2023-09-01 申合信科技集团有限公司 E-commerce data optimized storage method based on manifold learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112885080A (en) * 2021-01-11 2021-06-01 重庆长安新能源汽车科技有限公司 Construction method for driving condition of new energy automobile
CN112885080B (en) * 2021-01-11 2022-06-21 重庆长安新能源汽车科技有限公司 Construction method for driving condition of new energy automobile
CN114510525A (en) * 2022-04-18 2022-05-17 深圳丰尚智慧农牧科技有限公司 Data format conversion method and device, computer equipment and storage medium
CN116679888A (en) * 2023-07-27 2023-09-01 申合信科技集团有限公司 E-commerce data optimized storage method based on manifold learning
CN116679888B (en) * 2023-07-27 2023-10-10 申合信科技集团有限公司 E-commerce data optimized storage method based on manifold learning

Similar Documents

Publication Publication Date Title
US10504120B2 (en) Determining a temporary transaction limit
Gan et al. Regression modeling for the valuation of large variable annuity portfolios
US20180349324A1 (en) Real-time and computationally efficent prediction of values for a quote variable in a pricing application
CN106952072A (en) A kind of method and system of data processing
US9990597B2 (en) System and method for forecast driven replenishment of merchandise
CN112016581A (en) Multidimensional data processing method and device, computer equipment and storage medium
CN107622326B (en) User classification and available resource prediction method, device and equipment
US20170193538A1 (en) System and method for determining the priority of mixed-type attributes for customer segmentation
WO2020140681A1 (en) Numerical value calculation method and apparatus, computer device, and storage medium
US20170169447A1 (en) System and method for segmenting customers with mixed attribute types using a targeted clustering approach
CN110942392A (en) Service data processing method, device, equipment and medium
CN111932188A (en) Method, electronic device and storage medium for inventory management
CN116757779A (en) Recommendation method based on user portrait
CN110650170A (en) Method and device for pushing information
US11216761B2 (en) System and method for supply chain optimization
Liu et al. Real-time valuation of large variable annuity portfolios: a green mesh approach
US20220351051A1 (en) Analysis system, apparatus, control method, and program
CN111581296B (en) Data correlation analysis method and device, computer system and readable storage medium
CN113780912A (en) Method and device for determining safety stock
US11803868B2 (en) System and method for segmenting customers with mixed attribute types using a targeted clustering approach
CN115827994A (en) Data processing method, device, equipment and storage medium
CN110826579A (en) Commodity classification method and device
CN115116080A (en) Table analysis method and device, electronic equipment and storage medium
CN110837604B (en) Data analysis method and device based on housing monitoring platform
Rojas Time dependence in joint replacement to multi-products grouped. The case of hospital food service

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination