CN116679888B - E-commerce data optimized storage method based on manifold learning - Google Patents
E-commerce data optimized storage method based on manifold learning Download PDFInfo
- Publication number
- CN116679888B CN116679888B CN202310927504.7A CN202310927504A CN116679888B CN 116679888 B CN116679888 B CN 116679888B CN 202310927504 A CN202310927504 A CN 202310927504A CN 116679888 B CN116679888 B CN 116679888B
- Authority
- CN
- China
- Prior art keywords
- data
- intersection
- category
- matching
- dimension reduction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000013507 mapping Methods 0.000 claims abstract description 13
- 238000010606 normalization Methods 0.000 claims description 8
- 230000014759 maintenance of location Effects 0.000 claims description 6
- 230000000694 effects Effects 0.000 abstract description 14
- 238000013500 data storage Methods 0.000 abstract description 9
- 230000006835 compression Effects 0.000 abstract description 6
- 238000007906 compression Methods 0.000 abstract description 6
- 238000012545 processing Methods 0.000 abstract description 5
- 238000004364 calculation method Methods 0.000 description 3
- 238000011946 reduction process Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0604—Improving or facilitating administration, e.g. storage management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Human Computer Interaction (AREA)
- Business, Economics & Management (AREA)
- Bioinformatics & Computational Biology (AREA)
- Accounting & Taxation (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Finance (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of electronic commerce data processing, in particular to an electronic commerce data optimized storage method based on manifold learning. The method is based on any K value, and an equidistant mapping algorithm is used for reducing the electronic commerce data to obtain dimension reduction data; matching the e-commerce data and the dimension reduction data to obtain an original category and a dimension reduction category, and carrying out iterative updating on intersections of the original category and the dimension reduction category in the matching category pair to obtain the adjacency of the matching category pair to the data in the intersections updated in each iteration; obtaining a reserved set of matching category pairs according to the overlapping condition of the dimension reduction category and the matching category pair data sets corresponding to the intersection set updated in each iteration; obtaining an optimal K value according to the adjacency of data in a reserved set of the matching class pair obtained by the dimension reduction of different K values; and carrying out compression storage on the reduced data corresponding to the optimal K value. The invention ensures that the shape of the E-commerce data is reserved to the greatest extent by the data after the dimension reduction, and the dimension reduction effect and the data storage effect are improved.
Description
Technical Field
The invention relates to the technical field of electronic commerce data processing, in particular to an electronic commerce data optimized storage method based on manifold learning.
Background
With the rapid development of electronic commerce, more and more data needs to be stored and managed. Data storage is a critical ring in electronic commerce, which determines whether an enterprise can operate successfully, provides high quality services, and gains trust and loyalty for clients. E-commerce data storage refers to storing various data of e-commerce enterprises, such as data of orders, customer information, product information and the like, in a reliable database. In electronic commerce data storage, data scalability is also very important. With the increase of the size and business of enterprises, the data to be stored is increased, and the electronic commerce data is generally higher in dimension, so that the expandability and the performance of the data storage must be considered, and by adopting a proper storage technology and architecture, the performance of the data storage and access is determined not to be bottleneck under the condition that the data size is increased continuously.
The current common method for storing the electronic commerce data comprises the following steps: the electronic commerce data is subjected to dimension reduction through an equidistant mapping algorithm (Isometric Mapping, ISOMAP), and then the dimension reduced data is stored, so that the expandability of the data storage is realized. However, the influence of the selection of the K value on the dimension reduction result is larger in the dimension reduction process of the electronic commerce data by the equidistant mapping algorithm, the K value is usually set by taking an empirical value, and the actual data condition of the electronic commerce data is not combined, so that the dimension reduction effect is poor, and the storage effect is poor when the electronic commerce data is stored.
Disclosure of Invention
In order to solve the technical problem that the dimension of the electronic commerce data is reduced by an equidistant mapping algorithm, and the storage effect is poor when the dimension-reduced data is stored, the invention aims to provide an electronic commerce data optimized storage method based on manifold learning, and the adopted technical scheme is as follows:
acquiring electronic commerce data;
based on any K value, reducing the dimension of the electronic commerce data by using an equidistant mapping algorithm to obtain dimension reduction data; clustering the E-commerce data and the dimension reduction data respectively to obtain an original category and a dimension reduction category; matching the original category with the dimension reduction category to obtain a matching category pair; carrying out iterative updating on the intersection of the original category and the dimension reduction category in the same matching category pair by a data removing method, and obtaining the adjacency of the matching category to the data in the intersection updated each time according to the duty ratio of the data in the intersection obtained by each time of iterative updating; acquiring a data set corresponding to the intersection according to the intersection updated by each iteration; obtaining a reserved set of matching category pairs according to the overlapping condition of the dimension reduction category and the matching category pair data sets corresponding to the intersection set updated in each iteration;
obtaining an optimal K value according to the adjacency of data in a reserved set of the matching class pair obtained by the dimension reduction of different K values;
and carrying out compression storage on the reduced data corresponding to the optimal K value.
Preferably, the data removing method comprises the following steps:
for the corresponding matching category pairs under any K value, taking the intersection of the original category and the dimension-reduction category in each matching category pair as the initial intersection of the matching category pair;
taking the intersection of the initial intersections of all matching class pairs as a first intersection; deleting the data belonging to the first intersection from the initial intersection to obtain a first intersection to be deleted of each matching class pair;
calculating the sum of occurrence frequencies of each data in all first intersections, marking the sum as the first frequency, deleting the data corresponding to the maximum first frequency from all first intersections to be deleted to obtain a second intersection to be deleted, and enabling a set formed by the deleted data to be called a second intersection; calculating the sum of occurrence frequencies of each data in all second to-be-deleted intersections, marking the sum as the second frequency, deleting the data corresponding to the maximum second frequency from all second to-be-deleted intersections to obtain a third to-be-deleted intersection, and enabling a set formed by the deleted data to be called a third intersection; and so on, continuously updating the intersection to be deleted and the intersection; and stopping iteration until the sum of the occurrence frequency of each data in the latest intersection to be deleted is smaller than the preset data quantity.
Preferably, the obtaining the adjacency of the matching category to the data in the intersection updated in each iteration according to the ratio of the data in the intersection updated in each iteration includes:
and selecting any iteration update as a target iteration update for the matching class pair obtained under any K value, and taking the ratio of the occurrence times of data in all intersection sets obtained by the target iteration update to the number of data in the initial intersection sets as the adjacency of the data in the intersection sets of the target iteration update of the matching class pair in the intersection sets corresponding to the matching class pair obtained by the target iteration update.
Preferably, the obtaining the data set corresponding to the intersection according to the intersection updated in each iteration includes:
taking the first intersection as a first data set corresponding to the first intersection; adding the data in the second intersection into the first data set to obtain a second data set corresponding to the second intersection; adding the data in the third intersection into the second data set to obtain a third data set corresponding to the third intersection; and so on, obtaining a data set corresponding to each intersection.
Preferably, the clustering the e-commerce data and the dimension reduction data to obtain an original category and a dimension reduction category includes:
for the E-commerce data, taking a negative correlation normalization value of the similarity of any two E-commerce data as the distance characteristic of the two E-commerce data; clustering the electronic commerce data according to the distance characteristics of the two electronic commerce data to obtain an original category;
regarding the dimension reduction data, taking the negative correlation normalization value of the similarity of any two dimension reduction data as the distance characteristic of the two dimension reduction data; and clustering the dimension reduction data according to the distance characteristics of the two dimension reduction data to obtain the dimension reduction category.
Preferably, the matching the original category and the dimension reduction category to obtain a matching category pair includes:
matching the original category and the dimension reduction category by using an optimal matching algorithm to obtain a matching category pair; wherein each matching class pair includes an original class and a reduced-dimension class.
Preferably, the obtaining a reserved set of the matching category pair according to the overlapping condition of the data set corresponding to the intersection updated by each iteration of the dimension reduction category and the matching category pair includes:
and for the matching class pairs obtained under any K value, when the data in the data sets of the dimensionality reduction class and the corresponding iteration update intersection set in each matching class pair are the same, taking the iteration update data sets as a reserved set of each matching class pair.
Preferably, the obtaining the optimal K value according to the adjacency of the data in the reserved set of the matching class pair obtained by the falling dimension of the K values includes:
for any K value, taking the sum of the adjacencies of the data in the reserved set of all corresponding matching class pairs as the reserved property of the K value;
and taking the K value corresponding to the maximum retention as the optimal K value.
The embodiment of the invention has at least the following beneficial effects:
according to the invention, the reduced-dimension data and the original electronic commerce data under different K values are analyzed, firstly, the electronic commerce data and the reduced-dimension data are clustered based on any K value to obtain an original category and a reduced-dimension category, and after the data are clustered, the consistency of the category for subsequent analysis and the analysis of the adjacent condition of the reduced-dimension data are facilitated; matching the original category with the dimension-reducing category, carrying out iterative updating on the intersection of the original category and the dimension-reducing category in the same matching category pair through a data removing method to obtain the adjacency of the matching category to the data in the intersection updated each time, wherein in dimension-reducing data with different K values, if certain data are always adjacent, the data are relatively more important, the data can represent data characteristics, the dimension-reducing data after dimension reduction need to keep the information as much as possible, so that the adjacency of the data is calculated, and the optimal K value is conveniently found out later; according to the overlapping condition of the dimension reduction category and the data set corresponding to the intersection set updated by each iteration of the matching category pair, a reserved set of the matching category pair is obtained, the data adjacency in the reserved set further reflects the importance of the data, and the data adjacency in the reserved set of the matching category pair obtained by dimension reduction according to different K values is further obtained to obtain an optimal K value; and carrying out compression storage on the reduced data corresponding to the optimal K value. According to the invention, element adjacent relations in the E-commerce data can be reserved more by calculating the dimension reduction data with the K value, and the ISOMAP dimension reduction is performed by selecting the K value, so that the dimension reduction data can reserve the shape of the E-commerce data to the greatest extent, and the dimension reduction effect and the data storage effect are improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a method for optimizing and storing e-commerce data based on manifold learning according to an embodiment of the present invention.
Detailed Description
In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following detailed description is given below of the detailed implementation, structure, characteristics and effects of the method for optimizing and storing electronic commerce data based on manifold learning according to the invention by combining the accompanying drawings and the preferred embodiment. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The embodiment of the invention provides a specific implementation method of an e-commerce data optimized storage method based on manifold learning, which is suitable for a data optimized storage scene. In the scene, the technical problem that the storage effect is poor when the electronic commerce data is subjected to dimension reduction through an equidistant mapping algorithm and the dimension reduced data is stored is solved. According to the invention, element adjacent relations in the E-commerce data can be reserved more by calculating the dimension reduction data with the K value, and the ISOMAP dimension reduction is performed by selecting the K value, so that the dimension reduction data can reserve the shape of the E-commerce data to the greatest extent, and the dimension reduction effect and the data storage effect are improved.
The following specifically describes a specific scheme of the e-commerce data optimized storage method based on manifold learning provided by the invention with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of steps of a method for optimizing and storing e-commerce data based on manifold learning according to an embodiment of the present invention is shown, where the method includes the following steps:
step S100, acquiring electronic commerce data.
First, electronic commerce data can be obtained, and in the embodiment of the present invention, the electronic commerce data includes: data such as payment mode, payment amount, payment time, refund record and the like, wherein each E-commerce data comprises all data, and a plurality of E-commerce data are obtained; for example, e-commerce data a= { payment method, payment amount, payment time, refund record }.
Step S200, performing dimension reduction on the electronic commerce data by using an equidistant mapping algorithm based on any K value to obtain dimension reduction data; clustering the E-commerce data and the dimension reduction data respectively to obtain an original category and a dimension reduction category; matching the original category with the dimension reduction category to obtain a matching category pair; carrying out iterative updating on the intersection of the original category and the dimension reduction category in the same matching category pair by a data removing method, and obtaining the adjacency of the matching category to the data in the intersection updated each time according to the duty ratio of the data in the intersection obtained by each time of iterative updating; acquiring a data set corresponding to the intersection according to the intersection updated by each iteration; and obtaining a reserved set of the matching category pair according to the overlapping condition of the dimension reduction category and the matching category pair corresponding to the intersection set updated in each iteration.
And performing dimension reduction on the electronic commerce data through different K values by using an equidistant mapping algorithm to obtain various dimension reduction data, and simultaneously obtaining corresponding dimension reduction data.
According to the method, the original electronic commerce data are clustered, then the dimension-reduced data are clustered, the data adjacency is obtained according to the clustering consistency, the importance of each data is further obtained, the retention of each K value is further obtained, and the optimal K value is obtained.
Firstly, reducing the dimension of electronic commerce data by using an ISOMAP algorithm through different K values to obtain various dimension reduction data.
The ISOMAP algorithm is a nonlinear dimension reduction technique for mapping high-dimensional data into low-dimensional space. The following are the basic steps of the ISOMAP algorithm:
(1) Defining a distance matrix: the distance between each pair of sample points in the high-dimensional data is calculated, and Euclidean distance, cosine similarity or other distance measurement methods can be used.
(2) Constructing an adjacency graph: and determining K nearest neighbors of each sample according to the distance matrix to form an adjacency graph. This determines the local connectivity of each sample point.
(3) Calculating the shortest path: the shortest path distance between each pair of sample points in the adjacency graph is calculated using a graph algorithm, for example, the Dijkstra algorithm.
(4) Constructing a low-dimensional embedded representation: the topology between sample points is reconstructed using shortest path distances and mapped to a low dimensional space by multi-dimensional scaling (Multidimensional Scaling, MDS) or the like.
In ISOMAP dimension reduction results under different K values, if some data are always adjacent, the data are used as important data, so that the data characteristics can be reflected, and the information is kept as much as possible after dimension reduction.
Taking each e-commerce data as a data point, wherein each data point is a high-dimensional data, and performing dimension reduction on all the data points through different K values by using an ISOMAP algorithm to obtain dimension reduction data corresponding to each K value. The dimension reduction process is as follows: for example, the original data is: (100, 1100, 1230, 1245) to obtain two-dimensional data (16, 13) after dimension reduction, it should be noted that the description is only illustrative here, and the data forms before and after dimension reduction are described. In the embodiment of the invention, the value range of the K value is 5-20, and in other embodiments, the value range is adjusted by an implementer according to actual conditions.
The data points before each dimension reduction are called data, the data points obtained after each dimension reduction are called dimension reduction data points, meanwhile, the original data points corresponding to each dimension reduction data point can be obtained in the dimension reduction process and are called corresponding data, and then the corresponding dimension reduction data points under each K value can be obtained. And performing dimension reduction on the electronic commerce data by using an equidistant mapping algorithm based on any K value to obtain dimension reduction data.
Further, the electronic commerce data and the dimension reduction data are clustered respectively to obtain an original category and a dimension reduction category.
And clustering the original electronic commerce data, and then clustering the dimension-reduced data after dimension reduction. And obtaining the adjacency of the data points according to the consistency of the clusters, and further obtaining the retention of each K value to obtain the optimal K value.
In ISOMAP dimension reduction results under different K values, if some data are always adjacent, the data are used as important data, so that the data characteristics can be represented, and the dimension reduction data after dimension reduction needs to keep the information as much as possible.
The adjacent property of the data points can be obtained through superposition of the clustering results, and if some data points are in the same clustering category in the dimension reduction results with different K values, the data points have larger adjacency.
The method comprises the steps of obtaining an original category and a dimension reduction category, and specifically: for the E-commerce data, taking a negative correlation normalization value of the similarity of any two E-commerce data as the distance characteristic of the two E-commerce data; clustering the electronic commerce data according to the distance characteristics of the two electronic commerce data to obtain an original category; regarding the dimension reduction data, taking the negative correlation normalization value of the similarity of any two dimension reduction data as the distance characteristic of the two dimension reduction data; and clustering the dimension reduction data according to the distance characteristics of the two dimension reduction data to obtain the dimension reduction category. In the embodiment of the invention, the cosine similarity between two data is used as the similarity of the two data; the difference between the constant 1 and the cosine similarity is taken as a negative correlation normalization value of the similarity between the two data, and other methods for calculating the similarity and other methods for performing negative correlation normalization processing can be used in other embodiments.
In other words, in the embodiment of the present invention, for the e-commerce data, the cosine similarity of the two e-commerce data is calculated, and the difference between the constant 1 and the cosine similarity is used as the distance feature of the two e-commerce data, where the distance feature is the distance value of the data point corresponding to the e-commerce data. And then hierarchical clustering is carried out on the electronic commerce data through a complete connection method, so that a plurality of categories, namely original categories, are obtained. The dimension reduction data obtained by using the ISOMAP algorithm under each K value is also respectively clustered through layers to obtain a plurality of categories, which are called dimension reduction categories.
Matching the original category with the dimension-reduction category, namely matching the original category with the dimension-reduction category to obtain a matching category pair, and specifically: and matching the original category and the dimension reduction category by using an optimal matching algorithm (KM algorithm) to obtain a matching category pair. Wherein each matching class pair includes an original class and a reduced-dimension class. The nodes at two sides in KM matching are respectively an original category and a dimension-reducing category, each node at the left side and any node at the right side are provided with edge values, the edge values are the intersection ratio of the two categories of the original category and the dimension-reducing category, each original category formed by electronic commerce data and each dimension-reducing category formed by dimension-reducing data are obtained through a KM maximum matching principle to correspond one by one, and matching symmetry formed by one original category and one dimension-reducing category is a matching category pair. And the corresponding original category and dimension reduction category of the E-commerce data under each K value can be obtained through calculation.
After the original category corresponding to the electronic commerce data, the dimension reduction category corresponding to the dimension reduction data and the matching category pair under each K value are obtained. And carrying out iterative updating on the intersection of the original category and the dimension reduction category in the same matching category pair by a data removing method, and obtaining the adjacency of the matching category to the data in the intersection updated each iteration according to the duty ratio of the data in the intersection obtained by each iterative updating. It should be noted that, the intersection of the original category and the dimension-reduced category is data, that is, the data in the intersection is data existing in both the original category and the dimension-reduced category.
The data removing method comprises the following steps:
for the corresponding matching category pairs under any K value, taking the intersection of the original category and the dimension-reduction category in each matching category pair as the initial intersection of the matching category pair; taking the intersection of the initial intersections of all matching class pairs as a first intersection; deleting the data belonging to the first intersection from the initial intersection to obtain a first intersection to be deleted of each matching class pair; calculating the sum of occurrence frequencies of each data in all first intersections, marking the sum as the first frequency, deleting the data corresponding to the maximum first frequency from all first intersections to be deleted to obtain a second intersection to be deleted, and enabling a set formed by the deleted data to be called a second intersection; calculating the sum of occurrence frequencies of each data in all second to-be-deleted intersections, marking the sum as the second frequency, deleting the data corresponding to the maximum second frequency from all second to-be-deleted intersections to obtain a third to-be-deleted intersection, and enabling a set formed by the deleted data to be called a third intersection; calculating the sum of occurrence frequencies of each data in all third to-be-deleted intersections, marking the sum as third frequencies, deleting the data corresponding to the maximum third frequencies from all third to-be-deleted intersections to obtain a fourth to-be-deleted intersection, and enabling a set formed by the deleted data to be called a fourth intersection; and so on, continuously updating the intersection to be deleted and the intersection; and stopping iteration until the sum of the occurrence frequency of each data in the latest intersection to be deleted is smaller than the preset data quantity. It should be noted that, there may be more than one data corresponding to the sum of the maximum occurrence frequencies corresponding to each iteration update.
In the embodiment of the invention, the preset data quantity is as follows: the number of data in the initial intersection of the preset multiple.
Further, according to the duty ratio of the data in the intersection obtained by each iteration update, the adjacency of the matching category to the data in the intersection updated by each iteration is obtained, and the method is specific: and selecting any iteration update as a target iteration update for the matching class pair obtained under any K value, and taking the ratio of the occurrence times of data in all intersection sets obtained by the target iteration update to the number of data in the initial intersection sets as the adjacency of the data in the intersection sets of the target iteration update of the matching class pair in the intersection sets corresponding to the matching class pair obtained by the target iteration update.
The larger the occurrence times of the data in all the intersections obtained by the iterative updating of the target times, the larger the adjacency of the data; otherwise, the smaller the occurrence times of the data in all the intersections obtained by the iterative updating of the target time, the smaller the adjacency of the data.
The greater the adjacency of the data in the intersection of each iteration update of the matching class pair can be obtained by calculation, the greater the adjacency of the data, which means that the data has a greater probability of occurrence with other data.
According to the intersection updated by each iteration, a data set corresponding to the intersection is obtained, and the method is specific: taking the first intersection as a first data set corresponding to the first intersection; adding the data in the second intersection into the first data set to obtain a second data set corresponding to the second intersection; adding the data in the third intersection into the second data set to obtain a third data set corresponding to the third intersection; and so on, obtaining a data set corresponding to each intersection. The adjacency corresponding to each data set is the sum of the adjacencies of the data in the intersection set corresponding to the data set.
Taking the data in the first intersection as initial data to obtain a first data set, obtaining the adjacency of the data, and taking the sum of the adjacencies of the data in the first intersection as the adjacency of the first data set;
adding the data in the second intersection into the first data set to obtain a second data set; taking the sum of the adjacencies of the data in the second intersection as the adjacency of the second data set;
adding the data in the third intersection into the second data set to obtain a third data set, and taking the sum of the adjacencies of the data in the third intersection as the adjacencies of the third data set;
and by analogy, a plurality of data sets and corresponding adjacencies are obtained.
The more the adjacent relation of the dimension-reduced data after dimension reduction corresponding to the K value accords with the adjacency of the data obtained by previous calculation, the better the dimension-reduced effect is achieved by using the K value.
And obtaining a data set corresponding to the intersection of each matching category pair updated in each iteration for the matching category pairs to which the plurality of dimension reduction categories corresponding to the dimension reduction data after the dimension reduction of each K value belong.
According to the overlapping condition of the dimension reduction category and the matching category on the data set corresponding to the intersection updated in each iteration, a reserved set of the matching category pair is obtained, and the method is specific: and for the matching class pairs obtained under any K value, when the data in the data sets of the dimensionality reduction class and the corresponding iteration update intersection set in each matching class pair are the same, taking the iteration update data sets as a reserved set of each matching class pair. That is, for each dimension reduction category, calculating the intersection ratio of the iteratively updated intersection set of the matching category pair to which the dimension reduction category belongs and the dimension reduction category, reserving the data set with the intersection ratio of 1, and taking the reserved data set as a reserved set. Further, each dimension reduction category corresponds to a plurality of reserved sets.
And step S300, obtaining an optimal K value according to the adjacency of data in the reserved set of the matching class pair obtained by the falling dimension of different K values.
For any K value, taking the sum of the adjacencies of the data in the reserved set of all corresponding matching class pairs as the reserved property of the K value; the retention corresponding to each K value is obtained.
And taking the K value corresponding to the maximum retention as the optimal K value.
And step S400, performing compression storage on the reduced data corresponding to the optimal K value.
And taking the dimension reduction data obtained based on the optimal K value dimension reduction as final dimension reduction data, and performing ZIP compression storage.
The invention provides an electronic commerce data optimized storage method based on manifold learning, which can greatly reserve element adjacent relation in electronic commerce data by calculating dimension reduction data with which K value is reduced, and select the K value to carry out ISOMAP dimension reduction, so that the dimension reduction data can reserve the shape of the electronic commerce data to the greatest extent, and the dimension reduction effect is improved.
In summary, the present invention relates to the technical field of electronic commerce data processing, and firstly, electronic commerce data is obtained; based on any K value, reducing the dimension of the electronic commerce data by using an equidistant mapping algorithm to obtain dimension reduction data; clustering the E-commerce data and the dimension reduction data respectively to obtain an original category and a dimension reduction category; matching the original category with the dimension reduction category to obtain a matching category pair; carrying out iterative updating on the intersection of the original category and the dimension reduction category in the same matching category pair by a data removing method, and obtaining the adjacency of the matching category to the data in the intersection updated each time according to the duty ratio of the data in the intersection obtained by each time of iterative updating; acquiring a data set corresponding to the intersection according to the intersection updated by each iteration; obtaining a reserved set of matching category pairs according to the overlapping condition of the dimension reduction category and the matching category pair data sets corresponding to the intersection set updated in each iteration; obtaining an optimal K value according to the adjacency of data in a reserved set of the matching class pair obtained by the dimension reduction of different K values; and carrying out compression storage on the reduced data corresponding to the optimal K value. The invention reduces the use of storage space while keeping the key information of the data and improves the efficiency of data inquiry.
It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. The processes depicted in the accompanying drawings do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.
Claims (7)
1. The method for optimally storing the e-commerce data based on manifold learning is characterized by comprising the following steps of:
acquiring electronic commerce data;
based on any K value, reducing the dimension of the electronic commerce data by using an equidistant mapping algorithm to obtain dimension reduction data; clustering the E-commerce data and the dimension reduction data respectively to obtain an original category and a dimension reduction category; matching the original category with the dimension reduction category to obtain a matching category pair; carrying out iterative updating on the intersection of the original category and the dimension reduction category in the same matching category pair by a data removing method, and obtaining the adjacency of the matching category to the data in the intersection updated each time according to the duty ratio of the data in the intersection obtained by each time of iterative updating; acquiring a data set corresponding to the intersection according to the intersection updated by each iteration; obtaining a reserved set of matching category pairs according to the overlapping condition of the dimension reduction category and the matching category pair data sets corresponding to the intersection set updated in each iteration;
obtaining an optimal K value according to the adjacency of data in a reserved set of the matching class pair obtained by the dimension reduction of different K values;
compressing and storing the dimensionality reduction data corresponding to the optimal K value
The data removing method comprises the following steps:
for the corresponding matching category pairs under any K value, taking the intersection of the original category and the dimension-reduction category in each matching category pair as the initial intersection of the matching category pair;
taking the intersection of the initial intersections of all matching class pairs as a first intersection; deleting the data belonging to the first intersection from the initial intersection to obtain a first intersection to be deleted of each matching class pair;
calculating the sum of occurrence frequencies of each data in all first intersections, marking the sum as the first frequency, deleting the data corresponding to the maximum first frequency from all first intersections to be deleted to obtain a second intersection to be deleted, and enabling a set formed by the deleted data to be called a second intersection; calculating the sum of occurrence frequencies of each data in all second to-be-deleted intersections, marking the sum as the second frequency, deleting the data corresponding to the maximum second frequency from all second to-be-deleted intersections to obtain a third to-be-deleted intersection, and enabling a set formed by the deleted data to be called a third intersection; and so on, continuously updating the intersection to be deleted and the intersection; and stopping iteration until the sum of the occurrence frequency of each data in the latest intersection to be deleted is smaller than the preset data quantity.
2. The method for optimizing and storing data of e-commerce based on manifold learning according to claim 1, wherein the step of obtaining the adjacency of the matching category to the data in the intersection updated each iteration according to the duty ratio of the data in the intersection updated each iteration comprises the steps of:
and selecting any iteration update as a target iteration update for the matching class pair obtained under any K value, and taking the ratio of the occurrence times of data in all intersection sets obtained by the target iteration update to the number of data in the initial intersection sets as the adjacency of the data in the intersection sets of the target iteration update of the matching class pair in the intersection sets corresponding to the matching class pair obtained by the target iteration update.
3. The method for optimizing and storing e-commerce data based on manifold learning according to claim 1, wherein the obtaining the data set corresponding to the intersection according to the intersection updated in each iteration comprises:
taking the first intersection as a first data set corresponding to the first intersection; adding the data in the second intersection into the first data set to obtain a second data set corresponding to the second intersection; adding the data in the third intersection into the second data set to obtain a third data set corresponding to the third intersection; and so on, obtaining a data set corresponding to each intersection.
4. The method for optimizing and storing e-commerce data based on manifold learning according to claim 1, wherein the clustering the e-commerce data and the dimension reduction data to obtain an original category and a dimension reduction category comprises:
for the E-commerce data, taking a negative correlation normalization value of the similarity of any two E-commerce data as the distance characteristic of the two E-commerce data; clustering the electronic commerce data according to the distance characteristics of the two electronic commerce data to obtain an original category;
regarding the dimension reduction data, taking the negative correlation normalization value of the similarity of any two dimension reduction data as the distance characteristic of the two dimension reduction data; and clustering the dimension reduction data according to the distance characteristics of the two dimension reduction data to obtain the dimension reduction category.
5. The method for optimizing and storing e-commerce data based on manifold learning according to claim 1, wherein said matching the original class and the dimension-reduction class to obtain a matching class pair comprises:
matching the original category and the dimension reduction category by using an optimal matching algorithm to obtain a matching category pair; wherein each matching class pair includes an original class and a reduced-dimension class.
6. The method for optimizing and storing data of electronic commerce based on manifold learning according to claim 1, wherein the obtaining a reserved set of matching class pairs according to the overlapping condition of the data sets corresponding to intersections of the dimension reduction class and the matching class pair updated in each iteration comprises:
and for the matching class pairs obtained under any K value, when the data in the data sets of the dimensionality reduction class and the corresponding iteration update intersection set in each matching class pair are the same, taking the iteration update data sets as a reserved set of each matching class pair.
7. The method for optimizing and storing e-commerce data based on manifold learning according to claim 1, wherein the obtaining the optimal K value according to the adjacency of data in the reserved set of the matching class pair obtained by different K value falling dimensions comprises:
for any K value, taking the sum of the adjacencies of the data in the reserved set of all corresponding matching class pairs as the reserved property of the K value;
and taking the K value corresponding to the maximum retention as the optimal K value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310927504.7A CN116679888B (en) | 2023-07-27 | 2023-07-27 | E-commerce data optimized storage method based on manifold learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310927504.7A CN116679888B (en) | 2023-07-27 | 2023-07-27 | E-commerce data optimized storage method based on manifold learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116679888A CN116679888A (en) | 2023-09-01 |
CN116679888B true CN116679888B (en) | 2023-10-10 |
Family
ID=87779440
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310927504.7A Active CN116679888B (en) | 2023-07-27 | 2023-07-27 | E-commerce data optimized storage method based on manifold learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116679888B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111461339A (en) * | 2020-03-09 | 2020-07-28 | 中国人民解放军陆军工程大学 | Manifold dimension reduction method based on optimal density direction |
CN112016581A (en) * | 2019-05-31 | 2020-12-01 | 北京京东尚科信息技术有限公司 | Multidimensional data processing method and device, computer equipment and storage medium |
CN114595741A (en) * | 2022-01-17 | 2022-06-07 | 中国人民解放军国防科技大学 | High-dimensional data rapid dimension reduction method and system based on neighborhood relationship |
CN116304768A (en) * | 2023-03-01 | 2023-06-23 | 桂林理工大学 | High-dimensional density peak clustering method based on improved equidistant mapping |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210256538A1 (en) * | 2020-02-14 | 2021-08-19 | Actimize Ltd. | Computer Methods and Systems for Dimensionality Reduction in Conjunction with Spectral Clustering of Financial or Other Data |
-
2023
- 2023-07-27 CN CN202310927504.7A patent/CN116679888B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112016581A (en) * | 2019-05-31 | 2020-12-01 | 北京京东尚科信息技术有限公司 | Multidimensional data processing method and device, computer equipment and storage medium |
CN111461339A (en) * | 2020-03-09 | 2020-07-28 | 中国人民解放军陆军工程大学 | Manifold dimension reduction method based on optimal density direction |
CN114595741A (en) * | 2022-01-17 | 2022-06-07 | 中国人民解放军国防科技大学 | High-dimensional data rapid dimension reduction method and system based on neighborhood relationship |
CN116304768A (en) * | 2023-03-01 | 2023-06-23 | 桂林理工大学 | High-dimensional density peak clustering method based on improved equidistant mapping |
Also Published As
Publication number | Publication date |
---|---|
CN116679888A (en) | 2023-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108241745B (en) | Sample set processing method and device and sample query method and device | |
US6012058A (en) | Scalable system for K-means clustering of large databases | |
US6941303B2 (en) | System and method for organizing, compressing and structuring data for data mining readiness | |
US6148295A (en) | Method for computing near neighbors of a query point in a database | |
Naldi et al. | Efficiency issues of evolutionary k-means | |
CN101313301B (en) | Improving allocation performance by query optimization | |
US7246125B2 (en) | Clustering of databases having mixed data attributes | |
US20070250522A1 (en) | System and method for organizing, compressing and structuring data for data mining readiness | |
US20030208488A1 (en) | System and method for organizing, compressing and structuring data for data mining readiness | |
Zhang et al. | Speeding up k-means clustering in high dimensions by pruning unnecessary distance computations | |
CN112948345A (en) | Big data clustering method based on cloud computing platform | |
CN113704565B (en) | Learning type space-time index method, device and medium based on global interval error | |
Ryu et al. | An Effective Clustering Method over CF $^+ $+ Tree Using Multiple Range Queries | |
CN116679888B (en) | E-commerce data optimized storage method based on manifold learning | |
CN110503117A (en) | The method and apparatus of data clusters | |
Huang et al. | GNNVIs: Visualize large-scale data by learning a graph neural network representation | |
CN115146103A (en) | Image retrieval method, image retrieval apparatus, computer device, storage medium, and program product | |
CN112800138B (en) | Big data classification method and system | |
Hacid et al. | Incremental neighborhood graphs construction for multidimensional databases indexing | |
Wang et al. | MP-RW-LSH: an efficient multi-probe LSH solution to ANNS-L 1 | |
Fushimi et al. | Accelerating Greedy K-Medoids Clustering Algorithm with Distance by Pivot Generation | |
Ibrahim et al. | Towards a new approach for empowering the mr-dbscan clustering for massive data using quadtree | |
Balko et al. | The Active Vertice method: a performant filtering approach to high-dimensional indexing | |
Chen | DBSCAN Is Semi-Spectral Clustering | |
Hu et al. | One-step kernelized sparse clustering on grassmann manifolds |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |