CN113434471A - Data processing method, device, equipment and computer storage medium - Google Patents

Data processing method, device, equipment and computer storage medium Download PDF

Info

Publication number
CN113434471A
CN113434471A CN202110708164.XA CN202110708164A CN113434471A CN 113434471 A CN113434471 A CN 113434471A CN 202110708164 A CN202110708164 A CN 202110708164A CN 113434471 A CN113434471 A CN 113434471A
Authority
CN
China
Prior art keywords
data
target
data group
central value
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110708164.XA
Other languages
Chinese (zh)
Inventor
余玉霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202110708164.XA priority Critical patent/CN113434471A/en
Publication of CN113434471A publication Critical patent/CN113434471A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of computer data processing, and provides a data processing method, a device, equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring target data to be processed, wherein the target data comprises a plurality of data items; classifying the plurality of data items into at least one data group based on a clustering algorithm model; determining a current central value of each data group according to the data items currently contained in each data group; acquiring a historical central value of each data group, and determining a target data group according to the current central value and the historical central value of each data group; and storing the data items in the target data group to a target storage address. The storage space occupied by the data change can be reduced while historical data is kept. The application also relates to a blockchain technique, in which a target data set may be stored.

Description

Data processing method, device, equipment and computer storage medium
Technical Field
The present application relates to the field of computer data processing technologies, and in particular, to a data processing method, an apparatus, a device, and a computer-readable storage medium.
Background
With the large burst of data and the acceleration of data transmission rate, the data volume is more and more, the storage volume needing to be stored is also more and more, the storage space occupied by the full-volume storage of a plurality of data at certain time intervals is large, the deletion of historical data loses the integrity of the data, or a user cannot find back the historical data when wanting to search the historical data, aiming at the data which can change at certain time intervals, most of the data are slightly changed data, such as the label values calculated in a user portrait, the data values in different time periods are different, but most of the data still have little difference, the problem that the occupied space is large when the data are stored in full volume is faced, and the historical data have the data value and cannot be deleted.
Disclosure of Invention
The present application mainly aims to provide a data processing method, an apparatus, a device and a computer readable storage medium, which aim to save the storage space required by data storage.
In a first aspect, the present application provides a data processing method, including the steps of:
acquiring target data to be processed, wherein the target data comprises a plurality of data items;
classifying the plurality of data items into at least one data group based on a clustering algorithm model;
determining a current central value of each data group according to the data items currently contained in each data group;
acquiring a historical central value of each data group, and determining a target data group according to the current central value and the historical central value of each data group;
and storing the data items in the target data group to a target storage address.
In a second aspect, the present application further provides a data processing apparatus, comprising:
the data acquisition module is used for acquiring target data to be processed, and the target data comprises a plurality of data items;
a data item classification module for classifying the plurality of data items into at least one data group based on a clustering algorithm model;
the central value determining module is used for determining the current central value of each data group according to the data item currently contained in each data group;
the target data group determining module is used for acquiring a historical central value of each data group and determining a target data group according to the current central value and the historical central value of each data group;
and the data item storage module is used for storing the data items in the target data group to a target storage address.
In a third aspect, the present application also provides a computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the data processing method as described above.
In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, where the computer program, when executed by a processor, implements the steps of the data processing method as described above.
The application provides a data processing method, a device, equipment and a computer readable storage medium, and the data processing method comprises the steps of obtaining target data to be processed, wherein the target data comprises a plurality of data items; classifying the plurality of data items into at least one data group based on a clustering algorithm model; determining a current central value of each data group according to the data items currently contained in each data group; acquiring a historical central value of each data group, and determining a target data group according to the current central value and the historical central value of each data group; and storing the data items in the target data group to a target storage address. Because only part of data items in the target data are stored, the storage pressure of the server is reduced and more storage space is released while historical data and data continuity are kept.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present application;
fig. 2 is a usage scenario diagram of a data processing method according to an embodiment of the present application;
fig. 3 is a schematic block diagram of a data processing apparatus according to an embodiment of the present application;
fig. 4 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
The embodiment of the application provides a data processing method, a data processing device, computer equipment and a computer readable storage medium. The data processing method can be applied to terminal equipment, and the terminal equipment can be electronic equipment such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and wearable equipment.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present application.
As shown in fig. 1, the data processing method includes steps S101 to S105.
Step S101, target data to be processed is obtained, and the target data comprises a plurality of data items.
For example, the target data to be processed may be data generated by operation in a certain hard disk of the computer within a preset time period, or may be original data in a certain hard disk of the computer or data in a certain subfolder.
For example, target data to be processed may be obtained in a Hadoop Distributed File System (HDFS), and it can be understood that data in the whole Hadoop Distributed File System may be obtained as the target data to be processed, and data of a certain subfile of the Hadoop Distributed File System may also be obtained as the target data to be processed.
For example, the target data to be processed may be obtained from a storage address where the target data is stored through a target data obtaining request, where the target data obtaining request may include the storage address where the target data is stored, and may also include an address identifier for identifying the storage address where the target data is stored, so as to obtain the storage address where the target data is stored according to a mapping relationship between the address identifier and the storage address where the target data is stored. The storage address of the stored target data is used for indicating the storage position of the target data in the computer, such as a subfolder in a folder on a D disk of the computer; storage locations in a Hadoop distributed file system may also be indicated; the computer knows the storage position of the target data which needs to be acquired currently in the computer according to the storage address of the stored target data so as to extract the required target data.
As shown in fig. 2, fig. 2 is a usage scenario diagram provided in an embodiment of the present application, where target data to be processed may also be obtained from a terminal device used by a user, for example, the terminal device of the user records that the user has modified the data, and the terminal device of the user may send the target data to be processed to a server in response to a target data obtaining instruction of the server.
For example, the target data to be processed may include several data items, and the data items may be, for example, hobbies, habits, identity information, and the like in the user's image.
For example, the target data to be processed is data that needs to be stored to the target storage address, and the target data is processed to reduce the storage space of the target storage address occupied by the target data.
And S102, classifying the data items into at least one data group based on a clustering algorithm model.
Illustratively, the clustering algorithm model may be a k-means clustering algorithm model that classifies a plurality of data items into at least one data group, and it is understood that a data group may include a plurality of data items, wherein the number of data groups is not greater than the number of data items, such that at least one data item may be classified into each data group.
For example, the data set into which each data item should be classified may be determined by a relationship between each data item and each data set, wherein the relationship between the data item and the data set may be characterized by a euclidean distance, and the euclidean distance between the data item and each data set is calculated to determine into which data set the data item should be classified.
In some embodiments, the method further comprises: determining the number of data groups according to the target data to be processed; classifying the number of data items into the determined number of data groups based on the clustering algorithm model.
Illustratively, the number of data groups is determined according to the acquired target data to be processed, and it can be understood that the data groups are used for classifying a plurality of data items in the target data to be processed, and classifying related or similar data items into the same data group, so as to uniformly change storage addresses or perform data compression and the like.
In some embodiments, determining the number of data sets from the target data to be processed comprises: calculating the size of the target data to obtain the storage space occupation information of the target data; determining compression degree information of the target data according to preset occupation information and storage space occupation information of the target data; and determining the number of the data groups according to the compression degree information.
For example, the preset occupancy information is used to indicate space information occupied by target data that a user desires to process, and the preset occupancy information may be obtained through an input operation of the user.
Illustratively, the size of target data to be processed is calculated to obtain storage space occupation information of the target data, and compression degree information of the target data is determined according to the storage space occupation information of the target data and preset occupation information. For example, the calculated size of the target data, the storage space occupation information of the target data is 50G, the preset occupied space information is 30G, and reduction processing needs to be performed on the target data to reduce the storage space occupied by the target data. It is understood that the reduction degree information of the target data may be determined according to the storage space occupation information and the preset occupation space information of the target data, and the reduction degree information obtained by the storage space occupation information and the preset occupation space information of the target data as described above may be, for example, 20G reduction. The above size of the occupied storage space and the reduction information are exemplary examples, and other situations may exist, which are not limited herein.
Illustratively, the number of data groups is determined based on the reduction degree information, and it is understood that the reduction degree indicated by the reduction degree information is linearly inversely related to the number of data groups, i.e., the larger the reduction degree of the target data is, the smaller the number of data groups is determined. The smaller the reduction degree of the target data is, the larger the number of the determined data groups is.
For example, the data items in the target data are stored in groups by the data groups, that is, continuous data are discretized, and the storage space occupied by the target data can be adjusted by adjusting the degree of discretization, that is, adjusting the number of the data groups.
In some embodiments, said sorting said number of data items into said determined number of data sets comprises: calculating the Euclidean distance from each data item to each data group; and taking each data item as a target data item in turn, and classifying the target data item into a data group with the minimum Euclidean distance from the target data item.
Illustratively, each data item is sequentially used as a target data item, the Euclidean distance from the target data item to each data group is calculated, the Euclidean distance between the target data item and each data group is compared, the data group with the minimum Euclidean distance value is determined in each data group, and the target data item is classified into the data group corresponding to the minimum Euclidean distance value.
For example, a data item having an euclidean distance to the a data group of 4, an euclidean distance to the B data group of 6, an euclidean distance to the C data group of 5, and a data group corresponding to the minimum euclidean distance value of 4 is the a data group, and the data item is classified into the a data group.
Illustratively, each data item in the target data is subjected to a classification process as described above to classify the data items into a determined number of respective data groups.
In some embodiments, said calculating the euclidean distance of each of said data items from the respective data set comprises: randomly assigning different random center values to each of the data sets; and respectively calculating the Euclidean distance between each data item and each data group according to the random central value of each data group and a preset calculation formula.
Illustratively, each data group of the determined number is assigned with different random center values, for example, the random center value of the data group a is assigned with 3, the random center value of the data group B is assigned with 6, and the random center value of the data group C is assigned with 9.
For example, giving a random center value to each data group can classify a plurality of data items into different data groups, where the smaller the number of data groups is, the larger the difference between the random center values corresponding to each data group is, the more discretization processing can be performed on the data items, and more storage space can be released after each data group is processed, but the granularity of data changes that can be viewed by a user is coarser, and if the number of data groups is larger, the smaller the difference between the random center values corresponding to each data group is, the smaller the degree of discretization is performed on the data items, the finer the granularity of data changes that can be viewed by the user is, and fine changes of the data items can be monitored.
Illustratively, according to the random center value of each data set, the Euclidean distance between each data item and each data set is calculated according to a preset calculation formula so as to determine the data set of each data item classification.
For example, the preset calculation formula may be as follows:
Figure BDA0003132176660000061
where dis denotes distance, XiDenotes the ith data item, CjRepresenting the jth data set, m attribute dimensions of the data items, XitT-th attribute value, C, representing i-th data itemjCAnd (3) representing the random center value corresponding to the jth array, wherein t is more than or equal to 1 and less than or equal to m.
Illustratively, after the euclidean distance between the data item and each data set is obtained through the above formula calculation, the data item is classified into the corresponding data set with the smallest euclidean distance with the data item.
Illustratively, by calculating the euclidean distance of a data item from each data set to classify multiple data items, different data items can be efficiently classified, and the processing rate for retaining the desired data items can be increased.
Step S103, determining the current central value of each data group according to the data items currently contained in each data group.
For example, after classifying the data items into data groups, each data group may calculate a current center value of the data group according to all data items currently contained in the group.
For example, the current center value of the data set is used to characterize the mean value of all data items currently contained in the data set in each dimension, and the current center value of the data set can be calculated by the following formula:
Figure BDA0003132176660000071
wherein, CIRepresenting the current centre value, S, of the ith data setIIndicating the number of data items in the ith data set, XiRepresenting the mean of the ith data item in the ith data set in each dimension.
For example, the current central value of the data group in which the data item is located may be obtained by calculating the average value of all data items currently contained in the data group in each dimension.
In some embodiments, the method further comprises: and updating the random center value of each data group according to the current center value of each data group.
For example, when each data item is subjected to euclidean distance calculation with each data group, a random center value is randomly assigned to each data group, and after the data items are classified, a current center value of a corresponding data group may be calculated according to all data items currently included in each data group, so as to update the random center value of the data group.
For example, after all data items are classified, each data group includes at least one data item, and the random center value randomly assigned to the data group is updated to the current center value calculated from the attribute value of the data item currently included in the data group.
For example, the random center value of the data set may be updated according to the attribute value of the data item currently included in the data set by the above formula and method for calculating the center value, so as to obtain the current center value of the data set. For example, the a data group is randomly assigned a center value of 3, the current center value calculated from the attribute values of the data items currently included in the group is 3.6, and the random center value of the a data group is updated to 3.6.
Illustratively, the accuracy of determining the target data set may be improved by updating the random center value of the data set with the data items currently included in the data set.
And step S104, acquiring a historical center value of each data group, and determining a target data group according to the current center value and the historical center value of each data group.
For example, the historical center value may be a historical center value corresponding to each data set, for example, the historical center value is a center value of a data set recorded every preset time period, the historical center value is used to indicate a center value recorded in a last preset time period, and by comparing a current center value and the historical center value of the data set, it may be determined whether a data item in the data set has changed.
For example, when the current center value is compared with the historical center value, the current center value and the historical center value corresponding to the data set may be compared, or the current center value and the historical center value may be sorted and then compared.
For example, the current center value of the a data group is 3.6, the historical center value of the a data group is 3.5, the current center value of the B data group is 8, and the historical center value of the B data group is 8, and it can be determined that there is a change in the data item and there is no change in the data item of the B data group by comparing the current center value of the a data group with the historical center value of the a data group. It can be understood that the historical center values corresponding to the data group a and the data group B may be sorted first, and compared with the sorted current center values corresponding to the data group a and the data group B.
For example, if the current center value of the data group is the same as the historical center value of the data group, the data items in the data group are not changed, and the data group is not processed.
In some embodiments, the method further comprises: sorting the current central value of at least one data group according to the size of the current central value of at least one data group to obtain a current central value sequence; and acquiring a historical center value sequence, wherein the historical center value sequence is formed by sequencing corresponding historical center values.
For example, the current center values of the data groups may be sorted to obtain a current center value sequence of the data groups, for example, the current center value of the a data group is 3.6, the current center value of the B data group is 8, and the current center value of the C data group is 6, the current center values of the data groups are sorted according to the size sequence to obtain a current center value sequence of (3.6, 6, 8), and the current center value sequence is compared with the historical center value sequence. It can be understood that the historical center value sequence should also be well ordered according to the size of the historical center value corresponding to the data group, and by comparing the current center value sequence with the historical center value sequence, the wrong judgment condition that the historical center value of the data group is different from the current center value but the data item itself does not have data change due to the fact that the data item enters different data groups under two classifications can be avoided.
Determining a target data set according to the current center value and the historical center value of each data set, including: judging whether the current central value sequence is the same as the historical central value sequence or not; and if the current central value in the current central value sequence is different from the historical central value at the corresponding position in the historical central value sequence, determining the data group corresponding to the different current central value as the target data group.
For example, whether the current central value sequence is the same as the historical central value sequence or not is judged, and if not, at least one data item in the data group is changed. Specifically, if the current central value in the current central value sequence is different from the historical central value at the corresponding position in the historical central value sequence, the data item in the data group corresponding to the different current central value is changed, and the data group can be determined as the target data group.
For example, the current central value located second before the current central value sequence is different from the historical central value located second before the historical central value sequence, and the data set corresponding to the current central value located second before the current central value sequence is determined as the target data set.
For example, the target data set is determined by the current central value sequence and the historical central value sequence, so that the target data set can be determined more accurately, the probability of wrong judgment is reduced, and the data items in the target data set are processed.
Illustratively, the target data group may also be stored in a blockchain, and when the target data group needs to be stored, the target data group is obtained by broadcasting to the blockchain, where the blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
And step S105, storing the data items in the target data group to a target storage address.
For example, the data items in the target data group are stored to the target storage address, that is, the changed data items are stored to the target storage address, and the data items that are not changed are not stored or otherwise processed.
The target storage address may be a computer hard disk or a subfolder in the computer hard disk.
Specifically, at intervals, part of the data items may be updated or increased or decreased, if all the data items are saved according to the time nodes, the same unchanged data items are saved again, so that the storage space is repeatedly occupied, and more storage space is occupied by the data.
In some embodiments, the method further comprises: acquiring storage space information of the target storage address after the target storage address stores the data items of the target data group; judging whether the storage space occupied by the data items of the target data group conforms to preset occupied space information or not according to the storage space information; if the data item is not matched with the target data group, deleting the data item of the target data group in the target storage address, and adjusting the number of the data group; classifying the data items into the data groups with the adjusted number so as to update the target data group according to the data groups with the adjusted number; and storing the data items in the updated target data group to the target storage address.
Illustratively, after the data items in the target data group are stored in the target storage address, the storage space information of the target storage address is acquired, and is compared with the preset occupied space information, and whether the preset occupied space information is met or not, that is, whether the preset occupied space is exceeded or not is judged, if the preset occupied space is 1G, the acquired storage space information is not greater than 1G, the preset occupied space information is met, and if the acquired storage space information is greater than 1G, the preset occupied space information is not met.
For example, if the storage space information does not match the preset occupied space information, the data items of the target data group in the target storage address are deleted, and the number of the data groups is adjusted.
For example, the number of the data groups may be determined according to the target data, the acquired storage space information, and the preset occupied space information, and after the number of the data groups is determined again, the operations in steps S103 to S105 are performed, which is not described herein again, and each data item is classified again into the data group whose number is adjusted and the target data group is updated. It is understood that, after the number of data groups is adjusted, the target data group may be different, and the data items in the updated target data group are stored to the target storage address.
For example, in other embodiments, after classifying a plurality of data items into each data group, determining whether the data granularity of each data group meets a target data granularity, where the data granularity is used to indicate the information amount of the data items in the data group; and if the target data granularity is not met, adjusting the number of the data groups, and classifying the data items into the data groups with the adjusted number.
Illustratively, after the classification of the data items is completed, calculating the data granularity of each data group, where the data granularity is used to indicate the information amount of the data items in the data group, and the more the information amount of the data items in the data group is, the smaller the data granularity is, the more detailed the data change that can be obtained by the user is, but the larger the storage space occupied when storing is performed; if the data granularity of the data group does not accord with the target data granularity, the number of the data group is adjusted according to the target data, the data granularity of the data group and the target data granularity, and the data items are classified into the data group with the adjusted number so as to adjust the data granularity of each data group.
For example, if the data granularity of the data group meets the target data granularity, the operations of step S103 to step S105 are continued, which will not be described herein. It can be understood that the target data granularity may be a preset data granularity, or may be automatically generated according to the size of the storage space.
The number of the data groups is adjusted by comparing the data granularity of the data groups with the target data granularity, so that the operability of the method can be increased, and the granularity of data change can be adjusted through simple operation, so that the degree of monitoring data change and the occupied space when the target data are stored are adjusted.
In the data processing method provided in the above embodiment, target data to be processed is obtained, where the target data includes a plurality of data items; classifying the plurality of data items into at least one data group based on a clustering algorithm model; determining a current central value of each data group according to the data items currently contained in each data group; acquiring a historical central value of each data group, and determining a target data group according to the current central value and the historical central value of each data group; and storing the data items in the target data group to a target storage address. The storage pressure of the server is effectively reduced while historical data is kept, more storage space is released, and the granularity of target data change during storage can be conveniently adjusted.
Referring to fig. 3, fig. 3 is a schematic diagram of a data processing apparatus according to an embodiment of the present application, where the data processing apparatus may be configured in a server or a terminal for executing the foregoing data processing method.
As shown in fig. 3, the data processing apparatus includes: a data acquisition module 110, a data item classification module 120, a center value determination module 130, a target data set determination module 140, and a data item storage module 150.
A data obtaining module 110, configured to obtain target data to be processed, where the target data includes a plurality of data items;
a data item classification module 120 for classifying the plurality of data items into at least one data group based on a clustering algorithm model;
a central value determining module 130, configured to determine a current central value of each data group according to a data item currently included in each data group;
a target data set determining module 140, configured to obtain a historical center value of each data set, and determine a target data set according to the current center value and the historical center value of each data set;
a data item storage module 150, configured to store the data items in the target data group to the target storage address.
The data processing device further comprises a current center value sequence determining module and a historical center value sequence obtaining module.
And the current central value sequence determining module is used for sequencing the current central value of at least one data group according to the size of the current central value of at least one data group to obtain a current central value sequence.
And the historical center value sequence acquisition module is used for acquiring a historical center value sequence, and the historical center value sequence is formed by sequencing at least one historical center value corresponding to the data group.
The target data group determining module 140 is further configured to determine whether the current central value sequence is the same as the historical central value sequence; and if the current central value in the current central value sequence is different from the historical central value at the corresponding position in the historical central value sequence, determining the data group corresponding to the different current central value as the target data group.
Illustratively, the data processing apparatus further comprises a data set quantity determination module.
And the data group quantity determining module is used for determining the quantity of the data groups according to the target data to be processed.
A data item classification module 120, further configured to classify the number of data items into the determined number of data groups based on the clustering algorithm model.
Illustratively, the data group quantity determining module further comprises a space occupation information determining submodule, a compression degree information determining submodule and a quantity determining submodule.
And the space occupation information determining submodule is used for calculating the size of the target data to obtain the storage space occupation information of the target data.
And the compression degree information determining submodule is used for determining the compression degree information of the target data according to preset occupation information and the storage space occupation information of the target data.
And the quantity determining submodule is used for determining the quantity of the data groups according to the compression degree information.
Illustratively, the target data set determination module 140 further includes a euclidean distance calculation sub-module; a data item classification sub-module.
And the Euclidean distance calculation submodule is used for calculating the Euclidean distance between each data item and each data group.
And the data item classification submodule is used for sequentially taking each data item as a target data item and classifying the target data item into a data group with the minimum Euclidean distance from the target data item.
Illustratively, the Euclidean distance calculating submodule further comprises a random center value determining submodule and a distance calculating submodule.
And the random center value determining submodule is used for randomly endowing different random center values to each data group.
And the distance calculation submodule is used for respectively calculating the Euclidean distance between each data item and each data group according to a preset calculation formula and the random center value of each data group.
Illustratively, the data processing device further comprises a storage space information acquisition sub-module, a storage space judgment sub-module, a data set quantity adjustment sub-module and a target data set updating sub-module.
And the storage space information acquisition submodule is used for acquiring the storage space information of the target storage address after the data items of the target data group are stored.
And the storage space judgment submodule is used for judging whether the storage space occupied by the data item of the target data group conforms to the preset occupied space information or not according to the storage space information.
And the data group quantity adjusting submodule is used for deleting the data items of the target data group in the target storage address and adjusting the quantity of the data groups if the data items are judged not to be matched.
And the target data group updating submodule is used for classifying the data items into the data group with the adjusted quantity so as to update the target data group according to the data group with the adjusted quantity.
It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus, the modules and the units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The methods, apparatus, and devices of the present application are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The above-described methods and apparatuses may be implemented, for example, in the form of a computer program that can be run on a computer device as shown in fig. 4.
Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.
As shown in fig. 4, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a storage medium and an internal memory.
The storage medium may store an operating system and a computer program. The computer program comprises program instructions which, when executed, cause a processor to perform any of the data processing methods.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for the execution of a computer program on a storage medium, which when executed by a processor causes the processor to perform any of the data processing methods.
The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
acquiring target data to be processed, wherein the target data comprises a plurality of data items;
classifying the plurality of data items into at least one data group based on a clustering algorithm model;
determining a current central value of each data group according to the data items currently contained in each data group;
acquiring a historical central value of each data group, and determining a target data group according to the current central value and the historical central value of each data group;
and storing the data items in the target data group to a target storage address.
In one embodiment, the processor, when implementing the data processing method, is configured to implement:
sorting the current central value of at least one data group according to the size of the current central value of at least one data group to obtain a current central value sequence;
obtaining a historical center value sequence, wherein the historical center value sequence is formed by sequencing historical center values corresponding to at least one data set;
when the determination of the target data group according to the current center value and the historical center value of each data group is realized, the method is used for realizing that:
judging whether the current central value sequence is the same as the historical central value sequence or not;
and if the current central value in the current central value sequence is different from the historical central value at the corresponding position in the historical central value sequence, determining the data group corresponding to the different current central value as the target data group.
In one embodiment, the processor, when implementing the data processing method, is configured to implement:
determining the number of data groups according to the target data to be processed;
when implementing a clustering algorithm model based classification of the plurality of data items into at least one data group, for implementing:
classifying the number of data items into the determined number of data groups based on the clustering algorithm model.
In one embodiment, the processor, when carrying out determining the number of data groups according to the target data to be processed, is configured to carry out:
calculating the size of the target data to obtain the storage space occupation information of the target data;
determining compression degree information of the target data according to preset occupation information and storage space occupation information of the target data;
and determining the number of the data groups according to the compression degree information.
In one embodiment, the processor, when effecting classification of the number of data items into the determined number of data sets, is operative to effect:
calculating the Euclidean distance between each data item and each data set;
and taking each data item as a target data item in turn, and classifying the target data item into a data group with the minimum Euclidean distance from the target data item.
In one embodiment, the processor, when being configured to calculate the euclidean distance of each of the data items from the respective data set, is configured to:
randomly assigning different random center values to each of the data sets;
and respectively calculating the Euclidean distance between each data item and each data group according to the random central value of each data group and a preset calculation formula.
In one embodiment, the processor, when implementing the data processing method, is configured to implement:
acquiring storage space information of the target storage address after the target storage address stores the data items of the target data group;
judging whether the storage space occupied by the data items of the target data group conforms to the preset occupied space information or not according to the storage space information;
if the data item is not matched with the target data group, deleting the data item of the target data group in the target storage address, and adjusting the number of the data group;
classifying the data items into the data groups with the adjusted number so as to update the target data group according to the data groups with the adjusted number;
and storing the data items in the updated target data group to the target storage address.
It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the data processing described above may refer to the corresponding process in the foregoing data processing control method embodiment, and is not described herein again.
Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program includes program instructions, and a method implemented when the program instructions are executed may refer to various embodiments of the data processing method of the present application.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.
It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A data processing method, comprising:
acquiring target data to be processed, wherein the target data comprises a plurality of data items;
classifying the plurality of data items into at least one data group based on a clustering algorithm model;
determining a current central value of each data group according to the data items currently contained in each data group;
acquiring a historical central value of each data group, and determining a target data group according to the current central value and the historical central value of each data group;
and storing the data items in the target data group to a target storage address.
2. The data processing method of claim 1, wherein the method further comprises:
sorting the current central value of at least one data group according to the size of the current central value of at least one data group to obtain a current central value sequence;
obtaining a historical center value sequence, wherein the historical center value sequence is formed by sequencing historical center values corresponding to at least one data set;
determining a target data set according to the current center value and the historical center value of each data set, including:
judging whether the current central value sequence is the same as the historical central value sequence or not;
and if the current central value in the current central value sequence is different from the historical central value at the corresponding position in the historical central value sequence, determining the data group corresponding to the different current central value as the target data group.
3. The data processing method of claim 1 or 2, wherein after the obtaining of the target data to be processed, further comprising:
determining the number of data groups according to the target data to be processed;
the classifying the plurality of data items into at least one data group based on a clustering algorithm model comprises:
classifying the number of data items into the determined number of data groups based on the clustering algorithm model.
4. The data processing method of claim 3, wherein the determining the number of data sets from the target data to be processed comprises:
calculating the size of the target data to obtain the storage space occupation information of the target data;
determining compression degree information of the target data according to preset occupation information and storage space occupation information of the target data;
and determining the number of the data groups according to the compression degree information.
5. The data processing method of claim 3, wherein said classifying the number of data items into the determined number of data groups comprises:
calculating the Euclidean distance between each data item and each data set;
and taking each data item as a target data item in turn, and classifying the target data item into a data group with the minimum Euclidean distance from the target data item.
6. The data processing method of claim 5, wherein said calculating the Euclidean distance of each of said data items from the respective data set comprises:
randomly assigning different random center values to each of the data sets;
and respectively calculating the Euclidean distance between each data item and each data group according to the random central value of each data group and a preset calculation formula.
7. The data processing method of claim 4, wherein the method further comprises:
acquiring storage space information of the target storage address after the target storage address stores the data items of the target data group;
judging whether the storage space occupied by the data items of the target data group conforms to the preset occupied space information or not according to the storage space information;
if the data item is not matched with the target data group, deleting the data item of the target data group in the target storage address, and adjusting the number of the data group;
classifying the data items into the data groups with the adjusted number so as to update the target data group according to the data groups with the adjusted number;
and storing the data items in the updated target data group to the target storage address.
8. A data processing apparatus, characterized in that the data processing apparatus comprises:
the data acquisition module is used for acquiring target data to be processed, and the target data comprises a plurality of data items;
a data item classification module for classifying the plurality of data items into at least one data group based on a clustering algorithm model;
the central value determining module is used for determining the current central value of each data group according to the data item currently contained in each data group;
the target data group determining module is used for acquiring a historical central value of each data group and determining a target data group according to the current central value and the historical central value of each data group;
and the data item storage module is used for storing the data items in the target data group to a target storage address.
9. A computer arrangement, characterized in that the computer arrangement comprises a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the data processing method according to any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the data processing method according to any one of claims 1 to 7.
CN202110708164.XA 2021-06-24 2021-06-24 Data processing method, device, equipment and computer storage medium Pending CN113434471A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110708164.XA CN113434471A (en) 2021-06-24 2021-06-24 Data processing method, device, equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110708164.XA CN113434471A (en) 2021-06-24 2021-06-24 Data processing method, device, equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN113434471A true CN113434471A (en) 2021-09-24

Family

ID=77754195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110708164.XA Pending CN113434471A (en) 2021-06-24 2021-06-24 Data processing method, device, equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN113434471A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116860761A (en) * 2023-09-04 2023-10-10 北京安天网络安全技术有限公司 Data acquisition method, electronic equipment and storage medium
WO2024119746A1 (en) * 2022-12-07 2024-06-13 苏州元脑智能科技有限公司 Data storage method and apparatus, electronic device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710190A (en) * 2018-12-26 2019-05-03 百度在线网络技术(北京)有限公司 A kind of date storage method, device, equipment and storage medium
WO2020224091A1 (en) * 2019-05-06 2020-11-12 平安科技(深圳)有限公司 Sequence generation method and apparatus, computer device, and storage medium
CN112182111A (en) * 2020-10-13 2021-01-05 宁波金狮科技有限公司 Block chain based distributed system layered processing method and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710190A (en) * 2018-12-26 2019-05-03 百度在线网络技术(北京)有限公司 A kind of date storage method, device, equipment and storage medium
WO2020224091A1 (en) * 2019-05-06 2020-11-12 平安科技(深圳)有限公司 Sequence generation method and apparatus, computer device, and storage medium
CN112182111A (en) * 2020-10-13 2021-01-05 宁波金狮科技有限公司 Block chain based distributed system layered processing method and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
常颖;: "基于Hadoop下的数据智能分类算法分析", 通讯世界, no. 12, 25 December 2019 (2019-12-25), pages 78 - 79 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024119746A1 (en) * 2022-12-07 2024-06-13 苏州元脑智能科技有限公司 Data storage method and apparatus, electronic device and storage medium
CN116860761A (en) * 2023-09-04 2023-10-10 北京安天网络安全技术有限公司 Data acquisition method, electronic equipment and storage medium
CN116860761B (en) * 2023-09-04 2023-11-17 北京安天网络安全技术有限公司 Data acquisition method, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US10346439B2 (en) Entity resolution from documents
CN109165975B (en) Label recommending method, device, computer equipment and storage medium
CN105956628B (en) Data classification method and device for data classification
CN113434471A (en) Data processing method, device, equipment and computer storage medium
WO2022134881A1 (en) Data processing method, data processing apparatus, computer device, and non-transitory storage medium
CN110555164A (en) generation method and device of group interest tag, computer equipment and storage medium
CN111507090A (en) Abstract extraction method, device, equipment and computer readable storage medium
CN112070550A (en) Keyword determination method, device and equipment based on search platform and storage medium
CN111277274A (en) Data compression method, device, equipment and storage medium
CN111209929A (en) Access data processing method and device, computer equipment and storage medium
CN110991538B (en) Sample classification method and device, storage medium and computer equipment
CN113886443A (en) Log processing method and device, computer equipment and storage medium
WO2017065795A1 (en) Incremental update of a neighbor graph via an orthogonal transform based indexing
CN115729687A (en) Task scheduling method and device, computer equipment and storage medium
CN111784069B (en) User preference prediction method, device, equipment and storage medium
CN112487039B (en) Data processing method, device, equipment and readable storage medium
CN113626387A (en) Task data export method and device, electronic equipment and storage medium
CN113435741A (en) Training plan generation method, device, equipment and storage medium
CN113392208A (en) Method, device and storage medium for IT operation and maintenance fault processing experience accumulation
CN112685157A (en) Task processing method and device, computer equipment and storage medium
CN111667321A (en) Data processing method and device, computer and readable storage medium
CN111639103B (en) Service data processing method, device, computer system and medium
CN114661936B (en) Image retrieval method applied to industrial vision and electronic equipment
CN113254176B (en) Project management method and device, computer equipment and storage medium
CN113726785B (en) Network intrusion detection method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination