CN111209997A

CN111209997A - Data analysis method and device

Info

Publication number: CN111209997A
Application number: CN201811399481.2A
Authority: CN
Inventors: 李毫
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2020-05-29
Anticipated expiration: 2038-11-22
Also published as: CN111209997B

Abstract

The embodiment of the invention provides a data analysis method and device for massive irregular data, and belongs to the field of data analysis. The data analysis method comprises the following steps: determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range; setting a plurality of data sets for processing application data in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and performing data processing on each data set by adopting an ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and selecting data with pheromones higher than a set threshold value from each data set, wherein the higher the pheromones, the higher the correlation between the data with preset required data is. The method and the device can select the optimal data from the massive irregular data, and avoid discarding the valuable data.

Description

Data analysis method and device

Technical Field

The invention relates to the field of data analysis, in particular to a data analysis method and device.

Background

At present, a large amount of disordered and irregular data (hereinafter, referred to as mass and irregular data) exists in many data application fields, such as content marketing. The content marketing refers to the way that relevant contents of related enterprises are transmitted to clients through media such as pictures, characters, animations and the like to promote sales, namely valuable information is transmitted to users through reasonable content creation, publishing and propagation, and therefore the purpose of network marketing is achieved. According to the definition of content marketing, it can be known that there is a need to extract data meeting the need from massive irregular data for display and/or marketing. For example, when content marketing is performed by using a mobile phone, data of various aspects of user life and work, such as character deviation, investment deviation, wearing deviation, work specialty, emotional characteristics, physical characteristics, personal likes and dislikes, and the like, need to be acquired through operation records of the mobile phone of the user, and finally, the user is portrait through data analysis, personalized content marketing is performed, and personalized requirements of the user are met.

However, the user data involved here is irregular, and the data generated by a single user per day is messy and numerous. Therefore, if the user volume base for content marketing is large, massive irregular data is generated. For these massive irregular data, the prior art processes the data through a data modeling scheme, that is, the created data model transfers the irregular data into the regular ordered data, and then analyzes the data. However, in the process of ordering the data rules, data that is not involved in the data model is often discarded. In other words, once the data transfer fails on the data model, the unordered data, the irregular data, and even the ordered data of partial unordered data transfer lose meaning and are discarded. However, it should be noted that even though these data are discarded finally, the data model still analyzes them, thereby increasing the data analysis amount of the server and the like, occupying more data analysis resources, and finally affecting the efficiency of the whole data analysis process. In addition, the rhythm of the current society is very fast, the change of the user behavior habits is relatively fast, and the creation of the data model usually needs a certain time, so that the creation rate of the data model in the prior art is probably unable to adapt to the rapid change of the user data, and finally the product loses competitiveness.

Disclosure of Invention

The embodiment of the invention aims to provide a data analysis method and a data analysis device, which are used for solving the problem that massive irregular data are difficult to process in the prior art.

In order to achieve the above object, an embodiment of the present invention provides a data analysis method, including: determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range; setting a plurality of data sets for processing application data in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and performing data processing on each data set by adopting an ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and selecting data with pheromones higher than a set threshold value from each data set, wherein the higher the pheromones, the higher the correlation between the data with preset required data is.

Optionally, the performing data processing on each data set by using the ant colony algorithm includes: setting initialization parameters of the ant colony algorithm, wherein the initialization parameters comprise data numbers in each data set, initial pheromones of each data, heuristic factors and expectation factors, and the expectation factors comprise information of the required data; and enabling each ant to select data according to the initialization parameters to start walking, calculating the probability of transferring from the current data to the next data during walking according to the initial pheromone and the heuristic factor, calculating the correlation between the current data and the required data when data transfer occurs each time, updating the pheromone of the current data according to the correlation calculation result until all ants finish walking all data in each data set, finishing one iteration, and selecting the data with the pheromone higher than the set threshold.

Optionally, the calculating the correlation between the current data and the required data, and updating the pheromone of the current data according to the correlation calculation result includes: comparing the dimension set and the index set corresponding to the current data with the data header of the required data to obtain the percentage of the field name contained in the data header of the required data, wherein the percentage contains the dimension set and the index set of the current data, and taking the percentage as a correlation calculation result; and updating the pheromone of the current data by using the corresponding relation between the correlation and the pheromone according to the correlation calculation result, wherein the higher the correlation of the required data is, the higher the pheromone corresponding to the data is.

Optionally, after the ant colony algorithm is used to perform data processing on each data set, the data analysis method further includes: after one iteration is completed and the data with the pheromone higher than the set threshold value is selected, updating a global pheromone table based on the pheromone of the selected data, and applying the updated global pheromone table to the next iteration; and comparing the data obtained by the two adjacent iterations, and selecting the data with higher pheromone in the two iterations until the iterations of the preset times are completed so as to select the optimal data.

Optionally, the data analysis method further includes: establishing a data map for the data to be analyzed; and determining the plurality of data sets to which data processing is applied with reference to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed; and/or determining the initialization parameters by referring to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed.

On the other hand, an embodiment of the present invention further provides a data analysis device, where the data analysis method device includes: the first data processing unit is used for determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range; the second data processing unit is used for setting a plurality of data sets for applying data processing in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and the third data processing unit is used for carrying out data processing on each data set by adopting an ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and data with pheromones higher than a set threshold value is selected from each data set, wherein the higher the pheromones are, the higher the correlation between the data with the preset required data is.

Optionally, the third data processing unit includes: an initialization module, configured to set initialization parameters of the ant colony algorithm, where the initialization parameters include a data number in each data set, an initial pheromone of each data, a heuristic factor, and an expectation factor, where the expectation factor includes information of the required data; the calculation module is used for enabling each ant to select data according to the initialization parameters to start walking, calculating the probability of transferring from the current data to the next data during walking according to the initial pheromone and the heuristic factor, calculating the correlation between the current data and the required data when data transfer occurs each time, and updating the pheromone of the current data according to the correlation calculation result; and the first selection module is used for selecting the data of which the pheromone is higher than the set threshold value when all ants finish walking all the data in each data set and finishing one iteration.

Optionally, the calculation module includes: the transfer probability calculation submodule is used for calculating the probability of transferring the current data to the next data of each ant in the walking process according to the initial pheromone and the heuristic factor; a correlation calculation submodule, configured to compare the dimension set and the index set corresponding to the current data with a data header of the required data, so as to obtain a percentage of a field name included in the data header of the required data, where the field name includes the dimension set and the index set of the current data, and use the percentage as a correlation calculation result; and a pheromone calculation submodule for updating the pheromone of the current data by using the correspondence between the correlation and the pheromone according to the correlation calculation result, wherein the pheromone corresponding to the data having the higher correlation with the required data is higher.

Optionally, the third data processing unit further includes: the pheromone global updating module is used for updating a global pheromone table based on the pheromone of the selected data after one iteration is finished and the data of which the pheromone is higher than the set threshold is selected, and applying the updated global pheromone table to the next iteration; and the second selection module is used for comparing the data obtained by two adjacent iterations and selecting the data with higher pheromone until the iteration of the preset times is completed so as to select the optimal data.

Optionally, the data analysis apparatus further includes: the data map establishing unit is used for establishing a data map aiming at the data to be analyzed; the third data processing unit is further configured to determine the initialization parameter and/or determine the plurality of data sets to which data processing is applied with reference to the data map when the ant colony algorithm is used for performing data processing on the data to be analyzed.

In another aspect, the present invention also provides a machine-readable storage medium, where the machine-readable storage medium has instructions stored thereon, and the instructions are used to enable a machine to execute the data analysis method described above.

In another aspect, an embodiment of the present invention further provides a processor, configured to execute a program, where the program is executed to perform: such as the data analysis methods described above.

By the technical scheme, the ant colony algorithm is applied to data analysis, optimal data can be selected from massive irregular data, and valuable data are prevented from being discarded. Moreover, the scheme of the invention is suitable for data analysis in the fields of content marketing and the like, is convenient for selecting the optimal data from mass data to carry out content marketing, and can better ensure the accuracy of content marketing.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:

FIG. 1 is a schematic flow chart of a data analysis method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of data processing performed on data to be analyzed by using an ant colony algorithm according to an embodiment of the present invention; and

fig. 3 is a schematic structural diagram of a data analysis apparatus according to another embodiment of the present invention.

Description of the reference numerals

310 first data processing unit 320 second data processing unit

330 initialization module for third data processing unit 331

332 calculation module 333 first selection module

334 pheromone global update module 335 second selection module

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.

In the embodiments of the present invention, the terms "first", "second", "third", and the like are used for distinguishing similar objects, and are not necessarily used for describing a particular order or sequence. It is to be understood that the data so used is interchangeable under appropriate circumstances such that embodiments of the invention may be practiced otherwise than as specifically illustrated and described herein. Additionally, it should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a server, for example, as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than the flowchart.

The embodiment of the invention solves the problem of analyzing massive, irregular and disordered data mentioned in the background technology based on the ant colony algorithm. The ant colony algorithm is an evolutionary computing method provided by the italian scholaro Dorigo et al inspired by the ant colony foraging mechanism. The real ant colony can finally find a shortest path between the ant hole and the food source by means of the perception of pheromone through releasing the pheromone on the foraging path. The ant colony algorithm works by the mechanism of simulating real ant colonies.

The most typical application of the ant colony algorithm is to solve the problem of path selection, for example, when ants find the shortest path between a path starting point (ant hole) and a path ending point (food source), the ant colony releases pheromone on a foraging path, a single ant selects a next travel direction according to probability by sensing the intensity of the pheromone on the path, and indirect information transmission is completed between the ants by sensing and releasing the pheromone. If there is a new obstacle on the foraging path, the pheromone track is temporarily cut off, and ants randomly select the next traveling direction at the moment, so that ants near the obstacle with the new shortest path reconstruct the continuous pheromone track firstly. When ants on the path reach a certain degree, the intensity of the pheromone on the short path is greater than that on the longer path, so that the subsequent ants can select the short path with higher probability, and the positive feedback mechanism formed in the process enables the ants to find the latest shortest path.

According to the basic principle of the ant colony algorithm, the embodiment of the invention provides a scheme for applying the ant colony algorithm to the data analysis process of massive irregular data, and the massive irregular data involved in content marketing is taken as an example in the embodiment of the invention, wherein the content marketing is to recommend commodities which can be purchased to a user according to the shopping condition of the mobile phone of the user.

Fig. 1 is a schematic flow chart of a data analysis method according to an embodiment of the present invention, and as shown in fig. 1, the data processing method may include the following steps:

step S110, determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range.

The dimension set and the index set respectively refer to sets of dimensions and indexes, and the dimensions and the indexes are two vital parameters in the big data processing. The dimension refers to certain characteristics of things or phenomena, such as gender, region, time, consumption type and the like, and the time is a common and special dimension, and the development conditions of the things or phenomena can be known through comparison before and after the time, for example, the number of users is increased by 10% in the previous month and is increased by 20% in the same year; the index is a unit or a method for measuring the development degree of things or phenomena, and can also be called as a measure, and the indexes are population number, GDP, income, user number, profit margin, retention rate, coverage rate, consumption amount, consumption growth rate and the like.

The data feature limit range is used to limit a data range in which data analysis is performed, and examples of the data feature limit range include data in a certain area having the same IP address and a group in a certain province. By setting the data characteristic limit range, the reasonability and the legality of the data source can be ensured.

In the embodiment of the present invention, a dimension set is set as D, an index set is set as M, a data feature limit range is set as W, and a variable (D, M, W) is obtained, where the variable (D, M, W) is an input variable for data analysis in the embodiment of the present invention.

Step S120, setting a plurality of data sets for data processing in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed. Wherein the data set may include a plurality of data.

For example, according to the variables (D, M, W), the data sets required or required by the variables (D, M, W) can be filtered from the data to be analyzed through the preset function f (D, M, W) to perform data processing, so as to determine the data sets for subsequently applying the ant colony algorithm and the data in the data sets.

Step S130, performing data processing on each data set by using an ant colony algorithm, so that each ant walks through all data in each data set, and selecting data with pheromones higher than a set threshold from each data set, wherein the higher the pheromones, the higher the correlation between the data and preset required data is.

For example, in the process of performing data processing on each data set by using the ant colony algorithm, the correlation between each data and the preset required data is verified, and the pheromone of the corresponding data is updated according to the correlation, so that the higher the pheromone is, the higher the correlation between the data and the preset required data is, the higher the pheromone is, the more likely the data is the required data. Further, after the data processing by the ant colony algorithm is completed, the data with pheromones higher than the set threshold may be selected by sorting the pheromones in order from high to low.

As described above, in the ant colony algorithm conventionally applied to path selection, a path start point and a path end point need to be set, whereas in the ant colony algorithm applied to data analysis according to the embodiment of the present invention, no path end point is set, and all data in the ant-completed data set is used as a flag indicating that it completes one iteration for one data set. For example, let data set a be (C1, C2, C3, C4 … … Cn), where C1-Cn represents data, assuming ant k starts from C1, its second step may select one of C2-Cn (C1 cannot be selected again), assuming C3 is selected, its third step may select one of C2 and C4-Cn (C3 cannot be selected again), and so on, until ant k walks all data in C1-Cn, indicating that ant k has completed one iteration. The one-time iteration referred to herein means that all ants in the ant colony have completed all data in one data set, and the number of iterations can be set as needed.

The following describes a specific process of applying the ant algorithm to perform data processing in step S130.

Fig. 2 is a schematic flow chart of data processing on data to be analyzed by using an ant colony algorithm in the embodiment of the present invention. As shown in fig. 2, the data processing procedure may include the steps of:

step S131, setting initialization parameters of the ant colony algorithm.

The initialization parameters mainly include data numbers in each data set, initial pheromones of the data, heuristic factors and expectation factors. Where the data number and initial pheromone may be determined from a data map corresponding to the data set, the heuristic is used to calculate the probability that an ant will transfer from one data to another, both of which are described below. In addition, the expectation factor includes information of the required data, and in the embodiment of the present invention, the expectation factor may be understood as the required data.

Specifically, the heuristic factor is set to η_ij() In the embodiment of the invention, the heuristic factor is determined during initial setting and does not change subsequently; pheromone is set to tau_ij(t) for reflecting the amount of data on the path in the path selection problem, which is expressed in the embodiment of the present invention as the correlation between the current data i and the expected factor (e.g. the required data meeting the requirement of (D, M, W)) when the ant passes from the current data i to the data j in the t-th iteration. It should be noted that the ant colony algorithm of the embodiment of the present invention may also have other initialization parameters, which can be understood by referring to the conventional ant colony algorithm in the prior art, and will not be described in detail herein.

Step S132, each ant selects data according to initialization parameters to start walking, the probability of each ant transferring from the current data to the next data in the walking process is calculated according to the initial pheromone and the heuristic factors, the correlation between the current data and the required data is calculated when data transfer occurs each time, the pheromone of the current data is updated according to the correlation calculation result until all ants finish walking all data in each data set, one iteration is completed, and the data with the pheromone higher than the set threshold value is selected.

In a preferred embodiment, each ant can select data to start walking according to the starting point of the data given in the initialization parameters, and the selected data is in accordance with the data requirements of other parameters in the initialization parameters.

In a preferred embodiment, the following formula may be used to calculate the probability that each ant will move from the current data i to the next data j in the walk corresponding to the t-th iteration:

in the formula，j∈J_k(i) For t iterations, ant k allows the next data to be selected, and generally, the range of the next data allowed to be selected refers to the set of data not walked by the ant, α is used to represent pheromone τ_ijThe relative degree of importance of (t), β, denotes the heuristic factor η_ij() And β are selected based on the degree of association between data i and data j, e.g., similar data obtained from the same data source may be associated from [1, 10 ]]Is selected, the higher the degree of association, the larger the value selected.

In a preferred embodiment, in step S132, the calculating of the correlation between the current data and the required data and the updating of the pheromone of the current data according to the correlation calculation result include two parts:

1) and a correlation algorithm part.

Specifically, the dimension set and the index set corresponding to the current data are compared with the data header of the required data to obtain the percentage of the field name contained in the data header of the required data, where the field name contains the dimension set and the index set of the current data, and the percentage is used as a correlation calculation result.

The header is a header of a piece of data, for example, a piece of data related to the current consumption of the user, the header generally records header fields such as "time, amount" and the like, and the data portion other than the header records specific time value and amount value. For example, (D, W) is (time, amount of consumption), if the corresponding Data header has the word "time" in the field Data, and the percentage is defined as above, the correlation r is Data/(D, M), which is 50%, if the corresponding Data header includes the word "amount of consumption", the correlation r is also 50%, and if the corresponding Data header includes both "time" and "amount of consumption", the correlation r is 100%. Note that, here, the calculation of the correlation r is performed based on fuzzy matching, and in practice, (D, W) includes a plurality of parameters, and the correlation r is not easy to reach 100%. Generally, the range of the correlation r is set at [0, 1 ]. It should also be noted that correlation algorithms are well established in the art and are readily combined with ant colony algorithms, so that other correlation algorithms known in the art can be used to determine correlations between data.

2) And a pheromone updating part.

And updating the pheromone of the current data by utilizing the corresponding relation between the correlation and the pheromone according to the correlation calculation result, wherein the higher the correlation of the required data is, the higher the pheromone corresponding to the data is.

For example, the correspondence between the correlation r and the pheromone can be configured in advance, for example, after the correlation r is greater than 50%, the value of the pheromone is correspondingly increased by 1 every 10%. It should be noted that in some embodiments, the correlation may be directly accumulated as the value of the pheromone.

In a more preferred embodiment, step S132 may further include: after one iteration is completed and the data with the pheromone higher than the set threshold value is selected, updating a global pheromone table based on the pheromone of the selected data, and applying the updated global pheromone table to the next iteration; and comparing the data obtained by the two adjacent iterations, and selecting the data with higher pheromone in the two iterations until the iterations of the preset times are completed so as to select the optimal data.

Here, in the existing ant colony algorithm, local updating of pheromone is mostly performed, which is performed after each step of searching of ants, and a large amount of computing time is consumed. In addition, for the embodiment of the present invention, if each ant performs pheromone updating when passing through a certain data, the pheromone on the data will rise rapidly, the difference between the pheromone and the pheromone on other paths will increase, and the pheromone on the local optimal path will increase too fast and fall into local optimal. Therefore, in the embodiment of the present invention, the global pheromone table is updated based on the pheromone after one iteration, for example, in the first iteration, the pheromone is not cleared but is continuously retained, and the global pheromone table can be updated by taking the highest pheromone of all ants in the first iteration as the global optimal solution for the next iteration. Thus, as the number of iterations increases, the selection result is closer to the optimal data, and through the set number of iterations (e.g., 10), the optimal data closest to the required data is finally obtained.

Further, after the selection of the optimal data from one dataset is completed, other datasets may continue to be selected for application of the ant colony algorithm.

In a more preferred embodiment, the data analysis method according to the embodiment of the present invention may further include: establishing a data map for the data to be analyzed; and determining the initialization parameters by referring to the data map when the data to be analyzed is subjected to data processing by the ant colony algorithm, and/or determining the plurality of data sets to which data processing is applied when the data to be analyzed is subjected to data processing by the ant colony algorithm. The data map is used for displaying data to be analyzed in a map mode, and clearly and concisely showing data storage positions, data storage modes, data sources and the like through the map. The data map may simulate a two-dimensional space, for example, which may determine data storage locations by horizontal and vertical coordinates (x, y).

It is readily appreciated that the data map is similar to the path format and can simply illustrate the association between portions of data and the association of data with desired data. Therefore, in the embodiment of the present invention, when the ant colony algorithm is used to perform data processing on the data to be analyzed in step S130, the ant heuristic factor, the initial pheromone, the initial placement position, the ant walking path, and the like are determined by referring to the data map, for example, the initial placement position and the walking path of the ant may be consistent with the data storage position and the data storage path shown by the data map. It should be noted that the initial placement position of the ants may also be randomly selected, and the embodiment of the present invention is not limited thereto.

In addition, the data graph illustrates data sources, etc., such that different data sets to which the ant colony algorithm applies may be determined based on, for example, the different data sources, such as data originating from a first service provider as data set a and data originating from a second service provider as data set B.

In summary, the data analysis method for mass irregular data according to the embodiment of the present invention has the following advantages:

1) the ant colony algorithm is applied to data analysis of massive irregular data, optimal data can be selected from the massive irregular data, and valuable data are prevented from being discarded.

2) The method is suitable for data analysis in the fields of content marketing and the like, so that the optimal data is selected from the mass data for content marketing, and the accuracy of the content marketing can be better ensured.

3) The ant colony algorithm has the advantages of being beneficial to reducing the data analysis amount of hardware mechanisms such as a server and the like, and therefore, the hardware research and development cost is reduced.

4) The data passing through the ant walking process is verified through the correlation algorithm, the required data are screened out, and the accuracy of the final selection result of the ant colony algorithm is guaranteed. And, the correlation algorithm is easy to implement and easy to combine with the ant colony algorithm.

5) The mode of global pheromone updating is adopted in the ant colony algorithm, so that the phenomenon of local optimization is avoided, and the algorithm efficiency is improved.

6) A data map is established to assist in determining the initial placement position and the walking path of the ants, so that the algorithm difficulty is simplified, and the algorithm is closer to the data rule.

Fig. 3 is a schematic structural diagram of a data analysis device according to another embodiment of the present invention, which is based on the same inventive concept as the data analysis method according to the above-described embodiment. As shown in fig. 3, the data analysis apparatus includes: the first data processing unit 310 is configured to determine a dimension set of data to be analyzed, an index set corresponding to the dimension set, and a data feature limit range; a second data processing unit 320, configured to set multiple data sets in the data to be analyzed, where data processing is applied, according to the dimension set, the index set, and the data feature limit range of the data to be analyzed; and a third data processing unit 330, configured to perform data processing on each data set by using the ant colony algorithm, so that each ant in the ant colony runs through all data in each data set, so as to select data with pheromones higher than a set threshold from each data set, where the higher pheromone data has a higher correlation with preset required data.

In a preferred embodiment, the third data processing unit 330 comprises:

the initialization module 331 is configured to set initialization parameters of the ant colony algorithm, where the initialization parameters include a data number in each data set, an initial pheromone of each data, a heuristic factor, and an expectation factor, where the expectation factor includes information of the required data.

A calculating module 332, configured to enable each ant to select data according to the initialization parameter to start walking, calculate a probability that each ant transfers from current data to next data during walking according to the initial pheromone and the heuristic factor, calculate a correlation between the current data and the required data each time data transfer occurs, and update the pheromone of the current data according to a correlation calculation result.

The first selecting module 333 is configured to select data with pheromones higher than the set threshold when all ants finish walking all data in each data set and finish one iteration.

In a preferred embodiment, the calculation module 332 may include: the transfer probability calculation submodule is used for calculating the probability of transferring the current data to the next data of each ant in the walking process according to the initial pheromone and the heuristic factor; a correlation calculation submodule, configured to compare the dimension set and the index set corresponding to the current data with a data header of the required data, so as to obtain a percentage of a field name included in the data header of the required data, where the field name includes the dimension set and the index set of the current data, and use the percentage as a correlation calculation result; and a pheromone calculation submodule for updating the pheromone of the current data by using the correspondence between the correlation and the pheromone according to the correlation calculation result, wherein the pheromone corresponding to the data having the higher correlation with the required data is higher.

In a preferred embodiment, the third data processing unit 330 further comprises: the pheromone global updating module 334 is configured to, after one iteration is completed and data with pheromones higher than the set threshold is selected, update a global pheromone table based on the pheromones of the selected data, and apply the updated global pheromone table to the next iteration; and a second selecting module 335, configured to compare data obtained from two adjacent iterations, and select data with higher pheromone between the two iterations until the predetermined number of iterations is completed, so as to select optimal data.

In a more preferred embodiment, the data analysis apparatus further includes: a data map establishing unit (not shown in the figure) for establishing a data map for the data to be analyzed. The third data processing unit 330 is further configured to determine an initial placement position and a walking path of an ant and/or determine the multiple data sets to which data processing is applied, with reference to the data map, when the ant colony algorithm is used to perform data processing on the data to be analyzed.

For details and advantages of the data analysis apparatus according to the embodiment of the present invention, reference may be made to the above-mentioned embodiments related to the data analysis method, which are not described herein again.

In other embodiments, the data analysis apparatus includes a processor and a memory, the first data processing unit, the second data processing unit, the third data processing unit, the data map establishing unit, and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. One or more than one kernel can be set, and data analysis of massive irregular data is realized by adjusting kernel parameters.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

An embodiment of the present invention provides a storage medium, on which a program is stored, where the program, when executed by a processor, implements the data analysis method for massive irregular data.

The embodiment of the invention provides a processor, which is used for running a program, wherein the data analysis method aiming at massive irregular data is executed when the program runs.

The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range; setting a plurality of data sets for processing application data in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and performing data processing on each data set by adopting the ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and selecting data with pheromones higher than a set threshold value from each data set, wherein the higher pheromone data has higher correlation with preset required data. The processor executes the program and further realizes the following steps: setting initialization parameters of the ant colony algorithm, wherein the initialization parameters comprise data numbers in each data set, initial pheromones of each data, heuristic factors and expectation factors, and the expectation factors comprise information of the required data; and

enabling each ant to select data according to the initialization parameters to start walking, calculating the probability of transferring from current data to next data during walking according to the initial pheromone and the heuristic factors, calculating the correlation between the current data and the required data when data transfer occurs each time, updating the pheromone of the current data according to the correlation calculation result until all ants finish walking all data in each data set, finishing one iteration, and selecting the data with the pheromone higher than the set threshold. The processor executes the program and further realizes the following steps: comparing the dimension set and the index set corresponding to the current data with the data header of the required data to obtain the percentage of the field name contained in the data header of the required data, wherein the percentage contains the dimension set and the index set of the current data, and taking the percentage as a correlation calculation result; and updating the pheromone of the current data by using the corresponding relation between the correlation and the pheromone according to the correlation calculation result, wherein the higher the correlation of the required data is, the higher the pheromone corresponding to the data is. The processor executes the program and further realizes the following steps: after one iteration is completed and the data with the pheromone higher than the set threshold value is selected, updating a global pheromone table based on the pheromone of the selected data, and applying the updated global pheromone table to the next iteration; and comparing the data obtained by the two adjacent iterations, and selecting the data with higher pheromone in the two iterations until the iterations of the preset times are completed so as to select the optimal data. The processor executes to the extent of further performing the steps of: establishing a data map for the data to be analyzed; and determining the plurality of data sets to which data processing is applied with reference to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed; and/or determining the initialization parameters by referring to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed. The device herein may be a server, a PC, a PAD, a mobile phone, etc.

An embodiment of the present invention further provides a computer program product, which, when executed on a data processing apparatus, is adapted to execute a program that initializes the following method steps:

1) determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range; setting a plurality of data sets for processing application data in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and performing data processing on each data set by adopting the ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and selecting data with pheromones higher than a set threshold value from each data set, wherein the higher pheromone data has higher correlation with preset required data.

2) Setting initialization parameters of the ant colony algorithm, wherein the initialization parameters comprise data numbers in each data set, initial pheromones of each data, heuristic factors and expectation factors, and the expectation factors comprise information of the required data; and enabling each ant to select data according to the initialization parameters to start walking, calculating the probability of transferring from the current data to the next data during walking according to the initial pheromone and the heuristic factor, calculating the correlation between the current data and the required data when data transfer occurs each time, updating the pheromone of the current data according to the correlation calculation result until all ants finish walking all data in each data set, finishing one iteration, and selecting the data with the pheromone higher than the set threshold.

3) Comparing the dimension set and the index set corresponding to the current data with the data header of the required data to obtain the percentage of the field name contained in the data header of the required data, wherein the percentage contains the dimension set and the index set of the current data, and taking the percentage as a correlation calculation result; and updating the pheromone of the current data by using the corresponding relation between the correlation and the pheromone according to the correlation calculation result, wherein the higher the correlation of the required data is, the higher the pheromone corresponding to the data is.

4) After one iteration is completed and the data with the pheromone higher than the set threshold value is selected, updating a global pheromone table based on the pheromone of the selected data, and applying the updated global pheromone table to the next iteration; and comparing the data obtained by the two adjacent iterations, and selecting the data with higher pheromone in the two iterations until the iterations of the preset times are completed so as to select the optimal data.

5) Establishing a data map for the data to be analyzed; and determining the plurality of data sets to which data processing is applied with reference to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed; and/or determining the initialization parameters by referring to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A data analysis method, characterized in that the data analysis method comprises:

determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range;

setting a plurality of data sets for processing application data in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and

and performing data processing on each data set by adopting an ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and selecting data with pheromones higher than a set threshold value from each data set, wherein the higher the pheromones are, the higher the correlation between the data with preset required data is.

2. The data analysis method of claim 1, wherein the data processing of each of the data sets using the ant colony algorithm comprises:

setting initialization parameters of the ant colony algorithm, wherein the initialization parameters comprise data numbers in each data set, initial pheromones of each data, heuristic factors and expectation factors, and the expectation factors comprise information of the required data; and

enabling each ant to select data according to the initialization parameters to start walking, calculating the probability of transferring from current data to next data during walking according to the initial pheromone and the heuristic factors, calculating the correlation between the current data and the required data when data transfer occurs each time, updating the pheromone of the current data according to the correlation calculation result until all ants finish walking all data in each data set, finishing one iteration, and selecting the data with the pheromone higher than the set threshold.

3. The data analysis method according to claim 2, wherein the calculating of the correlation between the current data and the required data and the updating of the pheromone of the current data according to the correlation calculation result includes:

comparing the dimension set and the index set corresponding to the current data with the data header of the required data to obtain the percentage of the field name contained in the data header of the required data, wherein the percentage contains the dimension set and the index set of the current data, and taking the percentage as a correlation calculation result; and

4. The data analysis method of claim 2, wherein after the data processing of each data set using the ant colony algorithm, the data analysis method further comprises:

after one iteration is completed and the data with the pheromone higher than the set threshold value is selected, updating a global pheromone table based on the pheromone of the selected data, and applying the updated global pheromone table to the next iteration; and

and comparing the data obtained by two adjacent iterations, and selecting the data with higher pheromone in the two iterations until the iteration of the preset times is completed so as to select the optimal data.

5. The data analysis method of claim 2, further comprising:

establishing a data map for the data to be analyzed; and

determining the plurality of data sets to which data processing is applied with reference to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed; and/or determining the initialization parameters by referring to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed.

6. A data analysis device, characterized in that the data analysis method device comprises:

the first data processing unit is used for determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range;

the second data processing unit is used for setting a plurality of data sets for applying data processing in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and

and the third data processing unit is used for performing data processing on each data set by adopting an ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and data with pheromones higher than a set threshold value is selected from each data set, wherein the higher pheromone data is higher in correlation with preset required data.

7. The data analysis device according to claim 6, wherein the third data processing unit includes:

an initialization module, configured to set initialization parameters of the ant colony algorithm, where the initialization parameters include a data number in each data set, an initial pheromone of each data, a heuristic factor, and an expectation factor, where the expectation factor includes information of the required data;

the calculation module is used for enabling each ant to select data according to the initialization parameters to start walking, calculating the probability of transferring from the current data to the next data during walking according to the initial pheromone and the heuristic factor, calculating the correlation between the current data and the required data when data transfer occurs each time, and updating the pheromone of the current data according to the correlation calculation result; and

and the first selection module is used for selecting the data of which the pheromone is higher than the set threshold value when all ants finish walking all the data in each data set and finishing one iteration.

8. The data analysis device of claim 7, wherein the calculation module comprises:

the transfer probability calculation submodule is used for calculating the probability of transferring the current data to the next data of each ant in the walking process according to the initial pheromone and the heuristic factor;

a correlation calculation submodule, configured to compare the dimension set and the index set corresponding to the current data with a data header of the required data, so as to obtain a percentage of a field name included in the data header of the required data, where the field name includes the dimension set and the index set of the current data, and use the percentage as a correlation calculation result; and

and the pheromone calculation submodule is used for updating the pheromone of the current data by utilizing the corresponding relation between the correlation and the pheromone according to the correlation calculation result, wherein the higher the correlation of the required data is, the higher the pheromone corresponding to the data is.

9. A machine-readable storage medium having stored thereon instructions for causing a machine to perform the data analysis method of any one of claims 1 to 5.

10. A processor configured to execute a program, wherein the program is configured to perform: a method of data analysis as claimed in any one of claims 1 to 5.