CN111209997A - Data analysis method and device - Google Patents

Data analysis method and device Download PDF

Info

Publication number
CN111209997A
CN111209997A CN201811399481.2A CN201811399481A CN111209997A CN 111209997 A CN111209997 A CN 111209997A CN 201811399481 A CN201811399481 A CN 201811399481A CN 111209997 A CN111209997 A CN 111209997A
Authority
CN
China
Prior art keywords
data
pheromone
correlation
ant
analyzed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811399481.2A
Other languages
Chinese (zh)
Other versions
CN111209997B (en
Inventor
李毫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201811399481.2A priority Critical patent/CN111209997B/en
Publication of CN111209997A publication Critical patent/CN111209997A/en
Application granted granted Critical
Publication of CN111209997B publication Critical patent/CN111209997B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations

Abstract

The embodiment of the invention provides a data analysis method and device for massive irregular data, and belongs to the field of data analysis. The data analysis method comprises the following steps: determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range; setting a plurality of data sets for processing application data in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and performing data processing on each data set by adopting an ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and selecting data with pheromones higher than a set threshold value from each data set, wherein the higher the pheromones, the higher the correlation between the data with preset required data is. The method and the device can select the optimal data from the massive irregular data, and avoid discarding the valuable data.

Description

Data analysis method and device
Technical Field
The invention relates to the field of data analysis, in particular to a data analysis method and device.
Background
At present, a large amount of disordered and irregular data (hereinafter, referred to as mass and irregular data) exists in many data application fields, such as content marketing. The content marketing refers to the way that relevant contents of related enterprises are transmitted to clients through media such as pictures, characters, animations and the like to promote sales, namely valuable information is transmitted to users through reasonable content creation, publishing and propagation, and therefore the purpose of network marketing is achieved. According to the definition of content marketing, it can be known that there is a need to extract data meeting the need from massive irregular data for display and/or marketing. For example, when content marketing is performed by using a mobile phone, data of various aspects of user life and work, such as character deviation, investment deviation, wearing deviation, work specialty, emotional characteristics, physical characteristics, personal likes and dislikes, and the like, need to be acquired through operation records of the mobile phone of the user, and finally, the user is portrait through data analysis, personalized content marketing is performed, and personalized requirements of the user are met.
However, the user data involved here is irregular, and the data generated by a single user per day is messy and numerous. Therefore, if the user volume base for content marketing is large, massive irregular data is generated. For these massive irregular data, the prior art processes the data through a data modeling scheme, that is, the created data model transfers the irregular data into the regular ordered data, and then analyzes the data. However, in the process of ordering the data rules, data that is not involved in the data model is often discarded. In other words, once the data transfer fails on the data model, the unordered data, the irregular data, and even the ordered data of partial unordered data transfer lose meaning and are discarded. However, it should be noted that even though these data are discarded finally, the data model still analyzes them, thereby increasing the data analysis amount of the server and the like, occupying more data analysis resources, and finally affecting the efficiency of the whole data analysis process. In addition, the rhythm of the current society is very fast, the change of the user behavior habits is relatively fast, and the creation of the data model usually needs a certain time, so that the creation rate of the data model in the prior art is probably unable to adapt to the rapid change of the user data, and finally the product loses competitiveness.
Disclosure of Invention
The embodiment of the invention aims to provide a data analysis method and a data analysis device, which are used for solving the problem that massive irregular data are difficult to process in the prior art.
In order to achieve the above object, an embodiment of the present invention provides a data analysis method, including: determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range; setting a plurality of data sets for processing application data in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and performing data processing on each data set by adopting an ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and selecting data with pheromones higher than a set threshold value from each data set, wherein the higher the pheromones, the higher the correlation between the data with preset required data is.
Optionally, the performing data processing on each data set by using the ant colony algorithm includes: setting initialization parameters of the ant colony algorithm, wherein the initialization parameters comprise data numbers in each data set, initial pheromones of each data, heuristic factors and expectation factors, and the expectation factors comprise information of the required data; and enabling each ant to select data according to the initialization parameters to start walking, calculating the probability of transferring from the current data to the next data during walking according to the initial pheromone and the heuristic factor, calculating the correlation between the current data and the required data when data transfer occurs each time, updating the pheromone of the current data according to the correlation calculation result until all ants finish walking all data in each data set, finishing one iteration, and selecting the data with the pheromone higher than the set threshold.
Optionally, the calculating the correlation between the current data and the required data, and updating the pheromone of the current data according to the correlation calculation result includes: comparing the dimension set and the index set corresponding to the current data with the data header of the required data to obtain the percentage of the field name contained in the data header of the required data, wherein the percentage contains the dimension set and the index set of the current data, and taking the percentage as a correlation calculation result; and updating the pheromone of the current data by using the corresponding relation between the correlation and the pheromone according to the correlation calculation result, wherein the higher the correlation of the required data is, the higher the pheromone corresponding to the data is.
Optionally, after the ant colony algorithm is used to perform data processing on each data set, the data analysis method further includes: after one iteration is completed and the data with the pheromone higher than the set threshold value is selected, updating a global pheromone table based on the pheromone of the selected data, and applying the updated global pheromone table to the next iteration; and comparing the data obtained by the two adjacent iterations, and selecting the data with higher pheromone in the two iterations until the iterations of the preset times are completed so as to select the optimal data.
Optionally, the data analysis method further includes: establishing a data map for the data to be analyzed; and determining the plurality of data sets to which data processing is applied with reference to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed; and/or determining the initialization parameters by referring to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed.
On the other hand, an embodiment of the present invention further provides a data analysis device, where the data analysis method device includes: the first data processing unit is used for determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range; the second data processing unit is used for setting a plurality of data sets for applying data processing in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and the third data processing unit is used for carrying out data processing on each data set by adopting an ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and data with pheromones higher than a set threshold value is selected from each data set, wherein the higher the pheromones are, the higher the correlation between the data with the preset required data is.
Optionally, the third data processing unit includes: an initialization module, configured to set initialization parameters of the ant colony algorithm, where the initialization parameters include a data number in each data set, an initial pheromone of each data, a heuristic factor, and an expectation factor, where the expectation factor includes information of the required data; the calculation module is used for enabling each ant to select data according to the initialization parameters to start walking, calculating the probability of transferring from the current data to the next data during walking according to the initial pheromone and the heuristic factor, calculating the correlation between the current data and the required data when data transfer occurs each time, and updating the pheromone of the current data according to the correlation calculation result; and the first selection module is used for selecting the data of which the pheromone is higher than the set threshold value when all ants finish walking all the data in each data set and finishing one iteration.
Optionally, the calculation module includes: the transfer probability calculation submodule is used for calculating the probability of transferring the current data to the next data of each ant in the walking process according to the initial pheromone and the heuristic factor; a correlation calculation submodule, configured to compare the dimension set and the index set corresponding to the current data with a data header of the required data, so as to obtain a percentage of a field name included in the data header of the required data, where the field name includes the dimension set and the index set of the current data, and use the percentage as a correlation calculation result; and a pheromone calculation submodule for updating the pheromone of the current data by using the correspondence between the correlation and the pheromone according to the correlation calculation result, wherein the pheromone corresponding to the data having the higher correlation with the required data is higher.
Optionally, the third data processing unit further includes: the pheromone global updating module is used for updating a global pheromone table based on the pheromone of the selected data after one iteration is finished and the data of which the pheromone is higher than the set threshold is selected, and applying the updated global pheromone table to the next iteration; and the second selection module is used for comparing the data obtained by two adjacent iterations and selecting the data with higher pheromone until the iteration of the preset times is completed so as to select the optimal data.
Optionally, the data analysis apparatus further includes: the data map establishing unit is used for establishing a data map aiming at the data to be analyzed; the third data processing unit is further configured to determine the initialization parameter and/or determine the plurality of data sets to which data processing is applied with reference to the data map when the ant colony algorithm is used for performing data processing on the data to be analyzed.
In another aspect, the present invention also provides a machine-readable storage medium, where the machine-readable storage medium has instructions stored thereon, and the instructions are used to enable a machine to execute the data analysis method described above.
In another aspect, an embodiment of the present invention further provides a processor, configured to execute a program, where the program is executed to perform: such as the data analysis methods described above.
By the technical scheme, the ant colony algorithm is applied to data analysis, optimal data can be selected from massive irregular data, and valuable data are prevented from being discarded. Moreover, the scheme of the invention is suitable for data analysis in the fields of content marketing and the like, is convenient for selecting the optimal data from mass data to carry out content marketing, and can better ensure the accuracy of content marketing.
Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:
FIG. 1 is a schematic flow chart of a data analysis method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of data processing performed on data to be analyzed by using an ant colony algorithm according to an embodiment of the present invention; and
fig. 3 is a schematic structural diagram of a data analysis apparatus according to another embodiment of the present invention.
Description of the reference numerals
310 first data processing unit 320 second data processing unit
330 initialization module for third data processing unit 331
332 calculation module 333 first selection module
334 pheromone global update module 335 second selection module
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.
In the embodiments of the present invention, the terms "first", "second", "third", and the like are used for distinguishing similar objects, and are not necessarily used for describing a particular order or sequence. It is to be understood that the data so used is interchangeable under appropriate circumstances such that embodiments of the invention may be practiced otherwise than as specifically illustrated and described herein. Additionally, it should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a server, for example, as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than the flowchart.
The embodiment of the invention solves the problem of analyzing massive, irregular and disordered data mentioned in the background technology based on the ant colony algorithm. The ant colony algorithm is an evolutionary computing method provided by the italian scholaro Dorigo et al inspired by the ant colony foraging mechanism. The real ant colony can finally find a shortest path between the ant hole and the food source by means of the perception of pheromone through releasing the pheromone on the foraging path. The ant colony algorithm works by the mechanism of simulating real ant colonies.
The most typical application of the ant colony algorithm is to solve the problem of path selection, for example, when ants find the shortest path between a path starting point (ant hole) and a path ending point (food source), the ant colony releases pheromone on a foraging path, a single ant selects a next travel direction according to probability by sensing the intensity of the pheromone on the path, and indirect information transmission is completed between the ants by sensing and releasing the pheromone. If there is a new obstacle on the foraging path, the pheromone track is temporarily cut off, and ants randomly select the next traveling direction at the moment, so that ants near the obstacle with the new shortest path reconstruct the continuous pheromone track firstly. When ants on the path reach a certain degree, the intensity of the pheromone on the short path is greater than that on the longer path, so that the subsequent ants can select the short path with higher probability, and the positive feedback mechanism formed in the process enables the ants to find the latest shortest path.
According to the basic principle of the ant colony algorithm, the embodiment of the invention provides a scheme for applying the ant colony algorithm to the data analysis process of massive irregular data, and the massive irregular data involved in content marketing is taken as an example in the embodiment of the invention, wherein the content marketing is to recommend commodities which can be purchased to a user according to the shopping condition of the mobile phone of the user.
Fig. 1 is a schematic flow chart of a data analysis method according to an embodiment of the present invention, and as shown in fig. 1, the data processing method may include the following steps:
step S110, determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range.
The dimension set and the index set respectively refer to sets of dimensions and indexes, and the dimensions and the indexes are two vital parameters in the big data processing. The dimension refers to certain characteristics of things or phenomena, such as gender, region, time, consumption type and the like, and the time is a common and special dimension, and the development conditions of the things or phenomena can be known through comparison before and after the time, for example, the number of users is increased by 10% in the previous month and is increased by 20% in the same year; the index is a unit or a method for measuring the development degree of things or phenomena, and can also be called as a measure, and the indexes are population number, GDP, income, user number, profit margin, retention rate, coverage rate, consumption amount, consumption growth rate and the like.
The data feature limit range is used to limit a data range in which data analysis is performed, and examples of the data feature limit range include data in a certain area having the same IP address and a group in a certain province. By setting the data characteristic limit range, the reasonability and the legality of the data source can be ensured.
In the embodiment of the present invention, a dimension set is set as D, an index set is set as M, a data feature limit range is set as W, and a variable (D, M, W) is obtained, where the variable (D, M, W) is an input variable for data analysis in the embodiment of the present invention.
Step S120, setting a plurality of data sets for data processing in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed. Wherein the data set may include a plurality of data.
For example, according to the variables (D, M, W), the data sets required or required by the variables (D, M, W) can be filtered from the data to be analyzed through the preset function f (D, M, W) to perform data processing, so as to determine the data sets for subsequently applying the ant colony algorithm and the data in the data sets.
Step S130, performing data processing on each data set by using an ant colony algorithm, so that each ant walks through all data in each data set, and selecting data with pheromones higher than a set threshold from each data set, wherein the higher the pheromones, the higher the correlation between the data and preset required data is.
For example, in the process of performing data processing on each data set by using the ant colony algorithm, the correlation between each data and the preset required data is verified, and the pheromone of the corresponding data is updated according to the correlation, so that the higher the pheromone is, the higher the correlation between the data and the preset required data is, the higher the pheromone is, the more likely the data is the required data. Further, after the data processing by the ant colony algorithm is completed, the data with pheromones higher than the set threshold may be selected by sorting the pheromones in order from high to low.
As described above, in the ant colony algorithm conventionally applied to path selection, a path start point and a path end point need to be set, whereas in the ant colony algorithm applied to data analysis according to the embodiment of the present invention, no path end point is set, and all data in the ant-completed data set is used as a flag indicating that it completes one iteration for one data set. For example, let data set a be (C1, C2, C3, C4 … … Cn), where C1-Cn represents data, assuming ant k starts from C1, its second step may select one of C2-Cn (C1 cannot be selected again), assuming C3 is selected, its third step may select one of C2 and C4-Cn (C3 cannot be selected again), and so on, until ant k walks all data in C1-Cn, indicating that ant k has completed one iteration. The one-time iteration referred to herein means that all ants in the ant colony have completed all data in one data set, and the number of iterations can be set as needed.
The following describes a specific process of applying the ant algorithm to perform data processing in step S130.
Fig. 2 is a schematic flow chart of data processing on data to be analyzed by using an ant colony algorithm in the embodiment of the present invention. As shown in fig. 2, the data processing procedure may include the steps of:
step S131, setting initialization parameters of the ant colony algorithm.
The initialization parameters mainly include data numbers in each data set, initial pheromones of the data, heuristic factors and expectation factors. Where the data number and initial pheromone may be determined from a data map corresponding to the data set, the heuristic is used to calculate the probability that an ant will transfer from one data to another, both of which are described below. In addition, the expectation factor includes information of the required data, and in the embodiment of the present invention, the expectation factor may be understood as the required data.
Specifically, the heuristic factor is set to ηij() In the embodiment of the invention, the heuristic factor is determined during initial setting and does not change subsequently; pheromone is set to tauij(t) for reflecting the amount of data on the path in the path selection problem, which is expressed in the embodiment of the present invention as the correlation between the current data i and the expected factor (e.g. the required data meeting the requirement of (D, M, W)) when the ant passes from the current data i to the data j in the t-th iteration. It should be noted that the ant colony algorithm of the embodiment of the present invention may also have other initialization parameters, which can be understood by referring to the conventional ant colony algorithm in the prior art, and will not be described in detail herein.
Step S132, each ant selects data according to initialization parameters to start walking, the probability of each ant transferring from the current data to the next data in the walking process is calculated according to the initial pheromone and the heuristic factors, the correlation between the current data and the required data is calculated when data transfer occurs each time, the pheromone of the current data is updated according to the correlation calculation result until all ants finish walking all data in each data set, one iteration is completed, and the data with the pheromone higher than the set threshold value is selected.
In a preferred embodiment, each ant can select data to start walking according to the starting point of the data given in the initialization parameters, and the selected data is in accordance with the data requirements of other parameters in the initialization parameters.
In a preferred embodiment, the following formula may be used to calculate the probability that each ant will move from the current data i to the next data j in the walk corresponding to the t-th iteration:
Figure BDA0001875999110000101
in the formula,j∈Jk(i) For t iterations, ant k allows the next data to be selected, and generally, the range of the next data allowed to be selected refers to the set of data not walked by the ant, α is used to represent pheromone τijThe relative degree of importance of (t), β, denotes the heuristic factor ηij() And β are selected based on the degree of association between data i and data j, e.g., similar data obtained from the same data source may be associated from [1, 10 ]]Is selected, the higher the degree of association, the larger the value selected.
In a preferred embodiment, in step S132, the calculating of the correlation between the current data and the required data and the updating of the pheromone of the current data according to the correlation calculation result include two parts:
1) and a correlation algorithm part.
Specifically, the dimension set and the index set corresponding to the current data are compared with the data header of the required data to obtain the percentage of the field name contained in the data header of the required data, where the field name contains the dimension set and the index set of the current data, and the percentage is used as a correlation calculation result.
The header is a header of a piece of data, for example, a piece of data related to the current consumption of the user, the header generally records header fields such as "time, amount" and the like, and the data portion other than the header records specific time value and amount value. For example, (D, W) is (time, amount of consumption), if the corresponding Data header has the word "time" in the field Data, and the percentage is defined as above, the correlation r is Data/(D, M), which is 50%, if the corresponding Data header includes the word "amount of consumption", the correlation r is also 50%, and if the corresponding Data header includes both "time" and "amount of consumption", the correlation r is 100%. Note that, here, the calculation of the correlation r is performed based on fuzzy matching, and in practice, (D, W) includes a plurality of parameters, and the correlation r is not easy to reach 100%. Generally, the range of the correlation r is set at [0, 1 ]. It should also be noted that correlation algorithms are well established in the art and are readily combined with ant colony algorithms, so that other correlation algorithms known in the art can be used to determine correlations between data.
2) And a pheromone updating part.
And updating the pheromone of the current data by utilizing the corresponding relation between the correlation and the pheromone according to the correlation calculation result, wherein the higher the correlation of the required data is, the higher the pheromone corresponding to the data is.
For example, the correspondence between the correlation r and the pheromone can be configured in advance, for example, after the correlation r is greater than 50%, the value of the pheromone is correspondingly increased by 1 every 10%. It should be noted that in some embodiments, the correlation may be directly accumulated as the value of the pheromone.
In a more preferred embodiment, step S132 may further include: after one iteration is completed and the data with the pheromone higher than the set threshold value is selected, updating a global pheromone table based on the pheromone of the selected data, and applying the updated global pheromone table to the next iteration; and comparing the data obtained by the two adjacent iterations, and selecting the data with higher pheromone in the two iterations until the iterations of the preset times are completed so as to select the optimal data.
Here, in the existing ant colony algorithm, local updating of pheromone is mostly performed, which is performed after each step of searching of ants, and a large amount of computing time is consumed. In addition, for the embodiment of the present invention, if each ant performs pheromone updating when passing through a certain data, the pheromone on the data will rise rapidly, the difference between the pheromone and the pheromone on other paths will increase, and the pheromone on the local optimal path will increase too fast and fall into local optimal. Therefore, in the embodiment of the present invention, the global pheromone table is updated based on the pheromone after one iteration, for example, in the first iteration, the pheromone is not cleared but is continuously retained, and the global pheromone table can be updated by taking the highest pheromone of all ants in the first iteration as the global optimal solution for the next iteration. Thus, as the number of iterations increases, the selection result is closer to the optimal data, and through the set number of iterations (e.g., 10), the optimal data closest to the required data is finally obtained.
Further, after the selection of the optimal data from one dataset is completed, other datasets may continue to be selected for application of the ant colony algorithm.
In a more preferred embodiment, the data analysis method according to the embodiment of the present invention may further include: establishing a data map for the data to be analyzed; and determining the initialization parameters by referring to the data map when the data to be analyzed is subjected to data processing by the ant colony algorithm, and/or determining the plurality of data sets to which data processing is applied when the data to be analyzed is subjected to data processing by the ant colony algorithm. The data map is used for displaying data to be analyzed in a map mode, and clearly and concisely showing data storage positions, data storage modes, data sources and the like through the map. The data map may simulate a two-dimensional space, for example, which may determine data storage locations by horizontal and vertical coordinates (x, y).
It is readily appreciated that the data map is similar to the path format and can simply illustrate the association between portions of data and the association of data with desired data. Therefore, in the embodiment of the present invention, when the ant colony algorithm is used to perform data processing on the data to be analyzed in step S130, the ant heuristic factor, the initial pheromone, the initial placement position, the ant walking path, and the like are determined by referring to the data map, for example, the initial placement position and the walking path of the ant may be consistent with the data storage position and the data storage path shown by the data map. It should be noted that the initial placement position of the ants may also be randomly selected, and the embodiment of the present invention is not limited thereto.
In addition, the data graph illustrates data sources, etc., such that different data sets to which the ant colony algorithm applies may be determined based on, for example, the different data sources, such as data originating from a first service provider as data set a and data originating from a second service provider as data set B.
In summary, the data analysis method for mass irregular data according to the embodiment of the present invention has the following advantages:
1) the ant colony algorithm is applied to data analysis of massive irregular data, optimal data can be selected from the massive irregular data, and valuable data are prevented from being discarded.
2) The method is suitable for data analysis in the fields of content marketing and the like, so that the optimal data is selected from the mass data for content marketing, and the accuracy of the content marketing can be better ensured.
3) The ant colony algorithm has the advantages of being beneficial to reducing the data analysis amount of hardware mechanisms such as a server and the like, and therefore, the hardware research and development cost is reduced.
4) The data passing through the ant walking process is verified through the correlation algorithm, the required data are screened out, and the accuracy of the final selection result of the ant colony algorithm is guaranteed. And, the correlation algorithm is easy to implement and easy to combine with the ant colony algorithm.
5) The mode of global pheromone updating is adopted in the ant colony algorithm, so that the phenomenon of local optimization is avoided, and the algorithm efficiency is improved.
6) A data map is established to assist in determining the initial placement position and the walking path of the ants, so that the algorithm difficulty is simplified, and the algorithm is closer to the data rule.
Fig. 3 is a schematic structural diagram of a data analysis device according to another embodiment of the present invention, which is based on the same inventive concept as the data analysis method according to the above-described embodiment. As shown in fig. 3, the data analysis apparatus includes: the first data processing unit 310 is configured to determine a dimension set of data to be analyzed, an index set corresponding to the dimension set, and a data feature limit range; a second data processing unit 320, configured to set multiple data sets in the data to be analyzed, where data processing is applied, according to the dimension set, the index set, and the data feature limit range of the data to be analyzed; and a third data processing unit 330, configured to perform data processing on each data set by using the ant colony algorithm, so that each ant in the ant colony runs through all data in each data set, so as to select data with pheromones higher than a set threshold from each data set, where the higher pheromone data has a higher correlation with preset required data.
In a preferred embodiment, the third data processing unit 330 comprises:
the initialization module 331 is configured to set initialization parameters of the ant colony algorithm, where the initialization parameters include a data number in each data set, an initial pheromone of each data, a heuristic factor, and an expectation factor, where the expectation factor includes information of the required data.
A calculating module 332, configured to enable each ant to select data according to the initialization parameter to start walking, calculate a probability that each ant transfers from current data to next data during walking according to the initial pheromone and the heuristic factor, calculate a correlation between the current data and the required data each time data transfer occurs, and update the pheromone of the current data according to a correlation calculation result.
The first selecting module 333 is configured to select data with pheromones higher than the set threshold when all ants finish walking all data in each data set and finish one iteration.
In a preferred embodiment, the calculation module 332 may include: the transfer probability calculation submodule is used for calculating the probability of transferring the current data to the next data of each ant in the walking process according to the initial pheromone and the heuristic factor; a correlation calculation submodule, configured to compare the dimension set and the index set corresponding to the current data with a data header of the required data, so as to obtain a percentage of a field name included in the data header of the required data, where the field name includes the dimension set and the index set of the current data, and use the percentage as a correlation calculation result; and a pheromone calculation submodule for updating the pheromone of the current data by using the correspondence between the correlation and the pheromone according to the correlation calculation result, wherein the pheromone corresponding to the data having the higher correlation with the required data is higher.
In a preferred embodiment, the third data processing unit 330 further comprises: the pheromone global updating module 334 is configured to, after one iteration is completed and data with pheromones higher than the set threshold is selected, update a global pheromone table based on the pheromones of the selected data, and apply the updated global pheromone table to the next iteration; and a second selecting module 335, configured to compare data obtained from two adjacent iterations, and select data with higher pheromone between the two iterations until the predetermined number of iterations is completed, so as to select optimal data.
In a more preferred embodiment, the data analysis apparatus further includes: a data map establishing unit (not shown in the figure) for establishing a data map for the data to be analyzed. The third data processing unit 330 is further configured to determine an initial placement position and a walking path of an ant and/or determine the multiple data sets to which data processing is applied, with reference to the data map, when the ant colony algorithm is used to perform data processing on the data to be analyzed.
For details and advantages of the data analysis apparatus according to the embodiment of the present invention, reference may be made to the above-mentioned embodiments related to the data analysis method, which are not described herein again.
In other embodiments, the data analysis apparatus includes a processor and a memory, the first data processing unit, the second data processing unit, the third data processing unit, the data map establishing unit, and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. One or more than one kernel can be set, and data analysis of massive irregular data is realized by adjusting kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
An embodiment of the present invention provides a storage medium, on which a program is stored, where the program, when executed by a processor, implements the data analysis method for massive irregular data.
The embodiment of the invention provides a processor, which is used for running a program, wherein the data analysis method aiming at massive irregular data is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range; setting a plurality of data sets for processing application data in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and performing data processing on each data set by adopting the ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and selecting data with pheromones higher than a set threshold value from each data set, wherein the higher pheromone data has higher correlation with preset required data. The processor executes the program and further realizes the following steps: setting initialization parameters of the ant colony algorithm, wherein the initialization parameters comprise data numbers in each data set, initial pheromones of each data, heuristic factors and expectation factors, and the expectation factors comprise information of the required data; and
enabling each ant to select data according to the initialization parameters to start walking, calculating the probability of transferring from current data to next data during walking according to the initial pheromone and the heuristic factors, calculating the correlation between the current data and the required data when data transfer occurs each time, updating the pheromone of the current data according to the correlation calculation result until all ants finish walking all data in each data set, finishing one iteration, and selecting the data with the pheromone higher than the set threshold. The processor executes the program and further realizes the following steps: comparing the dimension set and the index set corresponding to the current data with the data header of the required data to obtain the percentage of the field name contained in the data header of the required data, wherein the percentage contains the dimension set and the index set of the current data, and taking the percentage as a correlation calculation result; and updating the pheromone of the current data by using the corresponding relation between the correlation and the pheromone according to the correlation calculation result, wherein the higher the correlation of the required data is, the higher the pheromone corresponding to the data is. The processor executes the program and further realizes the following steps: after one iteration is completed and the data with the pheromone higher than the set threshold value is selected, updating a global pheromone table based on the pheromone of the selected data, and applying the updated global pheromone table to the next iteration; and comparing the data obtained by the two adjacent iterations, and selecting the data with higher pheromone in the two iterations until the iterations of the preset times are completed so as to select the optimal data. The processor executes to the extent of further performing the steps of: establishing a data map for the data to be analyzed; and determining the plurality of data sets to which data processing is applied with reference to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed; and/or determining the initialization parameters by referring to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed. The device herein may be a server, a PC, a PAD, a mobile phone, etc.
An embodiment of the present invention further provides a computer program product, which, when executed on a data processing apparatus, is adapted to execute a program that initializes the following method steps:
1) determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range; setting a plurality of data sets for processing application data in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and performing data processing on each data set by adopting the ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and selecting data with pheromones higher than a set threshold value from each data set, wherein the higher pheromone data has higher correlation with preset required data.
2) Setting initialization parameters of the ant colony algorithm, wherein the initialization parameters comprise data numbers in each data set, initial pheromones of each data, heuristic factors and expectation factors, and the expectation factors comprise information of the required data; and enabling each ant to select data according to the initialization parameters to start walking, calculating the probability of transferring from the current data to the next data during walking according to the initial pheromone and the heuristic factor, calculating the correlation between the current data and the required data when data transfer occurs each time, updating the pheromone of the current data according to the correlation calculation result until all ants finish walking all data in each data set, finishing one iteration, and selecting the data with the pheromone higher than the set threshold.
3) Comparing the dimension set and the index set corresponding to the current data with the data header of the required data to obtain the percentage of the field name contained in the data header of the required data, wherein the percentage contains the dimension set and the index set of the current data, and taking the percentage as a correlation calculation result; and updating the pheromone of the current data by using the corresponding relation between the correlation and the pheromone according to the correlation calculation result, wherein the higher the correlation of the required data is, the higher the pheromone corresponding to the data is.
4) After one iteration is completed and the data with the pheromone higher than the set threshold value is selected, updating a global pheromone table based on the pheromone of the selected data, and applying the updated global pheromone table to the next iteration; and comparing the data obtained by the two adjacent iterations, and selecting the data with higher pheromone in the two iterations until the iterations of the preset times are completed so as to select the optimal data.
5) Establishing a data map for the data to be analyzed; and determining the plurality of data sets to which data processing is applied with reference to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed; and/or determining the initialization parameters by referring to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A data analysis method, characterized in that the data analysis method comprises:
determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range;
setting a plurality of data sets for processing application data in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and
and performing data processing on each data set by adopting an ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and selecting data with pheromones higher than a set threshold value from each data set, wherein the higher the pheromones are, the higher the correlation between the data with preset required data is.
2. The data analysis method of claim 1, wherein the data processing of each of the data sets using the ant colony algorithm comprises:
setting initialization parameters of the ant colony algorithm, wherein the initialization parameters comprise data numbers in each data set, initial pheromones of each data, heuristic factors and expectation factors, and the expectation factors comprise information of the required data; and
enabling each ant to select data according to the initialization parameters to start walking, calculating the probability of transferring from current data to next data during walking according to the initial pheromone and the heuristic factors, calculating the correlation between the current data and the required data when data transfer occurs each time, updating the pheromone of the current data according to the correlation calculation result until all ants finish walking all data in each data set, finishing one iteration, and selecting the data with the pheromone higher than the set threshold.
3. The data analysis method according to claim 2, wherein the calculating of the correlation between the current data and the required data and the updating of the pheromone of the current data according to the correlation calculation result includes:
comparing the dimension set and the index set corresponding to the current data with the data header of the required data to obtain the percentage of the field name contained in the data header of the required data, wherein the percentage contains the dimension set and the index set of the current data, and taking the percentage as a correlation calculation result; and
and updating the pheromone of the current data by utilizing the corresponding relation between the correlation and the pheromone according to the correlation calculation result, wherein the higher the correlation of the required data is, the higher the pheromone corresponding to the data is.
4. The data analysis method of claim 2, wherein after the data processing of each data set using the ant colony algorithm, the data analysis method further comprises:
after one iteration is completed and the data with the pheromone higher than the set threshold value is selected, updating a global pheromone table based on the pheromone of the selected data, and applying the updated global pheromone table to the next iteration; and
and comparing the data obtained by two adjacent iterations, and selecting the data with higher pheromone in the two iterations until the iteration of the preset times is completed so as to select the optimal data.
5. The data analysis method of claim 2, further comprising:
establishing a data map for the data to be analyzed; and
determining the plurality of data sets to which data processing is applied with reference to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed; and/or determining the initialization parameters by referring to the data map when the ant colony algorithm is adopted to perform data processing on the data to be analyzed.
6. A data analysis device, characterized in that the data analysis method device comprises:
the first data processing unit is used for determining a dimension set of data to be analyzed, an index set corresponding to the dimension set and a data feature limit range;
the second data processing unit is used for setting a plurality of data sets for applying data processing in the data to be analyzed according to the dimension set, the index set and the data feature limit range of the data to be analyzed; and
and the third data processing unit is used for performing data processing on each data set by adopting an ant colony algorithm, so that each ant in the ant colony finishes all data in each data set, and data with pheromones higher than a set threshold value is selected from each data set, wherein the higher pheromone data is higher in correlation with preset required data.
7. The data analysis device according to claim 6, wherein the third data processing unit includes:
an initialization module, configured to set initialization parameters of the ant colony algorithm, where the initialization parameters include a data number in each data set, an initial pheromone of each data, a heuristic factor, and an expectation factor, where the expectation factor includes information of the required data;
the calculation module is used for enabling each ant to select data according to the initialization parameters to start walking, calculating the probability of transferring from the current data to the next data during walking according to the initial pheromone and the heuristic factor, calculating the correlation between the current data and the required data when data transfer occurs each time, and updating the pheromone of the current data according to the correlation calculation result; and
and the first selection module is used for selecting the data of which the pheromone is higher than the set threshold value when all ants finish walking all the data in each data set and finishing one iteration.
8. The data analysis device of claim 7, wherein the calculation module comprises:
the transfer probability calculation submodule is used for calculating the probability of transferring the current data to the next data of each ant in the walking process according to the initial pheromone and the heuristic factor;
a correlation calculation submodule, configured to compare the dimension set and the index set corresponding to the current data with a data header of the required data, so as to obtain a percentage of a field name included in the data header of the required data, where the field name includes the dimension set and the index set of the current data, and use the percentage as a correlation calculation result; and
and the pheromone calculation submodule is used for updating the pheromone of the current data by utilizing the corresponding relation between the correlation and the pheromone according to the correlation calculation result, wherein the higher the correlation of the required data is, the higher the pheromone corresponding to the data is.
9. A machine-readable storage medium having stored thereon instructions for causing a machine to perform the data analysis method of any one of claims 1 to 5.
10. A processor configured to execute a program, wherein the program is configured to perform: a method of data analysis as claimed in any one of claims 1 to 5.
CN201811399481.2A 2018-11-22 2018-11-22 Data analysis method and device Active CN111209997B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811399481.2A CN111209997B (en) 2018-11-22 2018-11-22 Data analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811399481.2A CN111209997B (en) 2018-11-22 2018-11-22 Data analysis method and device

Publications (2)

Publication Number Publication Date
CN111209997A true CN111209997A (en) 2020-05-29
CN111209997B CN111209997B (en) 2023-04-07

Family

ID=70789315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811399481.2A Active CN111209997B (en) 2018-11-22 2018-11-22 Data analysis method and device

Country Status (1)

Country Link
CN (1) CN111209997B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356247A (en) * 2022-03-18 2022-04-15 闪捷信息科技有限公司 Hierarchical storage scheduling method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103310275A (en) * 2013-06-25 2013-09-18 北京航空航天大学 Novel codebook design method based on ant colony clustering and genetic algorithm
US20160171366A1 (en) * 2014-06-23 2016-06-16 International Business Machines Corporation Solving vehicle routing problems using evolutionary computing techniques
CN106940836A (en) * 2017-02-27 2017-07-11 北京因果树网络科技有限公司 A kind of data analysing method and device
CN107222834A (en) * 2017-06-15 2017-09-29 深圳市创艺工业技术有限公司 A kind of effective building safety monitoring system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103310275A (en) * 2013-06-25 2013-09-18 北京航空航天大学 Novel codebook design method based on ant colony clustering and genetic algorithm
US20160171366A1 (en) * 2014-06-23 2016-06-16 International Business Machines Corporation Solving vehicle routing problems using evolutionary computing techniques
CN106940836A (en) * 2017-02-27 2017-07-11 北京因果树网络科技有限公司 A kind of data analysing method and device
CN107222834A (en) * 2017-06-15 2017-09-29 深圳市创艺工业技术有限公司 A kind of effective building safety monitoring system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356247A (en) * 2022-03-18 2022-04-15 闪捷信息科技有限公司 Hierarchical storage scheduling method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111209997B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN107403335A (en) A kind of drawn a portrait based on depth user carries out the system and implementation method of precision marketing
US11768893B2 (en) Concept networks and systems and methods for the creation, update and use of same in artificial intelligence systems
US10574766B2 (en) Clickstream analysis methods and systems related to determining actionable insights relating to a path to purchase
CN103593392B (en) For generating the method and system recommended
US20190108286A1 (en) Concept networks and systems and methods for the creation, update and use of same to select images, including the selection of images corresponding to destinations in artificial intelligence systems
CN110413877A (en) A kind of resource recommendation method, device and electronic equipment
US9390142B2 (en) Guided predictive analysis with the use of templates
US20140032264A1 (en) Data refining engine for high performance analysis system and method
US10878058B2 (en) Systems and methods for optimizing and simulating webpage ranking and traffic
US10089405B2 (en) Addressable network resource selection management
US11461332B2 (en) Methods and apparatus to search datasets
US11244332B2 (en) Segments of contacts
US11138249B1 (en) Systems and methods for the creation, update and use of concept networks to select destinations in artificial intelligence systems
CN108205775A (en) The recommendation method, apparatus and client of a kind of business object
US20210103925A1 (en) Feature subspace isolation and disentanglement in merchant embeddings
US11295154B2 (en) Physical item optimization using velocity factors
CN107092609A (en) A kind of information-pushing method and device
CN111127074B (en) Data recommendation method
US20160063594A1 (en) Data refining engine for high performance analysis system and method
CN111209997B (en) Data analysis method and device
CN107066582A (en) Realize the method and device that virtual resource is recommended
Wu et al. Temporal bipartite projection and link prediction for online social networks
CN111967970A (en) Bank product recommendation method and device based on spark platform
CN115860865A (en) Commodity combination construction method and device, equipment, medium and product thereof
CN106886546B (en) Construction method and equipment of data website

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant