CN110955642A

CN110955642A - Data acquisition optimization method, device and equipment and readable storage medium

Info

Publication number: CN110955642A
Application number: CN201910968760.4A
Authority: CN
Inventors: 任熊
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2020-04-03
Also published as: WO2021068568A1

Abstract

The invention relates to the technical field of big data, and discloses a data acquisition optimization method, which comprises the following steps: the user behavior log file is sent to a sub-cloud storage space of a cloud storage space for collecting data at regular time; traversing user behavior data in a user behavior log file of a child cloud storage space, and counting the storage space which can be occupied by the traversed user behavior data to obtain the size of the storage space occupied by the user behavior data; calculating the variance of the size of the storage space occupied by the user behavior data; judging whether the variance is larger than a preset threshold value or not; and if so, adjusting the data collection frequency of the sub-cloud storage space by adopting a weighted polling algorithm until the variance is less than or equal to a preset threshold value. The invention also discloses a data acquisition optimization device, equipment and a computer readable storage medium. The data acquisition optimization method provided by the invention solves the technical problem of low utilization rate of the storage space in the prior art, and improves the utilization rate of the storage space.

Description

Data acquisition optimization method, device and equipment and readable storage medium

Technical Field

The invention relates to the technical field of big data, in particular to a data acquisition optimization method, a data acquisition optimization device, data acquisition optimization equipment and a computer-readable storage medium.

Background

At present, with the rapid development of computer technology, people have entered the information age, and information and data storage become important parts of people's daily life. The data storage capacity of enterprises and personal users is greatly increased, and storage space resources are greatly occupied. How to optimize the data acquisition process to solve the technical problem of low utilization rate of storage space resources is a problem to be urgently solved by technical personnel in the field at present.

Disclosure of Invention

The invention mainly aims to provide a data acquisition optimization method, a data acquisition optimization device, data acquisition optimization equipment and a computer-readable storage medium, and aims to solve the technical problem of low utilization rate of storage space resources.

In order to achieve the above object, the present invention provides a data acquisition optimization method, which comprises the following steps:

executing a Linux Shell script through a cron of Linux, and regularly sending a user behavior log file to a sub-cloud storage space of a cloud storage space for collecting data, wherein the cloud storage space comprises a plurality of sub-cloud storage spaces;

traversing user behavior data in the user behavior log file of the child cloud storage space, and counting the storage space which can be occupied by the traversed user behavior data to obtain the size of the storage space occupied by the user behavior data;

calculating the variance of the size of the storage space occupied by the user behavior data through the following formula;

mu is the mean value of the storage space of the sub-clouds for storing user behavior data X, V (X) is the variance, X (t) is the size of the storage space occupied by the user behavior data, t is the identification of different user behavior data, and n is the number of the user behavior data;

judging whether the variance is larger than a preset threshold value or not;

if the variance is larger than a preset threshold value, adjusting the data collection frequency of the sub-cloud storage space by adopting a weighted polling algorithm until the variance is smaller than or equal to the preset threshold value, if the variance is smaller than or equal to the preset threshold value, executing a Linux Shell script through a cron of Linux, and regularly sending the user behavior log file to the sub-cloud storage space of the cloud storage space for collecting data.

Optionally, after the step of executing the Linux Shell script through the cron of Linux and periodically sending the user behavior log file to the child cloud storage space of the cloud storage space for collecting data, the method further includes the following steps:

monitoring a user behavior log file in the cloud storage space in real time through a Flume plug-in, and collecting user behavior data in the user behavior log file;

and storing the user behavior data in real time through a distributed file system.

Optionally, before the step of traversing the user behavior data in the user behavior log file of the child cloud storage space, and counting the storage space that can be occupied by the traversed user behavior data to obtain the size of the storage space that is occupied by the user behavior data, the method further includes the following steps:

setting different identifications for the user behavior log files of the child cloud storage space to obtain user behavior log files with the identifications set;

traversing the user behavior data in the user behavior log file with the set identifier through a binary search tree, and counting the traversed user behavior data to obtain the number of the user behavior data corresponding to different identifiers.

judging whether an acquisition request of user behavior data exists or not;

if the request for acquiring the user behavior data exists at present, acquiring the user behavior data from a cloud storage space, and judging whether redundant user behavior data exists in the user behavior data;

if the user behavior data has redundant user behavior data, clearing the redundant user behavior data existing in the user behavior data through a preset redundancy strategy to obtain the user behavior data after clearing the redundant user behavior data, and if the user behavior data does not have the redundant user behavior data, not processing the user behavior data.

Optionally, if there is a request for acquiring user behavior data currently, acquiring the user behavior data from a cloud storage space, and determining whether redundant user behavior data exists in the user behavior data includes the following steps:

if the request for acquiring the user behavior data exists at present, acquiring the user behavior data from a cloud storage space through a flux plug-in and monitoring the user behavior data to obtain a monitoring result;

and comparing a preset monitoring index with the monitoring result to judge whether redundant user behavior data exist in the user behavior data, wherein the redundant user behavior data are the user behavior data exceeding the monitoring index.

Optionally, if there is redundant user behavior data in the user behavior data, the redundant user behavior data existing in the user behavior data is cleared by using a preset redundancy policy to obtain the user behavior data with the redundant user behavior data cleared, and if there is no redundant user behavior data in the user behavior data of the data collection end, the method does not include the following steps:

if the redundant user behavior data exists in the user behavior data, judging whether the redundant user behavior data exists in the user behavior data of the data collection end;

if the redundant user behavior data exists in the user behavior data of the data collection end, the redundant user behavior data existing in the user behavior data is eliminated through a mean shift algorithm to obtain the user behavior data after the redundant user behavior data is eliminated, and if the redundant user behavior data does not exist in the user behavior data of the data collection end, the user behavior data is not processed.

Further, in order to achieve the above object, the present invention further provides a data acquisition optimization apparatus, including:

the system comprises a sending module, a storage module and a processing module, wherein the sending module is used for executing a Linux Shell script through a cron of Linux and sending a user behavior log file to a sub-cloud storage space of a cloud storage space for collecting data at regular time, and the cloud storage space comprises a plurality of sub-cloud storage spaces;

the first traversal module is used for traversing the user behavior data in the user behavior log file of the child cloud storage space, and counting the storage space which can be occupied by the traversed user behavior data to obtain the size of the storage space occupied by the user behavior data;

the calculation module is used for calculating the variance of the size of the storage space occupied by the user behavior data through the following formula;

the first judgment module is used for judging whether the variance is larger than a preset threshold value or not;

and the adjusting module is used for adjusting the data collection frequency of the sub-cloud storage space by adopting a weighted polling algorithm if the variance is greater than a preset threshold value until the variance is less than or equal to the preset threshold value, executing a Linux Shell script through a cron of Linux if the variance is less than or equal to the preset threshold value, and sending the user behavior log file to the sub-cloud storage space of the cloud storage space for collecting data at regular time.

Optionally, the data acquisition optimization apparatus further includes the following modules:

the monitoring acquisition module is used for monitoring the user behavior log file in the cloud storage space in real time through a flash plug-in and acquiring user behavior data in the user behavior log file;

and the storage module is used for storing the user behavior data in real time through a distributed file system.

the setting module is used for setting different identifications for the user behavior log files of the sub-cloud storage space to obtain the user behavior log files with the identifications set;

and the second traversal module is used for traversing the user behavior data in the user behavior log file with the set identifier through a binary search tree, and counting the traversed user behavior data to obtain the number of the user behavior data corresponding to different identifiers.

the second judgment module is used for judging whether an acquisition request of user behavior data exists or not;

the third judgment module is used for acquiring the user behavior data from the cloud storage space if the acquisition request of the user behavior data currently exists, and judging whether redundant user behavior data exists in the user behavior data;

and the clearing module is used for clearing the redundant user behavior data in the user behavior data through a preset redundancy strategy to obtain the user behavior data after clearing the redundant user behavior data if the redundant user behavior data exists in the user behavior data, and not processing the user behavior data if the redundant user behavior data does not exist in the user behavior data of the data collection end.

Optionally, the third determining module includes the following units:

the monitoring unit is used for acquiring the user behavior data from the cloud storage space through a flux plug-in and monitoring the user behavior data to obtain a monitoring result if the request for acquiring the user behavior data exists at present;

and the first judgment unit is used for judging whether redundant user behavior data exist in the user behavior data or not by comparing a preset monitoring index with the monitoring result, wherein the redundant user behavior data are the user behavior data exceeding the monitoring index.

Optionally, the purge module comprises the following units:

the second judgment unit is used for judging whether redundant user behavior data exist in the user behavior data of the data collection end or not if the redundant user behavior data exist in the user behavior data;

and the clearing unit is used for clearing the redundant user behavior data in the user behavior data through a mean shift algorithm if the redundant user behavior data exists in the user behavior data of the data collection end to obtain the user behavior data after clearing the redundant user behavior data, and not processing the user behavior data if the redundant user behavior data does not exist in the user behavior data of the data collection end.

Further, in order to achieve the above object, the present invention also provides a data acquisition optimization method device, which includes a memory, a processor, and a data acquisition optimization method program stored in the memory and executable on the processor, wherein when the data acquisition optimization method program is executed by the processor, the data acquisition optimization method device implements the steps of any one of the data acquisition optimization method methods described above.

Further, to achieve the above object, the present invention also provides a computer readable storage medium, on which a data acquisition optimization method program is stored, which when executed by a processor implements the steps of the data acquisition optimization method according to any one of the above items.

The invention has the beneficial effects that: the invention aims to solve the technical problem of low utilization rate of storage space in the prior art. A data acquisition optimization method is provided. The realization process of the invention is as follows: the data collection frequency of the sub-cloud storage spaces is adjusted by adopting a weighted polling algorithm, so that load balance among the sub-cloud storage spaces is realized, the storage space resources are saved, redundant data are eliminated by adopting a mean shift algorithm, the redundant data are prevented from occupying the storage resources, and the utilization rate of the storage space is improved.

Drawings

Fig. 1 is a schematic structural diagram of an operating environment of a data acquisition optimization device according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a first embodiment of the data acquisition optimization method of the present invention;

FIG. 3 is a schematic flow chart of a data acquisition optimization method according to a second embodiment of the present invention;

FIG. 4 is a schematic flow chart of a data acquisition optimization method according to a third embodiment of the present invention;

FIG. 5 is a detailed flowchart of step S80 in FIG. 4;

FIG. 6 is a detailed flowchart of step S90 in FIG. 4;

FIG. 7 is a schematic flow chart of a data acquisition optimization method according to a fourth embodiment of the present invention;

FIG. 8 is a functional block diagram of a first embodiment of the data collection optimization device of the present invention;

fig. 9 is a schematic functional block diagram of a data acquisition optimizing apparatus according to a second embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a data acquisition optimization device.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an operating environment of a data acquisition optimization device according to an embodiment of the present invention.

As shown in fig. 1, the data acquisition optimization apparatus includes: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the hardware configuration of the data acquisition optimization device shown in fig. 1 does not constitute a limitation of the data acquisition optimization device, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, the memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a data collection optimization program. The operating system is a program for managing and controlling the data acquisition optimization device and software resources, and supports the operation of the data acquisition optimization program and other software and/or programs.

In the hardware structure of the data acquisition optimization device shown in fig. 1, the network interface 1004 is mainly used for accessing a network; the user interface 1003 is mainly used for detecting a confirmation instruction, an editing instruction, and the like. And the processor 1001 may be configured to invoke the data collection optimization program stored in the memory 1005 and perform the operations of the various embodiments of the data collection optimization method below.

Based on the hardware structure of the data acquisition optimization equipment, the data acquisition optimization method provided by the invention has various embodiments.

Referring to fig. 2, fig. 2 is a schematic flow chart of a data acquisition optimization method according to a first embodiment of the present invention. In this embodiment, the data acquisition optimization method includes the following steps:

step S10, executing a Linux Shell script through a cron of Linux, and regularly sending a user behavior log file to a sub-cloud storage space of a cloud storage space for collecting data, wherein the cloud storage space comprises a plurality of sub-cloud storage spaces;

in this embodiment, multiple commands and control statements, such as conditional control statements for if and else and loop control statements for and select, are set in the Shell script. The commands built in one Shell script are executed at one time, information cannot be continuously returned to the user, and according to the characteristic of the Shell script, the user behavior log file is sent to the sub-cloud storage space of the cloud storage space for data collection at regular time. Before this, the cloud storage space needs to be divided into a plurality of sub-cloud storage spaces for storing different user behavior log files.

Step S20, traversing the user behavior data in the user behavior log file of the child cloud storage space, and counting the storage space occupied by the traversed user behavior data to obtain the size of the storage space occupied by the user behavior data;

in this embodiment, in order to fully utilize the storage resources of each sub-cloud storage space and avoid resource waste, in this embodiment, the number of user behavior data in the user behavior log file sent to the sub-cloud storage space is obtained in real time, the user behavior data in the user behavior log file may be traversed in a traversal manner, and the traversed user behavior data are summed to obtain the number of the user behavior data.

Step S30, calculating the variance of the size of the storage space occupied by the user behavior data through the following formula;

in this embodiment, the following formula is used:

calculating a variance of the amount of user behavior data, the variance describing a degree of deviation of the variable from a mean, wherein,

mu is the mean value of the size of the sub-cloud storage space for storing the user behavior data X, V (X) is a variance, X (t) is the size of the storage space occupied by the user behavior data, t is the identification of different user behavior data, n is the number of the user behavior data, the variances with different sizes can be obtained through the formula, if the variance is large, the difference between the number of the user behavior data and the mean value of the size of the sub-cloud storage space is large, and if the variance is zero, the number of the user behavior data is just matched with the mean value of the size of the sub-cloud storage space.

Step S40, judging whether the variance is larger than a preset threshold value;

in this embodiment, the preset threshold refers to a value of the user behavior data quantity that can be stored in the sub-cloud storage space, and is set to prevent the phenomenon that some sub-cloud storage spaces have a larger data quantity and some other sub-cloud storage spaces have a smaller data quantity.

Step S50, if the variance is larger than a preset threshold, adjusting the data collection frequency of the sub-cloud storage space by adopting a weighted polling algorithm until the variance is smaller than or equal to the preset threshold, if the variance is smaller than or equal to the preset threshold, executing a Linux Shell script through a cron of Linux, and sending the user behavior log file to the sub-cloud storage space of the cloud storage space for collecting data at regular time, wherein the cloud storage space comprises a plurality of sub-cloud storage spaces.

In this embodiment, if the variance is greater than the preset threshold, it indicates that the amount of the user behavior data exceeds the average value of the sizes of the sub-cloud storage spaces, and if the acquisition process is not adjusted, the load may be unbalanced. Therefore, if the variance is greater than a preset threshold, a weighted polling algorithm is adopted to adjust the data collection frequency of the sub-cloud storage space, and the difference D between x (t) - μ is greater than zero, less than zero, or equal to zero. Only the first two cases need be considered in this embodiment. If the difference value D is larger than zero, the number of the user behavior data is larger than the average value of the sizes of the sub-cloud storage spaces, and if the difference value D is smaller than zero, the number of the user behavior data is smaller than the average value of the sizes of the sub-cloud storage spaces.

If N sub-cloud storage spaces exist in the cloud storage space, S ═ S₁,S₂,...,S_nAnd the initial weight of the sub-cloud storage space is as follows: w ═ W₁,W₂,...,W_nAnd the front effective weight of the sub-cloud storage space is as follows: CW ═ CW₁,CW₂,...,CW_n}. For example, the initial weights of the first and second cloud storage spaces before storing no data are the same and are both W, that is, W_{First of all}Is equal to W_{Second step}When the quantity of the user behavior data stored in the first cloud storage space is larger than that of the user behavior data stored in the second cloud storage space, W is calculated_{First of all}Adjust to large to obtain CW_{First of all}W is to be_{Second step}Adjusted to small to obtain CW_{Second step}I.e. CW_{First of all}Greater than CW_{Second step}。

Each sub-cloud storage space i except for the existence of an initial weight W_iIn addition, there is a current effective weight CW_iAnd CW_iInitialized weight of W_iThe mean of the initial weight sums of all the sub-cloud storage spaces is M:

the current effective weight of the sub-cloud storage space i is CW_i，CW_iInitialized weight of W_iBy the following formula:

find an initialMean value M of the sum of weights, by the formula P ═ CW_i-M_iAnd obtaining a difference value P between the current weight of the sub-cloud storage space i and the mean value M, setting weights for all sub-cloud storage spaces in the cloud storage space according to the difference value, if the difference value P is larger and the difference value D is smaller than zero, adjusting the acquisition frequency greatly, if the difference value P is smaller and the difference value D is larger than zero, adjusting the acquisition frequency slightly, arranging the sub-cloud storage spaces with different weights in a queue form, and when an instruction for sending the user behavior log file to the sub-cloud storage space of the cloud storage space for data collection exists, using the weighted sub-cloud storage spaces according to the queue sequence.

Referring to fig. 3, fig. 3 is a schematic flow chart of a data acquisition optimization method according to a second embodiment of the present invention. In this embodiment, after step S10 in fig. 2, the method further includes the following steps:

step S60, monitoring a user behavior log file in the cloud storage space in real time through a flash plug-in, and collecting user behavior data in the user behavior log file;

in this embodiment, the user behavior log file in the cloud storage space is monitored in real time through the flux plug-in, and user behavior data in the user behavior log file is collected. The monitoring modes of the flash plug-in are http and ganglia, the http monitoring can only obtain monitoring data in a json format through one http address, and the ganglia monitoring is displayed in an interface mode after the data are obtained, so that the monitoring is relatively visual.

And step S70, storing the user behavior data in real time through a distributed file system.

In the embodiment, the Linux Shell script is executed through the cron of Linux, and the user behavior log file is sent to the sub-cloud storage space of the cloud storage space for data collection at regular time.

Referring to fig. 4, fig. 4 is a schematic flow chart of a data acquisition optimization method according to a third embodiment of the present invention. In this embodiment, after step S10 in fig. 2, the method further includes the following steps:

step S80, judging whether there is a request for acquiring user behavior data;

in this embodiment, only when there is a request for acquiring user behavior data, the data is acquired, and therefore it is necessary to determine whether there is a request for acquiring user behavior data.

Step S90, if there is a request for acquiring user behavior data, acquiring user behavior data from a cloud storage space, and judging whether redundant user behavior data exists in the user behavior data, if not, not processing;

in this embodiment, if there is a request for acquiring user behavior data, the user behavior data is acquired from the cloud storage space, and it is determined whether redundant user behavior data exists in the user behavior data, because the same user behavior data is continuously acquired when the user behavior data is acquired from the cloud storage space, if a large amount of repeated user behavior data exists in the acquired data set, the large amount of repeated user behavior data occupies a large amount of storage space, which may affect user experience, it is necessary to determine whether redundant user behavior data exists in the user behavior data.

Step S100, if the redundant user behavior data exists in the user behavior data, clearing the redundant user behavior data existing in the user behavior data through a preset redundancy strategy to obtain the user behavior data with the redundant user behavior data cleared, and if the redundant user behavior data does not exist in the user behavior data of the data collection end, not processing the user behavior data.

In this embodiment, if there is redundant user behavior data in the user behavior data, the redundant user behavior data existing in the user behavior data is cleared by using a preset redundancy policy, so as to obtain the user behavior data from which the redundant user behavior data is cleared, and the purpose of processing the data with redundancy is to prevent a large amount of repeated data from entering a data storage space. The preset redundancy strategy refers to processing data with redundancy through a preset algorithm, for example, the preset algorithm may be a mean shift algorithm.

Referring to fig. 5, fig. 5 is a detailed flowchart of step S80 in fig. 4. In this embodiment, step S80 includes the following steps:

step S801, if a request for acquiring user behavior data exists currently, acquiring the user behavior data from a cloud storage space through a flux plug-in and monitoring the user behavior data to obtain a monitoring result, and if not, not processing the monitoring result;

in this embodiment, if there is a request for obtaining user behavior data currently, the user behavior data is collected from the cloud storage space through the flux plug-in, and the user behavior data is monitored in real time through the flux plug-in to obtain a monitoring result. The monitoring modes of the Flume plug-in are http and ganglia, for example, the http monitoring obtains monitoring data in a json format through an http address access, and the ganglia monitoring shows a monitoring result in an interface mode after obtaining the data.

Step S802, comparing a preset monitoring index with the monitoring result, and judging whether redundant user behavior data exists in the user behavior data, wherein the redundant user behavior data is the user behavior data exceeding the monitoring index.

In this embodiment, the preset monitoring index refers to an index for presetting whether redundant data exists in the evaluation data, for example, the same data repeatedly appears ten times, the user behavior data is monitored in real time through the Flume plug-in to obtain a monitoring result, and if the same data repeatedly appears ten times in the monitoring result, it indicates that the monitoring index is exceeded. And judging whether redundant user behavior data exist in the user behavior data according to the monitoring result. If the same data repeatedly appears, the redundant user behavior data exists in the user behavior data.

Referring to fig. 6, fig. 6 is a detailed flowchart of step S90 in fig. 4. In this embodiment, step S90 includes the following steps:

step S901, if there is redundant user behavior data in the user behavior data, determining whether there is redundant user behavior data in the user behavior data of the data collection end;

in this embodiment, if there is redundant user behavior data, it is determined whether there is redundant user behavior data in the user behavior data of the data collection end, where the redundant user behavior data in the user behavior data of the data collection end indicates that the same user behavior data is repeatedly collected by the data collection end. In this embodiment, only whether redundant user behavior data exists for the data collection end is determined, and if the same user behavior data is repeatedly collected by the data collection end, it is indicated that the redundant user behavior data exists for the user behavior data of the data collection end.

Step S902, if the redundant user behavior data exists in the user behavior data of the data collection end, the redundant user behavior data existing in the user behavior data is eliminated through a mean shift algorithm to obtain the user behavior data after the redundant user behavior data is eliminated, and if the redundant user behavior data does not exist in the user behavior data of the data collection end, the user behavior data is not processed.

In this embodiment, if the data collection end collects repeated user behavior data, it indicates that redundant user behavior data exists in the user behavior data collected by the data collection end. For K user behavior data sets D in a given N-dimensional space, the first formula may be:

acting on arbitrary user behavior data x in space, where S_hSo as to be in a high-dimensional sphere region s with a radius h by taking x as central data, k is so as to be in a high-dimensional sphere region s_hThe number of user behavior data within the range; x_iSo as to be in the high-dimensional spherical region s_hUser behavior data within a range. Moving the center point to the shifted mean position may be performed using a second formula: x^t+1＝M^t+x^tIn operation, M^tIs the mean of the shifts in the t state, x^tThe center in the t state. High dimensional sphere region S_hAnd shifting in a data space through a second formula to judge whether redundant user behavior data exist in the current high-dimensional sphere area, and if so, adjusting the offset mean value M^tUntil there is no redundant user behavior data in the current high-dimensional sphere region. In order to clean up redundant user data by using a mean shift algorithm, user behavior data needs to be converted into a form of a feature vector before the steps. Through the steps, the non-redundant data can be mapped in the high-dimensional sphere area, and the redundant data is excluded, so that the aim of clearing the redundant user behavior data in the user behavior data is fulfilled.

Referring to fig. 7, fig. 7 is a schematic flow chart of a data acquisition optimization method according to a fourth embodiment of the present invention. In this embodiment, before step S20 in fig. 2, the method further includes the following steps:

step S110, setting different identifications for the user behavior log files of the child cloud storage space to obtain user behavior log files with the identifications set;

in this embodiment, since the user behavior log files are different from each other, in order to facilitate calculation of the amount of user behavior data in different user behavior log files, a manner of setting different identifiers for different user behavior log files is adopted in this embodiment, which is to facilitate management of log files with different identifiers, for example, currently, only the user behavior data in a log file identified as "a" needs to be counted.

Step S120, traversing the user behavior data in the user behavior log file after the setting of the identification through a binary search tree, and counting the traversed user behavior data to obtain the number of the user behavior data corresponding to different identifications.

In this embodiment, the user behavior data of the user behavior log files with different identifiers is searched through the binary search tree, and how many user behavior data can be accommodated by how many nodes of each binary search tree, according to the formula: and M is the number of user behavior data, N is the number of user behavior data which can be accommodated by a single binary search tree, and N only comprises the number X of root nodes, the number Y of left subtrees and the number Z of right subtrees, and K is the number of binary search trees. When data is queried, a mode of forward-order traversal may be adopted, a mode of middle-order traversal may also be adopted, or a mode of backward-order traversal may be adopted, and whatever mode is adopted, user behavior data in the user behavior log file may be sequentially traversed, for example, in the forward-order traversal, the traversal order is: the root node, the left sub-tree and the right sub-tree, the traversing steps are as follows:

the first step is as follows: sequentially judging whether user behavior data exist at a current root node; if yes, judging whether user behavior data exist in the current left sub-tree or not; if not, obtaining the quantity of the user behavior data: and M is X K, wherein M is the number of the user behavior data, and X is the number of the root nodes.

The second step is that: if the user behavior data exist in the current left sub-tree, the quantity of the user behavior data is obtained: m ═ X + Y × K, where Y is the number of left subtrees; if the user behavior data do not exist in the current left sub-tree, obtaining the quantity of the user behavior data: m ═ X × K.

The third step: if the user behavior data exist in the current left sub-tree, judging whether the user behavior data exist in the current right sub-tree, and if the user behavior data exist in the current right sub-tree, obtaining the quantity of the user behavior data: m ═ X + Y + Z) × K, where Z is the number of right subtrees; if the current right subtree does not have the user behavior data, obtaining the quantity of the user behavior data: m ═ X + Y × K.

The number of the user behavior data in the user behavior log file with different identification marks is obtained through the method.

Referring to fig. 8, fig. 8 is a functional module schematic diagram of the first embodiment of the data acquisition optimization device of the present invention. In this embodiment, the data acquisition optimizing apparatus includes:

the sending module 10 is configured to execute a Linux Shell script through a cron of Linux, and send a user behavior log file to a sub-cloud storage space of a cloud storage space for collecting data at regular time, where the cloud storage space includes a plurality of sub-cloud storage spaces;

the traversal module 20 is configured to traverse the user behavior data in the user behavior log file of the child cloud storage space, and count a storage space that can be occupied by the traversed user behavior data to obtain a size of the storage space occupied by the user behavior data;

a calculating module 30, configured to calculate a variance of a size of a storage space occupied by the user behavior data according to the following formula;

the judging module 40 is used for judging whether the variance is larger than a preset threshold value;

and the adjusting module 50 is configured to adjust the data collection frequency of the sub-cloud storage space by using a weighted polling algorithm if the variance is greater than a preset threshold value until the variance is less than or equal to the preset threshold value, execute a Linux Shell script through a cron of Linux if the variance is less than or equal to the preset threshold value, and periodically send the user behavior log file to the sub-cloud storage space of the cloud storage space for collecting data.

In this embodiment, the sending module 10 is configured to execute a Linux Shell script through a cron of Linux, and send a user behavior log file to a sub-cloud storage space of a cloud storage space for collecting data at regular time, where the cloud storage space includes a plurality of sub-cloud storage spaces; the traversal module 20 is configured to traverse the user behavior data in the user behavior log file of the child cloud storage space, and count a storage space that can be occupied by the traversed user behavior data to obtain a size of the storage space occupied by the user behavior data; the calculating module 30 is configured to calculate a variance of a size of a storage space occupied by the user behavior data according to the following formula;

mu is the mean value of the storage space of the sub-clouds for storing user behavior data X, V (X) is the variance, X (t) is the size of the storage space occupied by the user behavior data, t is the identification of different user behavior data, and n is the number of the user behavior data; the judging module 40 is configured to judge whether the variance is greater than a preset threshold; the adjusting module 50 is configured to adjust the data collection frequency of the sub-cloud storage space by using a weighted polling algorithm if the variance is greater than a preset threshold value until the variance is less than or equal to the preset threshold value, and execute a Linux shell script through cron of Linux if the variance is less than or equal to the preset threshold value, and periodically send the user behavior log file to the sub-cloud storage space of the cloud storage space used for collecting data. The device adjusts the data collection frequency of the sub-cloud storage spaces through the weighted polling algorithm in the adjusting module, so that the load balance among the sub-cloud storage spaces is realized, the storage space resources are saved, and the storage space utilization rate is improved.

Referring to fig. 9, fig. 9 is a functional module schematic diagram of a data acquisition optimization device according to a second embodiment of the present invention. In this embodiment, the data acquisition optimizing apparatus includes:

the adjusting module 50 is configured to adjust the data collection frequency of the sub-cloud storage space by using a weighted polling algorithm if the variance is greater than a preset threshold value until the variance is less than or equal to the preset threshold value, execute a Linux Shell script through a cron of Linux if the variance is less than or equal to the preset threshold value, and periodically send the user behavior log file to the sub-cloud storage space of the cloud storage space for collecting data;

the monitoring acquisition module 60 is used for monitoring the user behavior log file in the cloud storage space in real time through a flash plug-in and acquiring user behavior data in the user behavior log file;

a storage module 70, configured to store the user behavior data in real time through a distributed file system.

The invention also provides a computer readable storage medium.

In this embodiment, the computer readable storage medium has a data acquisition optimization program stored thereon, and the data acquisition optimization program, when executed by a processor, implements the steps of the data acquisition optimization method as described in any one of the above embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes instructions for causing a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims

1. A data acquisition optimization method is characterized by comprising the following steps:

judging whether the variance is larger than a preset threshold value or not;

if the data collection frequency of the sub-cloud storage space is greater than or equal to the preset threshold, adjusting the data collection frequency of the sub-cloud storage space by adopting a weighted polling algorithm until the variance is less than or equal to the preset threshold, otherwise, executing a Linux Shell script through a cron of Linux, and sending the user behavior log file to the sub-cloud storage space of the cloud storage space for collecting data at regular time.

2. The data collection optimization method of claim 1, wherein after the step of executing a Linux Shell script through a cron of Linux and periodically sending the user behavior log file to a child cloud storage space of a cloud storage space for collecting data, further comprising the steps of:

and storing the user behavior data in real time through a distributed file system (HDFS).

3. The data collection optimization method of claim 1, wherein before the step of traversing the user behavior data in the user behavior log file of the child cloud storage space, and counting the storage space that can be occupied by the traversed user behavior data to obtain the size of the storage space that is occupied by the user behavior data, the method further comprises the following steps:

4. The data collection optimization method of claim 1, wherein after the step of executing a Linux Shell script through a cron of Linux and periodically sending the user behavior log file to a child cloud storage space of a cloud storage space for collecting data, further comprising the steps of:

judging whether an acquisition request of user behavior data exists or not;

and if the redundant user behavior data exists in the user behavior data, clearing the redundant user behavior data existing in the user behavior data through a preset redundancy strategy to obtain the user behavior data with the redundant user behavior data cleared.

5. The data acquisition optimization method of claim 4, wherein if there is a request for acquiring user behavior data currently, acquiring the user behavior data from a cloud storage space, and determining whether there is redundant user behavior data in the user behavior data comprises the following steps:

6. The data acquisition optimization method of claim 4, wherein if the user behavior data includes redundant user behavior data, clearing the redundant user behavior data included in the user behavior data by using a preset redundancy policy to obtain the user behavior data with the redundant user behavior data cleared includes the following steps:

and if redundant user behavior data exists in the user behavior data of the data collection end, clearing the redundant user behavior data existing in the user behavior data through a mean shift algorithm to obtain the user behavior data with the redundant user behavior data cleared.

7. A data collection optimization device, comprising:

the traversal module is used for traversing the user behavior data in the user behavior log file of the child cloud storage space, and counting the storage space which can be occupied by the traversed user behavior data to obtain the size of the storage space occupied by the user behavior data;

the judging module is used for judging whether the variance is larger than a preset threshold value or not;

8. The data acquisition optimization device of claim 7, further comprising the following modules:

9. A data acquisition optimization device, characterized in that the data acquisition optimization device comprises a memory, a processor and a data acquisition optimization program stored on the memory and executable on the processor, which data acquisition optimization program, when executed by the processor, implements the steps of the data acquisition optimization method according to any one of claims 1 to 6.

10. A computer-readable storage medium, having stored thereon a data acquisition optimization program which, when executed by a processor, implements the steps of the data acquisition optimization method of any one of claims 1-6.