CN108984700B - Data processing method and device, computer equipment and storage medium - Google Patents

Data processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN108984700B
CN108984700B CN201810729988.3A CN201810729988A CN108984700B CN 108984700 B CN108984700 B CN 108984700B CN 201810729988 A CN201810729988 A CN 201810729988A CN 108984700 B CN108984700 B CN 108984700B
Authority
CN
China
Prior art keywords
sampling
data
function
processed
sampling function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810729988.3A
Other languages
Chinese (zh)
Other versions
CN108984700A (en
Inventor
王炼
吕远方
卢力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810729988.3A priority Critical patent/CN108984700B/en
Publication of CN108984700A publication Critical patent/CN108984700A/en
Application granted granted Critical
Publication of CN108984700B publication Critical patent/CN108984700B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A data processing method and device, computer equipment and storage medium, the data processing method includes: acquiring data to be processed; acquiring a sampling expression corresponding to the data to be processed; analyzing the sampling expression to obtain a sampling function and a sampling parameter value corresponding to the sampling function; and sampling the data to be processed based on the sampling function and the corresponding sampling parameter value to determine sampling data. The method can improve the sampling efficiency.

Description

Data processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.
Background
With the development of computer technology and mobile internet technology, the variety of applications is increasing, and convenience is provided for users. For example, a mobile phone application download platform provides convenience for a user to download applications. In the development process of various applications, in order to ensure normal and stable operation of the applications, various functions of the applications need to be tested, and in the process, developers need to use data for testing. In order to avoid the influence of a large amount of data on the testing efficiency, the data needs to be sampled, so that the data volume in the testing process is reduced, and the testing efficiency is improved.
In the sampling process, data to be processed needs to be read by using a data input code, currently, sampling logic (which can be understood as a sampling code) is added to the data input code, that is, the data input code is modified, and the data to be processed read by the input code is sampled based on the sampling logic to obtain sampled data.
However, the sampling process described above requires writing sampling logic, modifying the data input code, and then performing sampling by the sampling logic every time sampling is performed, resulting in inefficient sampling.
Disclosure of Invention
In view of the above, it is necessary to provide a data processing method and apparatus, a computer device, and a storage medium, in order to solve the problem that the amount of information transmitted by a message in the existing mass texting process is limited.
A method of data processing comprising the steps of:
acquiring data to be processed;
acquiring a sampling expression;
analyzing the sampling expression to obtain a sampling function and a sampling parameter value corresponding to the sampling function;
and sampling the data to be processed based on the sampling function and the corresponding sampling parameter value to determine sampling data.
A data processing apparatus comprising:
the data acquisition module is used for acquiring data to be processed;
the expression acquisition module is used for acquiring a sampling expression;
the analysis module is used for analyzing the sampling expression to obtain a sampling function and a sampling parameter value corresponding to the sampling function;
and the sampling module is used for sampling the data to be processed based on the sampling function and the corresponding sampling parameter value to determine sampling data.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
According to the data processing method and device, the computer equipment and the storage medium, in the data sampling process, the data to be processed can be sampled by acquiring the sampling expression and through the sampling function in the sampling expression and the sampling parameter value corresponding to the sampling function, the operations of re-editing the sampling logic and modifying the data code in the sampling process are not needed, the sampling can be performed by only acquiring the sampling expression and utilizing the corresponding sampling function and the sampling parameter value, the sampling step is simplified, and the sampling efficiency is improved.
Drawings
FIG. 1 is a diagram illustrating an exemplary data processing method;
FIG. 2 is a flow diagram illustrating a data processing method according to one embodiment;
FIG. 3 is a prior art sampling schematic;
FIG. 4 is a sampling schematic diagram corresponding to a data processing method of an embodiment;
FIG. 5 is an expression configuration interface diagram of one embodiment;
FIG. 6 is a schematic diagram of combining samples using a skip sampling function and a number-limiting sampling function according to one embodiment;
FIG. 7 is a schematic diagram of sampling with an interval sampling function according to one embodiment;
FIG. 8 is a schematic diagram of sampling with a random sampling function according to one embodiment;
FIG. 9 is a block diagram of a data processing apparatus according to an embodiment;
FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
The data processing method of the embodiments provided in the present application can be applied to the application environment shown in fig. 1, where the application environment relates to the terminal 10 and the server 20, and the terminal 10 communicates with the server 20 through the network. The data processing method can be applied to the server 20, in the server 20, the data can be sampled to determine the sampled data through the data processing method, the function of the application to be tested can be tested to determine the test result based on the sampled data, and after the test is passed, the function of the application to be tested can be brought online for the user to use, that is, the terminal 10 can download the application to be tested through accessing the server 20, and the user can use the function provided by the application to be tested on the terminal 10. The server 20 may be implemented as a stand-alone server or as a server cluster comprising a plurality of servers.
In one embodiment, as shown in FIG. 2, a data processing method is provided. The method is exemplified by being applied to the server 20 in fig. 1, and includes the following steps S210 to S240.
S210: and acquiring data to be processed.
The data to be processed can be understood as data to be sampled, i.e. data to be extracted from the data to be processed. The data to be processed may include article content, commodity information, network receiving data, and the like, where the article content may be introduction to an application, a usage description, evaluation content, and the like, and the evaluation content may be determined by crawling from a network by a web crawler. The article content can be stored in a database in advance, and the article content can be acquired by reading the database. The commodity information may be commodity information provided by a third party platform, for example, commodity information of a corresponding second killer on a certain e-commerce platform, and the like. The third party platform may refer to a server device that belongs to a different platform from the server 20 currently executing the data processing method. Network reception data may be understood as data pulled through an access network interface. For example, a consulting-type application may provide, on a daily basis, select content (which may be select answers, monographs, etc.) from a web-based community of questions and answers (which may connect users of various industries, who may share knowledge, experience, and insight among themselves, or in which users may conduct relevant discussions around a topic, etc.). The consulting application provides an open interface, and the open interface accessing the information application can pull selected content, etc.
S220: and acquiring a sampling expression.
S230: and analyzing the sampling expression to obtain a sampling function and a sampling parameter value corresponding to the sampling function.
A sampling expression may be understood as a sampling condition, the sampling expression corresponding to the data to be processed, i.e. representing a condition for sampling the data to be processed. The sampling expression may include a sampling function and a sampling parameter value, where the sampling function refers to a function that can implement a sampling function, and may also be understood as a sampling rule, i.e., a requirement or rule to be followed in the sampling process. In the sampling process, data meeting the sampling rule can be extracted from the data to be processed. The sampling parameter value refers to a parameter value provided for a sampling process, and the sampling parameter value can be increased to limit the sampling on the basis of a sampling rule. I.e. the sampling function and the corresponding sampling parameter values constitute the sampling conditions for the data to be processed.
For example, data needs to be extracted every interval of one interval value for data to be processed, a sampling function in an obtained sampling expression needs to satisfy the requirement of sampling every interval of one interval value, and the interval value needs to be set to limit the interval size when sampling is performed through the sampling function, that is, a sampling rule is limited. The sampling function and the pitch value constitute a sampling condition for the data to be processed, i.e. the data to be processed needs to be sampled every other pitch value.
S240: and sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data.
The server stores a code corresponding to the sampling function in advance, that is, a sampling code for realizing the sampling process of the sampling function in advance. After the sampling function and the sampling parameter value are determined by analyzing the sampling expression, the corresponding sampling code can be executed based on the sampling function and the corresponding sampling parameter value, and the sampling of the data to be processed can be realized to obtain the sampling data.
According to the data processing method, in the data sampling process, the data to be processed can be sampled by acquiring the sampling expression and the sampling function in the sampling expression and the sampling parameter value corresponding to the sampling function, the operations of re-editing the sampling logic and modifying the data code in the sampling process are not needed, the sampling can be performed by acquiring the sampling expression and utilizing the corresponding sampling function and the sampling parameter value, the sampling step is simplified, and the sampling efficiency is improved.
In one embodiment, after determining the sample data, the method may further include the steps of: and respectively filtering the sampled data and determining the filtered sampled data.
In one example, after the filtered sample data is determined, statistical processing may be performed on the filtered sample data to obtain a data statistical result.
In order to ensure the accuracy of the data, the sampled data needs to be filtered separately, that is, the noise in the sampled data needs to be filtered. After filtering, the statistical work can be carried out on the data to obtain statistical results, so that the data condition can be conveniently known. For example, word count statistics can be performed on the to-be-processed data of the article content to obtain a word count statistical result. It will be appreciated that each time a sample is obtained, it may be subjected to the filtering and statistical processing described above.
In one embodiment, obtaining a sampling expression corresponding to data to be processed includes: a sampled expression obtained in response to an interactive operation on an expression input box in an expression configuration interface is received.
In other words, in the application, a B/S mode (Browser/Server, Browser/Server mode) is used to implement the data sampling process, the Server may send an expression configuration interface to the Browser, the Browser may obtain the expression configuration interface and then may display the expression configuration interface, and a user may perform an interactive operation on an expression input box in the expression configuration interface displayed by the Browser, for example, an input operation may be performed to input a sampling expression into the input box, for example, a selection operation may be performed, that is, a historical sampling expression is recorded in the Browser, and one of the historical sampling expressions in a drop-down box corresponding to the input box may be selected as a sampling expression corresponding to the current data to be processed. The browser responds to the interactive operation, and then the sampling expression corresponding to the data to be processed can be obtained. The browser can transmit the sampling expression to the server after obtaining the sampling expression, namely the browser can respond to the interactive operation of the expression input box in the expression configuration interface to obtain the sampling expression, and the server can receive the sampling expression obtained by the browser to realize the acquisition of the sampling expression. That is, by performing simple interactive operation on the expression configuration interface, the browser can acquire the sampling expression and transmit the sampling expression to the server, and the server can perform sampling processing after acquiring the sampling expression.
In one embodiment, obtaining data to be processed includes: reading each data source based on the iterator to obtain the data to be processed.
An iterator (iterator), also called cursor (cursor), is an interface that can be traversed over a container (container, e.g., a linked list or array, etc.) without concern for the contents of the container. It will be appreciated that an Iterator (Iterator) is an object that can be used to traverse some or all of the elements in a standard template library container, each Iterator object representing a determined address in the container. An iterator modifies the interface of a conventional pointer, so-called iterators are conceptual abstractions, and what behaves like an iterator can be called an iterator. However, there are many different capabilities of iterators, which can organically unify abstract containers and general algorithms.
The iterator, which works to traverse and select objects in a sequence, provides a way to access individual elements in a container object without exposing the internal details of the container. Through the iterator, the container can be traversed without knowing the structure of the bottom layer of the container. Iterators are often referred to as lightweight containers because of the small cost of creating them.
The iterator reads a piece of data, namely, the input of the data is realized, the data source can be regarded as an ordered sequence, the length of the sequence is not required to be known, the next data can be read as required through a next () function in the iterator, and the next data can be read when the next data is required to be returned. For example, whether there is data in the container can be judged, and if so, the next () function is used to obtain the next data.
Due to different data sources, the data structures corresponding to the data are different, which may cause data heterogeneity, that is, the data of each data source may be heterogeneous. In the process of sampling, various types of data sources need to be read, data to be processed is determined, and data guarantee is provided for the subsequent sampling process. In this embodiment, the to-be-processed data can be obtained by reading each data source through the iterator. Each data source may include a data source corresponding to article content, a data source corresponding to commodity information, and a data source corresponding to network receiving data.
In one embodiment, the sampling function comprises a skip sampling function, the skip sampling function corresponding to a sampling parameter value of a number of hops.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: the sampling data is determined based on the data sequence of the data to be processed and the data in the data to be processed, the data sequence of which is after the number of hops.
That is, in the present embodiment, sample data is determined by sampling data to be processed based on a skip sample function. Each data in the data to be processed corresponds to a data sequence, and can be understood as sequence data. The skip sampling function can correspond to a skip sampling rule, data with the number of hops before the data sequence in the data to be processed needs to be skipped in the sampling process, the data is not sampled, and the data with the data sequence after the number of hops in the data to be processed is used as sampling data. That is, during sampling, the data may be skipped by the number of hops leading in the data order to reduce the amount of processing. The amount of the data to be processed may be greater than the number of hops, or may be less than or equal to the number of hops, in one example, when the amount of the data to be processed is greater than the number of hops, data in the data to be processed, which is in sequence after the number of hops, is taken as sample data, and when the amount of the data to be processed is less than or equal to the number of hops, it indicates that the data to be processed does not meet the requirement of skip sampling, and then sampling is stopped.
For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, A6, a7, A8, a9, and a10, and the number of hops is 5, which indicates that 5 data before the data sequence need to be skipped to perform no sampling, that is, data a1, a2, A3, a4, and a5 are skipped to perform no sampling. Taking the data after 5 in the data sequence as sample data, the sample data includes a6, a7, A8, a9 and a 10.
In one embodiment, the sampling function includes a number-limited sampling function, and the value of the sampling parameter corresponding to the number-limited sampling function is a sample number threshold.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: and determining the data with the sampling number threshold value, which is the data with the data sequence before, in the data to be processed as the sampling data based on the data sequence of the data to be processed.
That is, in the present embodiment, the sample data is determined by sampling the data to be processed based on the number limit sampling function. In order to avoid that the sampling amount is too large, the pressure is increased on the subsequent processing work, and the sampling amount can be limited in the sampling process. The threshold value of the number of samples corresponding to the number limit sampling function is the upper limit of the number of samples. The number-limited sampling function may correspond to a number-limited sampling rule, i.e. the number of samples must not exceed the sample number threshold during the sampling process. And under the condition that no other sampling function is combined with the quantity limiting sampling function, the data with the sampling quantity threshold value, which is the data with the data sequence before, in the data to be processed is determined as the sampling data, and then the data sampling under the quantity limiting sampling rule can be realized. The quantity limiting sampling function is to take the data with the sampling quantity threshold value before the data sequence in the data to be processed as the sampling data.
For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, A6, a7, A8, a9, and a10, and the sampling number threshold is 5, then the sampling is performed by the above number limit sampling rule, and the obtained sampling data includes a1, a2, A3, a4, and a 5.
In one embodiment, the step of determining, as sample data, a threshold number of samples of data in the data to be processed, which is the data with the data sequence that is earlier than the threshold number of samples, based on the data sequence of the data to be processed, includes: when the number of the data to be processed is larger than or equal to the sampling number threshold value, determining the data to be processed as the sampling data according to the sampling number threshold value of the data in the data to be processed which is in front of the data sequence, and when the number of the data to be processed is smaller than the sampling number threshold value, determining the data to be processed as the sampling data,
in one embodiment, the sampling function includes an interval sampling function, and the sampling parameter value corresponding to the interval sampling function is a pitch value.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: and determining the data with the interval value in the data to be processed as sampling data based on the data sequence of the data to be processed, wherein the sampling data comprises the data with the first data sequence in the data to be processed.
That is, in the present embodiment, the sample data is determined by sampling the data to be processed based on the interval sampling function. The interval sampling function may correspond to an interval sampling rule, that is, a sampling rule for sampling each interval value of the data to be processed, and it can be understood that the data to be processed sequentially has interval values. That is, in the present embodiment, the data to be processed is sampled at equal intervals, or is sampled by a fixed interval value. Specifically, based on the data sequence, data with the first data sequence in the data to be processed may be used as current data, the current data is determined as sampling data, then, when the number of data following the current data in the data to be processed is greater than or equal to the distance value, data with the difference between the data sequence in the data to be processed and the data sequence of the current data is used as current data, that is, the current data is updated, and the step of determining the current data as sampling data is returned until the number of data following the current data in the data to be processed is less than the distance value, at which time, the sampling data may be determined.
For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, a6, a7, A8, a9, and a 10. The interval value is 3, and after sampling is performed based on the interval sampling rule, the obtained sampling data comprises A1, A4 and A7.
When the sampling is carried out through the interval sampling function, the whole data set, namely the data to be processed, is facilitated, and therefore the sampling data can be guaranteed to have better distribution. Under the condition that the data to be processed are the same, if the data to be processed are sampled for multiple times by adopting an interval sampling function, the sampled data obtained every time are ensured to be the same, so that the problem data are positioned and the problem is solved conveniently.
In one embodiment, the sampling function comprises a random sampling function, and the value of the sampling parameter corresponding to the random sampling function is a random probability value.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: and randomly sampling the data to be processed based on the random probability value, and determining sampled data.
That is, in the present embodiment, the sample data is determined by sampling the data to be processed based on a random sampling function. The random sampling function may correspond to a random sampling rule, i.e., a sampling rule for randomly sampling data to be processed. In this embodiment, the data to be processed may be randomly sampled. Specifically randomly sampled with a fixed random probability value. When the random sampling function is used for sampling, the whole data set is facilitated, and whether any data is sampled is unpredictable, so that the uniform distribution of sampled data can be ensured.
For example, the data to be processed includes 25 data, which are sequentially a1, a2, A3, a4, A5, A6, a7, A8, a9, a10, a11, a12, a13, a14, a15, a16, a17, a18, a19, a20, a21, a22, a23, a24, and a 25. The random probability value is 0.5, and after sampling is performed based on the random sampling rule, the obtained sampling data can comprise A2, A3, A6, A8, A10, A11, A12, A14, A19, A21, A22 and A25.
In one embodiment, the sampling function includes at least any one of a skip sampling function, a number-limited sampling function, an interval sampling function, and a random sampling function.
Combined sampling may be performed in addition to sampling with a single sampling function in a skip sampling function, a number-limited sampling function, an interval sampling function, and a random sampling function. The skip sampling function, the number limit sampling function, the interval sampling function and the random sampling function can be combined randomly to sample the data to be processed, so that the diversity and uniformity of the sampled data can be further ensured.
In the sampling expression, if the number of the sampling functions is more than 1, the sampling functions in the sampling expression have a sequence, and the sequence of the sampling functions can be obtained besides the sampling functions and the sampling parameter values corresponding to the sampling functions in analyzing the sampling expression. That is, the sampling function has an order in the sampling expression, and the step of sampling the data to be processed based on the sampling function and the corresponding sampling parameter value may include: and sampling the data to be processed based on the sampling function, the corresponding sampling parameter value and the sequence of the sampling function to determine sampling data.
After sampling is carried out based on the sampling function, the data obtained by sampling is taken as the data sampled by the next sampling function, namely, the sampling function is sampled on the basis of the data obtained by sampling the last sampling function, and the sampling is carried out according to the sequence of the sampling functions, so that the sampling data can be obtained. It can be understood that when the number of the sampling functions is at least 2, the sampling process is serial, that is, the sampling function ordered at the forefront samples the data to be processed, and on the basis of the data obtained by sampling the data to be processed by the above-mentioned sampling function ordered at the forefront, the data obtained by sampling the data by the adjacent and next-to-next sampling function is sampled, and so on, until all the sampling functions in the sampling expression are sampled.
Regardless of which functions of a skip sampling function, a number limit sampling function, an interval sampling function and a random sampling function are included in the sampling data, the process of sequentially sampling according to a single sampling function in the sequence of the sampling functions corresponds to the same principle of sampling when only one sampling function is included in the above sampling expression. For example, if the sampling expression includes a skip sampling function and a number limit sampling function, the principle process of sequentially sampling the data to be processed by the skip sampling function is the same as the principle process of sampling the data to be processed by only including the skip sampling function in the sampling expression, and then the process of sampling the data determined by sampling by the skip sampling function by the number limit sampling function is the same as the principle process of sampling the data determined by only including the number limit sampling function in the sampling expression.
In one embodiment, the number of the sampling functions is 2, that is, the sampling functions include any two of a skip sampling function, a number limit sampling function, an interval sampling function and a random sampling function, and the process of sampling the data to be processed based on the sampling functions and corresponding sampling parameter values determines first data to be sampled by sampling the data to be processed based on the sampling functions in the order before and the corresponding sampling parameter values, and then samples the first data to be sampled based on the sampling functions in the order after and the corresponding sampling parameter values to determine the sampled data.
In one embodiment, the sampling function includes a skip sampling function and a quantity-limiting sampling function, the order of the skip sampling function precedes the order of the quantity-limiting sampling function, the value of the sampling parameter corresponding to the skip sampling function is a skip number, and the value of the sampling parameter corresponding to the quantity-limiting sampling function is a sample quantity threshold.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: the method comprises the steps of sampling data to be processed based on a skip sampling function and the number of hops, determining first data to be sampled, sampling the first data to be sampled based on a quantity limiting sampling function and a sampling quantity threshold, and determining the sampled data.
In one embodiment, when there are more than the number of hops of non-decimated data in the data to be processed, data whose data order is after the number of hops in the data to be processed may be determined as the first sample data to be processed. And sampling data from the first to-be-determined sampling data, wherein the number of the sampling data is less than or equal to a sampling number threshold value.
That is, in the present embodiment, sampling is performed using a combination of the skip sampling function and the number-limited sampling function. In the sampling process, firstly, data needs to be skipped, whether the data to be processed contains more than the number of hops of unextracted data needs to be judged, if not, sampling is stopped, and if yes, the sampling data is determined based on the data after the number of hops (namely the first to-be-determined sampling data) in the data to be processed, so that the skip sampling rule is met, and the quantity limitation sampling rule is met. Specifically, the step of determining the sample data based on the data sequence in the data to be processed being the data after the number of hops may include, when the number of data after the number of hops is greater than or equal to the sample number threshold, determining the data sequence in the data to be processed being the data of the sample number threshold before the number of hops, as the sample data, that is, the data of the sample number threshold before the data sequence in the first sample data to be determined as the sample data. And when the quantity of the data after the number of hops is less than the sampling quantity threshold value, determining the data sequence in the data to be processed as the data after the number of hops to be sampling data, namely, taking the first to-be-determined sampling data as the sampling data. In one example, the sample data includes data in which the data order is the number of hops plus one in the data to be processed.
For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, A6, a7, A8, a9, and a10, the number of hops is 5, and the threshold of the sampling number is 2, which indicates that 5 data before the data sequence need to be skipped for no sampling, that is, data a1, a2, A3, a4, and a5 need to be skipped for no sampling. The first sample data to be determined is a6, a7, A8, a9 and a10 (data after 5 in data order), and the first 2 data after 5 in data order are taken as sample data, so that the sample data includes a6 and a 7.
In one embodiment, extracting sample data from first sample data to be sampled includes: initializing the number of samples to zero; selecting first data (namely first data after the number of hops) from the first to-be-determined sample data as current processing data based on the data order of the first to-be-determined sample data; when the sampling quantity is smaller than the sampling quantity threshold value and the current processing data meet the preset requirement, determining the current processing data as the sampling data, and adding one to the sampling quantity; and when data exists after the currently processed data in the first to-be-determined sampling data, based on the data sequence of the first to-be-determined sampling data, using the data adjacent to the currently processed data in the first to-be-determined sampling data (namely, the data next to the currently processed data) as the currently processed data, namely, updating the currently processed data, and returning to the step of using the currently processed data as the sampling data when the sampling quantity is smaller than the sampling quantity threshold and the currently processed data meets the preset requirement until the sampling quantity is equal to the sampling quantity threshold or no data exists after the currently processed data in the first to-be-determined sampling data (namely, the quantity of data after the currently processed data is zero). In addition, when the current processing data does not meet the preset requirement, the current processing data is abandoned, namely when data exists after the current processing data in the first to-be-determined sampling data, the current to-be-processed data is updated by taking adjacent data after the current processing data in the first to-be-determined sampling data as the current processing data based on the data sequence of the first to-be-determined sampling data.
For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, a6, a7, A8, a9, and a10, the number of hops is 5, and the threshold of the number of samples is 2. Firstly, if the 6 th data, namely a6, is taken as the current data to be processed, the number of samples is zero, is less than 2, and meets the preset requirement, the data a6 is determined as the sample data, and the number of samples is increased by one, namely the sample data is updated to be 1. The data after the data a6 further includes a7, A8, a9 and a10, a7 is used as the current data to be processed, next sample data is re-determined, at the moment, the number of samples is less than 2, and the preset requirement is met, the data a7 is determined as the sample data, and the number of samples is increased by one, namely, the sample data is updated to be 2. The data following the data a7 also includes A8, a9 and a10, however, the sample data at this time is 2, which is equal to the threshold of the number of samples, and the sampling is finished when the sampling end condition is satisfied. The determined sample data are a6 and a 7.
If the current to-be-processed data A7 does not meet the preset requirement, taking A8 as the current to-be-processed data, determining the data A8 as the sampling data if the sampling number is less than 2 and the preset requirement is met, and adding one to the sampling number, namely updating the sampling data to be 2. The data following the data A8 also includes a9 and a10, however, the sample data at this time is 2, which is equal to the sample number threshold, and the sampling can be ended if the sample end condition is satisfied. The determined sample data are a6 and A8.
In this embodiment, after the current data to be processed is determined as the sample data, the above filtering process and statistical process may be performed on the sample data.
In one embodiment, the predetermined requirement may be that the number of data of the determined sample data that is the same type as the current processing data is less than a predetermined number threshold. That is, when the number of data in the determined sampling data, which is of the same type as the current processing data, is smaller than the preset number threshold, it indicates that the current processing data meets the preset requirement. If the number of the continuous data of the same type as the current data to be processed in the determined sampling data is greater than or equal to the preset number threshold, it indicates that data of the same type as the current data to be processed has been extracted before, and in order to ensure the diversity of the sampling data, the data of the type is not extracted any more, at this time, the current data to be processed may be discarded, the current data to be processed may be updated, and the next round of sampling data determination may be performed.
For example, when the pre-processing data is article content, that is, corresponding to the article type, a preset number of threshold pieces of the to-be-processed data, which is the article type, have been extracted as sample data, at this time, the article content may be discarded without being extracted. For another example, the current processing data is the commodity information, that is, corresponding to the commodity type, the data to be processed, which is a preset number of threshold value commodity types, has been extracted as the sampling data, and at this time, the commodity information may be discarded without extraction. For another example, when the pre-processing data is network received data, that is, corresponding to the network data type, the pre-processing data with a preset number threshold as the network data type has been extracted as sampling data, at this time, the network received data may be discarded without being extracted.
In one embodiment, the sampling functions include a number-limited sampling function and a skip sampling function, the order of the number-limited sampling function preceding the order of the skip sampling function.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: the method comprises the steps of sampling data to be processed based on a quantity limiting sampling function and a sampling quantity threshold value, determining first to-be-determined sampling data, sampling the first to-be-determined sampling data based on a skip sampling function and a skip number, and determining sampling data. In this embodiment, the process of determining the first to-be-sampled data is based on the number-limited sampling function, and the specific content of the first to-be-sampled function determined by sampling based on the skip sampling function may be different from that described above.
In one embodiment, the data with the sampling number threshold value in the data sequence before the data sequence in the data to be processed can be determined as the first data to be sampled based on the data sequence of the data to be processed; and determining the data sequence of the first to-be-determined sampling data as the data after the jump number as the sampling data.
The data after the data sequence is the skip number in the first to-be-determined sampling data is the data after the skip number in the first to-be-determined sampling data, that is, in the present embodiment, the data is sampled by the quantity limiting sampling function, that is, the data of the sampling quantity threshold value in the data to be processed, which is the data before the data to be processed, is taken as the first to-be-determined sampling data, and then the data remaining in the first to-be-determined sampling data is taken as the sampling data by skipping the skip number in the first to-be-determined sampling data.
For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, a6, a7, A8, a9, and a10, the number of hops is 2, and the threshold value of the sampling number is 5. The first 5 data are extracted as first sample data to be determined, i.e., the first sample data to be determined includes a1, a2, A3, a4, and a 5. Then, data following the data order of 2 in the first to-be-determined sample data, of which data order of 2 is a2, is determined as sample data, the sample data including A3, a4, and a 5.
In one embodiment, the sampling function includes an interval sampling function and a number limit sampling function, the order of the interval sampling function precedes the order of the number limit sampling function, the sampling parameter value corresponding to the interval sampling function is a pitch value, and the sampling parameter value corresponding to the number limit sampling function is a sampling number threshold.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: sampling data to be processed based on an interval sampling function, determining first to-be-determined sampling data, sampling the first to-be-determined sampling data based on a quantity limit sampling function, and determining sampling data.
In one embodiment, first, data of the interval value in the data to be processed may be determined as first sample data to be determined based on the data order of the data to be processed, and data in the data order from the first sample data to be determined is taken as sample data. The number of the sampling data is less than or equal to the sampling number threshold value, and the data with the first data sequence in the data to be processed is included.
In one embodiment, first, based on the data sequence of the data to be processed, the data with the first data sequence in the data to be processed may be used as current data, the current data is determined as sample data, then, when the number of data after the current data in the data to be processed is greater than or equal to the spacing value and the determined number of sample data is less than the sample number threshold, the data with the spacing value difference between the data sequence of the data to be processed and the data sequence of the current data is used as the current data, that is, the current data is updated, and the step of determining the current data as sample data is returned until the number of data after the current data in the data to be processed is less than the spacing value or the determined number of sample data is greater than or equal to the sample number threshold, at which time, the sample data may be determined. It is understood that, when the number of data after the current data in the data to be processed is smaller than the interval value, the sampling data is determined, which indicates that the determined number of sampling data has not reached the sampling number threshold, and the data after the current data cannot meet the sampling requirement, at this time, the determined number of sampling data is smaller than the sampling number threshold. The sample data is determined when the determined number of sample data is greater than or equal to the sample number threshold, indicating that the determined number of sample data has reached the sample number threshold, at which time the determined number of sample data is equal to the sample number threshold.
For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, a6, a7, A8, a9, and a10, the interval value is 2, and the sample number threshold value is 4. First, data to be processed is sampled based on an interval sampling function, and the determined first data to be sampled includes a1, A3, a5, a7, and a9, and the data is subjected to number-limited sampling, that is, sampling data whose number of samples is up to 4 is extracted. The number of the first to-be-determined sample data is 5, which is greater than the sample number threshold, and 4 data in the first to-be-determined sample data that are in the top of the data sequence are taken as sample data, that is, a1, A3, a5, and a 7. If the threshold of the number of samples is 6 and the number of the first to-be-determined sample data is smaller than the threshold of the number of samples, the first sample data is regarded as sample data, that is, the sample data includes a1, A3, a5, a7, and a 9.
In one embodiment, the sampling functions include a number-limited sampling function and an interval sampling function, the order of the number-limited sampling function preceding the order of the interval sampling function.
In this implementation, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: sampling data to be processed based on a quantity limiting function and a sampling quantity threshold value, and determining first data to be sampled; and sampling the first to-be-determined sampling data based on the interval sampling function and the spacing value to determine the sampling data.
In one embodiment, first, the first data to be sampled may be determined based on the data sequence of the data to be processed, where the data to be processed is the data with the sampling number threshold before the data sequence; and determining data spaced by the interval value in first to-be-determined sampling data as sampling data, wherein the sampling data comprises data with a first data sequence in the first to-be-determined sampling data.
That is, in this embodiment, sampling is performed by number limitation, that is, data of a threshold number of samples sequentially preceding in the data to be processed is taken as first sample data to be determined, and then interval sampling is performed on the first sample data to be determined by interval value to determine sample data.
For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, a6, a7, A8, a9, and a10, the sampling number threshold is 6, and the interval value is 2. The first 6 data are extracted as first sample data to be determined, that is, the first sample data to be determined includes a1, a2, A3, a4, a5, and a 6. Then, the data of the interval value in the first to-be-determined sample data is determined as sample data, and the sample data includes the first-ranked data in the first to-be-determined sample data, i.e., a 1. The sample data includes a1, A3, and a 5.
In one embodiment, the sampling function includes a random sampling function and a number-limited sampling function, the order of the random sampling function precedes the order of the number-limited sampling function, the value of the sampling parameter corresponding to the random sampling function is a random probability value, and the value of the sampling parameter corresponding to the number-limited sampling function is a sample number threshold.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: randomly sampling the data to be processed based on a random sampling function and a random probability value, and determining first sampling data; the first sampled data is sampled based on a number-limited sampling function and a sample number threshold, determining a sampling function. Wherein the number of sample data is less than or equal to the sample number threshold.
The data to be processed is randomly sampled, the whole data set is traversed, and the number of samples can be limited in order to avoid the pressure on subsequent processing caused by too large number of samples. After the first to-be-determined sampling data is obtained after random sampling, sampling data of which the number is less than or equal to a sampling number threshold value is extracted from the first to-be-determined sampling data. That is, the data to be processed is traversed for random sampling, and the number of the data obtained by further sampling by the number limit sampling function is smaller than or equal to the threshold value of the sampling number. Specifically, first to-be-determined sampling data is obtained by sampling with a random sampling function, and when the number of the first to-be-determined sampling data is smaller than a sampling number threshold, the first to-be-determined sampling data obtained by random sampling is used as sampling data, and at the moment, the number of the sampling data is smaller than the sampling number threshold. And if the number of the first to-be-sampled data obtained by random sampling is larger than or equal to the sampling data threshold, extracting the data with the sampling data threshold of the data sequence before from the first to-be-sampled data as the sampling data, wherein the determined number of the sampling data is equal to the sampling number threshold.
In one embodiment, the sampling functions include a number-limited sampling function and a random sampling function, the order of the number-limited sampling function preceding the order of the random sampling function.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: sampling data to be processed based on a quantity limiting sampling function and a sampling quantity threshold value to determine first to-be-determined sampling data, sampling the first to-be-determined sampling data based on a random sampling function, and determining sampling data.
In one embodiment, the data with the sampling number threshold value in the data sequence before the data sequence in the data to be processed can be determined as the first data to be sampled based on the data sequence of the data to be processed; and randomly sampling the first to-be-determined sampling data based on the random sampling function and the random probability value to determine the first to-be-determined sampling data as sampling data. That is, in this embodiment, sampling is performed by number limitation, that is, data of a threshold number of samples sequentially preceding in the data to be processed is used as first to-be-sampled data, and then random sampling is performed on the first to-be-sampled data by a random probability value to determine sampled data.
For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, a6, a7, A8, a9, and a10, the sampling number threshold is 6, and the random probability value is 0.05. The first 6 data are extracted as first sample data to be determined, that is, the first sample data to be determined includes a1, a2, A3, a4, a5, and a 6. The first to-be-sampled data is then randomly sampled based on the random probability values to determine sampled data, which may include a1, a2, a5, and a6, for example.
In one embodiment, the sampling functions include a skip sampling function and an interval sampling function, the order of the skip sampling function precedes the order of the interval sampling function, the skip sampling function corresponds to sampling parameter values of the number of hops, and the interval sampling function corresponds to sampling parameter values of the interval sampling function of the interval values.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: sampling data to be processed based on a skip sampling function and the number of hops, and determining first data to be sampled; the first to-be-sampled data is sampled based on an interval sampling function to determine sampled data.
That is, in the present embodiment, data whose data order is after the number of hops in the data to be processed may be determined as the first sample data to be determined first based on the data order of the data to be processed. And then determining the data with interval values in the first to-be-determined sampling data as sampling data based on the data sequence of the first to-be-determined sampling data, wherein the sampling data comprises the data with the first data sequence in the first to-be-determined sampling data.
For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, A6, a7, A8, a9 and a10, the number of hops is 5, the interval spacing is 2, the first 5 data are skipped first, and are not sampled, the first data to be sampled includes A6, a7, A8, a9 and a10, and the data to be sampled includes A6, A8 and a 10.
In one embodiment, the sampling functions include an interval sampling function and a skip sampling function, the order of the interval sampling function preceding the order of the skip sampling function.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: sampling data to be processed based on an interval sampling function and an interval value, and determining first data to be sampled; the first to-be-sampled data is sampled based on the skip sampling function and the number of hops to determine sampled data.
The first to-be-determined sampling data comprises data ranked first in the to-be-processed data. That is, in this embodiment, the data to be processed may be sampled by the interval sampling function and the interval value to determine the first data to be sampled, and then data of the first data to be sampled, which is the data of the first data to be sampled and has the data sequence after the skip number, may be skipped to obtain the sample data.
For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, A6, a7, A8, a9 and a10, the number of hops is 2, the interval pitch is 2, the data to be processed is sampled at intervals by using the pitch value, and the obtained first data to be sampled includes a1, A3, a5, a7 and a 9. Then, the first 2 data in the first undetermined sample data are skipped, namely, A1 and A3 are skipped, and the rest A5, A7 and A9 are taken as sample data.
In one embodiment, the sampling functions include a skip sampling function and a random sampling function, the order of the skip sampling function precedes the order of the random sampling function, the value of the sampling parameter corresponding to the skip sampling function is a skip number, and the value of the sampling parameter corresponding to the random sampling function is a random probability value.
Based on the sampling function and the corresponding sampling parameter value, sampling the data to be processed and determining the sampling data, comprising: sampling data to be processed based on a skip sampling function and the number of hops, and determining first data to be sampled; and sampling the first to-be-determined sampling data based on the random sampling function and the random probability value to determine the sampling data.
That is, in the present embodiment, data whose data order is after the number of hops in the data to be processed may be determined as the first sample data to be determined based on the data order of the data to be processed; and randomly sampling the first to-be-sampled data based on the random sampling function and the random probability value to determine the sampled data.
The method comprises the steps of firstly performing sampling of a skip sampling rule to obtain first to-be-determined sampling data, and then performing interval sampling on the first to-be-determined sampling data to obtain sampling data. For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, A6, a7, A8, a9, and a10, the number of hops is 5, the random probability value is 0.5, the first 5 data are skipped first, and are not sampled, the first data to be sampled includes A6, a7, A8, a9, and a10, and are randomly sampled, for example, the sampled data may include a7 and A8.
In one embodiment, the sampling functions include a random sampling function and a skip sampling function, the order of the random sampling function preceding the order of the skip sampling function.
Then sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data, including: sampling data to be processed based on a random sampling function and a random probability value, and determining first data to be sampled; the first to-be-sampled data is sampled based on the skip sampling function and the number of hops to determine sampled data.
That is, in this embodiment, the data to be processed may be sampled based on the random sampling function and the random probability value to determine the first data to be sampled, and then data with the number of hops sequentially preceding in the first data to be sampled may be skipped, and data sequentially following the number of hops in the first data to be sampled may be used as the sample data.
For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, A6, a7, A8, a9 and a10, the number of hops is 2, the random probability value is 0.5, the data to be processed is randomly sampled by using the random probability value, and the obtained first data to be sampled includes a1, a2, a5, a7 and a 9. Then, the first 2 data in the first undetermined sample data are skipped, namely, A1 and A2 are skipped, and the rest A5, A7 and A9 are taken as sample data.
In one embodiment, the sampling function includes an interval sampling function and a random sampling function, the order of the interval sampling function is prior to the order of the random sampling function, the sampling parameter value corresponding to the interval sampling function is a distance value, and the sampling parameter value corresponding to the random sampling function is a random probability value.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: sampling data to be processed based on the interval sampling function and the interval value, and determining first data to be sampled; and sampling the first to-be-determined sampling data based on the random sampling function and the random probability value to determine the sampling data.
That is, in the present embodiment, the first data to be sampled obtained by pitch sampling is further randomly sampled not only by the pitch sampling function and the pitch value, so that it is ensured that the sampled data is more comprehensive rather than concentrated on a certain data.
In one embodiment, the sampling functions include a random sampling function and an interval sampling function, the order of the random sampling function preceding the order of the interval sampling function.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: sampling data to be processed based on a random sampling function and a random probability value, and determining first data to be sampled; and sampling the first to-be-determined sampling data based on the interval sampling function and the spacing value to determine the sampling data.
That is, in the present embodiment, the first to-be-sampled data obtained by random sampling is further sampled at intervals, not only by the random sampling function and the random probability value, so that it is ensured that the sampled data is more comprehensive, rather than concentrated on a certain data.
In one embodiment, the number of the sampling functions is three, that is, the sampling functions include any three of a skip sampling function, a number limit sampling function, an interval sampling function and a random sampling function, the data to be processed is sampled based on the sampling functions and corresponding sampling parameter values, and the process of determining the sampling data is to sample the data to be processed based on the sampling function in the first order and the corresponding sampling parameter value to determine second data to be sampled; sampling the second to-be-sampled data based on the sampling function in the middle in sequence and the corresponding sampling parameter value, and determining third to-be-sampled data; and finally, sampling the third to-be-determined sampling data based on the last sampling function and the corresponding sampling parameter value in sequence, and determining the sampling data.
The principle of sampling based on three sampling functions in sequence in this embodiment is similar to the above-described principle of sampling based on two sampling functions in sequence, except that the process of performing one sampling is added by adding one sampling function. The sampling process of a single sampling function in the three sampling functions is the same as the sampling principle when the sampling expression only comprises the single sampling function, but the difference is that when the three sampling functions are respectively sampled, the sampling functions except the sampling function arranged at the top (sampling is carried out on the basis of data to be processed), and the other sampling functions are all re-sampled on the basis of the data determined by the previous sampling.
The following description will be given taking as an example a partially ordered combination of three sampling functions (of the skip sampling function, the number limit sampling function, the interval sampling function, and the random sampling function, the number of ordered combinations of three sampling functions is 24).
For example, in one embodiment, the sampling functions include a skip sampling function, an interval sampling function, and a quantity-limited sampling function, the skip sampling function having a first order, the interval sampling function having a second order (i.e., an order between the skip sampling function and the quantity-limited sampling function), the quantity-limited sampling function having a last order, the skip sampling function having a skip number corresponding to a sampling parameter value, the interval sampling function having a pitch value corresponding to a sampling parameter value, and the quantity-limited sampling function having a sample quantity threshold value corresponding to a sampling parameter value.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: sampling data to be processed based on a skip sampling function and the number of hops, and determining second data to be sampled; sampling the second to-be-sampled data based on the interval sampling function and the spacing value, and determining third to-be-sampled data; and sampling the third to-be-determined sampling data based on the number limit sampling function and the sampling number threshold value to determine the sampling data.
That is, in this embodiment, data in the data to be processed, the data of which data sequence is after the skip number, may be determined as the second data to be sampled (this process is a process of sampling based on a skip sampling function) based on the data sequence of the data to be processed. And sampling the second to-be-sampled data based on the interval sampling function and the interval value, and determining third to-be-sampled data (namely the data with the interval value in the second to-be-sampled data and including the data with the first sequence in the second to-be-sampled data). And then sampling the third to-be-determined sampling data based on the number limit sampling function and the sampling number threshold value to determine the sampling data.
The difference is that the data basis is different, and here, the sampling is performed by the number limit sampling data on the basis of the third to-be-determined sampling data.
For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, a5, A6, a7, A8, a9, and a10, the number of hops is 1, the interval value is 2, the threshold value of the sampling quantity is 4, the data to be processed is randomly sampled by using the skip sampling function and the number of hops, that is, 1 data in the front of the order is skipped, and the obtained second data to be sampled includes a2, A3, a4, a5, A6, a7, A8, a9, and a 10. And performing interval sampling on the second to-be-sampled data through an interval sampling function and an interval value to obtain third to-be-sampled data including A2, A4, A6, A8 and A10. And finally, sampling the third to-be-sampled data by using a quantity limiting sampling function, namely selecting the data with the sampling quantity threshold value, which is in front of the data sequence, in the third to-be-sampled data as the sampled data, namely the sampled data comprises A2, A4, A6 and A8.
Under the condition that the sampling functions comprise a skip sampling function, an interval sampling function and a quantity limiting sampling function, the sequence of the sampling functions can be changed at will, based on actual requirements, the sampling can be performed according to ordered and arbitrary combinations corresponding to the skip sampling function, the interval sampling function and the quantity limiting sampling function, the sampling process of a single sampling function is the same as the principle of performing sampling on the single sampling function independently, the difference is that the sampling sequence is different, sampling is performed on different data bases, and the obtained sampling results can be different.
Sampling is a common early data processing technique and stage for data analysis and mining, and generally, the size of the total data (i.e., the data to be processed) is too large, so that if the analysis operation is performed on the total data, not only a large amount of resources are consumed, but also the analysis operation time is significantly increased, and even a system is crashed during analysis. By sampling the total data and analyzing and operating the sampled data, not only the consumption of resources is saved, but also the analysis time is reduced, and the possibility of system breakdown can be reduced. And through one or some attributes of the sampled data, an evaluation judgment with certain reliability can be obtained for the overall data characteristics, so that the knowledge of the overall data is achieved. After sampling is completed, the sampled data may be applied to each actual scenario, for example, the sampled data may be analyzed, characteristics of the sampled data may be determined, characteristics of the sampled data may be applied to a data classification scenario, a software test may be performed using the sampled data, and the like.
The sampled data may be required to be different at different stages (e.g., development or testing stages) or based on different requirements. The sampling function has different orders in the sampling expression and different sampling processes, so that the obtained sampling data has different results. Thus, sampling expressions with different sampling functions can be configured, and the order of the sampling functions in the sampling expressions can be changed to obtain different sampling data. Based on different requirements, sampling expressions of different sampling function sequences can be configured, sampling is carried out by using the sampling functions in different sequences, and different sampling data can be provided to meet different requirements.
For example, the above-mentioned sampling is performed based on the skip sampling function, the interval sampling function and the quantity limiting sampling function in sequence, and the skip sampling function is used for sampling first, that is, data with the number of hops before the sequence in the data to be processed is skipped first, so that the data with the number of hops before the sequence in the data to be processed is prevented from being extracted, the influence of the data before the sequence on the whole sampled data is reduced, and the data quantity is reduced. And then, interval sampling is carried out on the basis of the data obtained by sampling through the skip sampling function, so that transition concentration of the data can be avoided, and the uniformity of the sampled data is improved. And finally, on the basis of the data obtained by interval sampling, sampling is carried out through a quantity limiting function so as to limit the quantity of the sampled data and reduce the data quantity. Therefore, by sampling the data to be processed in sequence to obtain the sampling data, unstable data in the data to be processed, which are in the front of the sequence, are reduced, data uniformity is ensured, and the data volume is reduced.
For example, when sampling is performed in order based on a skip sampling function, a number limit sampling function, and an interval sampling function, the order of sampling is different from the order of the above-described sampling. The data with the jump number in the data to be processed which is in the front of the sequence is skipped first, so that the data with the jump number in the data to be processed which is in the front of the sequence is prevented from being extracted, the influence of the instability of the data in the front of the sequence on the whole sampling data is reduced, and the data volume is reduced. Then, on the basis of the data obtained by sampling through the skip sampling function, the quantity limiting sampling (instead of the interval sampling) is carried out, namely, the quantity of the data obtained after skip sampling is limited, and the data which is in the front of the data obtained by skip sampling and is less than or equal to a sampling quantity threshold value (when the quantity of the data obtained by skip sampling is greater than or equal to the sampling quantity threshold value, the threshold quantity of the data is extracted, otherwise, the threshold quantity of the data is extracted), so that the second sampling is realized, and the data quantity is further reduced. And finally, performing interval sampling, namely selecting data with interval spacing value every interval from the first data on the basis of the data obtained by the quantity limit sampling to ensure the uniformity of the sampled data. That is, in this embodiment, the data after the number-limited sampling is subjected to interval extraction, rather than the data obtained after the skip sampling, so that different sampling results can be obtained, that is, sampling in different orders is performed through different orders of sampling functions, and various sampling data can be obtained, and the diversity of the sampling data can be improved to meet different requirements.
In one embodiment, the sampling functions include a skip sampling function, an interval sampling function, and a random sampling function, the order of the skip sampling function is first, the order of the random sampling function is last, the order of the interval sampling function is between the order of the skip sampling function and the order of the random sampling function, the sampling parameter values corresponding to the skip sampling function are skip numbers, the sampling parameter values corresponding to the random sampling function are random probability values, and the sampling parameter values corresponding to the interval sampling function are pitch values.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: sampling data to be processed based on a skip sampling function and the number of hops, and determining second data to be sampled; sampling the second to-be-sampled data based on the interval sampling function and the spacing value, and determining third to-be-sampled data; and sampling the third to-be-sampled data based on a random sampling function and a random probability value to determine sampled data.
That is, in the present embodiment, data whose data order is after the number of hops in the data to be processed may be determined as the second data to be sampled based on the data order of the data to be processed. And sampling the second to-be-sampled data based on the interval sampling function and the interval value, and determining third to-be-sampled data (namely the data with the interval value in the second to-be-sampled data and including the data with the first sequence in the second to-be-sampled data). And then sampling the third to-be-sampled data based on a random sampling function and a random probability value to determine sampled data.
Therefore, data which is sequentially compared with the previous data in the data to be processed can be skipped, and the influence of the data on the overall sampling result is avoided. And after skip sampling, sampling is carried out through an interval sampling function so as to ensure the uniformity of data distribution and enable the data distribution to be closer to the rule of the overall data. And finally, sampling is carried out through a random sampling function, the whole data obtained through interval sampling can be traversed in the sampling process, whether each data is sampled or not is unpredictable, and the uniform distribution of the sampled data is further ensured.
Under the condition that the sampling functions comprise a skip sampling function, an interval sampling function and a random sampling function, the sequence of the sampling functions can be changed at will, based on actual requirements, sampling can be carried out according to ordered and random combinations corresponding to the skip sampling function, the interval sampling function and the random sampling function, the sampling process of a single sampling function is the same as the principle of independently sampling the single sampling function, the difference is that the sampling sequence is different, sampling is carried out on different data bases, and the obtained sampling results can be different so as to meet different requirements.
In one embodiment, the sampling functions include a number-limited sampling function, an interval sampling function, and a random sampling function, the number-limited sampling function is in a first order, the interval sampling function is in a middle order (i.e., the order is between the order of the number-limited sampling function and the order of the random sampling function), the random sampling function is in a last order, the sampling parameter value corresponding to the number-limited sampling function is a sampling number threshold, the sampling parameter value corresponding to the interval sampling function is a spacing value, and the sampling parameter value corresponding to the random sampling function is a random probability value.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: sampling the data to be processed based on the quantity limit sampling function and the sampling data threshold value, and determining second data to be sampled; sampling the second to-be-sampled data based on the interval sampling function and the spacing value, and determining third to-be-sampled data; and sampling the third to-be-sampled data based on a random sampling function and a random probability value to determine sampled data.
That is, in this embodiment, the threshold number of samples of the data to be processed, which is the data in the data sequence before the data sequence, may be determined as the second data to be sampled based on the data sequence of the data to be processed. And sampling the second to-be-sampled data based on the interval sampling function and the interval value, and determining third to-be-sampled data (namely the data with the interval value in the second to-be-sampled data and including the data with the first sequence in the second to-be-sampled data). And then sampling the third to-be-sampled data based on a random sampling function and a random probability value to determine sampled data.
Therefore, the data quantity can be ensured not to be overlarge, and the data quantity can be ensured to be within the threshold range of the sampled data, so that the data quantity is reduced. And after the number limit sampling, sampling is carried out through an interval sampling function so as to ensure the uniformity of data distribution and enable the data distribution to be closer to the rule of the overall data. And finally, sampling is carried out through a random sampling function, the whole data obtained through interval sampling can be traversed in the sampling process, whether each data is sampled or not is unpredictable, and the uniform distribution of the sampled data is further ensured. Namely, on the basis of ensuring the data volume, the problem of uniform data distribution is considered, so that the data volume can be reduced, and the data uniformity can be ensured.
Under the condition that the sampling functions comprise a quantity limiting sampling function, an interval sampling function and a random sampling function, the sequence of the sampling functions can be changed at will, based on actual requirements, the sampling can be performed by ordered and arbitrary combinations corresponding to the quantity limiting sampling function, the interval sampling function and the random sampling function, the sampling process of a single sampling function is the same as the principle of performing sampling on the single sampling function independently, the difference is that the sampling sequence is different, sampling is performed on different data bases, and the obtained sampling results can be different.
In one embodiment, the sampling functions include a skip sampling function, a quantity-limited sampling function, and a random sampling function, the skip sampling function having a first order, the quantity-limited sampling function having an intermediate order (i.e., an order between the order of the skip sampling function and the order of the random sampling function), the random sampling function having a last order, the quantity-limited sampling function having a corresponding sampling parameter value as a sampling quantity threshold, the skip sampling function having a corresponding sampling parameter value as a skip number, and the random sampling function having a corresponding sampling parameter value as a random probability value.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: sampling data to be processed based on a skip sampling function and the number of hops, and determining second data to be sampled; sampling the second to-be-sampled data based on a number limit sampling function and a sampling number threshold, and determining third to-be-sampled data; and sampling the third to-be-sampled data based on a random sampling function and a random probability value to determine sampled data.
That is, in the present embodiment, data whose data order is after the number of hops in the data to be processed may be determined as the second data to be sampled based on the data order of the data to be processed. And determining the data with the sampling number threshold value in the second data to be sampled, wherein the sampling number threshold value is the data with the front sequence in the second data to be sampled, as the third data to be sampled. And then sampling the third to-be-sampled data based on a random sampling function and a random probability value to determine sampled data.
Therefore, data which is sequentially compared with the previous data in the data to be processed can be skipped, and the influence of the data on the overall sampling result is avoided. And after skip sampling, sampling is carried out through a quantity limiting sampling function so as to ensure that the data volume of the data is not overlarge and is ensured to be within the range of a sampling data threshold value, and the data volume is reduced. And finally, sampling is carried out through a random sampling function, the whole data obtained by quantity limiting sampling can be traversed in the sampling process, whether each data is sampled or not is unpredictable, and the uniform distribution of the sampled data is further ensured.
Under the condition that the sampling functions comprise a skip sampling function, a quantity limiting sampling function and a random sampling function, the sequence of the sampling functions can be changed at will, based on actual requirements, the sampling can be performed according to ordered and arbitrary combinations corresponding to the skip sampling function, the quantity limiting sampling function and the random sampling function, the sampling process of a single sampling function is the same as the principle of performing sampling on the single sampling function independently, the difference is that the sampling sequence is different, sampling is performed on different data bases, and the obtained sampling results can be different.
In one embodiment, when the number of the sampling functions is 4, that is, the sampling functions include a skip sampling function, a number limit sampling function, an interval sampling function and a random sampling function, the data to be processed is sampled based on the sampling function with the first order, the fourth data to be sampled is determined, the fourth data to be sampled is sampled based on the sampling function with the second order, the fifth data to be sampled is determined, the fifth data to be sampled is sampled based on the sampling function with the third order, the sixth data to be sampled is determined, and the sixth data to be sampled is sampled based on the sampling function with the last order, and the sampled data is determined.
The following description will be given taking as an example a combination of four partially ordered sampling functions (the number of ordered combinations of skip sampling function, number-limited sampling function, interval sampling function, and random sampling function is 24).
For example, in one embodiment, the sampling functions include a skip sampling function, an interval sampling function, a random sampling function, and a quantity limit sampling function, the skip sampling function has a first order, the interval sampling function has an order between the skip sampling function and the random sampling function, the random sampling function has an order between the interval sampling function and the quantity limit sampling function, the quantity limit sampling function has a last order, the skip sampling function has a skip number corresponding to a sampling parameter value, the interval sampling function has a pitch value corresponding to a sampling parameter value, the random sampling function has a random probability value corresponding to a sampling parameter value, and the quantity limit sampling function has a sampling quantity threshold value corresponding to a sampling parameter value.
In this embodiment, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampling data includes: sampling the data to be processed based on the skip sampling function and the skip number, and determining fourth data to be sampled; sampling the fourth to-be-determined sampling data based on the interval sampling function and the spacing value, and determining fifth to-be-determined sampling data (including the data ranked first in the fourth to-be-determined sampling data); sampling the fifth undetermined sampling data based on a random sampling function and a random probability value, and determining the sixth undetermined sampling data; and sampling the sixth to-be-sampled data based on the number limit sampling function and the sampling number threshold value to determine the sampled data.
That is, in this embodiment, based on the data sequence of the data to be processed, the data in the data to be processed, whose data sequence is after the number of hops, may be determined as the fourth sample data to be determined; on the basis of the fourth to-be-determined sampling data, sampling is carried out through an interval sampling function and an interval value, then sampling is carried out through a random sampling function and a number limiting sampling function, and finally sampling is carried out through a number limiting sampling function and a sampling number threshold value, so that sampling data are obtained.
In this embodiment, the number of the sample data is limited to be within the threshold number of the sample number, that is, the number of the sample data is smaller than or equal to the threshold number of the sample. For example, the data to be processed includes 10 data, which are sequentially a1, a2, A3, a4, A5, A6, a7, A8, a9 and a10, the number of hops is 2, the interval is 2, the random probability value is 0.5, the threshold value of the sampling number is 3, the first 2 data are skipped over and are not sampled, the fourth data to be sampled includes A3, a4, A5 and A5, the fourth data to be sampled is sampled at intervals according to the interval value 2, the fifth data to be sampled includes A5, A5 and A5, and the sixth data to be sampled randomly may include A5, A5 and A5, on the basis, the number of samples is limited, that the sampling number of the sixth data to be sampled in the sampling order of the first data to be sampled includes a 5872, A5 and A5. If the sixth pending sample data obtained by random sampling comprises A5 and A7, and the number of the sixth pending sample data is less than the sample number threshold, the sixth pending sample data is taken as sample data, and the sample data comprises A5 and A7.
Therefore, data which is sequentially compared with the previous data in the data to be processed can be skipped, and the influence of the data on the overall sampling result is avoided. And after skip sampling, sampling is carried out through an interval sampling function so as to ensure the uniformity of data distribution and enable the data distribution to be closer to the rule of the overall data. And then sampling is carried out through a random sampling function, the whole data obtained by interval sampling is traversed in the sampling process, whether each data is sampled or not is unpredictable, and the uniform distribution of the sampled data is further ensured. And finally, sampling is carried out through a quantity limiting sampling function so as to ensure that the data volume of the data is not overlarge and the data volume is reduced within the range of a sampling data threshold. The obtained sampling data meets the sampling rules corresponding to the four sampling functions, namely the four sampling rules, so as to meet the sampling requirement.
Under the condition that the sampling functions comprise a skip sampling function, a quantity limiting sampling function, an interval sampling function and a random sampling function, the sequence of the sampling functions can be changed at will, based on actual requirements, sampling can be carried out according to ordered and arbitrary combinations corresponding to the interval sampling functions, and the sampling process of a single sampling function is the same as the principle of sampling the single sampling function independently.
In one embodiment, the method further includes the steps of: and when the sampling function meets the preset condition, sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and determining the sampled data, otherwise, giving out error reporting prompt information.
Due to the fact that the sampling expression configured by the user may have deviation, namely, the configured sampling function does not meet the preset condition, namely, the sampling process cannot be normally executed. At this time, error-reporting prompt information can be given to prompt a user that the configured sampling expression is in error, the sampling expression can be reconfigured, and when the sampling function in the sampling expression meets the preset condition, the step of sampling the data to be processed based on the sampling function and the corresponding sampling parameter value and determining the sampling data is executed.
In one embodiment, the preset condition may include belonging to each preset sampling function, that is, when the sampling function belongs to each preset sampling function, the sampling function satisfies the preset condition.
That is, the sampling function obtained by the analysis may not be a preset sampling function preset in the server, and even if the sampling function is obtained by the analysis, the corresponding sampling operation process cannot be executed, that is, the sampling process cannot be normally executed. At this time, an error reporting prompt message can be given to prompt a user that the configured sampling expression has errors, the sampling expression can be reconfigured, and when the sampling function in the sampling expression belongs to each preset sampling function, the step of sampling the data to be processed and determining the sampling data is executed based on the sampling function and the corresponding sampling parameter value.
The above data processing procedure is described in detail with an embodiment.
Please refer to fig. 3, which is a schematic diagram of a conventional sampling method. The data source comprises a database, a message queue, a network interface and the like, article contents are stored in the database, commodity information is stored in the message queue, and the network interface corresponds to network interface data. In general, sampling is performed in an intrusive mode, that is, a sampling code is written once every time sampling is performed, the sampling code is added to a code of data input, and the code of the data input is modified. Sampling processing is carried out on input data to be processed through an invasive sampling code to obtain sampling data. Then, the sampled data can be subjected to data processing such as subsequent filtering and statistics, and then a statistical result can be obtained. For example, a report of the product information can be obtained by counting the product information.
Fig. 4 is a schematic diagram of sampling according to a data processing method according to an embodiment of the present application. The data flow frame and the service logic layer are separated in the whole data processing scheme. The data flow frame is a set of stable frame system and is responsible for connecting all the services in series and providing basic services. And the business logic layer is changed and is responsible for executing actual work according to different use scenes.
The data flow frame comprises a data input layer, a sampling expression engine, a data processing plug-in and a data output layer, wherein the data input layer reads data in a data source to realize data input. The sampling expression engine is responsible for analyzing configured sampling expressions (such as input. skip (100). limit (200). sample (2), namely, firstly skipping 100 data in the data to be processed, taking the data with the data sequence after 100 as second data to be sampled, extracting 200 data in the second data to be sampled, taking the third data to be sampled at intervals by using a distance value 2, determining the sampling data), applying the sampling data to the data to be processed for sampling, and executing codes corresponding to the sampling functions obtained by analysis to complete sampling without intruding into the input service logic. After sampling is finished, the sampled data can be processed by using the data processing plug-in, for example, filtering and statistical processing, and the processed result is obtained and output to the service logic layer through the data output layer.
Please refer to fig. 5, which is a diagram illustrating an interface for configuring an expression. The user may input the sample expression by entering it in the expression input of the expression configuration interface diagram, for example, the user may enter the sample expression of input. The sampling expression engine may parse the sampling expression to obtain a sampling function and corresponding sampling parameter values. Namely, the sampling function obtained by analysis comprises the skip sampling function skip and the number limit sampling function limit. Referring to fig. 6, a schematic diagram of combining samples using a skip sampling function and a number-limiting sampling function is shown. For skip (5) and limit (11), the first 5 data in the data to be processed are skipped first, and then the 6 th to 16 th data are selected as sampling data.
In fig. 5, the configured sampling expression has a skip number of 600 corresponding to the skip sampling function skip and a sample number threshold of 300 corresponding to the number limit sampling function limit. Then, the data to be processed obtained from the data source is sampled, that is, in the sampling process, the first 600 data in the data to be processed are skipped, and sampling corresponding to the number limit sampling function limit is performed from the 601 th data, for example, the first 300 data are selected as the sampled data from the data after the 600 th data in the data to be processed.
Based on different requirements, the sampling expression can be modified at will, so that different sampling expressions can be obtained, and different sampling processes can be executed to obtain different sampling data. Referring to fig. 7, a schematic diagram of sampling with an interval sampling function is shown. The sampling process of the interval sampling function sample is to extract the interval value, for example, in fig. 7, the sampling expression is sample (4), and the interval value is 4, and then, in the data to be processed, one data is extracted as the sampling data at intervals.
Fig. 8 is a schematic diagram of sampling with a random sampling function. The sampling process of the random sampling function random is random extraction, for example, in fig. 8, the sampling expression is random (0.5), and the random probability value is 0.5, and then the data to be processed is randomly extracted according to the random probability value in the data to be processed, so as to determine the sampled data. In fig. 8, the sampled data randomly sampled by the random sampling function random includes 2 nd, 3 rd, 6 th, 8 th, 10 th to 12 th data, 14 th data, 19 th data, 21 st to 22 th data, and 25 th data.
When the data sampling is carried out, only one sampling expression needs to be configured for input data (namely data to be processed), and the server executes codes corresponding to sampling functions in the sampling expression to finish sampling. If the sampling strategy is modified, the sampling expression can be modified, for example, when the full amount (the data to be processed is needed) needs to be adopted, the full amount sampling can be realized only by modifying the sampling expression.
It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 9, there is provided a data processing apparatus including:
a data obtaining module 910, configured to obtain data to be processed;
an expression obtaining module 920, configured to obtain a sampling expression;
the analyzing module 930 is configured to analyze the sampling expression to obtain a sampling function and a sampling parameter value corresponding to the sampling function;
and a sampling module 940, configured to sample the data to be processed based on the sampling function and the corresponding sampling parameter value, and determine sampled data.
In one embodiment, the expression obtaining module is configured to receive a sampled expression obtained in response to an interactive operation on an expression input box in an expression configuration interface.
In one embodiment, the data obtaining module is configured to read each data source through the iterator to obtain the data to be processed.
In one embodiment, the sampling function comprises a skip sampling function, the skip sampling function corresponding to a sampling parameter value of a number of hops;
and the sampling module is used for determining the sampling data based on the data sequence of the data to be processed and the data of which the data sequence is after the jump number in the data to be processed.
In one embodiment, the sampling function includes a number-limited sampling function, and the value of the sampling parameter corresponding to the number-limited sampling function is a sampling number threshold;
and the sampling module is used for determining the data with the sampling quantity threshold value, which is in front of the data sequence in the data to be processed, as the sampling data based on the data sequence of the data to be processed.
In one embodiment, the sampling function includes an interval sampling function, and the sampling parameter value corresponding to the interval sampling function is a pitch value;
the sampling module is used for determining the data of the interval values in the data to be processed as sampling data based on the data sequence of the data to be processed, and the sampling data comprises the data of which the data sequence in the data to be processed is first.
In one embodiment, the sampling function comprises a random sampling function, and the value of the sampling parameter corresponding to the random sampling function is a random probability value;
and the sampling module is used for randomly sampling the data to be processed based on the random probability value and determining the sampled data.
In one embodiment, the sampling function includes at least any one of a skip sampling function, a number-limited sampling function, an interval sampling function, and a random sampling function.
In one embodiment, the sampling function includes a skip sampling function and a quantity-limiting sampling function, the order of the skip sampling function precedes the order of the quantity-limiting sampling function, the value of the sampling parameter corresponding to the skip sampling function is a skip number, and the value of the sampling parameter corresponding to the quantity-limiting sampling function is a sample quantity threshold.
In this embodiment, the sampling module is configured to sample the data to be processed based on the skip sampling function and the number of hops to determine first data to be sampled, and sample the first data to be sampled based on the number-limited sampling function and the sampling number threshold to determine the sampled data.
In one embodiment, the sampling module is configured to determine data in the data to be processed, which is in the order of the data after the number of hops, as the first data to be sampled when there are more than the number of hops in the data to be processed that have not been extracted. And sampling data from the first to-be-determined sampling data, wherein the number of the sampling data is less than or equal to a sampling number threshold value.
In one embodiment, the apparatus includes a sampling module to initialize a number of samples to zero; selecting first data (namely the first data after the jump number) from the first to-be-sampled data as current processing data based on the data sequence of the to-be-processed data and the data sequence of the first to-be-sampled data; when the sampling quantity is smaller than the sampling quantity threshold value and the current processing data meet the preset requirement, determining the current processing data as the sampling data, and adding one to the sampling quantity; and when data exists after the current processing data in the first to-be-determined sampling data, based on the data sequence of the first to-be-determined sampling data, taking the adjacent data after the current processing data in the first to-be-determined sampling data (namely, the next data of the current processing data) as the current processing data, and returning to the step of taking the current processing data as the sampling data when the sampling quantity is smaller than the sampling quantity threshold and the current processing data meets the preset requirement until the sampling quantity is equal to the sampling quantity threshold or no data exists after the current processing data in the first to-be-determined sampling data (namely, the quantity of the data after the current processing data is zero). In addition, when the current processing data does not meet the preset requirement, the current processing data is abandoned, namely when data exists after the current processing data in the first to-be-determined sampling data, the current to-be-processed data is updated by taking adjacent data after the current processing data in the first to-be-determined sampling data as the current processing data based on the data sequence of the first to-be-determined sampling data.
In one embodiment, the sampling function includes an interval sampling function and a number limit sampling function, the order of the interval sampling function precedes the order of the number limit sampling function, the sampling parameter value corresponding to the interval sampling function is a pitch value, and the sampling parameter value corresponding to the number limit sampling function is a sampling number threshold.
In this embodiment, the sampling module is configured to sample data to be processed based on an interval sampling function, determine first data to be sampled, sample the first data to be sampled based on a number limit sampling function, and determine sampled data.
In one embodiment, the sampling module is used for determining data of interval space values in the data to be processed as first data to be sampled based on the data sequence of the data to be processed, and data in the first data to be sampled, which is in the front of the data sequence, serves as the sampled data. The number of the sampling data is less than or equal to the sampling number threshold value, and the data with the first data sequence in the data to be processed is included.
In one embodiment, the sampling function includes a random sampling function and a number-limited sampling function, the order of the random sampling function precedes the order of the number-limited sampling function, the value of the sampling parameter corresponding to the random sampling function is a random probability value, and the value of the sampling parameter corresponding to the number-limited sampling function is a sample number threshold.
In this embodiment, the sampling module is configured to perform random sampling on the data to be processed based on a random sampling function and a random probability value, and determine first sampling data; the first sampled data is sampled based on a number-limited sampling function and a sample number threshold, determining a sampling function. Wherein the number of sample data is less than or equal to the sample number threshold.
In one embodiment, the sampling functions include a skip sampling function and an interval sampling function, the order of the skip sampling function precedes the order of the interval sampling function, the skip sampling function corresponds to sampling parameter values of the number of hops, and the interval sampling function corresponds to sampling parameter values of the interval sampling function of the interval values.
In this embodiment, the sampling module is configured to sample data to be processed based on a skip sampling function and a skip number, and determine first data to be sampled; the first to-be-sampled data is sampled based on an interval sampling function to determine sampled data.
In one embodiment, the sampling module is configured to determine, as the first to-be-sampled data, data in the to-be-processed data whose data order is after the number of hops based on the data order of the to-be-processed data. And then determining the data with interval values in the first to-be-determined sampling data as sampling data based on the data sequence of the first to-be-determined sampling data, wherein the sampling data comprises the data with the first data sequence in the first to-be-determined sampling data.
In one embodiment, the sampling functions include a skip sampling function and a random sampling function, the order of the skip sampling function precedes the order of the random sampling function, the value of the sampling parameter corresponding to the skip sampling function is a skip number, and the value of the sampling parameter corresponding to the random sampling function is a random probability value.
In this embodiment, the sampling module is configured to sample data to be processed based on a skip sampling function and a skip number, and determine first data to be sampled; and sampling the first to-be-determined sampling data based on the random sampling function and the random probability value to determine the sampling data.
In one embodiment, the sampling module is used for determining data of which the data sequence is after the jump number in the data to be processed as first to-be-sampled data based on the data sequence of the data to be processed; and randomly sampling the first to-be-sampled data based on the random sampling function and the random probability value to determine the sampled data.
In one embodiment, the sampling function includes an interval sampling function and a random sampling function, the order of the interval sampling function is prior to the order of the random sampling function, the sampling parameter value corresponding to the interval sampling function is a distance value, and the sampling parameter value corresponding to the random sampling function is a random probability value.
In this embodiment, the sampling module is configured to sample data to be processed based on an interval sampling function and a distance value, and determine first data to be sampled; and sampling the first to-be-determined sampling data based on the random sampling function and the random probability value to determine the sampling data.
In one embodiment, the sampling functions include a skip sampling function, an interval sampling function, and a quantity limit sampling function, the skip sampling function having a first order, the interval sampling function having a second order (i.e., the order is between the order of the skip sampling function and the order of the quantity limit sampling function), the quantity limit sampling function having a last order, the skip sampling function having a skip number corresponding to a sampling parameter value, the interval sampling function having a pitch value corresponding to a sampling parameter value, and the quantity limit sampling function having a sampling parameter value that is a sampling quantity threshold value.
In this embodiment, the sampling module is configured to sample data to be processed based on a skip sampling function and a skip number, and determine second data to be sampled; sampling the second to-be-sampled data based on the interval sampling function and the spacing value, and determining third to-be-sampled data; and sampling the third to-be-determined sampling data based on the number limit sampling function and the sampling number threshold value to determine the sampling data.
In one embodiment, the sampling module is configured to determine, as the second data to be sampled (that is, a process of sampling based on a skip sampling function), data in the data to be processed, which is in the data order after the number of hops. And sampling the second to-be-sampled data based on the interval sampling function and the interval value, and determining third to-be-sampled data (namely the data with the interval value in the second to-be-sampled data and including the data with the first sequence in the second to-be-sampled data). And then sampling the third to-be-determined sampling data based on the number limit sampling function and the sampling number threshold value to determine the sampling data.
In one embodiment, the sampling functions include a skip sampling function, an interval sampling function and a random sampling function, the order of the skip sampling function is the first, the order of the random sampling function is the last, the order of the interval sampling function is between the order of the skip sampling function and the order of the random sampling function, the sampling parameter value corresponding to the skip sampling function is a skip number, the sampling parameter value corresponding to the random sampling function is a random probability value, and the sampling parameter value corresponding to the quantity limit sampling function is a sampling quantity threshold.
In this embodiment, the sampling module, configured to sample data to be processed based on a sampling function and a corresponding sampling parameter value, and determine the sampled data, includes: sampling data to be processed based on a skip sampling function and the number of hops, and determining second data to be sampled; sampling the second to-be-sampled data based on the interval sampling function and the spacing value, and determining third to-be-sampled data; and sampling the third to-be-sampled data based on a random sampling function and a random probability value to determine sampled data.
In one embodiment, the sampling module is configured to determine, as the second data to be sampled, data in the data to be processed, which is subsequent to the number of hops in the data order, based on the data order of the data to be processed. And sampling the second to-be-sampled data based on the interval sampling function and the interval value, and determining third to-be-sampled data (namely the data with the interval value in the second to-be-sampled data and including the data with the first sequence in the second to-be-sampled data). And then sampling the third to-be-sampled data based on a random sampling function and a random probability value to determine sampled data.
In one embodiment, the sampling functions include a skip sampling function, an interval sampling function, a random sampling function, and a quantity limit sampling function, the order of the skip sampling function is first, the order of the interval sampling function is between the order of the skip sampling function and the order of the random sampling function, the order of the random sampling function is between the order of the interval sampling function and the order of the quantity limit sampling function, the order of the quantity limit sampling function is last, the sampling parameter value corresponding to the skip sampling function is a skip number, the sampling parameter value corresponding to the interval sampling function is a pitch value, the sampling parameter value corresponding to the random sampling function is a random probability value, and the sampling parameter value corresponding to the quantity limit sampling function is a sampling quantity threshold.
In this embodiment, the sampling module is configured to sample data to be processed based on a skip sampling function and a skip number, and determine fourth data to be sampled; sampling the fourth to-be-determined sampling data based on the interval sampling function and the spacing value, and determining fifth to-be-determined sampling data (including the data ranked first in the fourth to-be-determined sampling data); sampling the fifth undetermined sampling data based on a random sampling function and a random probability value, and determining the sixth undetermined sampling data; and sampling the sixth to-be-sampled data based on the number limit sampling function and the sampling number threshold value to determine the sampled data.
In one embodiment, the sampling module is configured to determine, as fourth to-be-determined sampling data, data in the to-be-processed data, the data having a data sequence after the number of hops based on a data sequence of the to-be-processed data; on the basis of the fourth to-be-determined sampling data, sampling is carried out through an interval sampling function and an interval value, then sampling is carried out through a random sampling function and a number limiting sampling function, and finally sampling is carried out through a number limiting sampling function and a sampling number threshold value, so that sampling data are obtained.
In one embodiment, the sampling module is configured to, when the sampling function meets a preset condition, perform sampling on the data to be processed based on the sampling function and a corresponding sampling parameter value, determine the sampled data, and otherwise, give an error notification message.
In one embodiment, the preset condition may include belonging to each preset sampling function, that is, when the sampling function belongs to each preset sampling function, the sampling function satisfies the preset condition.
For specific limitations of the data processing apparatus, reference may be made to the above limitations on the data processing method, where for limitations of the sampling module in the data processing apparatus, reference may be made to the above limitations on the data processing method, and details are not described herein again. The various modules in the data processing apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be the server 20 in fig. 1, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. Which computer program is executed by a processor to carry out the steps of the embodiments of the methods described above.
Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the above method when the processor executes the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (11)

1. A data processing method is applied to a data stream framework comprising a data input layer, a sampling expression engine, a data processing plug-in and a data output layer, and comprises the following steps:
the data input layer is used for traversing each data source based on the iterator and reading data in each data source sequence as required to obtain data to be processed;
the sampling expression engine is used for receiving a sampling expression which is transmitted by a browser and acquired in response to interactive operation on an expression input box in an expression configuration interface based on a B/S mode, the sampling expression corresponds to data to be processed and is used for representing conditions for sampling the data to be processed, and the sampling expression comprises a sampling function and a sampling parameter value, wherein the sampling function is a function for realizing a sampling function, and the sampling parameter value is a parameter value provided in a sampling process;
the sampling expression engine is also used for analyzing the sampling expression to obtain a sampling function, the sequence of the sampling function and a sampling parameter value corresponding to the sampling function;
the sampling expression engine is further used for sampling the data to be processed based on the sampling function and the corresponding sampling parameter value when the sampling function meets a preset condition, determining the sampling data, and pushing error prompt information when the sampling function does not meet the preset condition, wherein the error prompt information is used for prompting a user that the sampling expression configured by the user is wrong and reconfiguring the sampling expression, and the preset condition is that the sampling function belongs to a preset sampling function preset in a server;
the data processing plug-in is used for filtering the sampling data, determining the filtered sampling data, and carrying out statistical processing on the filtered sampling data to obtain a data statistical result;
the data output layer is used for outputting the data statistical result to the service logic layer;
the sampling the data to be processed based on the sampling function and the corresponding sampling parameter value, and the determining the sampling data includes:
based on the sampling function, the sequence of the sampling function and the corresponding sampling parameter value, sequentially executing a preset sampling code corresponding to the sampling function according to a single sampling function in the sequence of the sampling function to sample the data to be processed, and determining sampling data;
the step of sampling the data to be processed by executing a preset sampling code corresponding to the sampling function according to a single sampling function in the sequence of the sampling function in sequence based on the sampling function, the sequence of the sampling function and the corresponding sampling parameter value, wherein the step of determining the sampling data comprises the following steps:
determining a sampling function ranked at the forefront according to the sequence of the sampling functions, sampling the data to be processed according to the sampling function ranked at the forefront, and determining first data to be sampled;
sampling the first to-be-determined sampling data based on the sampling function in the subsequent order and the corresponding sampling parameter value to determine sampling data;
the sampling function comprises a quantity limiting sampling function, the value of a sampling parameter corresponding to the quantity limiting sampling function is a sampling quantity threshold, the sampling of the first to-be-determined sampling data is carried out, and the determination of the sampling data comprises the following steps:
initializing the number of samples to zero;
selecting first data from the first to-be-determined sample data as current processing data based on the data sequence of the first to-be-determined sample data;
when the sampling quantity is smaller than the sampling quantity threshold value and the current processing data meets the preset requirement, determining the current processing data as the sampling data, adding one to the sampling quantity, and when the sampling quantity is smaller than the sampling quantity threshold value and the current processing data does not meet the preset requirement, abandoning the current processing data;
when data exists after the current processing data in the first to-be-determined sampling data, taking adjacent data after the current processing data in the first to-be-determined sampling data as the current processing data based on the data sequence of the first to-be-determined sampling data;
returning to the step of determining the current processing data as the sampling data when the sampling quantity is less than the sampling quantity threshold and the current processing data meets the preset requirement until the sampling quantity is equal to the sampling quantity threshold or no data exists after the current processing data in the first to-be-determined sampling data;
the preset requirement is that the number of data of the same type as the current processing data in the determined sampling data is smaller than a preset number threshold.
2. The method of claim 1, wherein the sampling function comprises a skip sampling function, the skip sampling function corresponding to a sampling parameter value of a number of hops;
the sampling the data to be processed based on the sampling function and the corresponding sampling parameter value to determine the sampling data comprises:
and determining the sampling data based on the data sequence of the data to be processed and the data of which the data sequence is after the jump number in the data to be processed.
3. The method of claim 1, wherein the sampling function comprises a number-limited sampling function, the number-limited sampling function corresponding to a sampling parameter value that is a sampling number threshold;
the sampling the data to be processed based on the sampling function and the corresponding sampling parameter value to determine the sampling data comprises:
and determining the data with the sampling number threshold value in the data to be processed, which is the data with the data sequence before the data sequence, as the sampling data based on the data sequence of the data to be processed.
4. The method of claim 1, wherein the sampling function comprises an interval sampling function, the interval sampling function corresponding to sampling parameter values that are interval values;
the sampling the data to be processed based on the sampling function and the corresponding sampling parameter value to determine the sampling data comprises:
and determining the data separated by the distance value in the data to be processed as the sampling data based on the data sequence of the data to be processed, wherein the sampling data comprises the data with the first data sequence in the data to be processed.
5. The method of claim 1, wherein the sampling function comprises a random sampling function, and wherein the value of the sampling parameter corresponding to the random sampling function is a random probability value;
the sampling the data to be processed based on the sampling function and the corresponding sampling parameter value to determine the sampling data comprises:
and randomly sampling the data to be processed based on the random probability value, and determining the sampled data.
6. The method of any of claims 1-5, wherein the sampling function comprises at least any of a skip sampling function, a number-limited sampling function, an interval sampling function, and a random sampling function.
7. A data processing apparatus, comprising:
the data acquisition module is used for traversing each data source based on the iterator and reading data in each data source sequence as required to acquire data to be processed;
the system comprises an expression acquisition module, a data processing module and a data processing module, wherein the expression acquisition module is used for receiving a sampling expression which is transmitted by a browser and acquired in response to interactive operation on an expression input box in an expression configuration interface based on a B/S mode, the sampling expression corresponds to data to be processed and is used for representing a condition for sampling the data to be processed, and the sampling expression comprises a sampling function and a sampling parameter value, wherein the sampling function is a function for realizing a sampling function, and the sampling parameter value is a parameter value provided in a sampling process;
the analysis module is used for analyzing the sampling expression to obtain a sampling function, the sequence of the sampling function and a sampling parameter value corresponding to the sampling function;
the sampling module is used for sampling the data to be processed based on the sampling function and the corresponding sampling parameter value when the sampling function meets a preset condition, determining the sampling data, and pushing error prompt information when the sampling function does not meet the preset condition, wherein the error prompt information is used for prompting a user that a sampling expression configured by the user is wrong and reconfiguring the sampling expression, and the preset condition is that the sampling function belongs to a preset sampling function preset in a server;
the sampling module is also used for filtering the sampling data, determining the filtered sampling data, performing statistical processing on the filtered sampling data to obtain a data statistical result, and outputting the data statistical result;
the sampling module is further used for executing a preset sampling code corresponding to the sampling function according to a single sampling function in the sequence of the sampling function in sequence to sample the data to be processed and determine sampling data based on the sampling function, the sequence of the sampling function and the corresponding sampling parameter value; the sampling module is further used for determining a sampling function which is sequenced at the forefront according to the sequence of the sampling function, sampling the data to be processed according to the sampling function which is sequenced at the forefront, determining first data to be sampled, and then sampling the first data to be sampled based on a sampling function which is sequenced at the back and corresponding sampling parameter values to determine the sampling data;
the sampling function comprises a quantity limiting sampling function, the sampling parameter value corresponding to the quantity limiting sampling function is a sampling quantity threshold value, the sampling module is further used for initializing the sampling quantity to be zero, selecting first data from first to-be-determined sampling data as current processing data based on the data sequence of the first to-be-determined sampling data, determining the current processing data as the sampling data when the sampling quantity is smaller than the sampling quantity threshold value and the current processing data meets preset requirements, adding one to the sampling quantity, abandoning the current processing data when the sampling quantity is smaller than the sampling quantity threshold value and the current processing data does not meet the preset requirements, and taking adjacent data behind the current processing data in the first to-be-determined sampling data as the current processing data based on the data sequence of the first to-be-determined sampling data when data exists behind the current processing data in the first to-be-determined sampling data, and returning to the step of determining the current processing data as the sampling data when the sampling quantity is less than the sampling quantity threshold and the current processing data meets the preset requirement, until the sampling quantity is equal to the sampling quantity threshold or no data exists behind the current processing data in the first to-be-determined sampling data, wherein the preset requirement is that the quantity of data, which is the same type as the current processing data, in the determined sampling data is less than the preset quantity threshold.
8. The apparatus of claim 7, wherein the expression obtaining module is configured to receive a sampled expression obtained in response to an interaction with an expression input box in the expression configuration interface.
9. The apparatus of claim 7, wherein the data obtaining module is configured to obtain the data to be processed by reading each data source through an iterator.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method according to any of claims 1-6.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN201810729988.3A 2018-07-05 2018-07-05 Data processing method and device, computer equipment and storage medium Active CN108984700B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810729988.3A CN108984700B (en) 2018-07-05 2018-07-05 Data processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810729988.3A CN108984700B (en) 2018-07-05 2018-07-05 Data processing method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108984700A CN108984700A (en) 2018-12-11
CN108984700B true CN108984700B (en) 2021-07-27

Family

ID=64537230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810729988.3A Active CN108984700B (en) 2018-07-05 2018-07-05 Data processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108984700B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569284A (en) * 2019-09-09 2019-12-13 联想(北京)有限公司 Information processing method and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506383A (en) * 2017-07-25 2017-12-22 中国建设银行股份有限公司 A kind of audit data processing method and computer equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103078772A (en) * 2013-02-26 2013-05-01 南京理工大学常熟研究院有限公司 Depth packet inspection (DPI) sampling peer-to-peer (P2P) flow detection system based on credibility
CN104572849A (en) * 2014-12-17 2015-04-29 西安美林数据技术股份有限公司 Automatic standardized filing method based on text semantic mining
CN105243127A (en) * 2015-09-30 2016-01-13 海天水务集团股份公司 Report data sampling method for wastewater treatment plant
CN106886535A (en) * 2015-12-16 2017-06-23 大唐软件技术股份有限公司 A kind of data pick-up method and apparatus for being adapted to multiple data sources
CN107181776B (en) * 2016-03-10 2020-04-28 华为技术有限公司 Data processing method and related equipment and system
CN107766486B (en) * 2017-10-16 2021-04-20 浪潮通用软件有限公司 Method, device, readable medium and storage controller for randomly extracting sample data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506383A (en) * 2017-07-25 2017-12-22 中国建设银行股份有限公司 A kind of audit data processing method and computer equipment

Also Published As

Publication number Publication date
CN108984700A (en) 2018-12-11

Similar Documents

Publication Publication Date Title
US10949329B2 (en) Machine defect prediction based on a signature
CN110944048B (en) Service logic configuration method and device
US20100287361A1 (en) Root Cause Analysis for Complex Event Processing
CN110297760A (en) Building method, device, equipment and the computer readable storage medium of test data
CN112667426A (en) Log analysis method and device
CN112328259A (en) Compiling time length processing method and device
CN115118582B (en) Log analysis method and device
CN114490375A (en) Method, device and equipment for testing performance of application program and storage medium
CN108984700B (en) Data processing method and device, computer equipment and storage medium
CN111259212A (en) Telemetering data interpretation method, device, equipment and storage medium
CN110728118B (en) Cross-data-platform data processing method, device, equipment and storage medium
CN110716866A (en) Code quality scanning method and device, computer equipment and storage medium
CN113098961B (en) Component uploading method, device and system, computer equipment and readable storage medium
CN113419957A (en) Rule-based big data offline batch processing performance capacity scanning method and device
US11501183B2 (en) Generating a recommendation associated with an extraction rule for big-data analysis
CN113010310A (en) Job data processing method and device and server
KR20220115859A (en) Edge table representation of the process
CN111160583A (en) Data processing method and device
EP3091453A1 (en) Designing a longevity test for a smart tv
CN109101515B (en) Rule configuration method, server and computer-readable storage medium
CN116431366B (en) Behavior path analysis method, system, storage terminal, server terminal and client terminal
CN110750563A (en) Multi-model data processing method, system, device, electronic equipment and storage medium
CN112835803B (en) Tool generation method, test data construction method, device, equipment and medium
US20220350509A1 (en) Tagging a last known good upgrade event for automatic rollback based on detected regression
EP4250169A1 (en) Method and system for non-intrusive profiling of high-level synthesis (hls) based applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant