CN111708979A - Method for judging big data discrete degree in real time - Google Patents

Method for judging big data discrete degree in real time Download PDF

Info

Publication number
CN111708979A
CN111708979A CN201910204265.6A CN201910204265A CN111708979A CN 111708979 A CN111708979 A CN 111708979A CN 201910204265 A CN201910204265 A CN 201910204265A CN 111708979 A CN111708979 A CN 111708979A
Authority
CN
China
Prior art keywords
variance
subset
components
computation
standard deviation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910204265.6A
Other languages
Chinese (zh)
Inventor
吕纪竹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201910204265.6A priority Critical patent/CN111708979A/en
Publication of CN111708979A publication Critical patent/CN111708979A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)

Abstract

A method, system and computer system program product are disclosed for determining the degree of dispersion of big data in real time by iteratively calculating the variance or the standard deviation of a computation subset of a specified scale over the big data. Embodiments of the present invention include iteratively calculating a plurality of components of the variance and or standard deviation of the adjusted computed subset based on the plurality of components of the variance and or standard deviation of the pre-adjusted computed subset, and then generating the variance and or standard deviation of the adjusted computed subset based on the iteratively calculated components as needed. The iterative computation variance and/or the standard variance can update the computation result in real time based on the latest data and avoid accessing all data elements in the computation subset after adjustment and executing repeated computation, thereby improving the computation efficiency, saving the computation resource and reducing the energy consumption of the computation system, and making the high-efficiency and low-consumption real-time judgment of the dispersion degree of the big data and some scenes of the real-time judgment of the dispersion degree of the big data impossible to become possible.

Description

Method for judging big data discrete degree in real time
Technical Field
Big data or streaming data analysis.
Background
The internet, mobile communication, navigation, online tour, sensing technology and large-scale computing infrastructure generate massive data every day. Big data is data that exceeds the processing power of traditional database systems and the analysis power of traditional analysis methods due to its large size, rapid change and growth rate.
The variance and standard deviation reflect the degree of dispersion of the data. It is therefore obvious to judge the degree of dispersion of big data after calculating the variance or the standard deviation, and the difficulty and challenge is how to calculate the variance and the standard deviation on big data in real time.
In order to be able to obtain a decision at any time using the latest data, the variance and/or standard deviation may need to be recalculated after a change in the large data set. Thus, some (and possibly many) data elements may be repeatedly accessed and used. For example, it is possible that the variance or the standard deviation is calculated over a calculation subset containing n data elements. When a data element is removed from the computation subset and a data element is added to the computation subset, all n data elements in the computation subset are accessed to recalculate the variance and/or the standard deviation.
Depending on the needs, the size of the computing subsets may be very large, e.g., data elements in a computing subset may be distributed across thousands of computing devices of a cloud platform. The traditional method for recalculating the variance and/or the standard deviation on the big data after the data change can not process in real time and occupies and wastes a large amount of computing resources.
Disclosure of Invention
The invention extends to methods, systems and computing system program products for computing the variance or the standard deviation of the adjusted computed subset of the big data in an iterative manner so as to judge the dispersion degree of the big data in real time. Iteratively computing the variances and or the standard deviations for one of the adjusted computed subsets comprises iteratively computing the variances and or the standard deviations of the adjusted computed subset based on components of the variances and or the standard deviations of the pre-adjusted computed subset and then generating the variances and or the standard deviations of the adjusted computed subset based on the iteratively computed components as needed. The iterative computation of the variance and/or the standard variance only needs to access and use the components of the iterative computation and the newly added and removed data elements to avoid accessing all the data elements in the computation subset after adjustment and executing repeated computation, thereby reducing the data access delay, improving the computation efficiency, saving the computation resources and reducing the energy consumption of the computation system, and making the high efficiency and low consumption of real-time judgment of the dispersion degree of the big data and the possibility of some scenes of real-time judgment of the dispersion degree of the big data impossible.
For a given iterative algorithm of variance or standard deviation, assume that the total number of components (including the sum or average of each variable in the subset of computations) for all iterations in the same round of iterative computation is p (p > 1). And if the number of the components of the direct iteration is v (1 is less than or equal to v and less than or equal to p), the number of the components of the indirect iteration is w-p-v (w is more than or equal to 0). Where computing the sum or average of each variable in a subset is a special component that must be iteratively computed. And or the average may be iteratively calculated, either directly or indirectly.
The computing system initializes two or more (p, p >1) components of a pre-adjusted computed subset of a large dataset stored on one or more storage media, including one and or one mean for each variable. The initialization of the two or more components includes receiving or accessing a computed component from a computing device readable medium or computing according to a definition of the component based on data elements in a pre-adjustment computation subset.
The computing system accesses a data element to be removed from the pre-adjusted computing subset and a data element to be added to the pre-adjusted computing subset.
The computing system adjusts the pre-adjustment computing subset by removing data elements to be removed from the pre-adjustment computing subset and adding data elements to be added to the pre-adjustment computing subset.
The computing system iteratively computes a sum or average or sum and average for the adjusted computation subset.
The computing system directly iterates the computation of one or more (v,1 ≦ v ≦ p) components of the adjusted computation subset or of the standard deviation other than the sum and the mean. The direct iterative computation of these v components includes: accessing v components of the pre-adjustment computation subset; mathematically removing the contribution of the removed data element from each of the v components; and mathematically adding the contributions of the added data elements to each of the v components while avoiding accessing and using all data elements in the adjusted computation subset to reduce data access latency, save computational resources and reduce energy consumption and improve computational efficiency.
The computing system indirectly iteratively computes the adjusted variance sum of the computation subsets and the w ═ p-v components of the standard deviation as needed. Indirectly iteratively computing w components includes indirectly iteratively computing each of the w components one by one. One component of indirect iterative computation includes: one or more components other than the component are accessed and used to compute the component. The one or more components may be initialized, directly iteratively computed, or indirectly iteratively computed.
The computing system generates a variance or a standard deviation of the adjusted computed subset based at least on one or more iteratively computed components of the variance or the standard deviation of the adjusted computed subset.
The computing system may continuously access one data element to be removed and one data element to be added, adjust the pre-adjustment computation subset, directly iteratively compute v (1 ≦ v ≦ p) components, indirectly iteratively compute w ≦ p-v components and generate the variance or standard deviation as needed. The computing system may repeat this process as many times as necessary.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The features and advantages of the invention may be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. These and other features of the present invention will become more fully apparent from the following description and appended claims, or the practice of the invention.
Drawings
In order to describe the manner in which the above-recited and other advantages and features of the invention can be obtained, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. These drawings depict only typical embodiments of the invention and are not therefore to be understood or interpreted as limiting the scope of the invention:
FIG. 1 illustrates a high level overview of an example computing system that supports iterative computation of variance and or standard deviation.
Fig. 1-1 illustrates an example of a computing system architecture in which the variance and or standard deviation is calculated iteratively for large data, and all components are calculated iteratively directly.
FIGS. 1-2 illustrate an example of a computing system architecture in which the variance and/or standard deviation is calculated iteratively for large data, with partial component direct iterative calculations and partial component indirect iterative calculations.
FIG. 2 illustrates a flow chart of an example method for iteratively calculating the variance or the standard deviation for large data.
FIG. 3 illustrates data elements accessed when computing variance and or standard deviation for an iteration over large data.
Fig. 4-1 illustrates the definition of variance or standard deviation and the conventional equations for calculating variance or standard deviation over a subset of calculations.
Fig. 4-2 shows a first iterative calculation algorithm of variance or standard deviation (iterative algorithm 1).
Fig. 4-3 shows a second iterative calculation algorithm for variance or standard deviation (iterative algorithm 2).
Fig. 4-4 shows a third iterative calculation algorithm for variance and/or standard deviation (iterative algorithm 3).
FIG. 5-1 shows a first subset of computations for an example computation.
Fig. 5-2 shows a second subset of computations for one example of computation.
Fig. 5-3 shows a third subset of computations for one example of computation.
Fig. 6-1 illustrates the comparison of computational efforts of the conventional variance algorithm and the iterative variance algorithm when the computational subset size is 8.
Fig. 6-2 illustrates the comparison of computational efforts of the conventional variance algorithm and the iterative variance algorithm when the computational subset size is 1,000,000.
Fig. 6-3 illustrate the comparison of computational efforts for the conventional standard deviation algorithm and the iterative standard deviation algorithm at a computation subset size of 8.
Fig. 6-4 illustrate the comparison of computational efforts for the conventional standard deviation algorithm and the iterative standard deviation algorithm at a computation subset size of 1,000,000.
Detailed description of the invention
The present invention extends to methods, systems and computer system program products for iteratively calculating the variance and or standard deviation over large data by iteratively calculating more than two components of adjusted computational subsets given a computational scale of n (n >1) such that the degree of dispersion of the large data can be determined in real time. A computing system includes one or more processor-based computing devices. Each computing device contains one or more processors. The computing system includes one or more storage media. At least one of the one or more storage media has a data set thereon. Data elements from the data set that involve variance or sum of standard deviation calculations constitute a pre-adjusted calculation subset. The calculation subset size n indicates the number of data elements in the calculation subset. Embodiments of the present invention include iteratively calculating a plurality of components of the variance and or standard deviation of the adjusted computed subset based on the plurality of components of the variance and or standard deviation of the pre-adjusted computed subset, and then generating the variance and or standard deviation of the adjusted computed subset based on the iteratively calculated components as needed. The iterative computation of the variance and/or the standard variance avoids accessing all data elements in the adjusted calculation subset and executing repeated computation, thereby improving the computation efficiency, saving the computation resources and reducing the energy consumption of the computation system, and making the high-efficiency and low-consumption real-time judgment of the dispersion degree of the big data and some scenes of real-time judgment of the dispersion degree of the big data from impossible to possible.
In this context, a computation subset refers to a data set containing data elements for variance or standard deviation computation. A computational subset is analogous to a sliding window of variance or standard deviation computed over the stream data or time series data. In the description of the embodiments of the present invention, the difference between the computation subsets and the computation windows is that the data elements in the computation windows are ordered but none of the computation subsets.
The difference between real-time streaming data processing and streaming big data processing is that all historical data can be accessed when processing streaming big data, so no additional buffer is needed to store newly received data elements.
In this context, a component of variance or standard deviation is a quantity or expression that appears in the variance or standard deviation defining formula or any transformation thereof defining formula. The variance or the standard deviation is its own largest component. The variance or sum of standard deviations may be calculated based on one or more components or combinations thereof, so multiple algorithms support iterative variance or sum of standard deviations calculations. The following are some examples of components of variance or standard deviation.
Figure BDA0001998468600000051
Figure BDA0001998468600000052
Figure BDA0001998468600000053
Figure BDA0001998468600000054
Figure BDA0001998468600000055
Figure BDA0001998468600000056
Figure BDA0001998468600000057
Figure BDA0001998468600000058
A component may be directly iteratively calculated or indirectly iteratively calculated. The difference is that when a component is directly iteratively calculated, the component is calculated by the value that the component calculated in the previous round, and when the component is indirectly iteratively calculated, the component is calculated by a component other than the component.
For a given component, it may be iteratively computed directly in one algorithm but indirectly in another algorithm.
Computing the sum or average of the subsets is a special component that must be iteratively computed. For any algorithm, at least two components are iteratively computed, wherein one component is a sum or average value, which can be directly or indirectly iteratively computed, and the other of the two components can be directly iteratively computed. For a given algorithm, assuming that the total number of different components iteratively calculated in the same iteration is p (p ≧ 2), if the number of components iteratively calculated directly is v (1 ≦ v ≦ p), then the number of components iteratively calculated indirectly is w ≦ p-v (0 ≦ w < p). It is possible that these components are all computed iteratively directly (in this case v ═ p >1 and w ═ 0). However, components that directly iterate the computation must be computed whether the result of the variance or the standard deviation is needed and accessed in a particular round.
For a given algorithm, a component must be computed if it is directly computed iteratively (i.e., each time an existing data element is removed from the pre-adjustment computation subset and each time a data element is added to the pre-adjustment computation subset). However, if a component is computed iteratively indirectly, the component can be computed as needed by using one or more other components in addition to the component, i.e., only if the variance or the standard deviation needs to be computed and accessed. Thus, when the variance or the standard deviation is not accessed in a certain iteration computation round, only a small number of components may need to be iteratively computed. A component of an indirect iterative computation may be used for a direct iterative computation of another component, in which case the computation of the component of the indirect iterative computation may not be omitted.
The variance or the standard deviation can be calculated as desired. When the variance or standard deviation varies in each subset of calculations and need not be accessed, the computing system need only iteratively calculate the sum or mean and one or more components other than the sum or mean for each data change. Iteratively computing these components avoids accessing all inputs and making duplicate calculations and thus improves computational efficiency. The variance or the standard deviation may be generated by the computing system based on iteratively computed components when needed to be accessed.
Embodiments of the present invention include a plurality of components that iteratively calculate the variance or the standard deviation of the adjusted computed subset based on a plurality of components calculated for the pre-adjusted computed subset.
The computing system initializes one and or one mean or one and one mean of the pre-adjustment computing subset for a given scale n (n >1), and one or more (p in total, (p >1)) other components of the variance or standard deviation. Initialization of the two or more components includes accessing or receiving already computed components from one or more computing device readable media or computing based on data elements in the pre-adjustment computation subset according to their definitions.
The computing system accesses a data element to be removed from the pre-adjustment computing subset and a data element to be added to the pre-adjustment computing subset.
The computing system adjusts the pre-adjustment computation subset by: removing data elements to be removed from the pre-adjustment computing subset and adding data elements to be added to the pre-adjustment computing subset.
The computing system iteratively computes a sum or average or sum and average for the adjusted computation subset.
The computing system iteratively computes one or more v (1 ≦ v ≦ p) components for the adjusted computation subset directly, excluding the sum and the mean, of the variance or the standard variance. The component for directly and iteratively calculating v (1 is more than or equal to v and less than or equal to p) comprises the following components: accessing v components computed for the pre-adjustment computation subset; mathematically removing any contribution of the removed data element from each component accessed; and mathematically adding any contribution of the added data elements to each component accessed while avoiding accessing and using all data elements in the adjusted computation subset to reduce data access latency, save computational resources and reduce energy consumption and improve computational efficiency.
The computing system indirectly iteratively computes the w ═ p-v components of the variance or the standard deviation for the adjusted computation subset as needed. The indirect iterative computation of the w components of the variance or the standard deviation comprises an indirect iterative computation of each of the w components. One component of indirect iterative computation includes: one or more components other than the component are accessed and the component is computed based on the accessed components. These one or more components may be initialized, directly iteratively computed, or indirectly iteratively computed.
The computing system generates a variance and/or standard deviation for the adjusted computed subset based on at least one or more components of the iteratively computed variance and/or standard deviation for the adjusted computed subset, as needed.
The computing system may continue to access data elements to be removed from and data elements to be added to the pre-adjustment computation subset, adjust the pre-adjustment computation subset, iteratively compute a sum or a mean or a sum and a mean, directly iteratively compute one or more v (1 ≦ v ≦ p) components, indirectly iteratively compute w ≦ p-v components as needed, generate variances and or standard variances based on the one or more iteratively computed components as needed, and repeat this process as needed.
Embodiments of the present invention may include or utilize computing device hardware, such as one or more processors and storage devices described in greater detail below, special purpose or general computing devices. The scope of embodiments of the present invention also includes physical and other computing device-readable media for carrying or storing computing device-executable instructions and/or data structures. These computing device-readable media can be any media that can be accessed by a general purpose or special purpose computing device. A computing device readable medium storing instructions executable by a computing device is a storage medium (device). A computing device readable medium carrying computing device executable instructions is a transmission medium. Thus, by way of example, and not limitation, embodiments of the invention may include at least two different types of computing device-readable media: storage media (devices) and transmission media.
Storage media (devices) include Random Access Memory (RAM), read-only Memory (ROM), electrically erasable programmable read-only Memory (EEPROM), compact disc read-only Memory (CD-ROM), Solid State Disk (SSD), Flash Memory (Flash Memory), Phase Change Memory (PCM), other types of Memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computing device.
A "network" is defined as one or more data links that enable computing devices and or modules and or other electronic devices to transfer electronic data. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computing device, the computing device views the connection as a transmission medium. The transmission medium can include a network and or data links which carry program code in the form of computing device-executable instructions or data structures and which are accessible by a general purpose or special purpose computing device. Combinations of the above should also be included within the scope of computing device readable media.
Further, program code in the form of computing device executable instructions or data structures can be transferred automatically from transmission media to storage media (devices) (or vice versa) when different computing device components are employed. For example, computing device executable instructions or data structures received from a network or data link may be staged into random access memory in a network interface module (e.g., a NIC) and then ultimately transferred to random access memory of the computing device and/or to a less volatile storage medium (device) of the computing device. It should be understood, therefore, that a storage medium (device) can be included in a computing device component that also (or even primarily) employs a transmission medium.
Computing device executable instructions include, for example, instructions and data which, when executed by a processor, cause a general purpose computing device or special purpose computing device to perform a certain function or group of functions. The computing device executable instructions may be, for example, binaries, intermediate format instructions such as assembly code, or even source code. Although the objects described are described in language specific to structural features and or methodological acts, it is to be understood that the objects defined in the appended claims are not necessarily limited to the features or acts described above. Rather, the described features and acts are disclosed only as examples of implementing the claims.
Embodiments of the invention may be practiced in network computing environments where many types of computing devices, including personal computers, desktop computers, notebook computers, information processors, hand-held devices, multi-processing systems, microprocessor-based or programmable consumer electronics, network computers, minicomputers, mainframe computers, supercomputers, mobile telephones, palmtops, tablets, pagers, routers, switches, and the like, may be deployed. Embodiments of the invention may also be practiced in distributed system environments where local and remote computing devices that perform tasks are interconnected by a network (i.e., either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links). In a distributed system environment, program modules may be stored in local or remote memory storage devices.
Embodiments of the invention may also be implemented in a cloud computing environment. In this description and in the following claims, "cloud computing" is defined as a model that enables on-demand access to a shared pool of configurable computing resources over a network. For example, cloud computing can be utilized by the marketplace to provide a shared pool of popular and convenient on-demand access to configurable computing resources. A shared pool of configurable computing resources can be provisioned quickly through virtualization and with low administrative overhead or low service provider interaction, and then adjusted accordingly.
The cloud computing model may include various features such as on-demand self-service, broadband network access, resource collection, fast deployment, metering services, and so forth. The cloud computing model may also be embodied in various service models, for example, software as a service ("SaaS"), platform as a service ("PaaS"), and infrastructure as a service ("IaaS"). The cloud computing model may also be deployed through different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth.
Several examples will be given in the following section.
FIG. 1 illustrates a high-level overview of an example computing system 100 that iteratively computes variances and or standard deviations for large data. Referring to fig. 1, computing system 100 includes a number of devices connected by different networks, such as local area network 1021, wireless network 1022, and internet 1023, among others. The plurality of devices include, for example, a data analysis engine 1007, a storage system 1011, a real-time data stream 1006, and a plurality of distributed computing devices such as a personal computer 1016, a handheld device 1017, and a desktop 1018, among others, that may schedule data analysis tasks and/or query data analysis results.
Data analysis engine 1007 can include one or more processors, such as CPU 1009 and CPU1010, one or more system memories, such as system memory 1008, and component calculation module 131, variance calculation module 191, and standard variance calculation module 193. Details of module 131 are illustrated in greater detail in other figures (e.g., fig. 1-1 and 1-2). Storage system 1011 may include one or more storage media, such as storage media 1012 and storage media 1014, which may be used to store large data sets. For example, 1012 and or 1014 may include data set 123. The data sets in the storage system 1011 may be accessed by a data analysis engine 1007.
In general, data stream 1006 may include streaming data from various data sources, such as stock prices, audio data, video data, geospatial data, internet data, mobile communications data, web-surfing data, banking data, sensor data, and/or closed captioning data, among others. Several of which are depicted here by way of example, real-time data 1000 may include data collected in real-time from sensors 1001, stocks 1002, correspondence 1003, banks 1004, and the like. The data analysis engine 1007 may receive data elements from the data stream 1006. Data from different data sources may be stored on storage system 1011 and accessed by big data analysis engine, e.g., data set 123 may be from different data sources and accessed by data analysis engine 1007.
It should be understood that fig. 1 presents some concepts in a very simplified form, for example, distribution devices 1016 and 1017 may be coupled to data analysis engine 1007 through a firewall, data accessed or received by data analysis engine 1007 from data stream 1006 and/or storage system 1011 may be filtered through a data filter, and so on.
Fig. 1-1 illustrates an example computing system architecture 100A in which the variance or the standard deviation is iteratively computed for large datasets, with all (v ═ p >1) components being directly iteratively computed. With respect to the computing system architecture 100A, only the functions and interrelationships of the major components of the architecture will be described, and the process of how these components cooperate to jointly perform iterative variance and/or standard variance calculations will be described later in conjunction with the flow chart depicted in FIG. 2. Fig. 1-1 illustrates 1006 and 1007 shown in fig. 1. Referring to fig. 1-1, computing system architecture 100A includes component calculation module 131, variance calculation module 191, and standard deviation calculation module 193. The component computing module 131 may be tightly coupled to one or more storage media via a high-speed data bus or loosely coupled to one or more storage media managed by the storage system via a network, such as a local area network, a wide area network, or even the internet. Accordingly, component calculation module 131, and any other connected computing devices and their components, can send and receive message related data (e.g., internet protocol ("IP") datagrams and other higher layer protocols that use IP datagrams, such as, for example, user datagram protocol ("UDP"), real-time streaming protocol ("RTSP"), real-time transport protocol ("RTP"), microsoft media server ("MMS"), transmission control protocol ("TCP"), hypertext transfer protocol ("HTTP"), simple mail transfer protocol ("SMTP"), etc.) over a network. The output of the component calculation block 131 will be used as an input to a variance calculation block 191. the variance calculation block 191 may generate a variance 192. The output of the component calculation module 131 will be used as an input to a standard deviation calculation module 193, and the standard deviation calculation module 193 may generate a standard deviation 194. The variance 192 may be a sample variance or a global variance. The standard deviation 194 may be a sample standard deviation or a total standard deviation.
As shown, storage medium 121 contains data set 123. Data set 123 comprises a plurality of data elements stored at a plurality of locations on storage medium 121. For example, data elements 101, 102, 103,104,105,106,107, 108, 109, and 110 exist at positions 121A, 121B, 121C, 121D, 121E, 121F, 121G, 121H, 121I, and 121J, respectively, and so on. There are also multiple data elements in other locations.
Referring to computing system architecture 100A, typically component computation module 131 contains v (v ═ p) for a set of n data elements of the adjusted computation subset for direct iterative computation>1) V component calculation modules for each component. v is the number of components that are directly iteratively computed in a given algorithm that iteratively computes the variance or the standard deviation, which varies with the iterative algorithm used. As shown in FIG. 1-1, the component calculation module 131 contains a component Cd1Calculation module 161 and a component CdvA calculation module 162, and v-2 other component calculation modules, which may be a component Cd, between them2Computing Module, component Cd3Calculation Module, … …, and component Cdv-1And a calculation module. Each component calculation module calculates a specific component. Each component computation module includes an initialization module that initializes a component for the first pre-adjustment computation subset and an algorithm that directly iteratively computes the component for the post-adjustment computation subset. For example, component Cd1The calculation module 161 includes an initialization module 132 to initialize a component Cd1And an iterative algorithm 133 to iteratively compute the component Cd1Component CdvThe calculation module 162 includes an initialization module 138 to initialize the component CdvAnd iterative algorithm 139 to iteratively compute component Cdv
The initialization module 132 may initialize the component Cd1Either when used or when the variance or standard deviation calculation is reset. Likewise, the initialization module 138 may initialize the component CdvEither when used or when the variance or standard deviation calculation is reset.
Referring to fig. 1-1, the computing system architecture 100A also includes a variance calculation module 191 and a standard deviation calculation module 193. Variance calculation module 191 may calculate variance 192 based on one or more iteratively calculated components, as desired. The standard deviation calculation module 193 may calculate the standard deviation 194 based on one or more iteratively calculated components, as desired. The variance 192 may be a sample variance or a global variance. The standard deviation 194 may be a sample standard deviation or a total standard deviation.
FIGS. 1-2 illustrate iterative computation of variance or standard deviation and part (v (1 ≦ v) for large datasets<p,p>1) One) component direct iterative computation, and part (w ═ p-v) component indirect iterative computation, computing system architecture 100B. In some implementations, the difference between computing system architectures 100B and 100A is that architecture 100B includes a component computing module 135. Otherwise, the same reference numerals as in 100A are used in the same manner. So as not to repeat what was previously explained in the description of 100A, only the different parts will be discussed here. The number v in 100B may be different from the number v in 100A because some components in 100A that are directly iteratively computed are indirectly iteratively computed in 100B. In 100A, v ═ p>1, but in 100B, 1. ltoreq. v<p is the same as the formula (I). Referring to fig. 1-2, the computing system architecture 100B includes a component calculation module 135. The output of the component calculation module 131 may be input to the component calculation module 135, the outputs of the calculation modules 131 and 135 may be input to the variance calculation module 191, and the variance calculation module 191 may generate the variance 192. The output of the component calculation module 131 may be input to the component calculation module 135, the outputs of the calculation modules 131 and 135 may be input to the standard deviation calculation module 193, and the standard deviation calculation module 193 may generate the standard deviation 194. The component calculation module 135 typically includes a w-p-v component calculation module to indirectly iteratively calculate w components. For example, component calculation module 135 includes a component calculation module 163 for indirectly iteratively calculating component Ci1The component calculation module 164 is used for indirect iterative calculation of the components CiwAnd other w-2 component calculation modules in between. Indirectly iteratively computing w components includes indirectly iteratively computing each of the w components one by one. Indirect iterative computation of a component includes accessing and using one or more components other than the component itself. The one or more components may be initialized, directly iteratively calculated, or indirectly iteratively calculated.
FIG. 2 illustrates a flow diagram of an example method 200 for iteratively calculating a variance or a standard deviation for large data. The method 200 will be described in conjunction with computing system architectures 100A and 100B, respectively, and data.
To specify a scale of n (n)>1) Pre-tuning computation of subset initialization variance or p (p) of standard variance>1) An assembly (201). For example, in computing system architectures 100A and 100B, initialization module 132 may initialize component Cd with values for contribution 151 (contribution of data element 101), contribution 152 (contribution of data element 102), and contribution 153 (contributions of other data elements 103,104,105,106,107 and 108 … …)1141. Likewise, the initialization module 138 may access the component CdvAnd initializes the component Cd with the values of contribution 181 (contribution of data element 101), contribution 182 (contribution of data element 102), and contribution 183 (contributions of other data elements 103,104,105,106,107 and 108 … …)v145。
The method 200 includes accessing data elements to be removed from the pre-adjustment computation subset and data elements to be added to the pre-adjustment computation subset (202). For example, data elements 101 to be removed from the pre-adjustment computation subset and data elements 109 to be added to the pre-adjustment computation subset may be accessed after accessing data elements 102-108.
The method 200 includes adjusting the pre-adjustment computation subset (203). Adjusting the pre-adjustment computation subset includes removing data elements to be removed from the pre-adjustment computation subset (204) and adding data elements to be added to the pre-adjustment computation subset (205). For example, after data element 101 is removed from pre-adjustment computing subset 122 and data element 109 is added to pre-adjustment computing subset 122, pre-adjustment computing subset 122 may become adjusted computing subset 122A.
The method 200 includes directly iteratively calculating v components of variance or standard deviation for the post-adjustment computation subset based on v (1 ≦ v ≦ p) components of variance or standard deviation for the pre-adjustment computation subset (206), including: accessing v components (207) that compute the variance or standard deviation of the subsets before adjustment; mathematically removing from each component accessed any contribution of data elements removed from the pre-adjustment computation subset (208); and mathematically adding to each component accessed any contribution added to the data elements in the pre-adjustment computation dataset (209). The details are described below.
The v components for direct iterative computation of the variance or the standard deviation for the post-adjustment computation subset include v components (207) accessing the variance or the standard deviation of the pre-adjustment computation subset. For example, the iterative algorithm 133 may access the component Cd 1141, iterative algorithm 139 accessible component Cd v145。
The direct iterative computation of the squared difference of the adjusted computed subset or the v components of the standard deviation includes mathematically removing any contribution of the removed data elements from each component accessed (208). For example, the direct iterative computation component Cd 1143 may include the contribution removal module 133A mathematically removing Cd from the component 1141 remove contribution 151 (contribution of data element 101), and directly iterate the computation component Cd v147 can include a contribution removal module 139A mathematically removing the component Cd from the component Cd v145 removes contribution 181 (the contribution of data element 101).
The v components of the direct iterative computation of the variance or the standard deviation of the adjusted computation subset include mathematically adding the contribution of the added data element to each component accessed (209). For example, the direct iterative computation component Cd 1143 may include a contribution addition module 133B to mathematically add the contribution 154 to the component Cd 1141 direct iterative computation of component Cd v147 can include a contribution addition module 139B mathematically adding a contribution 184 to the component Cd v145. Both contributions 154 and 184 are contributions from data element 109.
As shown in FIGS. 1-1 and 1-2, component Cd 1143 include contribution 152 (contribution of data element 102), other contributions 153 (contribution of data elements 103 and 108), and contribution 154 (contribution of data element 109). Likewise, component Cd v147 includes contribution 182 (the contribution of data element 102), other contributions 183 (the contribution of data elements 103 and 108), and contribution 184 (the contribution of data element 109).
The variance or the standard deviation may be calculated as needed, i.e. only need to be accessed, but every time an existing data element is removed and a data element is added to the pre-adjustment calculation subset, the v components have to be calculated.
When the variance or the standard deviation is accessed and v<p (i.e., not all components are directly iteratively computed), method 200 includes indirectly iteratively computing w-p-v components as needed (210). The w components are only computed when the variance or the standard deviation is accessed. For example, referring to FIGS. 1-2, where some components are directly iteratively computed, some components are indirectly iteratively computed, and computation module 163 may be based on components Ci1One or more components other than Ci to indirectly iteratively compute component Ci1The calculation module 164 can be based on the components CiwOne or more components other than Ci to indirectly iteratively compute component Ciw. The one or more components may be initialized, directly iteratively calculated, or indirectly iteratively calculated.
Method 200 includes calculating variance as needed with one or more initialized or iteratively calculated components (211). For example, referring to FIG. 1-1, the variance calculation module 191 may be based on the component Cd 1143 to component CdvOne or more components within 147 calculate the variance 192.
The method 200 includes calculating the standard deviation as needed with one or more initialized or iteratively calculated components (212). For example, referring to FIGS. 1-1, the standard deviation calculation module 193 may be based on a component Cd 1143 to component CdvOne or more components within 147 calculate the standard deviation 194.
202- "209 can be repeated as more data element pairs are accessed. 210-212 may be repeated as desired. For example, the component Cd is counted1143 to component Cd v147, data element 102 and data element 110 may be accessed (202). 102 and 110 are accessible from locations 121B and 121J, respectively. Each time the next iteration is started, the original post-adjustment computation subset becomes the pre-adjustment computation subset of the new computation. By removing data elements 102(204) to be removed and adding data elements 110(205) to be added, the original adjusted computation subset 122A (i.e., the pre-adjustment computation subset of the new computation round) may become the new adjusted computation roundThe computation subset 122B (203).
The iterative algorithm 133 may use the component Cd1143 (of the adjusted computation subset 122A) direct iterative computation component Cd1144 (of the adjusted computation subset 122B) (206). The iterative algorithm 133 can access the component Cd1143(207). Direct iterative computation component Cd1144 may include a contribution removal module 133A to remove a contribution from the component Cd 1143 mathematically removes the contribution 152 (i.e., the contribution of the removed data element 102) (208). Direct iterative computation component Cd1144 may include a contribution addition module 133B to mathematically add a contribution 155 (i.e., the contribution of the added data element 110) to the component Cd1143 (209). Likewise, the iterative algorithm 139 can use the component Cdv147 (of the adjusted computation subset 122A) direct iterative computation component Cdv148 (of the adjusted computation subset 122B) (206). The iterative algorithm 139 can access the component Cdv147(207). Direct iterative computation component Cd v148 may include a contribution removal module 139A to remove a component Cd from the component Cd v147 mathematically remove the contribution 182 (i.e., the contribution of the removed data element 102) (208). Direct iterative computation component Cd v148 may include a contribution addition module 139B to mathematically add a contribution 185 (i.e., the contribution of the added data element 110) to the component Cdv147 (209).
As shown in FIGS. 1-1 and 1-2, component Cd1144 include other contributions 153 (contribution of data elements 103-108), contribution 154 (contribution of data element 109), and contribution 155 (contribution of data element 110). Likewise, component Cd v148 include other contributions 183 (contributions of data elements 103-108), contribution 184 (contribution of data element 109), and contribution 185 (contribution of data element 110).
The method 200 includes indirectly iteratively calculating w components and variances and/or standard deviations as needed, i.e., only the variances and/or standard deviations are accessed. If the variance or the standard deviation is not accessed, the method 200 includes continuing to access a next data element to be removed and a next data element to be added for a next computation subset (202). If the variance and/or standard deviation is accessed, method 200 includes indirectly iteratively computing w components (210), computing the variance and/or standard deviation based on one or more iteratively computed components (211, 212).
The component Cd when the next data element to be removed and the data element to be added are accessed1144 can be used to directly iterate the calculation of the next component Cd1Component Cdv148 can be used to directly iterate the calculation of the next component Cdv
Fig. 3 illustrates data elements accessed by a computational subset 300 that iteratively computes a variance or a standard deviation over large data. The difference between the computation subsets and the computation windows is that the pairs of data elements on the computation subsets are not ordered (e.g., data elements that are present may be removed from any position of the computation subsets and data elements may be added to any position of the computation subsets). For example, referring to fig. 3, the accessed data element may be removed from any location of the computation subset 300 ("r" identified) and one data element may be added to any location of the computation subset 300 ("a" identified). For the computation subset 300, the first n data elements are accessed to compute (initialize) one or more components of the first pre-adjustment computation subset, and then compute the variance and or standard deviation as needed. Over time, a data element x to be removed from the pre-adjustment computation subsetrAnd a data element pair x to be added to the computation subsetaThe v components that would be accessed to directly iterate the computation of the variance or standard deviation of the adjusted computation subset and the indirect iterative computation w ═ p-v components. One or more of these iteratively calculated components may be used to calculate the variance or the standard deviation. The v components may be computed iteratively directly by removing data elements from the pre-adjustment computation subset, adding data elements, and v components of the pre-adjustment computation subset, without accessing other data elements in the computation subset 300. For a given iterative algorithm, v is a constant, so the operands for the v components are calculated directly iteratively and also constants for the w-p-v components. Therefore, after one or more components of the first pre-adjusted computation subset are computed, the computational effort to compute all p components of a given n-sized adjusted computation subset is reducedAnd remains constant. The larger n is, the more prominent the reduction in the calculation amount is.
The following sections have some examples of components of variance or standard deviation and examples of iterative variance or standard deviation calculation algorithms.
FIG. 4-1 illustrates the definition of variance and standard deviation. Assume that the computation subset X consists of n data elements: x ═ XiI 1, …, n, is a subset of a large dataset and assumes that the variance or the standard deviation needs to be calculated. Assume a passage time data element xr(1 ≦ r ≦ n) to be removed from the pre-adjustment computation subset X and data element XaThe subset X is calculated before the adjustment is added. Whenever any component of the variance or the standard deviation needs to be recalculated due to a change in the data elements in the data set, a new round of iterative calculation is started. In a new round of iterative computation, the original post-adjustment computation subset becomes the pre-adjustment computation subset of the new round of computation.
Equations 401 and 402 are the sum S of all data elements of the computation subset X before adjustment for the k-th computation, respectivelykAnd average value
Figure BDA0001998468600000151
The conventional equation of (c). Equations 403 and 404 are the sample variances vs for all data elements of X for the kth round of computation, respectivelykAnd the overall variance vpkThe conventional equation of (c). Equations 405 and 406 are the sample standard deviations s of all data elements for the k-th round of computing X, respectivelykAnd the total standard deviation σkThe conventional equation of (c). Equations 407 and 408 are the sum S of all data elements of the adjusted computation subset X' for the k +1 th round of computation, respectivelyk+1And average value
Figure BDA0001998468600000161
The conventional equation of (c). Equations 409 and 410 are the sample variances vs for all data elements of the adjusted computation subset X' for the k +1 th round of computation, respectivelyk+1And the overall variance vpk+1The conventional equation of (c). Equations 411 and 412 are the sample standard deviations s of the adjusted computation subset X' for the k +1 th round of computation, respectivelyk+1And the total standard deviation σk+1OfThe process.
To demonstrate how the variance and standard deviation are iteratively calculated using components, three different iterative variance and standard deviation algorithms are provided as examples.
Fig. 4-2 illustrates equations that may be used by the first example iterative variance or standard deviation calculation algorithm (iterative algorithm 1). Equations 413 and 414, respectively, may be used to initialize S for all data elements of the pre-adjustment computation subset XkAndor or
Figure BDA0001998468600000162
Equations 415, 416, 417, and 418 may be used to calculate the sample variance vs based on the initialized components, respectively, as neededkGlobal variance vpkSample standard deviation skTotal standard deviation σk. Suppose a period later data element xrTo be removed from the pre-adjustment computation subset X and data element XaThe subset X is calculated before the adjustment is added. Based on component SkAndor or
Figure BDA0001998468600000163
Equations 419 and 420 may be used to iteratively calculate S for the adjusted computation subset X', respectivelyk+1Andor or
Figure BDA0001998468600000164
Based on vsk Equation 421 can iteratively calculate the sample variance vs of Xk+1. Based on vsk+1 Equation 422 may calculate the sample standard deviation s of Xk+1. Based on vpkEquation 423 can iteratively calculate the overall variance vp of Xk+1. Based on vpk+1Equation 424 may calculate the overall standard deviation σ of Xk+1. 419,420,421, and 423 each contain a plurality of equations but only one of them is required depending on whether or not and average or both are available.
Fig. 4-3 illustrate equations that may be used by a second example iterative variance or standard deviation calculation algorithm (iterative algorithm 2). Equations 425 and 426 may be used to initialize S for all data elements of the pre-adjustment computation subset X, respectivelykAndor or
Figure BDA0001998468600000165
Equation 427 may be used to initialize component SSDk. Equations 428 and 429 may be based on SSD as neededkSeparately calculating the variance vs of the sampleskAnd the overall variance vpk. Equation 430 may be based on SSD as neededkOr vskCalculating the standard deviation s of the samplek. Equation 431 may be based on SSD as neededkOr vpkCalculating the total standard deviation sigmak. Suppose a period later data element xrTo be removed from the pre-adjustment computation subset X and data element XaThe subset X is calculated before the adjustment is added. Based on component SkAndor or
Figure BDA0001998468600000166
Equations 432 and 433, respectively, may be used to iteratively calculate S for the adjusted computation subset Xk+1Andor or
Figure BDA0001998468600000167
Component-based SSDkEquation 434 can be used to iteratively calculate the SSD for Xk+1. Component-based SSDk+1Equations 435 and 436 may calculate the sample variance vs, respectivelyk+1And the overall variance vpk+1. Equation 437 can be based on SSD as neededk+1Or vsk+1Calculating the standard deviation s of the samplek+1. Equation 438 may be based on SSD as neededk+1Or vpk+1Calculating the total standard deviation sigmak+1. 432,433, and 434 contain multiple equations but only one of them needs to be dependent on whether or not and average or both are available, respectively.
Fig. 4-4 illustrate equations that may be used by a third example iterative variance or standard deviation calculation algorithm (iterative algorithm 3). Equations 439 and 440 may be used to initialize S for all data elements of the pre-adjustment computation subset X, respectivelykAndor or
Figure BDA0001998468600000171
Equation 441 may be used to initialize component SSk. Equations 442 and 443 can be based as desiredIn SSkSeparately calculating the variance vs of the sampleskAnd the overall variance vpk. Equation 444 may be based on SS as neededkOr vskCalculating the standard deviation s of the samplek. Equation 445 may be based on SS as neededkOr vpkCalculating the total standard deviation sigmak. Assume a passage time data element xrTo be removed from the pre-adjustment computation subset X and data element XaThe subset X is calculated before the adjustment is added. Based on component SkAndor or
Figure BDA0001998468600000172
Equations 446 and 447, respectively, may be used to iteratively calculate S for the adjusted computation subset Xk+1Andor or
Figure BDA0001998468600000173
Component-based SSkEquation 448 can be used to iteratively calculate SS for Xk+1. Component-based SSk+1Equations 449 and 450 can calculate the sample variance vs, respectivelyk+1And the overall variance vpk+1. Equation 451 may be based on SS as neededk+1Or vsk+1Calculating the standard deviation s of the samplek+1. Equation 452 may be based on SS as desiredk+1Or vpk+1Calculating the total standard deviation sigmak+1. 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, and 452, respectively, contain a plurality of equations but only one of which is required depending on whether or not and average or both are available.
To demonstrate the iterative variance and standard variance algorithms and their comparison with conventional algorithms, three examples are given below. Data for 3 computation subsets were used. For the conventional algorithm, the computation process is exactly the same for all 3 computation subsets. For an iterative algorithm, a first subset of computations performs initialization of the plurality of components, and a second and third subset of computations performs iterative computations.
FIGS. 5-1, 5-2, and 5-3 show first, second, and third subsets of computations, respectively, for an example computation. The computation subset 503 comprises 8 data elements of the large data set 501: 8,3,6,1,9,2,5,4. The computation subset 504 includes 8 data elements of the large data set 501: 3,6,1,9,2,5,4, -7. The computation subset 505 comprises 8 data elements of the large data set 501: 6,1,9,2,5,4, -7, 11. The calculation subset size 502(n) is 8.
The sample variance and the sample standard variance of the computation subsets 503, 504 and 505, respectively, are first calculated using conventional algorithms.
The sample variance and sample standard deviation are calculated for the calculation subset 503:
Figure BDA0001998468600000181
Figure BDA0001998468600000182
Figure BDA0001998468600000183
without any optimization, the sample variance calculation for the size 8 calculation subset has 2 divisions, 14 multiplications, 14 additions, and 9 subtractions. The standard deviation of the samples is calculated plus 1 square root.
The same equations and processes can be used to calculate the sample variance and the sample standard deviation for the computation subset 504 shown in fig. 5-2 and the computation subset 505 shown in fig. 5-3, respectively. Computing sample variances for the subsets 504
Figure BDA0001998468600000184
And standard deviation of sample
Figure BDA0001998468600000185
Figure BDA0001998468600000186
Calculating the sample variance without optimization includes 2 divisions, 8 multiplications, 14 additions, and 9 subtractions. The standard deviation of the samples is calculated plus 1 square root. Computing sample variances for the subsets 505
Figure BDA0001998468600000187
And standard deviation of sample
Figure BDA0001998468600000188
Figure BDA0001998468600000189
Calculating the sample variance without optimization includes 2 divisions, 8 multiplications, 14 additions, and 9 subtractions. The standard deviation of the samples is calculated plus 1 square root. Conventional algorithms typically require 2 divisions, n multiplications, 2(n-1) additions, and n +1 subtractions to be performed without optimization to calculate the sample variance for a subset of computations of size n. The standard deviation of the samples is calculated plus 1 square root.
The sample variance and the sample standard deviation of the computation subsets 503, 504, and 505 are calculated using an iterative algorithm 1 below.
The sample variance and sample standard deviation are calculated for the size 8 calculation subset 503:
initialize round 1 with equation 414
Figure BDA00019984686000001810
Figure BDA00019984686000001811
Calculate the 1 st round sample variance using equation 415:
Figure BDA0001998468600000191
calculate the 1 st round sample standard deviation using equation 417:
Figure BDA0001998468600000192
there are 2 divisions, 8 multiplications, 14 additions, and 9 subtractions in calculating the sample variance for the subset 503. The standard deviation of the samples is calculated plus 1 square root.
The sample variance and sample standard deviation are calculated for the size 8 calculation subset 504:
iteratively calculating the round 2 component using equation 420
Figure BDA0001998468600000193
Figure BDA0001998468600000194
Calculate round 2 vs using equations 421 and 422, respectively2And s2
Figure BDA0001998468600000195
Figure BDA0001998468600000196
There are 2 divisions, 1 multiplication, 4 additions, and 4 subtractions in iteratively calculating the sample variance for the adjusted computation subset 504. The standard deviation of the samples is calculated plus 1 square root.
The sample variance and sample standard deviation are calculated for the size 8 calculation subset 505:
iteratively calculating the round 3 component using equation 420
Figure BDA0001998468600000197
Figure BDA0001998468600000198
Calculate round 3 vs using equations 421 and 422, respectively3And s3
Figure BDA0001998468600000199
Figure BDA00019984686000001910
There are 2 divisions, 1 multiplication, 4 additions, and 4 subtractions in iteratively calculating the sample variance for the adjusted computation subset 505. The standard deviation of the samples is calculated plus 1 square root.
The sample variance and the sample standard deviation of the computation subsets 503, 504, and 505 are calculated using an iterative algorithm 2 below.
The sample variance and sample standard deviation are calculated for the size 8 calculation subset 503:
initialize round 1 with equation 426
Figure BDA0001998468600000201
Figure BDA0001998468600000202
Initializing round 1 SSD with equation 4271
Figure BDA0001998468600000203
Calculate round 1 vs using equations 428 and 430, respectively1And s1
Figure BDA0001998468600000204
Figure BDA0001998468600000205
There are 2 divisions, 8 multiplications, 14 additions, and 9 subtractions in calculating the sample variance for the subset 503. The standard deviation of the samples is calculated plus 1 square root.
The sample variance and sample standard deviation are calculated for the size 8 calculation subset 504:
iteratively calculating the Components for round 2 with equations 433 and 434, respectively
Figure BDA0001998468600000206
And SSD2
Figure BDA0001998468600000207
Figure BDA0001998468600000208
Calculate round 2 vs using equations 435 and 437, respectively2And s2
Figure BDA0001998468600000209
Figure BDA00019984686000002010
There are 2 divisions, 1 multiplication, 4 additions, and 4 subtractions in iteratively calculating the sample variance for the adjusted computation subset 504. The standard deviation of the samples is calculated plus 1 square root.
The sample variance and sample standard deviation are calculated for the size 8 calculation subset 505:
iteratively calculating the components for round 3 using equations 433 and 434, respectively
Figure BDA00019984686000002011
And SSD3
Figure BDA00019984686000002012
Figure BDA00019984686000002013
Figure BDA0001998468600000211
Calculate round 3 vs using equations 435 and 437, respectively3And s3
Figure BDA0001998468600000212
Figure BDA0001998468600000213
There are 2 divisions, 1 multiplication, 4 additions, and 4 subtractions in iteratively calculating the sample variance for the adjusted computation subset 504. The standard deviation of the samples is calculated plus 1 square root.
Next, the sample variance and the sample standard deviation of the computation subsets 503, 504, and 505 are calculated using an iterative algorithm 3.
The sample variance and sample standard deviation are calculated for the size 8 calculation subset 503:
initialize round 1 with equation 440
Figure BDA0001998468600000214
Figure BDA0001998468600000215
Initialize round 1 SS with equation 4411
Figure BDA0001998468600000216
Calculate round 1 vs using equations 442 and 444, respectively1And s1
Figure BDA0001998468600000217
Figure BDA0001998468600000218
There are 2 divisions, 10 multiplications, 14 additions, and 2 subtractions in calculating the sample variance for the subset 503. The standard deviation of the samples is calculated plus 1 square root.
The sample variance and sample standard deviation are calculated for the size 8 calculation subset 504:
iteratively calculating the round 2 components using equations 447 and 448, respectively
Figure BDA00019984686000002113
And SS2
Figure BDA00019984686000002110
SS2=SS1+xa 2-xr 2=236+(-7)2-82=236+49-64=221
Calculation of round 2 vs using equations 449 and 451, respectively2And s2
Figure BDA00019984686000002111
Figure BDA00019984686000002112
There are 2 divisions, 4 multiplications, 2 additions, and 4 subtractions in iteratively computing the sample variance for the adjusted computation subset 504. The standard deviation of the samples is calculated plus 1 square root.
The sample variance and sample standard deviation are calculated for the size 8 calculation subset 505:
iteratively calculating the 3 rd round components using equations 447 and 448, respectively
Figure BDA0001998468600000225
And SS3
Figure BDA0001998468600000222
SS3=SS2+xa 2-xr 2=221+112-32=333
Calculate round 3 vs using equations 449 and 451, respectively3And s3
Figure BDA0001998468600000223
Figure BDA0001998468600000224
There are 2 divisions, 4 multiplications, 2 additions, and 4 subtractions in iteratively computing the sample variance for the adjusted computation subset 505. The standard deviation of the samples is calculated plus 1 square root.
In the above three examples, the mean is used for iterative sample variance and sample standard deviation calculations. And may also be used for sample variance and sample standard variance iterative calculations, with only different operands.
Fig. 6-1 illustrates the comparison of the computation amount of the conventional variance algorithm and the iterative variance algorithm when n is 8. As shown, any of the iterative and conventional algorithms have fewer addition and subtraction operations than the conventional algorithm.
Fig. 6-2 illustrates a comparison of the computation amounts of the conventional variance algorithm and the iterative variance algorithm when n is 1,000,000. As shown, any one iterative algorithm has many fewer multiply operations, add operations and subtract operations than the conventional algorithm.
Fig. 6-3 illustrates the comparison of the computation load of the conventional standard deviation algorithm and the iterative standard deviation algorithm when n is 8. As shown, any one of the iterative algorithm and the conventional algorithm has many fewer addition operations and subtraction operations than the conventional algorithm.
Fig. 6-4 illustrate the comparison of the computation load of the conventional standard deviation algorithm and the iterative standard deviation algorithm when n is 1,000,000. As shown, any one iterative algorithm has many fewer multiply operations, add operations and subtract operations than the conventional algorithm.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described implementations are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims (10)

1. A method implemented by a computing system based on one or more computing devices for iteratively computing a data set stored on one or more storage devices, the method comprising the steps of adjusting the computed subset, variance, or standard deviation, wherein:
initializing, by a computing system based on a computing device, a sum or a mean or a sum and a mean, and one or more other components of variance or standard variance other than the sum and mean for a pre-adjusted computing subset of a data set of a specified size n (n >4) stored on one or more storage devices on the computing system;
accessing, by the computing system based on a computing device, a data element to be removed from the pre-adjusted computing subset and a data element to be added to the pre-adjusted computing subset;
adjusting, by the computing system based on a computing device, the pre-adjustment computation subset by:
removing data elements to be removed from the pre-adjustment computation subset; and
adding data elements to be added to the pre-adjustment computation subset;
iteratively calculating, by a computing system based on the computing device, a sum or an average or a sum and an average for the adjusted computation subset;
iteratively calculating, by the computing system based on a computing device, one or more components of variance and/or standard deviation for the adjusted computed subset based at least on the one or more other components of variance and/or standard deviation of the pre-adjusted computed subset other than sum and mean, where iteratively calculating one or more components comprises:
accessing the one or more components of the pre-adjustment computation subset of variances and/or the standard variances other than the sum and the mean to avoid accessing all data elements in the post-adjustment computation subset to reduce data access latency, save computational resources, and reduce energy consumption; and
mathematically removing any contribution of the removed data elements from each component accessed and mathematically adding any contribution of the added data elements to each component accessed based on the removed data elements and the added data elements while avoiding using all data elements in the adjusted computation subset in iteratively computing the one or more components of variance or standard variance to improve computational efficiency; and
generating, by the computing system based on a computing device, a variance or a standard deviation for the adjusted computed subset based on one or more components iteratively computed for the adjusted computed subset.
2. The computing system implemented method of claim 1, wherein: the method also includes performing a pre-alignment computation subset for each of the plurality of data elements to be removed and each of the plurality of data elements to be added, iteratively computing a sum or a mean or a sum and a mean, directly iteratively computing one or more components of a variance or a standard deviation other than the sum and the mean, and generating a variance or a standard deviation for the post-alignment computation subset.
3. The computing system implemented method of claim 2, wherein: the generating a variance or a standard deviation for the adjusted computation subset is performed if and only if the variance or the standard deviation is accessed.
4. The computing system implemented method of claim 3, wherein: generating the variance and or the standard deviation for the adjusted computed subset further includes indirectly iteratively computing, by the computing system based on the computing device, one or more components of the variance and or the standard deviation for the adjusted computed subset, the indirectly iteratively computing the one or more components including individually computing the one or more components based on one or more components other than the component to be computed.
5. A computing system, characterized by:
one or more computing devices;
each computing device contains one or more processors;
one or more storage media, wherein at least one storage media stores a data set; and
one or more computation modules that, when executed by at least one of the one or more computing devices, determine a variance or a standard deviation for a specified size of the scaled subset of computations for the data set, where the determination of the variance or the standard deviation comprises:
a. initializing a sum or a mean or a sum and a mean, and one or more other components of variance or standard variance other than the sum and mean, for a pre-conditioning computation subset of the data set of a specified size n (n > 4);
b. accessing a data element to be removed from the pre-adjusted computation subset and a data element to be added to the pre-adjusted computation subset;
c. adjusting the pre-adjustment computation subset, comprising:
removing data elements to be removed from the pre-adjustment computation subset; and
adding data elements to be added to the pre-alignment computation dataset;
d. iteratively calculating a sum or an average or a sum and an average for the adjusted computation subset;
e. iteratively calculating one or more components other than sum and mean of variance or standard deviation for the adjusted computation subset, comprising:
accessing the one or more components other than the sum and the mean of the variance or the standard deviation of the pre-adjustment computation subset to avoid accessing all data elements in the post-adjustment computation subset to reduce data access latency, save computation resources, and reduce energy consumption; and
based on the removed data elements and the added data elements, avoiding using all data elements in the adjusted computation subset in the one or more components that iteratively compute variances and/or standard variances to improve computational efficiency by mathematically removing any contribution of the removed data elements from each component accessed and mathematically adding any contribution of the added data elements to each component accessed to arrive at the one or more components of the adjusted computation subset of variances and/or standard variances; and
f. a variance or a standard deviation is generated for the adjusted computed subset based on one or more components iteratively computed for the adjusted computed subset.
6. The computing system of claim 5, wherein: the one or more computing modules, when executed by at least one of the one or more computing devices, perform b, c, d, e, and f a plurality of times.
7. The computing system of claim 6, wherein: the execution f is if and only if the variance or the standard deviation is accessed.
8. The computing system of claim 7, wherein: the f further includes one or more components that are indirectly iteratively calculated by the computing system for the adjusted computation subset of variances and or standard deviations, the indirectly iteratively calculating the one or more components including individually calculating the one or more components based on the one or more components other than the component to be calculated.
9. A computing system program product for execution on a computing system comprising one or more computing devices, the computing system including one or more processors and one or more storage media, the computing system program product comprising computing device-executable instructions that, when executed by at least one of the computing devices in the computing system, perform a method of generating variances and/or standard variances for adjusted computed subsets of the data set, the method comprising:
initializing a sum or a mean or a sum and a mean and one or more other components of variance or standard variance other than the sum and mean for a pre-conditioning computation subset of a data set of a specified size n (n >4) stored on at least one storage medium of the system;
accessing a data element to be removed from the pre-adjusted computation subset and a data element to be added to the pre-adjusted computation subset;
adjusting the pre-adjustment computation subset by:
removing data elements to be removed from the pre-adjustment computation subset; and
adding data elements to be added to the pre-adjustment computation subset;
iteratively calculating a sum or an average or a sum and an average for the adjusted computation subset;
iteratively calculating one or more components of variance and or standard deviation for the adjusted computed subset based at least on the one or more other components of variance and or standard deviation of the pre-adjusted computed subset other than sum and mean, the iteratively calculating one or more components comprising:
accessing the one or more components other than the sum and the mean of the variance or the standard deviation of the pre-adjustment computation subset to avoid accessing all data elements in the post-adjustment computation subset to reduce data access latency, save computation resources, and reduce energy consumption; and
mathematically removing any contribution of the removed data elements from each component accessed and mathematically adding any contribution of the added data elements to each component accessed based on the removed data elements and the added data elements while avoiding using all data elements in the adjusted computation subset in the one or more components iteratively computing a variance or a standard variance to improve computational efficiency; and
a variance or a standard deviation is generated for the adjusted computed subset based on one or more components iteratively computed for the adjusted computed subset.
10. The computing system program product of claim 9, wherein: generating the variance and or the standard deviation for the adjusted computed subset further includes indirectly iteratively computing, by the computing system based on the computing device, one or more components of the variance and or the standard deviation for the adjusted computed subset, the indirectly iteratively computing the one or more components including individually computing the one or more components based on one or more components other than the component to be computed.
CN201910204265.6A 2019-03-18 2019-03-18 Method for judging big data discrete degree in real time Pending CN111708979A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910204265.6A CN111708979A (en) 2019-03-18 2019-03-18 Method for judging big data discrete degree in real time

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910204265.6A CN111708979A (en) 2019-03-18 2019-03-18 Method for judging big data discrete degree in real time

Publications (1)

Publication Number Publication Date
CN111708979A true CN111708979A (en) 2020-09-25

Family

ID=72536079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910204265.6A Pending CN111708979A (en) 2019-03-18 2019-03-18 Method for judging big data discrete degree in real time

Country Status (1)

Country Link
CN (1) CN111708979A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140164456A1 (en) * 2012-12-12 2014-06-12 Microsoft Corporation Iteratively calculating standard deviation for streamed data
US10079910B1 (en) * 2014-12-09 2018-09-18 Cloud & Stream Gears Llc Iterative covariance calculation for streamed data using components
US10225308B1 (en) * 2015-02-12 2019-03-05 Cloud & Stream Gears Llc Decremental Z-score calculation for big data or streamed data using components

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140164456A1 (en) * 2012-12-12 2014-06-12 Microsoft Corporation Iteratively calculating standard deviation for streamed data
US10079910B1 (en) * 2014-12-09 2018-09-18 Cloud & Stream Gears Llc Iterative covariance calculation for streamed data using components
US10225308B1 (en) * 2015-02-12 2019-03-05 Cloud & Stream Gears Llc Decremental Z-score calculation for big data or streamed data using components

Similar Documents

Publication Publication Date Title
US9928215B1 (en) Iterative simple linear regression coefficient calculation for streamed data using components
US10659369B2 (en) Decremental autocorrelation calculation for big data using components
US9760539B1 (en) Incremental simple linear regression coefficient calculation for big data or streamed data using components
US10235415B1 (en) Iterative variance and/or standard deviation calculation for big data using components
US10225308B1 (en) Decremental Z-score calculation for big data or streamed data using components
US10318530B1 (en) Iterative kurtosis calculation for big data using components
US10310910B1 (en) Iterative autocorrelation calculation for big data using components
CN112035521A (en) Method for judging self-set delay repeatability of streaming data in real time
US10079910B1 (en) Iterative covariance calculation for streamed data using components
US10394809B1 (en) Incremental variance and/or standard deviation calculation for big data or streamed data using components
US10394810B1 (en) Iterative Z-score calculation for big data using components
CN112035520A (en) Method for judging self-set delay repeatability of streaming data in real time
US10191941B1 (en) Iterative skewness calculation for streamed data using components
CN111708979A (en) Method for judging big data discrete degree in real time
US10282445B1 (en) Incremental kurtosis calculation for big data or streamed data using components
CN110457340B (en) Method for searching big data self-repeating rule in real time
CN112035505A (en) Method for judging concentration degree of big data distribution density in real time
CN112434258A (en) Method for judging relative distance of selected data from average value in real time by taking standard variance as unit
CN110515680B (en) Method for judging given delay repeatability of big data in real time
CN111708972A (en) Method for judging concentration degree of stream data distribution density in real time
CN110363321B (en) Method for predicting big data change trend in real time
CN110515681B (en) Method for judging given delay repeatability of stream data in real time
CN111488380A (en) Method for judging asymmetry of stream data distribution in real time
CN112035792A (en) Method for judging self-given delay repeatability of big data in real time
US10339136B1 (en) Incremental skewness calculation for big data or streamed data using components

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination