CN111967616B - Automatic time series regression method and device - Google Patents

Automatic time series regression method and device Download PDF

Info

Publication number
CN111967616B
CN111967616B CN202010832356.7A CN202010832356A CN111967616B CN 111967616 B CN111967616 B CN 111967616B CN 202010832356 A CN202010832356 A CN 202010832356A CN 111967616 B CN111967616 B CN 111967616B
Authority
CN
China
Prior art keywords
time
time sequence
window
data set
sequence data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010832356.7A
Other languages
Chinese (zh)
Other versions
CN111967616A (en
Inventor
陈海波
罗志鹏
王锦
姚灿美
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyan Technology Beijing Co ltd
Original Assignee
Shenyan Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyan Technology Beijing Co ltd filed Critical Shenyan Technology Beijing Co ltd
Priority to CN202010832356.7A priority Critical patent/CN111967616B/en
Publication of CN111967616A publication Critical patent/CN111967616A/en
Application granted granted Critical
Publication of CN111967616B publication Critical patent/CN111967616B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The invention provides an automatic time sequence regression method and device, wherein the method comprises the following steps: acquiring a time sequence data set, and preprocessing the time sequence data set; carrying out automatic time sequence characteristic engineering processing and data sampling on the preprocessed time sequence data set; establishing different types of machine learning models; and calculating dynamic weights based on the time sliding window according to the time sequence data sets after preprocessing, automatic time sequence feature engineering processing and data sampling so as to fuse different types of machine learning models. In the machine learning application related to time series data, the invention can conveniently obtain an application model without relying on experience and knowledge accumulation of a data scientist, and can obtain a more accurate output result by using the model.

Description

Automatic time series regression method and device
Technical Field
The present invention relates to the field of machine learning, and in particular to an automatic time series regression method, an automatic time series regression apparatus, a computer device, a non-transitory computer readable storage medium and a computer program product.
Background
With the advent of the data age, the volume of data is growing exponentially, and the rate of such growth has forced the scientific community to face a variety of challenges, one important form of data presentation among these is time series, which is a function of time-based independent variables, and many sequences of data in life are all in the category of time series, such as stock indices, electrocardiography, even speech signals, wind speeds in grasslands, etc., and are all subject to variations in their inherent characteristics. Time series regression is the deterministic and orderly regularity that is implied in large amounts of seemingly random and unordered data. The idea of this law is not tangible and is therefore dynamic, limited, because it varies differently in different time periods, in different places and even by changing one of the small factors.
Time series relational data are very common in application scenes such as economic finance, insurance, online advertising, recommendation systems, medical treatment and the like, and people often need to utilize the data to construct a machine learning model and apply the machine learning model to improve the effect of corresponding business. Time series data is the most important data of machine learning solving industry scenes, especially big data scenes, and because of the specificity of data input, the time series storage mode and the database design mode are greatly different from those of a common relational database. Current methods for time series pattern recognition mainly involve two directions: one is called complex system and the other is machine learning. The complex system is one that requires fitting data to known models, such as classical AR (Auto Regressive Model ), MA (Moving Average Model, moving average model), ARMA (Auto REGRESSIVE AND Moving Average Model, autoregressive moving average model), aria (Auto REGRESSIVE INTEGRATE Moving Average Model, differential autoregressive moving average model). While machine learning is performing "violence" fits using a generic class of models, such as neural networks.
In the conventional machine learning application, an experienced expert is required to extract effective feature information from time series data, and to use the feature information to improve the effect of the machine learning model. Even with deep knowledge reserves, experts need to build valuable timing features through continuous trial and error and utilize multiple correlation tables to improve the performance of the machine learning model. In addition, the selection of the appropriate machine learning model and super parameters also requires strong machine learning expertise as support.
Disclosure of Invention
In order to solve the technical problems, the invention provides an automatic time sequence regression method and an automatic time sequence regression device, which can conveniently obtain an application model without relying on experience and knowledge accumulation of a data scientist in machine learning application related to time sequence data, and can obtain more accurate output results by using the model.
The technical scheme adopted by the invention is as follows:
An automatic time series regression method comprising the steps of: acquiring a time sequence data set, and preprocessing the time sequence data set; carrying out automatic time sequence characteristic engineering processing and data sampling on the preprocessed time sequence data set; establishing different types of machine learning models; and calculating dynamic weights based on the time sliding window according to the time sequence data sets after preprocessing, automatic time sequence feature engineering processing and data sampling so as to fuse different types of machine learning models.
Preprocessing the time series data set, including: and carrying out smoothing treatment on the abnormal points in the time sequence data set.
The characteristics obtained by carrying out automatic time sequence characteristic engineering processing on the preprocessed time sequence data set comprise a target characteristic based on a time sliding window, a target statistical characteristic based on the time sliding window, a target trend characteristic based on the time sliding window, an important original characteristic based on the time sliding window and a statistical characteristic based on the time sliding window.
Data sampling is carried out on the preprocessed time sequence data set, and the method comprises the following steps: the IDs in the time series dataset are randomly sampled, wherein different sampling ratios are used for different sizes of data amounts.
Different types of machine learning models include linear regression and LightGBM models.
Calculating dynamic weights based on a time sliding window according to the time sequence data set after preprocessing, automatic time sequence feature engineering processing and data sampling so as to fuse different types of machine learning models, wherein the method comprises the following steps of: determining initial fusion weights through the verification set; setting a time window of a test set, and testing with the initial fusion weight in a first time window; after each time window is finished, obtaining a corresponding optimal fusion weight according to a test result of the time window; updating the optimal fusion weight of the time window according to the set rule, and testing the next time window according to the updated fusion weight.
An automatic time series regression apparatus comprising: the preprocessing module is used for acquiring a time sequence data set and preprocessing the time sequence data set; the characteristic engineering and sampling module is used for carrying out automatic time sequence characteristic engineering processing and data sampling on the preprocessed time sequence data set; the model building module is used for building machine learning models of different types; and the fusion module is used for calculating dynamic weights based on the time sliding window according to the time sequence data sets after preprocessing, automatic time sequence feature engineering processing and data sampling so as to fuse different types of machine learning models.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-described automatic time series regression method when executing the program.
A non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described automatic time series regression method.
A computer program product which, when executed by a processor, performs the above-described automatic time series regression method.
The invention has the beneficial effects that:
According to the invention, the time sequence data set is preprocessed, automatic time sequence characteristic engineering processing and data sampling are carried out, and different types of machine learning models are fused by calculating the dynamic weight based on the time sliding window, so that in the machine learning application related to the time sequence data, the application model can be conveniently obtained without relying on experience and knowledge accumulation of a data scientist, and a more accurate output result can be obtained by using the model.
Drawings
FIG. 1 is a flow chart of an automatic time series regression method according to an embodiment of the present invention;
Fig. 2 is a block diagram of an automatic time series regression apparatus according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the automatic time series regression method according to the embodiment of the present invention includes the following steps:
s1, acquiring a time sequence data set, and preprocessing the time sequence data set.
The time-series data is a data sequence recorded in time series with the same unified index. In one embodiment of the invention, the abnormal points in the time-ordered data set can be smoothed, so that the influence of the data set on the model precision is reduced. In time series data tasks, outlier points often occur, and the processing of outliers also compares the impact, which presents a greater challenge for processing outliers of time series data than non-time series data. Time series data is generally and strongly related to time, and a target value also can drift to another value range along with the time, so that if the global mean standard deviation is directly adopted for processing, some non-outliers can be processed. Considering the existence of the problem, the embodiment of the invention adopts a global and local outlier smoothing processing mode.
This approach considers the global mean standard deviation as well as the mean standard deviation of the current point near time window and the values of neighboring points, where the fold setting for deviations from the global standard deviation is larger in order to avoid that normal values are processed. A particular concern here is also that the processing of the training set and the test set is somewhat different, since the data of the test set is obtained slowly in time steps, and then later time steps of the current point in time are not visible, and therefore the processing of the test data set is based on the adjacent time window data before the current point. After the abnormal point is detected in a global and local mode, a value in a relatively normal range can be calculated according to the local mean standard deviation and the left and right adjacent values of the current point and is reassigned as a new value of the current point.
S2, carrying out automatic time sequence characteristic engineering processing and data sampling on the preprocessed time sequence data set.
In time series related tasks, what happens within a certain time window in the past has a great influence on future result predictions. The time window that is typically acted upon is somewhat different for different levels of time granularity data. Therefore, the embodiment of the invention mainly takes the characteristics of a time sliding window based on the time sequence. The features obtained by the automatic time sequence feature engineering processing comprise a target feature based on a time sliding window, a target statistical feature based on the time sliding window, a target trend feature based on the time sliding window, an important original feature based on the time sliding window, a statistical feature based on the time sliding window and other features.
For the time sliding window based target feature, the target is generally not far away from the adjacent time step value in the time series data, and they have strong correlation, so that the adjacent target in the past can be first used as the feature. In addition, the time step interval of the data set is also identified, the data set can be judged to be in terms of hours, minutes, days, months or weeks as the time step, and the size of the characteristic window is determined by verifying the search according to the model according to the time interval.
For the statistics of targets based on time sliding window, after targets are completed with sliding window, statistics are further performed on targets. There are two statistical methods, the first is to count the last N days, and some differences according to the time step difference, and the statistics are generally performed by taking the last 2, 3, 5 and 7 days of the time interval of days, and the memory limitation is also considered here. The second is to divide a large time window into N segments, and count each segment separately. The statistical calculation has maximum value, minimum value, mean value, standard deviation and the like.
For the trend feature based on the time sliding window target, the change rate of the target is calculated, and the change trend can be reflected.
Where r i represents the rate of change of the current time of the target, t i-1 represents the target of the last time node, and t i-2 represents the target of the last time node.
For important raw features based on a time sliding window, the model may be trained first using the raw features and the importance of the features obtained, and then ranked according to feature importance. Other raw features are of less importance than the historical target, so a smaller time window than target can be selected, and then the number of features used is determined based on the time window and the system-limited resources.
For statistical features based on time sliding windows, statistical feature calculations are performed here for the classification features and the numerical features, respectively. For classification features, the frequency and ratio of occurrence of statistical feature values within a time window. For numerical features, the calculation mode is the same as the statistics based on the target, and the maximum value, the minimum value, the mean value and the standard deviation are counted, but the time window is controlled to be smaller.
For other features, in addition to the above features, the feature counted by the training set is also tried to be directly used as the feature of the whole data. For example, the frequency and ratio of statistics of classification features in the training set are global, the frequency and ratio of statistics of two classification features with high feature importance are combined, and the frequency and ratio of one classification feature with high feature importance and one numerical feature with high feature importance are combined, and statistics are performed on the numerical features based on the classification features. Cross combinations of historical target and other features are also contemplated, such as cross modes of multiplying or dividing target by other more important numerical features.
The automatic feature engineering and automatic feature selection phases are typically time consuming and memory consuming, and data may be sampled in order to speed up the process. The sampling of the time sequence needs to pay attention to the sampling mode, if the data is directly and randomly sampled, the data with different time stamps of the same ID is lost, the data is not complete enough, and the effect is poor and the effect of the whole data is not comparable. Considering this problem, the embodiment of the invention randomly samples the IDs in the time-ordered data set, and uses different sampling ratios for data amounts of different sizes, the larger the data is, the smaller the sampling ratio is, when the data amount is larger, the data is truncated according to time steps, the later time step data is reserved, the sampling mode is consistent with the basic effect of using the full amount of sampling, and the final feature selection effect is relatively stable.
S3, building machine learning models of different types.
In an embodiment of the invention, two linear models and a tree model with large differences can be established. In particular, linear regression and LightGBM models may be established.
S4, calculating dynamic weights based on the time sliding window according to the time sequence data sets after preprocessing, automatic time sequence feature engineering processing and data sampling so as to fuse different types of machine learning models.
The above linear regression and LightGBM models have relatively large differences in the effects of the two models over time series data sets, some data sets have close effects, some linear regression effects are better, and some LightGBM models have better effects. Analysis has found that these datasets vary significantly over time, with the targets of some tasks increasing over time, such data often not fitting into tree models. In addition, the effect performance difference of different models in different time periods can also be greatly changed for the same data set.
The time series data has a larger time relation, so in order to reduce the influence of time factors on the model, the model can be fused by calculating the dynamic weight based on the time sliding window.
Specifically, an initial fusion weight w 0 may be first determined by the validation set, then the time window of the test set is set, and the test is performed at the initial fusion weight w 0 in the first time window. After each time window is finished, obtaining a corresponding optimal fusion weight according to a test result of the time window, updating the optimal fusion weight of the time window according to a set rule, and testing with the updated fusion weight in the next time window. That is, the test is performed with the initial fusion weight w 0 in the first time window, and when the first time window is finished, the optimal fusion weight w 1 of the window can be obtained according to the test result of the first time window, and then the following formula is used to update w 1:
w′1=r×w0+(1-r)×w1
wherein r is a memory factor, namely the proportion of the previous time window weight to the current time window weight updating process.
The test results for the second window are iteratively updated using w' 1 as the fusion weight, and so on. Thus, over time, the effect of longer elapsed time results on fusion may become less.
According to the automatic time sequence regression method provided by the embodiment of the invention, the time sequence data set is preprocessed, the automatic time sequence characteristic engineering processing and the data sampling are carried out, and different types of machine learning models are fused by calculating the dynamic weight based on the time sliding window, so that in the machine learning application related to the time sequence data, the application model can be conveniently obtained without relying on experience and knowledge accumulation of a data scientist, and a more accurate output result can be obtained by using the model.
Corresponding to the automatic time series regression method of the embodiment, the invention also provides an automatic time series regression device.
As shown in fig. 2, the automatic time series regression apparatus according to the embodiment of the present invention includes a preprocessing module 10, a feature engineering and sampling module 20, a model building module 30, and a fusion module 40. The preprocessing module 10 is used for acquiring a time sequence data set and preprocessing the time sequence data set; the feature engineering and sampling module 20 is used for performing automatic time sequence feature engineering processing and data sampling on the preprocessed time sequence data set; model building module 30 is used to build different types of machine learning models; the fusion module 40 is configured to calculate dynamic weights based on the time sliding window according to the pre-processed time sequence data set after the automatic time sequence feature engineering process and the data sampling, so as to fuse different types of machine learning models.
The time-series data is a data sequence recorded in time series with the same unified index. In one embodiment of the present invention, the preprocessing module 10 may perform smoothing on the outliers in the time-ordered data set, so as to reduce the influence of the data set on the model accuracy. In time series data tasks, outlier points often occur, and the processing of outliers also compares the impact, which presents a greater challenge for processing outliers of time series data than non-time series data. Time series data is generally and strongly related to time, and a target value also can drift to another value range along with the time, so that if the global mean standard deviation is directly adopted for processing, some non-outliers can be processed. Considering the existence of the problem, the embodiment of the invention adopts a global and local outlier smoothing processing mode.
This approach considers the global mean standard deviation as well as the mean standard deviation of the current point near time window and the values of neighboring points, where the fold setting for deviations from the global standard deviation is larger in order to avoid that normal values are processed. A particular concern here is also that the processing of the training set and the test set is somewhat different, since the data of the test set is obtained slowly in time steps, and then later time steps of the current point in time are not visible, and therefore the processing of the test data set is based on the adjacent time window data before the current point. After the abnormal point is detected in a global and local mode, a value in a relatively normal range can be calculated according to the local mean standard deviation and the left and right adjacent values of the current point and is reassigned as a new value of the current point.
In time series related tasks, what happens within a certain time window in the past has a great influence on future result predictions. The time window that is typically acted upon is somewhat different for different levels of time granularity data. Thus, the feature engineering and sampling module 20 of embodiments of the present invention primarily characterizes the time sliding window based on the time series itself. The features obtained by the automatic time sequence feature engineering processing comprise a target feature based on a time sliding window, a target statistical feature based on the time sliding window, a target trend feature based on the time sliding window, an important original feature based on the time sliding window, a statistical feature based on the time sliding window and other features.
For the time sliding window based target feature, the target is generally not far away from the adjacent time step value in the time series data, and they have strong correlation, so that the adjacent target in the past can be first used as the feature. In addition, the time step interval of the data set is also identified, the data set can be judged to be in terms of hours, minutes, days, months or weeks as the time step, and the size of the characteristic window is determined by verifying the search according to the model according to the time interval.
For the statistics of targets based on time sliding window, after targets are completed with sliding window, statistics are further performed on targets. There are two statistical methods, the first is to count the last N days, and some differences according to the time step difference, and the statistics are generally performed by taking the last 2, 3, 5 and 7 days of the time interval of days, and the memory limitation is also considered here. The second is to divide a large time window into N segments, and count each segment separately. The statistical calculation has maximum value, minimum value, mean value, standard deviation and the like.
For the trend feature based on the time sliding window target, the change rate of the target is calculated, and the change trend can be reflected.
Where r i represents the rate of change of the current time of the target, t i-1 represents the target of the last time node, and t i-2 represents the target of the last time node.
For important raw features based on a time sliding window, the model may be trained first using the raw features and the importance of the features obtained, and then ranked according to feature importance. Other raw features are of less importance than the historical target, so a smaller time window than target can be selected, and then the number of features used is determined based on the time window and the system-limited resources.
For statistical features based on time sliding windows, statistical feature calculations are performed here for the classification features and the numerical features, respectively. For classification features, the frequency and ratio of occurrence of statistical feature values within a time window. For numerical features, the calculation mode is the same as the statistics based on the target, and the maximum value, the minimum value, the mean value and the standard deviation are counted, but the time window is controlled to be smaller.
For other features, in addition to the above features, the feature counted by the training set is also tried to be directly used as the feature of the whole data. For example, the frequency and ratio of statistics of classification features in the training set are global, the frequency and ratio of statistics of two classification features with high feature importance are combined, and the frequency and ratio of one classification feature with high feature importance and one numerical feature with high feature importance are combined, and statistics are performed on the numerical features based on the classification features. Cross combinations of historical target and other features are also contemplated, such as cross modes of multiplying or dividing target by other more important numerical features.
The automatic feature engineering and automatic feature selection phase is typically time consuming and memory consuming, and data may be sampled by the feature engineering and sampling module 20 in order to expedite this process. The sampling of the time sequence needs to pay attention to the sampling mode, if the data is directly and randomly sampled, the data with different time stamps of the same ID is lost, the data is not complete enough, and the effect is poor and the effect of the whole data is not comparable. In view of this problem, the feature engineering and sampling module 20 of the embodiment of the present invention performs random sampling on IDs in the time-ordered data set, and uses different sampling ratios for different amounts of data, where the larger the data, the smaller the sampling ratio, and when the amount of data is larger, the data is truncated according to time steps, and the later time step data is retained, so that the sampling manner is consistent with the basic effect of using the full amount of sampling, and the final feature selection effect is also relatively stable.
In an embodiment of the present invention, model building module 30 may build two more diverse linear models and a tree model. In particular, linear regression and LightGBM models may be established.
The above linear regression and LightGBM models have relatively large differences in the effects of the two models over time series data sets, some data sets have close effects, some linear regression effects are better, and some LightGBM models have better effects. Analysis has found that these datasets vary significantly over time, with the targets of some tasks increasing over time, such data often not fitting into tree models. In addition, the effect performance difference of different models in different time periods can also be greatly changed for the same data set.
The time series data has a larger time relation, so in order to reduce the influence of the time factors on the model, the fusion module 40 can fuse the model by calculating the dynamic weight based on the time sliding window.
Specifically, the fusion module 40 may first determine an initial fusion weight w 0 from the validation set, then set a time window for the test set, and test at the initial fusion weight w 0 in the first time window. After each time window is finished, the fusion module 40 obtains a corresponding optimal fusion weight according to the test result of the time window, updates the optimal fusion weight of the time window according to a set rule, and tests the next time window with the updated fusion weight. That is, the fusion module 40 tests with the initial fusion weight w 0 in the first time window, and after the first time window is finished, the optimal fusion weight w 1 of the window can be obtained according to the test result of the first time window, and then the following formula is used to update w 1:
w′1=r×w0+(1-r)×w1
wherein r is a memory factor, namely the proportion of the previous time window weight to the current time window weight updating process.
The test results for the second window are iteratively updated using w' 1 as the fusion weight, and so on. Thus, over time, the effect of longer elapsed time results on fusion may become less.
According to the automatic time sequence regression device provided by the embodiment of the invention, the time sequence data set is preprocessed, the automatic time sequence characteristic engineering processing and the data sampling are carried out, and different types of machine learning models are fused by calculating the dynamic weight based on the time sliding window, so that in the machine learning application related to the time sequence data, the application model can be conveniently obtained without relying on experience and knowledge accumulation of a data scientist, and a more accurate output result can be obtained by using the model.
Corresponding to the embodiment, the invention also provides a computer device.
The computer device according to the embodiment of the present invention includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the automatic time series regression method according to the above embodiment of the present invention can be implemented.
According to the computer device of the embodiment of the invention, when the processor executes the computer program stored on the memory, the time sequence data set is preprocessed, the automatic time sequence characteristic engineering processing and the data sampling are carried out, and different types of machine learning models are fused by calculating the dynamic weight based on the time sliding window, so that in the machine learning application related to the time sequence data, the application model can be conveniently obtained without depending on experience and knowledge accumulation of a data scientist, and a more accurate output result can be obtained by using the model.
The present invention also proposes a non-transitory computer-readable storage medium corresponding to the above-described embodiments.
The non-transitory computer-readable storage medium of the embodiment of the present invention has stored thereon a computer program which, when executed by a processor, can implement the automatic time series regression method according to the above-described embodiment of the present invention.
According to the non-transitory computer readable storage medium of the embodiment of the invention, when a processor executes a computer program stored thereon, firstly, preprocessing, automatic time sequence feature engineering processing and data sampling are performed on a time sequence data set, and different types of machine learning models are fused by calculating dynamic weights based on a time sliding window, so that in machine learning application involving time sequence data, an application model can be conveniently obtained without relying on experience and knowledge accumulation of a data scientist, and more accurate output results can be obtained by using the model.
The invention also provides a computer program product corresponding to the above embodiment.
The automatic time series regression method according to the above-described embodiments of the present invention may be performed when instructions in the computer program product of the embodiments of the present invention are executed by a processor.
According to the computer program product of the embodiment of the invention, when the processor executes instructions in the computer program product, the time sequence data set is firstly preprocessed, automatically processed in time sequence characteristic engineering and sampled, and different types of machine learning models are fused by calculating dynamic weights based on time sliding windows, so that in machine learning application related to time sequence data, an application model can be conveniently obtained without relying on experience and knowledge accumulation of a data scientist, and a more accurate output result can be obtained by using the model.
In the description of the present invention, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. The meaning of "a plurality of" is two or more, unless specifically defined otherwise.
In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
In the present invention, unless expressly stated or limited otherwise, a first feature "up" or "down" a second feature may be the first and second features in direct contact, or the first and second features in indirect contact via an intervening medium. Moreover, a first feature being "above," "over" and "on" a second feature may be a first feature being directly above or obliquely above the second feature, or simply indicating that the first feature is level higher than the second feature. The first feature being "under", "below" and "beneath" the second feature may be the first feature being directly under or obliquely below the second feature, or simply indicating that the first feature is less level than the second feature.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily for the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (7)

1. An automatic time series regression method, comprising the steps of:
Acquiring a time sequence data set, and preprocessing the time sequence data set, wherein the time sequence data in the time sequence data set is stock index, electrocardiogram, voice signal or wind speed;
Carrying out automatic time sequence characteristic engineering processing and data sampling on the preprocessed time sequence data set;
establishing different types of machine learning models;
calculating dynamic weights based on time sliding windows according to the time sequence data sets after preprocessing, automatic time sequence feature engineering processing and data sampling to fuse different types of machine learning models,
The characteristics obtained by carrying out automatic time sequence characteristic engineering processing on the preprocessed time sequence data set comprise a target characteristic based on a time sliding window, a target statistical characteristic based on the time sliding window, a target trend characteristic based on the time sliding window, an important original characteristic based on the time sliding window and a statistical characteristic based on the time sliding window,
Calculating dynamic weights based on a time sliding window according to the time sequence data set after preprocessing, automatic time sequence feature engineering processing and data sampling so as to fuse different types of machine learning models, wherein the method comprises the following steps of: determining initial fusion weights through the verification set; setting a time window of a test set, and testing with the initial fusion weight in a first time window; after each time window is finished, obtaining a corresponding optimal fusion weight according to a test result of the time window; updating the optimal fusion weight of the time window according to the set rule, and testing the next time window according to the updated fusion weight.
2. The automated time series regression method of claim 1 wherein preprocessing the time series data set comprises:
and carrying out smoothing treatment on the abnormal points in the time sequence data set.
3. The automated time series regression method of claim 2 wherein data sampling the preprocessed time series data set comprises:
the IDs in the time series dataset are randomly sampled, wherein different sampling ratios are used for different sizes of data amounts.
4. The automated time series regression method of claim 3 wherein the different types of machine learning models include linear regression and LightGBM models.
5. An automatic time series regression apparatus, comprising:
the preprocessing module is used for acquiring a time sequence data set and preprocessing the time sequence data set, wherein the time sequence data in the time sequence data set is stock index, electrocardiogram, voice signal or wind speed;
the characteristic engineering and sampling module is used for carrying out automatic time sequence characteristic engineering processing and data sampling on the preprocessed time sequence data set;
the model building module is used for building machine learning models of different types;
A fusion module for calculating dynamic weights based on time sliding windows according to the time sequence data sets after preprocessing, automatic time sequence feature engineering processing and data sampling so as to fuse different types of machine learning models,
The characteristics obtained by carrying out automatic time sequence characteristic engineering processing on the preprocessed time sequence data set comprise a target characteristic based on a time sliding window, a target statistical characteristic based on the time sliding window, a target trend characteristic based on the time sliding window, an important original characteristic based on the time sliding window and a statistical characteristic based on the time sliding window,
The fusion module is specifically used for: determining initial fusion weights through the verification set; setting a time window of a test set, and testing with the initial fusion weight in a first time window; after each time window is finished, obtaining a corresponding optimal fusion weight according to a test result of the time window; updating the optimal fusion weight of the time window according to the set rule, and testing the next time window according to the updated fusion weight.
6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the automatic time series regression method according to any of claims 1-4 when executing the program.
7. A non-transitory computer readable storage medium, having stored thereon a computer program, which when executed by a processor, implements an automatic time series regression method according to any of claims 1-4.
CN202010832356.7A 2020-08-18 2020-08-18 Automatic time series regression method and device Active CN111967616B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010832356.7A CN111967616B (en) 2020-08-18 2020-08-18 Automatic time series regression method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010832356.7A CN111967616B (en) 2020-08-18 2020-08-18 Automatic time series regression method and device

Publications (2)

Publication Number Publication Date
CN111967616A CN111967616A (en) 2020-11-20
CN111967616B true CN111967616B (en) 2024-04-23

Family

ID=73388391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010832356.7A Active CN111967616B (en) 2020-08-18 2020-08-18 Automatic time series regression method and device

Country Status (1)

Country Link
CN (1) CN111967616B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113297481A (en) * 2021-05-08 2021-08-24 武汉卓尔数字传媒科技有限公司 Information pushing method, information pushing device and server based on streaming data processing
CN114860802A (en) * 2022-04-26 2022-08-05 上海分泽时代软件技术有限公司 Fusion method and system of time sequence pedestrian volume data and scalar label number

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6115406A (en) * 1999-09-10 2000-09-05 Interdigital Technology Corporation Transmission using an antenna array in a CDMA communication system
CN108302329A (en) * 2018-01-25 2018-07-20 福建双环能源科技股份有限公司 A kind of dew point data exception detection method
CN108777873A (en) * 2018-06-04 2018-11-09 江南大学 The wireless sensor network abnormal deviation data examination method of forest is isolated based on weighted blend
CN109255506A (en) * 2018-11-22 2019-01-22 重庆邮电大学 A kind of internet finance user's overdue loan prediction technique based on big data
CN109299185A (en) * 2018-10-18 2019-02-01 上海船舶工艺研究所(中国船舶工业集团公司第十研究所) A kind of convolutional neural networks for timing flow data extract the analysis method of feature
CN110348622A (en) * 2019-07-02 2019-10-18 创新奇智(成都)科技有限公司 A kind of Time Series Forecasting Methods based on machine learning, system and electronic equipment
CN110443373A (en) * 2019-07-12 2019-11-12 清华大学 Linear model stablizes learning method and device
CN110705692A (en) * 2019-09-25 2020-01-17 中南大学 Method for predicting product quality of industrial nonlinear dynamic process by long-short term memory network based on space and time attention

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7620204B2 (en) * 2006-02-09 2009-11-17 Mitsubishi Electric Research Laboratories, Inc. Method for tracking objects in videos using covariance matrices

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6115406A (en) * 1999-09-10 2000-09-05 Interdigital Technology Corporation Transmission using an antenna array in a CDMA communication system
CN108302329A (en) * 2018-01-25 2018-07-20 福建双环能源科技股份有限公司 A kind of dew point data exception detection method
CN108777873A (en) * 2018-06-04 2018-11-09 江南大学 The wireless sensor network abnormal deviation data examination method of forest is isolated based on weighted blend
CN109299185A (en) * 2018-10-18 2019-02-01 上海船舶工艺研究所(中国船舶工业集团公司第十研究所) A kind of convolutional neural networks for timing flow data extract the analysis method of feature
CN109255506A (en) * 2018-11-22 2019-01-22 重庆邮电大学 A kind of internet finance user's overdue loan prediction technique based on big data
CN110348622A (en) * 2019-07-02 2019-10-18 创新奇智(成都)科技有限公司 A kind of Time Series Forecasting Methods based on machine learning, system and electronic equipment
CN110443373A (en) * 2019-07-12 2019-11-12 清华大学 Linear model stablizes learning method and device
CN110705692A (en) * 2019-09-25 2020-01-17 中南大学 Method for predicting product quality of industrial nonlinear dynamic process by long-short term memory network based on space and time attention

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
An Isolation-Based Distributed Outlier Detection Framework Using Nearest Neighbor Ensembles for Wireless Sensor Networks;Zhong-Min Wang等;《IEEE Access》;96319-96333 *
Combining LSTM Network Ensemble via Adaptive Weighting for Improved Time Series Forecasting;Jae Young Choi等;《Mathematical Problems in Engineering》;1-8 *
基于深度表征的视觉理解关键技术研究;王景文;《中国博士学位论文全文数据库 (信息科技辑)》(第12期);I138-114 *
基于非线性动力学的金融时间序列预测技术研究;卢山;《中国博士学位论文全文数据库 (经济与管理科学辑)》(第04期);J160-3 *

Also Published As

Publication number Publication date
CN111967616A (en) 2020-11-20

Similar Documents

Publication Publication Date Title
Nagy et al. Predicting dropout in higher education based on secondary school performance
US9047559B2 (en) Computer-implemented systems and methods for testing large scale automatic forecast combinations
CN109242135B (en) Model operation method, device and business server
CN111967616B (en) Automatic time series regression method and device
CN110717535B (en) Automatic modeling method and system based on data analysis processing system
US20080071764A1 (en) Method and an apparatus to perform feature similarity mapping
CN108960269B (en) Feature acquisition method and device for data set and computing equipment
CN111898443B (en) Flow monitoring method for wire feeding mechanism of FDM type 3D printer
CN113642938B (en) Intelligent production management method and system
Perner Decision tree induction methods and their application to big data
CN112445690B (en) Information acquisition method and device and electronic equipment
CN112757053A (en) Model fusion tool wear monitoring method and system based on power and vibration signals
CN108427756A (en) Personalized query word completion recommendation method and device based on same-class user model
CN113722997A (en) New well dynamic yield prediction method based on static oil and gas field data
CN112990480A (en) Method and device for building model, electronic equipment and storage medium
Matuszny Building decision trees based on production knowledge as support in decision-making process
CN106264545A (en) Step recognition method and device
CN114942947A (en) Follow-up visit data processing method and system based on intelligent medical treatment
CN112148942A (en) Business index data classification method and device based on data clustering
CN107203916B (en) User credit model establishing method and device
CN115618987A (en) Production well production data prediction method, device, equipment and storage medium
CN117314492A (en) Sales prediction method, sales prediction device, computer equipment and storage medium
CN106874286B (en) Method and device for screening user characteristics
CN110855519A (en) Network flow prediction method
CN105678430A (en) Improved user recommendation method based on neighbor project slope one algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant