CN112926613A - Method and device for positioning time sequence training start node - Google Patents

Method and device for positioning time sequence training start node Download PDF

Info

Publication number
CN112926613A
CN112926613A CN201911243435.8A CN201911243435A CN112926613A CN 112926613 A CN112926613 A CN 112926613A CN 201911243435 A CN201911243435 A CN 201911243435A CN 112926613 A CN112926613 A CN 112926613A
Authority
CN
China
Prior art keywords
sequence
subsequences
time
recent
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911243435.8A
Other languages
Chinese (zh)
Inventor
张奔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Wodong Tianjun Information Technology Co Ltd
Priority to CN201911243435.8A priority Critical patent/CN112926613A/en
Publication of CN112926613A publication Critical patent/CN112926613A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Fuzzy Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for positioning a time sequence training start node, and relates to the technical field of computers. One embodiment of the method comprises: dividing the time sequence into a plurality of subsequences according to the preset number of segments; taking the sub-sequence closest to the prediction time node in the plurality of sub-sequences as a reference sequence, and respectively calculating the similarity between the reference sequence and the rest of the plurality of sub-sequences; and positioning a training starting node of the time sequence according to the similarity between the reference sequence and the rest subsequences in the plurality of subsequences. The implementation mode can solve the technical problem that the selection of the training starting node is not appropriate.

Description

Method and device for positioning time sequence training start node
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for positioning a time sequence training start node.
Background
When predicting the time series, the selection of the training start node is very important, especially when the time series is long. If the training starting node is far away, when the historical data and the recent time sequence are distributed with large differences, the long-term historical data can interfere with prediction, the distribution trend of the prediction result is not high in coincidence degree with the distribution trend of the recent time sequence, and meanwhile, more data are used, so that more storage and calculation resources are wasted. When the training start node is selected to be more recent and the history distribution is similar to the recent time series, if the training set ignores the historical data and only keeps using the recent data, the prediction accuracy is reduced because the data in the recent period may not contain more complete trend or seasonal information and cannot provide great help for prediction. Therefore, the selection of a proper training starting node is very important for improving the prediction accuracy.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
due to the fact that the selection of the training starting node is not appropriate, prediction accuracy is low, and particularly when a high-magnitude (million or even ten million) time sequence is predicted, prediction accuracy cannot be guaranteed.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for positioning a time series training start node, so as to solve the technical problem that selection of the training start node is not appropriate.
To achieve the above object, according to an aspect of the embodiments of the present invention, there is provided a method for positioning a time series training start node, including:
dividing the time sequence into a plurality of subsequences according to the preset number of segments;
taking the sub-sequence closest to the prediction time node in the plurality of sub-sequences as a reference sequence, and respectively calculating the similarity between the reference sequence and the rest of the plurality of sub-sequences;
and positioning a training starting node of the time sequence according to the similarity between the reference sequence and the rest subsequences in the plurality of subsequences.
Optionally, the calculating the similarity between the reference sequence and the remaining subsequences in the plurality of subsequences respectively comprises:
and respectively calculating the similarity between the reference sequence and the rest subsequences in the plurality of subsequences by adopting a dynamic time warping algorithm.
Optionally, the locating a training start node of the time sequence according to the similarity between the reference sequence and the remaining subsequences in the plurality of subsequences comprises:
sequencing the remaining subsequences in the plurality of subsequences according to the sequence of similarity between the reference sequence and the remaining subsequences in the plurality of subsequences from high to low;
and positioning a target subsequence in the sequence according to a preset lowest reservation ratio and the number of the remaining subsequences in the plurality of subsequences, and taking the starting node of the target subsequence as the training starting node of the time sequence.
Optionally, locating the target subsequence in the sequence according to a preset lowest retention ratio and the number of remaining subsequences in the plurality of subsequences, including:
locate the first in the sequence
Figure BDA0002306873090000021
A subsequence as a target subsequence;
wherein N is the preset number of segments, and q is the lowest retention ratio.
Optionally, dividing the time sequence into a plurality of subsequences according to the preset number of segments includes:
obtaining a recent sequence from the time sequence, and judging whether the recent sequence is stable and has difference with the time sequence;
if so, dividing the recent sequence into a plurality of subsequences according to a preset segment number;
and if not, dividing the time sequence into a plurality of subsequences according to the preset segment number.
Optionally, determining whether the recent sequence is stable and has a difference from the time series comprises:
calculating the mean and standard deviation of the recent sequence and the time sequence respectively;
judging whether the recent sequence is stable or not according to the mean value and the standard deviation of the recent sequence;
and judging whether the recent sequence and the time sequence have difference according to the average value and the standard deviation of the recent sequence and the time sequence.
Optionally, determining whether the recent sequence is stable according to the mean and the standard deviation of the recent sequence includes:
determining a stable value range according to the mean value and the standard deviation of the recent sequence;
and judging whether the ratio of the number of elements in the stable value range to the total number of elements in the recent sequence is smaller than a preset ratio, thereby judging whether the recent sequence is stable.
In addition, according to another aspect of the embodiments of the present invention, there is provided a positioning apparatus for a time series training start node, including:
the segmentation module is used for dividing the time sequence into a plurality of subsequences according to the preset segment number;
a calculating module, configured to use a subsequence closest to a predicted time node in the multiple subsequences as a reference sequence, and calculate similarities between the reference sequence and remaining subsequences in the multiple subsequences, respectively;
and the positioning module is used for positioning the training starting node of the time sequence according to the similarity between the reference sequence and the rest subsequences in the subsequences.
Optionally, the computing module is further configured to:
and respectively calculating the similarity between the reference sequence and the rest subsequences in the plurality of subsequences by adopting a dynamic time warping algorithm.
Optionally, the positioning module is further configured to:
sequencing the remaining subsequences in the plurality of subsequences according to the sequence of similarity between the reference sequence and the remaining subsequences in the plurality of subsequences from high to low;
and positioning a target subsequence in the sequence according to a preset lowest reservation ratio and the number of the remaining subsequences in the plurality of subsequences, and taking the starting node of the target subsequence as the training starting node of the time sequence.
Optionally, the positioning module is further configured to:
locate the first in the sequence
Figure BDA0002306873090000041
A subsequence as a target subsequence;
wherein N is the preset number of segments, and q is the lowest retention ratio.
Optionally, the segmentation module is further configured to:
obtaining a recent sequence from the time sequence, and judging whether the recent sequence is stable and has difference with the time sequence;
if so, dividing the recent sequence into a plurality of subsequences according to a preset segment number;
and if not, dividing the time sequence into a plurality of subsequences according to the preset segment number.
Optionally, the segmentation module is further configured to:
calculating the mean and standard deviation of the recent sequence and the time sequence respectively;
judging whether the recent sequence is stable or not according to the mean value and the standard deviation of the recent sequence;
and judging whether the recent sequence and the time sequence have difference according to the average value and the standard deviation of the recent sequence and the time sequence.
Optionally, the segmentation module is further configured to:
determining a stable value range according to the mean value and the standard deviation of the recent sequence;
and judging whether the ratio of the number of elements in the stable value range to the total number of elements in the recent sequence is smaller than a preset ratio, thereby judging whether the recent sequence is stable.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any of the embodiments described above.
According to another aspect of the embodiments of the present invention, there is also provided a computer readable medium, on which a computer program is stored, which when executed by a processor implements the method of any of the above embodiments.
One embodiment of the above invention has the following advantages or benefits: because the technical means of dividing the time sequence into a plurality of subsequences and calculating the similarity between the reference sequence and the rest subsequences in the plurality of subsequences so as to position the training start node of the time sequence is adopted, the technical problem that the selection of the training start node is not proper in the prior art is solved. The embodiment of the invention segments the time sequence, divides the time sequence into a plurality of subsequences, takes the sequence segment nearest to the prediction start node as a reference sequence, compares the similarity of the rest subsequences with the reference sequence, and locates the optimal node of the time sequence training start time by taking the similarity as an index. Particularly, when a high-magnitude (million or even ten million) time sequence is predicted, the method provided by the embodiment of the invention can automatically acquire the training start time interception point of each time sequence with higher efficiency and more accuracy, thereby improving the accuracy of prediction.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
fig. 1 is a schematic diagram of a main flow of a positioning method of a time series training start node according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of dividing a time series into a plurality of sub-sequences according to an embodiment of the present invention;
FIG. 3 is a schematic illustration of the similarity of various subsequences according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a main flow of a method for locating a time-series training start node according to a reference embodiment of the present invention;
FIG. 5 is a schematic diagram of the main modules of a positioning apparatus for time series training start nodes according to an embodiment of the present invention;
FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 7 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the prior art, there is no fixed or definite effective method for intercepting and selecting the training set starting node, and the following three methods are generally adopted:
(1) using all data, no training set data start node is selected.
(2) And selecting a training set data start node by adopting an intuitive method, for example, taking a certain time node before a fixed time period length as the training set data start node, or directly and manually setting a time node as the training set data start node.
(3) And selecting a training set data start node by adopting a statistical method, and separating and truncating the time sequence by adopting certain comprehensive indexes such as a mean value and a standard deviation so as to obtain the training set data start node.
However, the above three methods all have respective drawbacks:
the method of (1): a) the data volume is large, so that unnecessary memory consumption and low calculation execution efficiency are caused; b) when the distribution of time series on a time axis is obviously different, prediction can be interfered by historical 'dirty' data, so that the accuracy is reduced, for example, when a certain time series generates data from 2014, the magnitude is very low and irregular from 2014 to 2016, but the magnitude is higher from 2017 to 2019 in 9 months, and a stable trend exists, so that the prediction accuracy is inevitably reduced by using all historical data without removing the data before 2017.
The method of (2): a) characteristics of different time sequences are not analyzed and considered, the selected position is not accurate enough, and a proper training starting node cannot be found; b) the method has limitations, and especially when million or even ten million magnitude time sequences are trained and predicted, the interception points of each time sequence cannot be guaranteed to be suitable by using uniform interception lengths or having obvious deviation of time nodes; c) the method strongly depends on data analysis experience, and the selection of the interception length or the interception time node has a great relationship with the experience and the capability of a data analyst or an algorithm analyst and has strong subjectivity.
The method of (3): a) the accuracy is low, the indexes of the statistical method are generally only suitable for time sequences with large distribution difference, the judgment granularity of the method is coarse, and the method is not suitable for time sequences with similar distribution but different time periods; b) statistical indexes are difficult to unify, common methods include mean values, standard deviations and the like, but different indexes cause different selections of the interception nodes and have no fixed standard; c) the statistical indicators have unreliability, for example, a certain time sequence has obvious distribution difference in two periods, but the statistical indicators may be similar.
In order to solve the above technical problem, in the embodiment of the present invention, the training start node of each time sequence is accurately located according to the similarity between the reference sequence and each other sequence, so as to improve the accuracy of prediction.
Fig. 1 is a schematic diagram of a main flow of a method for positioning a time series training start node according to an embodiment of the present invention. As an embodiment of the present invention, as shown in fig. 1, the method for positioning a time series training start node may include:
step 101, dividing the time sequence into a plurality of subsequences according to a preset number of segments.
As shown in fig. 2, for a certain time sequence with a length L, the time sequence is divided by a certain length M according to a preset number of segments N, so as to obtain N subsequences. Where M ═ int (L/N). Alternatively, the value of N may be set generally within the range of 5 to 10.
In order to improve the calculation efficiency, for a given time sequence, it may be determined whether the time sequence has a larger distribution difference, and a time sequence with more consistent recent distribution is obtained to divide the subsequences. Optionally, step 101 comprises: obtaining a recent sequence from the time sequence, and judging whether the recent sequence is stable and has difference with the time sequence; if so, dividing the recent sequence into a plurality of subsequences according to a preset segment number; and if not, dividing the time sequence into a plurality of subsequences according to the preset segment number. Firstly, it is clear that, when the respective distributions of the recent sequence and the distant sequence are stable, whether the distributions of the two sequences are obviously different is judged. Judging time division two-step operation: first, the recent sequence stability is judged. Intercepting a recent sequence and judging the stability of the recent sequence; in the second step, the same statistics, such as mean and standard deviation, are calculated for all sequences (not for the future sequences, but for the future sequences, because if the distribution of the future sequences is more different than the near sequences, it will necessarily reflect some statistics of the whole time sequence), and the statistics are used to measure whether the distribution of the near sequences and the future sequences is more different. And if the recent sequence is stable and the difference is reflected by the statistic value of the recent sequence and the whole time sequence, taking the recent sequence as S, and dividing the recent sequence S into a plurality of subsequences according to the preset number of segments. Otherwise, the whole time sequence is used as S, and the whole time sequence S is divided into a plurality of subsequences according to the preset segment number.
It should be noted that, in the above operation process, an operation of selecting a recent time sequence is involved, which is related to human experience and is determined according to different prediction scenarios with respect to what the truncation length is. For example: forecasting future one month inventory for a warehouse. There is currently a known history of daily inventory, from 2016-01-01 to 2019-11-08. In this scenario, through some data analysis (such as historical time length averaged over thousands of warehouses, periodicity, etc.), data of the recent 6 months (the data of the recent 6 months is considered sufficient to support the predicted result) is taken as the recent sequence.
Optionally, determining whether the recent sequence is stable and has a difference from the time series comprises: calculating the mean and standard deviation of the recent sequence and the time sequence respectively; judging whether the recent sequence is stable or not according to the mean value and the standard deviation of the recent sequence; and judging whether the recent sequence and the time sequence have difference according to the average value and the standard deviation of the recent sequence and the time sequence. Optionally, determining whether the recent sequence is stable according to the mean and the standard deviation of the recent sequence includes: determining a stable value range according to the mean value and the standard deviation of the recent sequence; and judging whether the ratio of the number of elements in the stable value range to the total number of elements in the recent sequence is smaller than a preset ratio, thereby judging whether the recent sequence is stable. For example:
firstly, calculating the mean value and standard deviation mean _1, std _1 of [ recent sequence ]; and judging whether the proportion of non-abnormal values (actual values are [ mean _1-3 × std _1, mean _1+3 × std _1]) in the recent sequence is less than 0.8. If less than 0.8 indicates that there are many outliers indicating that the recent sequence is not stable enough, the whole time sequence is directly divided into a plurality of subsequences. Otherwise, the recent sequence is considered to be stable, and the second step is carried out.
Secondly, calculating the mean value and standard deviation mean _2 and std _2 of the whole time sequence; certain rules may be set, such as (mean _1 ═ 0.2 × mean _2, or mean _2 ═ 0.2 × mean _1,) while std _1 ═ 0.1 × std2, then the recent sequence has a difference from the entire time sequence. The obvious difference of the mean values can indicate that the recent sequence and the long-term sequence have obvious difference of amplitude values; the standard deviation differences indicate that the stability of the near term sequence is significantly different from the stability of the entire time series, which is reflected by the distribution of the near term sequence and the far term sequence.
In a word, whether the recent sequence and the future sequence have obvious distribution difference can be judged through the two steps, only the time sequence is primarily screened, and if the recent sequence can be judged and intercepted, the calculation efficiency of the subsequent steps can be greatly improved.
And step 102, taking the sub-sequence closest to the predicted time node in the plurality of sub-sequences as a reference sequence, and respectively calculating the similarity between the reference sequence and the rest of the plurality of sub-sequences.
As shown in FIG. 3, the remaining (N-1) subsequences C are respectively compared with the reference sequence Q of the subsequence with length M before the predicted time node, that is, the subsequence closest to the predicted time nodei(i is more than or equal to 0 and less than or equal to N-1) and Q sequences are compared in similarity to obtain N-1 similarities [ a ≤1,a2,a3,….aN-1]And the N-1 similarities are based on the subsequence CiThe times of (a) are arranged from near to far.
Optionally, the calculating the similarity between the reference sequence and the remaining subsequences in the plurality of subsequences respectively comprises: and respectively calculating the similarity between the reference sequence and the rest subsequences in the plurality of subsequences by adopting a dynamic time warping algorithm. A Dynamic Time Warping (DTW) is a method for calculating sequence similarity, and a method for obtaining paths corresponding to two Time sequence nodes based on a Dynamic planning idea. The similarity between the two subsequences can be accurately calculated through DTW, so that a proper time sequence starting node can be located.
Optionally, the UCR-DTW may also be used to calculate the similarity between the reference sequence and the remaining subsequences in the plurality of subsequences, and the UCR-DTW is optimized for the conventional DTW algorithm, so that the similarity between two subsequences can be calculated more accurately, and a suitable time sequence start node can be located. In the embodiment of the present invention, UCD-DTW is used as a calculation method for calculating the similarity, because it adopts an optimal path alignment method, and is not strict time node alignment. Compared with the traditional DTW algorithm, the method is optimized, time complexity is greatly reduced through normalization and early termination, and execution efficiency of the algorithm in similarity calculation can be effectively guaranteed.
And 103, positioning a training starting node of the time sequence according to the similarity between the reference sequence and the rest subsequences in the plurality of subsequences.
When the time sequence is predicted, the predicted trend is influenced most by the recent sequence, so that the trend of the recent sequence segment can strongly predict the future trend, but the recent sequence is shorter and cannot contain all historical information, and more historical data is required to be used as a training set. The selection criteria of the training set are to make the overall trend of the training set close to the future as much as possible. At the moment, similarity comparison is carried out on the recent sequence by using different periods of history traced forwards, so that more historical data are obtained, and the trend of the beginning of the historical data is most similar to the recent history. Thus, under the same characteristic processing, the prediction effect can be optimized.
Optionally, step 103 comprises: sequencing the remaining subsequences in the plurality of subsequences according to the sequence of similarity between the reference sequence and the remaining subsequences in the plurality of subsequences from high to low; and positioning a target subsequence in the sequence according to a preset lowest reservation ratio and the number of the remaining subsequences in the plurality of subsequences, and taking the starting node of the target subsequence as the training starting node of the time sequence.
Optionally, locating the target subsequence in the sequence according to a preset minimum retention ratio and the number of remaining subsequences in the plurality of subsequences, including: locate the first in the sequence
Figure BDA0002306873090000101
A subsequence as a target subsequence; wherein N is the preset number of segments, and q is the lowest retention ratio. By setting the lowest reservation ratio q in the embodiment of the present invention,rounding up the calculation result to ensure that the range of the result calculated by (N-1) q is [1, N-1 ]]And are all integers. Such as: (N-1) × q ═ 0.5, and the result was "1" after rounding up. Locate similarity in rank
Figure BDA0002306873090000102
And acquiring a starting time node T of the large target subsequence, wherein the T is the optimal training starting interception point of the whole time sequence.
It should be noted that, the lowest retention ratio q value is preset as required, the q value range is [0,1], and after the similarity between the reference sequence and the remaining subsequences in the plurality of subsequences is obtained, the intercepted training start node can be controlled by the lowest retention ratio q, so as to control the length of the retained and intercepted sequence. It should be noted that a larger q value does not mean a longer length of the sequence to be retained, which is determined by the similarity of the combination.
The general recognition is: there will be a higher degree of similarity for sub-sequences closer to the reference sequence Q. However, simply taking distance as the only measure is not comprehensive. Thus, embodiments of the present invention are provided by
Figure BDA0002306873090000111
The target subsequence is accurately located. The calculation formula has the advantages that: not only can satisfy common cognition, but also can exclude some special cases. If the subsequences with closer time distance have higher similarity, the obtained interception time point of the formula is consistent with the conventional principle. If the sub-sequences with longer time have higher similarity, some longer time sequences can be avoided by setting a q value through the formula. In addition, the core of the formula is data-based guidance, positioning is performed by combining similarity and time, and the formula also has the advantages of simplicity and convenience in calculation and high calculation speed.
According to the various embodiments described above, it can be seen that the technical means of the present invention for locating the training start node of the time sequence by dividing the time sequence into a plurality of subsequences and calculating the similarity between the reference sequence and the remaining subsequences in the plurality of subsequences solves the technical problem of the prior art that the selection of the training start node is not appropriate. The embodiment of the invention segments the time sequence, divides the time sequence into a plurality of subsequences, takes the sequence segment nearest to the prediction start node as a reference sequence, compares the similarity of the rest subsequences with the reference sequence, and locates the optimal node of the time sequence training start time by taking the similarity as an index. Particularly, when a high-magnitude (million or even ten million) time sequence is predicted, the method provided by the embodiment of the invention can automatically acquire the training start time interception point of each time sequence with higher efficiency and more accuracy, thereby improving the accuracy of prediction.
Fig. 4 is a schematic diagram of a main flow of a method for positioning a time-series training start node according to a reference embodiment of the present invention.
Step 401, judging whether the length of the time sequence is smaller than a preset length threshold value; if yes, ending; if not, go to step 402.
If the length of the whole time sequence is smaller than the length threshold value, the searching of the training starting node is not needed, because the effect of training prediction is influenced by too short length. The embodiment of the invention searches the training starting node aiming at the time sequence with the length being more than or equal to the length threshold value.
The length threshold is empirically set manually, and is typically referred to the meaning of the input time series and the predicted scene. For example: for example, the precipitation in a future week of a certain area is predicted, for the scene, it is determined by human experience that the probability of precipitation is high and is related to the seasonality of the year, that is, it is significant to refer to the past year synchronization value, for the purpose of prediction accuracy, the length threshold value cannot be smaller than 3 years generally, that is, the length threshold value is 3 × 365 — 1095, the threshold value is not fixed, or may be set to 2 years, that is, the length threshold value is 2 × 365 — 730.
Step 402, obtaining a recent sequence from the time sequence.
Step 403, determining whether the recent sequence is stable and has difference from the time sequence; if yes, go to step 404; if not, go to step 405.
Optionally, calculating a mean and a standard deviation of the recent sequence and the time sequence, respectively; judging whether the recent sequence is stable or not according to the mean value and the standard deviation of the recent sequence; and judging whether the recent sequence and the time sequence have difference according to the average value and the standard deviation of the recent sequence and the time sequence. Optionally, determining a stable value range according to the mean and standard deviation of the recent sequence; and judging whether the ratio of the number of elements in the stable value range to the total number of elements in the recent sequence is smaller than a preset ratio, thereby judging whether the recent sequence is stable.
Step 404, dividing the recent sequence into a plurality of subsequences according to a preset number of segments.
Step 405, dividing the time sequence into a plurality of subsequences according to a preset number of segments.
And step 406, taking the subsequence closest to the predicted time node in the plurality of subsequences as a reference sequence, and respectively calculating the similarity between the reference sequence and the remaining subsequences in the plurality of subsequences by using a dynamic time warping algorithm.
Optionally, UCR-DTW may be used to calculate the similarity between the reference sequence and the remaining subsequences in the plurality of subsequences, so that the similarity between two subsequences can be calculated more accurately, and a suitable time sequence start node can be located.
Step 407, sorting the remaining subsequences in the plurality of subsequences according to the descending order of similarity between the reference sequence and the remaining subsequences in the plurality of subsequences.
And step 408, positioning a target subsequence in the sequence according to a preset lowest reservation ratio and the number of the remaining subsequences in the plurality of subsequences, and taking a starting node of the target subsequence as a training starting node of the time sequence.
Optionally according to a preset minimum retention ratio and the number of remaining subsequences in the plurality of subsequencesA bit target subsequence comprising: locate the first in the sequence
Figure BDA0002306873090000131
A subsequence as a target subsequence; wherein N is the preset number of segments, and q is the lowest retention ratio.
In addition, in a reference embodiment of the present invention, the detailed implementation of the positioning method for the time series training start node is already described in detail in the above-mentioned positioning method for the time series training start node, so that the repeated content is not described again.
Fig. 5 is a schematic diagram of main modules of a positioning apparatus for a time series training start node according to an embodiment of the present invention, and as shown in fig. 5, the positioning apparatus 500 for a time series training start node includes a segmentation module 501, a calculation module 502, and a positioning module 503. The segmenting module 501 is configured to divide the time sequence into a plurality of subsequences according to a preset number of segments; the calculating module 502 is configured to use a subsequence closest to the predicted time node in the plurality of subsequences as a reference sequence, and calculate similarities between the reference sequence and remaining subsequences in the plurality of subsequences respectively; the positioning module 503 is configured to position a training start node of the time sequence according to the similarity between the reference sequence and the remaining subsequences in the plurality of subsequences.
Optionally, the calculating module 502 is further configured to:
and respectively calculating the similarity between the reference sequence and the rest subsequences in the plurality of subsequences by adopting a dynamic time warping algorithm.
Optionally, the positioning module 503 is further configured to:
sequencing the remaining subsequences in the plurality of subsequences according to the sequence of similarity between the reference sequence and the remaining subsequences in the plurality of subsequences from high to low;
and positioning a target subsequence in the sequence according to a preset lowest reservation ratio and the number of the remaining subsequences in the plurality of subsequences, and taking the starting node of the target subsequence as the training starting node of the time sequence.
Optionally, the positioning module 503 is further configured to:
locate the first in the sequence
Figure BDA0002306873090000141
A subsequence as a target subsequence;
wherein N is the preset number of segments, and q is the lowest retention ratio.
Optionally, the segmentation module 501 is further configured to:
obtaining a recent sequence from the time sequence, and judging whether the recent sequence is stable and has difference with the time sequence;
if so, dividing the recent sequence into a plurality of subsequences according to a preset segment number;
and if not, dividing the time sequence into a plurality of subsequences according to the preset segment number.
Optionally, the segmentation module 501 is further configured to:
calculating the mean and standard deviation of the recent sequence and the time sequence respectively;
judging whether the recent sequence is stable or not according to the mean value and the standard deviation of the recent sequence;
and judging whether the recent sequence and the time sequence have difference according to the average value and the standard deviation of the recent sequence and the time sequence.
Optionally, the segmentation module 501 is further configured to:
determining a stable value range according to the mean value and the standard deviation of the recent sequence;
and judging whether the ratio of the number of elements in the stable value range to the total number of elements in the recent sequence is smaller than a preset ratio, thereby judging whether the recent sequence is stable.
According to the various embodiments described above, it can be seen that the technical means of the present invention for locating the training start node of the time sequence by dividing the time sequence into a plurality of subsequences and calculating the similarity between the reference sequence and the remaining subsequences in the plurality of subsequences solves the technical problem of the prior art that the selection of the training start node is not appropriate. The embodiment of the invention segments the time sequence, divides the time sequence into a plurality of subsequences, takes the sequence segment nearest to the prediction start node as a reference sequence, compares the similarity of the rest subsequences with the reference sequence, and locates the optimal node of the time sequence training start time by taking the similarity as an index. Particularly, when a high-magnitude (million or even ten million) time sequence is predicted, the method provided by the embodiment of the invention can automatically acquire the training start time interception point of each time sequence with higher efficiency and more accuracy, thereby improving the accuracy of prediction.
It should be noted that, in the implementation of the positioning apparatus for a time series training start node according to the present invention, the details have been described in the above-mentioned positioning method for a time series training start node, and therefore, the repeated contents are not described again here.
Fig. 6 shows an exemplary system architecture 600 of a positioning method of a time series training start node or a positioning apparatus of a time series training start node to which an embodiment of the present invention may be applied.
As shown in fig. 6, the system architecture 600 may include terminal devices 601, 602, 603, a network 604, and a server 605. The network 604 serves to provide a medium for communication links between the terminal devices 601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. The terminal devices 601, 602, 603 may have installed thereon various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 605 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 601, 602, 603. The background management server may analyze and otherwise process the received data such as the item information query request, and feed back a processing result (for example, target push information, item information — just an example) to the terminal device.
It should be noted that the method for positioning a time series training start node provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the positioning device of the time series training start node is generally disposed in the server 605. The method for positioning the time series training start node provided by the embodiment of the present invention may also be executed by the terminal devices 601, 602, and 603, and accordingly, the positioning apparatus for the time series training start node may be disposed in the terminal devices 601, 602, and 603.
It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer programs according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a segmentation module, a computation module, and a location module, where the names of the modules do not in some cases constitute a limitation on the modules themselves.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: dividing the time sequence into a plurality of subsequences according to the preset number of segments; taking the sub-sequence closest to the prediction time node in the plurality of sub-sequences as a reference sequence, and respectively calculating the similarity between the reference sequence and the rest of the plurality of sub-sequences; and positioning a training starting node of the time sequence according to the similarity between the reference sequence and the rest subsequences in the plurality of subsequences.
According to the technical scheme of the embodiment of the invention, because the technical means of dividing the time sequence into the plurality of subsequences and calculating the similarity between the reference sequence and the rest subsequences in the plurality of subsequences so as to position the training start node of the time sequence is adopted, the technical problem of inappropriate selection of the training start node in the prior art is solved. The embodiment of the invention segments the time sequence, divides the time sequence into a plurality of subsequences, takes the sequence segment nearest to the prediction start node as a reference sequence, compares the similarity of the rest subsequences with the reference sequence, and locates the optimal node of the time sequence training start time by taking the similarity as an index. Particularly, when a high-magnitude (million or even ten million) time sequence is predicted, the method provided by the embodiment of the invention can automatically acquire the training start time interception point of each time sequence with higher efficiency and more accuracy, thereby improving the accuracy of prediction.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for positioning a time series training start node is characterized by comprising the following steps:
dividing the time sequence into a plurality of subsequences according to the preset number of segments;
taking the sub-sequence closest to the prediction time node in the plurality of sub-sequences as a reference sequence, and respectively calculating the similarity between the reference sequence and the rest of the plurality of sub-sequences;
and positioning a training starting node of the time sequence according to the similarity between the reference sequence and the rest subsequences in the plurality of subsequences.
2. The method of claim 1, wherein calculating the similarity between the reference sequence and the remaining subsequences in the plurality of subsequences comprises:
and respectively calculating the similarity between the reference sequence and the rest subsequences in the plurality of subsequences by adopting a dynamic time warping algorithm.
3. The method of claim 1, wherein locating a training start node of the time series based on similarity between the reference sequence and remaining subsequences of the plurality of subsequences comprises:
sequencing the remaining subsequences in the plurality of subsequences according to the sequence of similarity between the reference sequence and the remaining subsequences in the plurality of subsequences from high to low;
and positioning a target subsequence in the sequence according to a preset lowest reservation ratio and the number of the remaining subsequences in the plurality of subsequences, and taking the starting node of the target subsequence as the training starting node of the time sequence.
4. The method of claim 3, wherein locating the target subsequence in the sequence according to a predetermined minimum retention ratio and a number of remaining subsequences in the plurality of subsequences comprises:
locate the first in the sequence
Figure FDA0002306873080000011
A subsequence as a target subsequence;
wherein N is the preset number of segments, and q is the lowest retention ratio.
5. The method of claim 1, wherein dividing the time sequence into a plurality of subsequences according to a preset number of segments comprises:
obtaining a recent sequence from the time sequence, and judging whether the recent sequence is stable and has difference with the time sequence;
if so, dividing the recent sequence into a plurality of subsequences according to a preset segment number;
and if not, dividing the time sequence into a plurality of subsequences according to the preset segment number.
6. The method of claim 5, wherein determining whether the recent sequence is stable and distinct from the time series comprises:
calculating the mean and standard deviation of the recent sequence and the time sequence respectively;
judging whether the recent sequence is stable or not according to the mean value and the standard deviation of the recent sequence;
and judging whether the recent sequence and the time sequence have difference according to the average value and the standard deviation of the recent sequence and the time sequence.
7. The method of claim 6, wherein determining whether the recent sequence is stable based on the mean and standard deviation of the recent sequence comprises:
determining a stable value range according to the mean value and the standard deviation of the recent sequence;
and judging whether the ratio of the number of elements in the stable value range to the total number of elements in the recent sequence is smaller than a preset ratio, thereby judging whether the recent sequence is stable.
8. A positioning apparatus for a time series training start node, comprising:
the segmentation module is used for dividing the time sequence into a plurality of subsequences according to the preset segment number;
a calculating module, configured to use a subsequence closest to a predicted time node in the multiple subsequences as a reference sequence, and calculate similarities between the reference sequence and remaining subsequences in the multiple subsequences, respectively;
and the positioning module is used for positioning the training starting node of the time sequence according to the similarity between the reference sequence and the rest subsequences in the subsequences.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN201911243435.8A 2019-12-06 2019-12-06 Method and device for positioning time sequence training start node Pending CN112926613A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911243435.8A CN112926613A (en) 2019-12-06 2019-12-06 Method and device for positioning time sequence training start node

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911243435.8A CN112926613A (en) 2019-12-06 2019-12-06 Method and device for positioning time sequence training start node

Publications (1)

Publication Number Publication Date
CN112926613A true CN112926613A (en) 2021-06-08

Family

ID=76161685

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911243435.8A Pending CN112926613A (en) 2019-12-06 2019-12-06 Method and device for positioning time sequence training start node

Country Status (1)

Country Link
CN (1) CN112926613A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581693A (en) * 2022-03-07 2022-06-03 支付宝(杭州)信息技术有限公司 Method and device for distinguishing user behavior patterns

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11102351A (en) * 1997-06-05 1999-04-13 Northern Telecom Ltd Data sequential value predicting method, data sequential input determining method, and computer system
CN103294911A (en) * 2013-05-23 2013-09-11 中国人民解放军国防科学技术大学 Time sequence similarity value acquisition method and system
CN103294729A (en) * 2012-03-05 2013-09-11 富士通株式会社 Method and equipment for processing and predicting time sequence containing sample points
CN104811991A (en) * 2015-04-17 2015-07-29 合肥工业大学 Wireless link quality predicting method based on dynamic time warping algorithm
CN107528722A (en) * 2017-07-06 2017-12-29 阿里巴巴集团控股有限公司 Abnormal point detecting method and device in a kind of time series
CN107590143A (en) * 2016-07-06 2018-01-16 北京金山云网络技术有限公司 A kind of search method of time series, apparatus and system
CN108491559A (en) * 2018-01-19 2018-09-04 北京理工大学 A kind of time series method for detecting abnormality based on normalized mutual information estimation
CN108710623A (en) * 2018-03-13 2018-10-26 南京航空航天大学 Airport departure from port delay time at stop prediction technique based on Time Series Similarity measurement
CN109214948A (en) * 2018-09-25 2019-01-15 新智数字科技有限公司 A kind of method and apparatus of electric system heat load prediction
CN109783877A (en) * 2018-12-19 2019-05-21 平安科技(深圳)有限公司 Time series models method for building up, device, computer equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11102351A (en) * 1997-06-05 1999-04-13 Northern Telecom Ltd Data sequential value predicting method, data sequential input determining method, and computer system
CN103294729A (en) * 2012-03-05 2013-09-11 富士通株式会社 Method and equipment for processing and predicting time sequence containing sample points
CN103294911A (en) * 2013-05-23 2013-09-11 中国人民解放军国防科学技术大学 Time sequence similarity value acquisition method and system
CN104811991A (en) * 2015-04-17 2015-07-29 合肥工业大学 Wireless link quality predicting method based on dynamic time warping algorithm
CN107590143A (en) * 2016-07-06 2018-01-16 北京金山云网络技术有限公司 A kind of search method of time series, apparatus and system
CN107528722A (en) * 2017-07-06 2017-12-29 阿里巴巴集团控股有限公司 Abnormal point detecting method and device in a kind of time series
CN108491559A (en) * 2018-01-19 2018-09-04 北京理工大学 A kind of time series method for detecting abnormality based on normalized mutual information estimation
CN108710623A (en) * 2018-03-13 2018-10-26 南京航空航天大学 Airport departure from port delay time at stop prediction technique based on Time Series Similarity measurement
CN109214948A (en) * 2018-09-25 2019-01-15 新智数字科技有限公司 A kind of method and apparatus of electric system heat load prediction
CN109783877A (en) * 2018-12-19 2019-05-21 平安科技(深圳)有限公司 Time series models method for building up, device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MANABU OKAWA: "Template Matching Using Time-Series Averaging and DTW With Dependent Warping for Online Signature Verification", 《 IEEE ACCESS.》, 1 January 2019 (2019-01-01), pages 81010 - 81019 *
姜逸凡,叶青: "基于孪生神经网络的时间序列相似性度量", 《计算机应用》, 16 November 2018 (2018-11-16), pages 1041 - 1045 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581693A (en) * 2022-03-07 2022-06-03 支付宝(杭州)信息技术有限公司 Method and device for distinguishing user behavior patterns
CN114581693B (en) * 2022-03-07 2023-11-03 支付宝(杭州)信息技术有限公司 User behavior mode distinguishing method and device

Similar Documents

Publication Publication Date Title
CN110069698B (en) Information pushing method and device
CN113342905B (en) Method and device for determining stop point
CN114205690B (en) Flow prediction method, flow prediction device, model training device, electronic equipment and storage medium
CN107609192A (en) The supplement searching method and device of a kind of search engine
CN111209347A (en) Method and device for clustering mixed attribute data
CN111435406A (en) Method and device for correcting database statement spelling errors
CN114817651B (en) Data storage method, data query method, device and equipment
CN115659411A (en) Method and device for data analysis
CN110443264A (en) A kind of method and apparatus of cluster
CN111708942A (en) Multimedia resource pushing method, device, server and storage medium
CN113220705B (en) Method and device for recognizing slow query
CN110019802B (en) Text clustering method and device
CN112926613A (en) Method and device for positioning time sequence training start node
CN110737691B (en) Method and apparatus for processing access behavior data
CN110837907A (en) Method and device for predicting wave order quantity
CN113066479B (en) Method and device for evaluating model
CN114662607A (en) Data annotation method, device and equipment based on artificial intelligence and storage medium
CN113468354A (en) Method and device for recommending chart, electronic equipment and computer readable medium
CN113590322A (en) Data processing method and device
CN113779370A (en) Address retrieval method and device
CN112800315A (en) Data processing method, device, equipment and storage medium
CN112395510A (en) Method and device for determining target user based on activity
CN112862554A (en) Order data processing method and device
CN113538026B (en) Service amount calculation method and device
CN111984839A (en) Method and apparatus for rendering a user representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination