CN116383273A

CN116383273A - Time sequence dimension reduction representation method and system

Info

Publication number: CN116383273A
Application number: CN202310352540.5A
Authority: CN
Inventors: 史晓贤; 周同明; 王秦; 马振武; 魏媛媛; 赵春海; 赵春阁; 成怡; 马坤; 汪卫
Original assignee: Jinan Yongxin New Material Technology Co ltd
Current assignee: Jinan Yongxin New Material Technology Co ltd
Priority date: 2023-04-04
Filing date: 2023-04-04
Publication date: 2023-07-04

Abstract

The application discloses a time sequence dimension reduction representation method and a system, wherein the method comprises the following steps: traversing a time series, identifying top and bottom features of the time series; dividing the time sequence into a plurality of simplified line segments based on the top-bottom characteristics, and recording starting points and ending points of the plurality of simplified line segments; calculating characteristic values of a plurality of simplified curves based on the starting points and the ending points; and respectively calculating the difference between the characteristic values of the simplified line segment and the adjacent segment, and if the difference is smaller than a preset threshold value, merging the line segment and the adjacent segment to finish the dimension reduction representation. By means of the dimension reduction method, dimension reduction work on variable-length variable-amplitude time sequences can be completed under the condition that time sequences with a large number of drift, distortion, fluctuation, abnormal points, pulling up and compression exist and time sequence characteristics are effectively guaranteed.

Description

Time sequence dimension reduction representation method and system

Technical Field

The application relates to the technical field of databases, in particular to a time sequence dimension reduction representation method and a system.

Background

Time series data is common in almost all human activities, including clinical medical vital sign recording equipment, real-time transaction data for financial stock futures, sales data for electronic commerce retail markets, astronomical observations, and real-time weather temperatures.

In recent years, with the popularization of emerging applications such as data centers and the internet of things, the scale of time-series data is also expanding. Many real-time applications produce tens or even hundreds of millions of time series data, with storage scales up to TB or PB.

Time series similarity queries. Time series similarity query is an important research direction in the field of time series mining. A time series similarity query refers to finding a set of target sequences that are most similar to a given time series on a set of time series data according to some similarity metric function. The time sequence similarity query is the basic pre-work of time sequence clustering, classification, anomaly detection and frequent pattern mining. The similarity queries of the time series can be divided into two major categories, full-sequence matching and sub-sequence matching. Wherein full sequence matching means that the searched time sequence has the same length as the target sequence. Sub-sequence matching refers to finding all sub-sequences similar to the target sequence in a longer sequence.

Because time series have high dimensionality, processing directly on the original data is very costly. Thus, it is common practice to perform data or dimensional reduction and transformation on time series data, the data being mapped into transformed space and retaining a small set of "strongest" transformed coefficients as features/representations. Because the dimensions of the new space are relatively low, such dimension reduction methods are known as time-series dimension reduction representation techniques.

Disclosure of Invention

The method and the system for representing the time sequence dimension reduction are provided, and under the condition that a large amount of drift, distortion, fluctuation, outliers, pulling-up and compression exist in time sequence data, the similar time sequence query is difficult, information is often lost, or a large amount of noise is introduced, dimension reduction of a variable-length variable-amplitude time sequence cannot be well completed, and therefore a large amount of errors are generated in the subsequent time sequence similarity measurement.

To achieve the above object, the present application provides the following solutions:

a time-series dimension-reduction representation method, comprising the steps of:

traversing a time series, identifying top and bottom features of the time series;

dividing the time sequence into a plurality of simplified line segments based on the top-bottom characteristics, and recording starting points and ending points of the plurality of simplified line segments;

calculating characteristic values of a plurality of simplified curves based on the starting points and the ending points;

and respectively calculating the difference between the characteristic values of the simplified line segment and the adjacent segment, and if the difference is smaller than a preset threshold value, merging the line segment and the adjacent segment to finish the dimension reduction representation.

Preferably, the method for identifying the top-bottom feature comprises the following steps:

judging the number of inflection points of the time sequence, and recording the number of inflection points as a first curve if the number of the inflection points is not less than 5;

if a first inflection point is higher than a first adjacent point and a second adjacent point adjacent to the first inflection point in the first curve, the first adjacent point is higher than a third adjacent point adjacent to the first adjacent point on the other side, and the second adjacent point is higher than a fourth adjacent point adjacent to the second adjacent point on the other side, the first curve is marked as a top characteristic, and the first inflection point is marked as a top point;

if there is a second inflection point lower than a fifth adjacent point and a sixth adjacent point adjacent to the second inflection point in the first curve, and the fifth adjacent point is lower than a seventh adjacent point adjacent to the fifth adjacent point on the other side, and the sixth adjacent point is lower than an eighth adjacent point adjacent to the sixth adjacent point on the other side, the first curve is marked as a bottom feature, and the second inflection point is marked as a bottom point.

Preferably, the recording method of the start point and the end point comprises the following steps:

when the simplified line segment is a descending line segment, taking the vertex as a starting point, recording the vertex coordinates of the vertex, taking the bottom point as an end point, recording the bottom point coordinates of the bottom point, integrating the vertex coordinates and the bottom point coordinates into a data pair, and storing the data pair into a linked list;

when the simplified line segment is an ascending line segment, taking the bottom point as a starting point, recording the bottom point coordinate of the bottom point, taking the top point as an end point, recording the top point coordinate of the top point, integrating the top point coordinate and the bottom point coordinate into a data pair, and storing the data pair into a linked list.

Preferably, the characteristic value includes: the slope K of the simplified line segment, the standard deviation sigma of the simplified line segment ² The mean value mu of the simplified line segments, the height ym1 of the starting point and the height ym2 of the ending point.

Preferably, the calculating and comparing method of the gap comprises:

wherein k is _m To simplify the slope of line segment m, σ ² _m To simplify the standard deviation of line segment m, mu _m To simplify the mean value of line segment m, k _n To simplify the slope of line segment n, σ ² _n To simplify the standard deviation of line segment n, μ _n To simplify the mean value of line segment n, E _k Is a preset threshold value for the slope,

e is a preset threshold of standard deviation _μ Is a preset threshold value of the mean value.

The application also provides a time sequence dimension reduction representation system, which comprises: the device comprises an identification module, a simplification module, a calculation module and a comparison module;

the identification module is used for traversing the time sequence and identifying the top and bottom characteristics of the time sequence;

the simplification module is used for dividing the time sequence into a plurality of simplified line segments based on the top-bottom characteristics and recording starting points and ending points of the plurality of simplified line segments;

the calculation module is used for calculating the characteristic values of a plurality of simplified curves based on the starting point and the ending point;

and the comparison module is used for respectively calculating the difference between the characteristic values of the simplified line segment and the adjacent segment, and if the difference is smaller than a preset threshold value, merging the line segment and the adjacent segment to finish the dimension reduction representation.

Preferably, the workflow of the identification module includes:

Preferably, the method for recording the start point and the end point by the simplifying module includes:

Preferably, the calculation and comparison method of the difference by the comparison module comprises the following steps:

Compared with the prior art, the beneficial effects of this application are:

by means of the dimension reduction method, dimension reduction work on variable-length variable-amplitude time sequences can be completed under the condition that time sequences with a large number of drift, distortion, fluctuation, abnormal points, pulling up and compression exist and time sequence characteristics are effectively guaranteed.

Drawings

For a clearer description of the technical solutions of the present application, the drawings that are required to be used in the embodiments are briefly described below, it being evident that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method according to a first embodiment of the present application;

FIG. 2 is a schematic top feature diagram of a first embodiment of the present application;

FIG. 3 is a schematic view of the bottom feature of the first embodiment of the present application;

FIG. 4 is a schematic view of a descending segment according to the first embodiment of the present application;

FIG. 5 is a diagram illustrating a rising line segment according to a first embodiment of the present disclosure;

fig. 6 is a schematic system structure of a second embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments that are illustrated in the appended drawings.

Example 1

In a first embodiment, as shown in fig. 1, a time-series dimension-reduction representation method includes the following steps:

s1, traversing the time sequence, and identifying the top and bottom features of the time sequence.

The time sequence is a sequence formed by arranging various numerical values of indexes at different times in time sequence, and the time sequence analysis is a theory and a method for establishing a mathematical model through curve fitting and parameter estimation according to time sequence data obtained by system observation, and is generally used in the fields of finance, weather prediction, market analysis and the like.

The method for identifying the top and bottom features comprises the following steps: judging the number of inflection points of the time sequence, and recording the number of inflection points as a first curve if the number of the inflection points is not less than 5; if the first curve has a first inflection point higher than a first adjacent point and a second adjacent point adjacent to the first inflection point, the first adjacent point is higher than a third adjacent point adjacent to the first adjacent point on the other side, and the second adjacent point is higher than a fourth adjacent point adjacent to the second adjacent point on the other side, the first curve is marked as a top characteristic, and the first inflection point is marked as a top point; if the second inflection point is lower than the fifth adjacent point and the sixth adjacent point adjacent to the second inflection point in the first curve, the fifth adjacent point is lower than the seventh adjacent point adjacent to the fifth adjacent point on the other side, the sixth adjacent point is lower than the eighth adjacent point adjacent to the sixth adjacent point on the other side, the first curve is marked as a bottom feature, and the second inflection point is marked as a bottom point.

In this embodiment, the original time series is traversed starting from the head of the time series, and the time axis coordinates of all the top and bottom are identified (the variable length luffing problem is primarily solved).

As shown in fig. 2, definition of top: the top sub-sequence, the shortest length, needs to satisfy five points, the middle point (vertex) is higher than its neighbors, which are higher than the two sides. The structure of this foundation is called the top and the highest point is marked as the vertex; as shown in fig. 3, the definition of bottom: the sub-sequence of the bottom, the shortest length, needs to satisfy five points, the middle point (bottom point) is lower than its neighbors, which are lower than the two sides. This basic structure is called the bottom and the lowest point is noted as the bottom point. Five points satisfying such a feature are defined as the top and bottom infrastructure. Wherein the vertices are referred to as vertices/nadirs. All vertices and nadir points are highlighted. In the S1 process, the case of abnormality of the time series point can be partially solved. By combining a plurality of top points and bottom points, partial interference of time sequence drift, distortion, fluctuation, pull-up and compression can be partially solved.

S2, dividing the time sequence into a plurality of simplified line segments based on the top-bottom characteristics, and recording starting points and ending points of the plurality of simplified line segments.

The recording method of the start point and the end point comprises the following steps: when the simplified line segment is a descending line segment, taking a vertex as a starting point, recording the vertex coordinates of the vertex, taking a bottom point as an end point, recording the bottom point coordinates of the bottom point, integrating the vertex coordinates and the bottom point coordinates into a data pair, and storing the data pair into a linked list; when the simplified line segment is an ascending line segment, the bottom point is used as a starting point, the bottom point coordinates of the bottom point are recorded, the top point is used as an end point, the top point coordinates of the top point are recorded, the top point coordinates and the bottom point coordinates are integrated into a data pair, and the data pair is stored in a linked list.

The start and end time of each simplified line segment is recorded, and the start and end coordinates are stored in a new linked list. In this embodiment, as shown in fig. 4, a descending line segment is formed by points from top to bottom, recording the top at the leftmost side as start1 and the bottom at the rightmost side as end1; alternatively, as shown in fig. 5, a rising line segment is formed from bottom points to top points, where the bottom point at the leftmost side is recorded as start1, and the top point at the rightmost side is recorded as end1. Coordinates of the vertices and the nadir are recorded as a pair of data, and sequentially stored in the linked list, (start 1, end 1), (start 2, end 2), … … … … (start n, end n).

S3, calculating characteristic values of a plurality of simplified curves based on the starting point and the ending point.

The characteristic values include: simplifying the slope K of the line segment, simplifying the standard deviation sigma of the line segment ² The mean μ of the line segments, the height ym1 of the start point, and the height ym2 of the end point are simplified. Wherein the starting point height ym1 is a value. Corresponds to a value corresponding to a (start m) time point in the time series. Endpoint height ym2 value. Corresponding to the value corresponding to the (end m) time point in the time series.

The calculation method of the characteristic value is as follows:

line segment equation for the height of the starting point ym1, the height of the ending point ym 2: y=kx+c

The straight line parameters to be solved are slope k and intercept c.

So there are:

written in matrix form: />

Wherein, the liquid crystal display device comprises a liquid crystal display device,

bringing it into the objective function J1 yields:

the objective function derives θ and makes it equal to zero, yielding:

and (3) solving to obtain: θ= (X) ^T X) ^-1 X ^T y

Namely:

s4, calculating differences of characteristic values of the simplified line segments and the adjacent segments respectively, and if the differences are smaller than a preset threshold value, merging the line segments and the adjacent segments to finish dimension reduction representation.

The calculation and comparison method of the gap comprises the following steps:

In this embodiment, if the start and end points obtained in S2 are recorded as follows:

......(start665，end665)，(start666，end666)......

after calculation through S4, the combination is performed to obtain:

......(start665，end666)......

wherein, the preset threshold epsilon _k 、

∈ _μ It is required to be specified by an expert or manually changed according to actual requirements.

Example two

In the second embodiment, as shown in fig. 6, a time-series dimension-reduction representation system includes: the device comprises an identification module, a simplification module, a calculation module and a comparison module;

the identification module is used for traversing the time sequence and identifying the top and bottom characteristics of the time sequence. In this embodiment, a method for identifying a top-bottom feature includes: judging the number of inflection points of the time sequence, and recording the number of inflection points as a first curve if the number of the inflection points is not less than 5; if the first curve has a first inflection point higher than a first adjacent point and a second adjacent point adjacent to the first inflection point, the first adjacent point is higher than a third adjacent point adjacent to the first adjacent point on the other side, and the second adjacent point is higher than a fourth adjacent point adjacent to the second adjacent point on the other side, the first curve is marked as a top characteristic, and the first inflection point is marked as a top point; if the second inflection point is lower than the fifth adjacent point and the sixth adjacent point adjacent to the second inflection point in the first curve, the fifth adjacent point is lower than the seventh adjacent point adjacent to the fifth adjacent point on the other side, the sixth adjacent point is lower than the eighth adjacent point adjacent to the sixth adjacent point on the other side, the first curve is marked as a bottom feature, and the second inflection point is marked as a bottom point.

In this embodiment, the original time series is traversed starting from the head of the time series, and the time axis coordinates of all the top and bottom are identified (the variable length luffing problem is primarily solved). Definition of roof: the top sub-sequence, the shortest length, needs to satisfy five points, the middle point (vertex) is higher than its neighbors, which are higher than the two sides. The structure of this foundation is called the top and the highest point is marked as the vertex; definition of bottom: the sub-sequence of the bottom, the shortest length, needs to satisfy five points, the middle point (bottom point) is lower than its neighbors, which are lower than the two sides. This basic structure is called the bottom and the lowest point is noted as the bottom point. Five points satisfying such a feature are defined as the top and bottom infrastructure. Wherein the vertices are referred to as vertices/nadirs. All vertices and nadir points are highlighted. In the S1 process, the case of abnormality of the time series point can be partially solved. By combining a plurality of top points and bottom points, partial interference of time sequence drift, distortion, fluctuation, pull-up and compression can be partially solved.

The simplification module is used for dividing the time sequence into a plurality of simplified line segments based on the top-bottom characteristics and recording starting points and ending points of the plurality of simplified line segments. In this embodiment, the recording method of the start point and the end point includes: when the simplified line segment is a descending line segment, taking a vertex as a starting point, recording the vertex coordinates of the vertex, taking a bottom point as an end point, recording the bottom point coordinates of the bottom point, integrating the vertex coordinates and the bottom point coordinates into a data pair, and storing the data pair into a linked list; when the simplified line segment is an ascending line segment, the bottom point is used as a starting point, the bottom point coordinates of the bottom point are recorded, the top point is used as an end point, the top point coordinates of the top point are recorded, the top point coordinates and the bottom point coordinates are integrated into a data pair, and the data pair is stored in a linked list.

The start and end time of each simplified line segment is recorded, and the start and end coordinates are stored in a new linked list. In this embodiment, a descending line segment is formed by points from top points to bottom points, and records that the top point at the leftmost side is start1 and the bottom point at the rightmost side is end1; or, a rising line segment is formed by bottom points to top points, the bottom point at the leftmost side is recorded as start1, and the top point at the rightmost side is recorded as end1. Coordinates of the vertices and the nadir are recorded as a pair of data, and sequentially stored in the linked list, (start 1, end 1), (start 2, end 2), … … … … (start n, end n).

The calculation module is used for calculating characteristic values of a plurality of simplified curves based on the starting point and the ending point. In this embodiment, the feature values include: simplifying the slope K of the line segment, simplifying the standard deviation sigma of the line segment ² The mean μ of the line segments, the height ym1 of the start point, and the height ym2 of the end point are simplified. Wherein the starting point height ym1 is a value. Corresponds to a value corresponding to a (start m) time point in the time series. Endpoint height ym2 value. Corresponding to the value corresponding to the (end m) time point in the time series.

The calculation method of the characteristic value is as follows:

The straight line parameters to be solved are slope k and intercept c.

So there are:

written in matrix form: />

bringing it into the objective function J ₁ Obtaining:

the objective function derives θ and makes it equal to zero, yielding:

and (3) solving to obtain: θ= (X) ^T X) ^-1 X ^T y

Namely:

the comparison module is used for respectively calculating the difference between the characteristic values of the simplified line segment and the adjacent segment, and if the difference is smaller than a preset threshold value, merging the line segment and the adjacent segment to finish the dimension reduction representation. In this embodiment, the method for calculating and comparing the gap includes:

......(start665，end665)，(start666，end666)......

after calculation through S4, the combination is performed to obtain:

......(start665，end666)......

wherein, the preset threshold epsilon _k 、

The foregoing embodiments are merely illustrative of the preferred embodiments of the present application and are not intended to limit the scope of the present application, and various modifications and improvements made by those skilled in the art to the technical solutions of the present application should fall within the protection scope defined by the claims of the present application.

Claims

1. A time-series dimension-reduction representation method, characterized by comprising the steps of:

2. A time series dimension reduction representation method according to claim 1, wherein said method of identifying said top and bottom features comprises:

3. The method for time-series dimension-reduction representation according to claim 2, wherein the recording method of the start and end points comprises:

4. A time-series reduced dimension representation method according to claim 3, wherein said characteristic values comprise: the slope K of the simplified line segment, the standard deviation sigma of the simplified line segment ² The mean value mu of the simplified line segments, the height ym1 of the starting point and the height ym2 of the ending point.

5. The method for time-series dimension-reduction representation according to claim 4, wherein said method for calculating and comparing said difference comprises:

6. A time-series dimension-reduction representation system, comprising: the device comprises an identification module, a simplification module, a calculation module and a comparison module;

7. The time series reduced dimension representation system of claim 6, wherein the workflow of the recognition module comprises:

8. The time series dimension reduction representation system of claim 7, wherein the method for recording the start and end points by the simplifying module comprises:

9. The time series reduced dimension representation system of claim 8, wherein the characteristic values comprise: the slope K of the simplified line segment, the standard deviation sigma of the simplified line segment ² The mean value mu of the simplified line segments, the height ym1 of the starting point and the height ym2 of the ending point.

10. The time series dimension reduction representation system of claim 9, wherein the comparison module calculates and compares the gap by a method comprising: