WO2022139615A1 - Method and apparatus for clustering time series data - Google Patents

Method and apparatus for clustering time series data Download PDF

Info

Publication number
WO2022139615A1
WO2022139615A1 PCT/RU2020/000739 RU2020000739W WO2022139615A1 WO 2022139615 A1 WO2022139615 A1 WO 2022139615A1 RU 2020000739 W RU2020000739 W RU 2020000739W WO 2022139615 A1 WO2022139615 A1 WO 2022139615A1
Authority
WO
WIPO (PCT)
Prior art keywords
event
time
processor
events
event type
Prior art date
Application number
PCT/RU2020/000739
Other languages
French (fr)
Inventor
Erica Alexandrovna SHEFER
Adam Viktorovich BARDASHEVICH
Zhavlon ISOMURODOV
Bogdan TRUBETSKOY
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/RU2020/000739 priority Critical patent/WO2022139615A1/en
Publication of WO2022139615A1 publication Critical patent/WO2022139615A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0633Workflow analysis

Definitions

  • the present disclosure relates generally to the field of data analytics; and more specifically, to methods and apparatus for clustering time series data.
  • each computing device is enabled to record behavioral data in the form of a time-series.
  • the user behavior is analyzed based on various parameters, such as, but not limited to, a user activity performed at various instances over a period of time to infer any temporal patterns therein.
  • time series clustering is employed for monitoring current activities of an agent, such as, but not limited to, a mobile phone user, cloud virtual machine (VM) server or a data compression technique that can be improved by clustering data processing that allows to obtain information about the character and behavior of the agent.
  • agent such as, but not limited to, a mobile phone user, cloud virtual machine (VM) server or a data compression technique that can be improved by clustering data processing that allows to obtain information about the character and behavior of the agent.
  • agent such as, but not limited to, a mobile phone user, cloud virtual machine (VM) server or a data compression technique that can be improved by clustering data processing that allows to obtain information about the character and behavior of the agent.
  • VM virtual machine
  • a data compression technique that can be improved by clustering data processing that allows to obtain information about the character and behavior of the agent.
  • different variants and techniques of clustering and segmentation of data for mining based on agent behavior are employed.
  • such techniques are unable to map the patterns of individual agent activity and subsequent behavior due to many reasons,
  • the present disclosure seeks to provide a computer implemented method for clustering time series data, which mainly enables in predicting upcoming events to be performed by a computing device.
  • the present disclosure seeks to provide a solution to the existing problem of processing a large variety and amount of operational data of computing devices for predicting upcoming events to be performed by such computing devices.
  • An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provides an efficient way to predict upcoming events to be performed by computing devises by processing small amount of operational data (in the form of clustered time series data) of the computing device, and thereby optimizing performance of the computing devices.
  • the present disclosure provides a computer implemented method for clustering time series data.
  • the method comprises receiving, by a processor, an input data series including a plurality of events associated with a corresponding event time and an event type; initiating, by the processor at each event time, an active segment for the corresponding event type; setting, by the processor, an expiry time for each active segment based on a predetermined threshold time period for each event type; expiring, by the processor, each active segment at the corresponding expiry time; and dividing, by the processor, the input data series into a series of activity clusters; where a boundary between adjacent clusters is set when either: (a) predetermined number of active segments are initiated within a predefined time window; or (b) number of active segments falls to a predefined threshold.
  • the method of the present disclosure enables time series clustering for various events to be performed by a computing device.
  • the method does not require vast amounts and variety of information relating to the events for performing the time series clustering.
  • the method requires only an event type, an event time and an expiry tentatively assigned corresponding to the events time for clustering the time series data.
  • the time series data is divided into different clusters or segments, wherein all clusters or segments are united by a single user activity.
  • the user activity may be associated with a processing power of the computing device.
  • the method of the present disclosure typically enables in identifying the operational aspects of the computing device i.e.
  • the clustering of the time series data is also a preprocessing step in mining another time series data to identify homogeneous groups for further generating supervised models
  • setting the expiry time for the active segment of a certain event type includes resetting the expiry time if a further event corresponding to the same event type occurs during the active segment.
  • the resetting of the expiry time of the certain event allows in altering the predetermined threshold time period associated with the expiry of such events. This allows in altering the active segment associated with such event.
  • by resetting of the expiry time of the certain event allows the number of active segments in the predefined time window to change, which in turn changes fall of the number of active segments. Therefore, new boundaries can be identified and selected from the time series data for defining new clusters based on the active segments of the events. This typically allows in accommodating a real time or actual changes in the activity clusters based on the occurred events, which in turn enable in efficiently predicting upcoming events for optimizing operations of computing devices.
  • the plurality of events includes one or more of a plurality of data storage events, a plurality of server access events or a plurality of user activity events.
  • the plurality of events being one or more of the storage events, the server access events, or the user activity events; various operational aspects associated with a computing device can be analyzed. This enables in considering the operational aspects that can majorly influence usages of the computing devices, and by predicting such operational aspects performance of such computing device can be majorly optimized.
  • a user activity event includes an application launch, where the associated event time includes an application launch time and the event type includes an application name.
  • the user activity event being an application launch, allows in typically identifying and predicting usages of various applications associated with a computing device. Since, usage of applications (i.e. launching or closing of applications) constitute a major portion of operational data, therefore this enables in optimizing operations of the computing devices for efficient performance thereof.
  • the predetermined threshold time period for each event type is determined based on an average usage associated with each event type.
  • the predetermined threshold time period Upon considering the average usage associated with each event type while determining the predetermined threshold time period enables in more precisely determining the predetermined threshold time period. Typically, this enables in also considering uncommon or unlikely usages associated with each event type, thereby making the predetermined threshold time period more precise.
  • the predetermined number of active segments is 5, the predefined time window is 2 minutes, and the predefined threshold for the number of active segments is 2.
  • the method further comprises training, by the processor, a contextbased event prediction model for each activity cluster based on the plurality of events within the cluster, where the context-based prediction model is configured to output a probability for an upcoming event type based on a currently active segment.
  • the method further comprises training, by the processor, a general event prediction model based on the plurality of events within the input data series; and generating, by the processor, a modified event prediction based on a probability output by the general event prediction model and the probability output by the context-based event prediction model.
  • the context based event prediction model and the general prediction model, for each activity cluster co-operates to allow generation of a modified prediction model for more accurately determining the probability of an event occurring in future by providing relevant context associated with each event.
  • the present disclosure provides a computer-readable medium configured to store instructions which, when executed by a processor, cause the processor to perform the method for clustering time series data, mentioned herein above.
  • the computer-readable medium (specifically, a non-transitory computer-readable medium) carrying computer instructions achieves all the advantages and effects of the method.
  • the present disclosure provides a data-processing apparatus comprising a processor configured to perform the method for clustering time series data, mentioned herein above.
  • the data-processing apparatus is operable to achieve all the advantages and effects of the method described herein above.
  • the data-processing apparatus is operable to be communicatively coupled with a plurality of computing devices for receiving operational data, in the form of input data series of events, performed by the computing devices.
  • the apparatus is operable to cluster the input data series into activity clusters for detecting upcoming event to be performed by the computing devices and thereby optimizing performance of the computing devices.
  • FIG. 1 illustrates a flowchart of a computer implemented method for clustering time series data, in accordance with an embodiment of the present disclosure
  • FIG. 2 illustrates a graphical representation of a series of activity clusters and associated plurality of events, in accordance with an embodiment of the present disclosure
  • FIG. 3 is a block diagram of a data-processing apparatus for clustering time series data, in accordance with an embodiment of the present disclosure.
  • an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent.
  • a non-underlined number relates to an item identified by a line linking the nonunderlined number to the item.
  • the non-underlined number is used to identify a general item at which the arrow is pointing.
  • the method 100 includes steps 102, 104, 106, 108 and 110.
  • time series data'' refers to a dataset or a series of data points indexed (or listed or graphed) in a temporal order.
  • the time series data is a sequence of events taken at successively spaced points in time. In other words, a sequence of discretetime data.
  • the points in time are spaced at equal intervals.
  • the points in time are spaced at varying intervals.
  • the time series data includes large volumes of data having a high dimensionality, wherein the data in the time series is added and analyzed dynamically as time progresses.
  • the time series is updated in real time, specifically at the successively spaced points in time.
  • the time series data comprises the information relating to events being performed by the user using any computing device, such as a mobile phone or tablet computer.
  • any computing device such as a mobile phone or tablet computer.
  • other exemplary scenarios may also be implemented using the method 100, such as using a virtual machine server or for performing data compression and so forth.
  • various users employ computing devices, such as a mobile phone, tablets, laptop, computers and perform various operations on a variety of applications depending upon usage.
  • the usage of users is recorded by their respective computing devices and further analysed and processed by the processor to infer information from the recorded usage data of the user and predict future events to optimize performance of the computing device and the processor or apparatus performing the method 100.
  • the present disclosure provides an effective time-series analysis technique that extracts optimal time segments of a user’s similar behavioral characteristics while utilizing their mobile phone data.
  • the term "processor” refers to a computational element that is operable to respond to and process instructions to perform the clustering operations.
  • the processor may be a microprocessor, a microcontroller, a complex instruction set computing (CISC) microprocessor, a reduced instruction set (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, or any other type of processing circuit, for example as aforementioned.
  • the processor may be operated individually or as a part of a computation system.
  • the processor is configured to perform a plurality of operations according to the method 100 for clustering time series data.
  • the method 100 comprises receiving, by a processor, an input data series including a plurality of events associated with a corresponding event time and an event type.
  • the “input data series” refers to the time series data received by the processor from various sources for performing several operations on the input data series to achieve a desired result, such as to forecast an event in future.
  • the input data series includes a plurality of events.
  • the term “event” refers to a structured data relating to a task or outcome of a task that has been or will be performed, for example, opening or closing of applications, data transfer or storage operations, changes to a cluster of servers, trends and patterns related to stock market trades.
  • each event is associated with the corresponding event time, wherein the event time refers to the initiation time of the associated event, or optionally, the event time refers to the time period in which the associated event gets completed.
  • each event is associated with an event type.
  • the “event type ” refers to a data structure that defines the data contained in an event.
  • the event type includes the name of the application, for example, an application being launched by a user on any computing device.
  • the processor associates the input data series to an event of a particular event type. Further, to categorize such events, the event type is employed to represent the event data and enable the processor to access, process and manipulate the input data series effectively.
  • the plurality of events include one or more of a plurality of data storage events, a plurality of server access events or a plurality of user activity events.
  • the plurality of events includes different types of events being employed at different platforms and devices, such as, but not limited to, a data storage event such as a data transfer for data compression, a server access event such as employing a virtual machine (VM) server for processing data at a remote location or a user activity event, such as launching and using an application or closing an application.
  • the plurality of events comprises one or more of each type of event.
  • the plurality of events may comprise one or more data storage events, one or more server access events and one or more user activity events either individually or any combination of these.
  • the plurality of events being one or more of the storage events, the server access events, or the user activity events; various operational aspects associated with a computing device can be analyzed. This enables in considering the operational aspects that can majorly influence usages of the computing devices, and by predicting such operational aspects performance of such computing device can be majorly optimized.
  • a user activity event includes an application launch
  • the associated event time includes an application launch time
  • the event type includes an application name.
  • the term “user activity event” refers to an event associated with a user behaviour or activity, such as opening or closing of an application.
  • the user activity includes an application launch.
  • the user activity event further includes an associated event time, wherein the associated event time includes the application launch time.
  • the associated event type includes the application name, for example, a user launches a social media application on a computing device daily, wherein the event type is the name of the social media application.
  • the user activity event being an application launch, allows in typically identifying and predicting usages of various applications associated with a computing device. Since, usage of applications (i.e. launching or closing of applications) constitute a major portion of operational data regarding usages computing devices, therefore this enables in optimizing operations of the computing devices for efficient performance thereof.
  • the method 100 comprises initiating, by the processor at each event time, an active segment for the corresponding event type.
  • active segment refers to a time period for the currently active event type.
  • each respective segment may be referred to as an active segment.
  • the processor is configured to initiate the active segment at each event time. For example, the processor initiates the active segment whenever a user launches an application. The plurality of events are performed at each active segment and enable to effectively track and manipulate the input data series accordingly.
  • the method 100 comprises setting, by the processor, an expiry time for each active segment based on a predetermined threshold time period for each event type.
  • the processor is configured to set the expiry time for each active segment.
  • the “expiry time” refers to an end time for an event type, for example, the closing time of an application initiated by either the user or the processor.
  • the processor is configured to analyze the user behaviour regarding an event type, for example, the times of usage or duration of usage, frequency of usage for a particular application, to infer the behavioural patterns and/or tendencies of the user to further determine an end time or expiry time.
  • the expiry time enables the processor to calculate the threshold time period i.e. the time period between the launch and closing of an application.
  • the processor sets the expiry time for each active segment based on the predetermined threshold time period.
  • the predetermined threshold time period for each event type is determined based on an average usage associated with each event type.
  • the processor is configured to determine the predetermined threshold time period for each event type is determined in terms of minutes or seconds based on an average usage associated with each event type.
  • the user performs certain activities or events regularly, like launching a specific application for a specific time period daily on the computing device.
  • the processor determines an average usage time for the user for each event type based on the history of usage and correspondingly determines the predetermined threshold time period based on the average usage time of the associated event type.
  • considering the average usage associated with each event type while determining the predetermined threshold time period enables in more precisely determining the predetermined threshold time period. Typically, this also enables the method 100 or the processor to consider different events associated with each event type, thereby making the predetermined threshold time period more precise.
  • setting the expiry time for the active segment of a certain event type includes resetting the expiry time if a further event corresponding to the same event type occurs during the active segment.
  • the expiry time of the active segment is reset by the processor.
  • the active segment entails more than one event, in such cases, the expiry time of the active segment is reset, since the earlier set expiry time might take place before the further event takes place.
  • the resetting of the expiry time of the active segment allows in altering the predetermined threshold time period associated with the expiry of such segments and its associated events. This allows in altering the active segment associated with such events.
  • by resetting the expiry time allows the number of active segments in the predefined time window to change, which in turn changes fall of the number of active segments. Therefore, new boundaries between adjacent clusters can be determined based on the resetting the expiry time. This allows in accommodating the real time or actual changes in the activity clusters based on the probable upcoming events, which in turn enable in efficiently predicting the events for optimizing operations of the apparatus and other apparatuses associated with the apparatus.
  • the method 100 comprises expiring, by the processor, each active segment at the corresponding expiry time.
  • the processor is configured to expire or close each active segment associated to a corresponding event type at the corresponding expiry time as set by the processor earlier at step 106 based on the predetermined threshold time period.
  • the processor upon setting the expiry time, the processor is configured to expire the active segment upon reaching the set expiry time.
  • the method 100 comprises dividing, by the processor, the input data series into a series of activity clusters.
  • the processor is configured to divide the input data series into the series of activity clusters.
  • activity cluster refers to an active group of events or active segments having a similar behaviour.
  • the processor is configured to divide the input data series via the boundary into a series of activity clusters.
  • boundary refers to a logical division of the input data series based on a certain parameter or behaviour into multiple continuous clusters segmented by the boundary.
  • each cluster is a list of consecutive events, united by a single user activity.
  • the processor is configured to set the boundary between adjacent clusters when either:
  • two different scenarios may be tracked by the processor, wherein at least a predetermined number of active segments are launched or closed either simultaneously or within a short period of time.
  • the processor is configured to set the boundary, if any one of the above conditions hold true.
  • both the conditions hold true within a short period of time only a single boundary is set by the processor instead of two closely spaced boundaries to simplify the clustering method.
  • the times of application launches by a user are provided.
  • the input data series is divided into one or more clusters, wherein each cluster is associated with a single user activity.
  • the processor is configured to, search a begin of cluster or initiation time of a cluster of the one or more clusters, wherein another cluster is initiated, if the user changes behavior i.e. opens or launches a plurality of event types or applications and search an end of cluster or end time of cluster, wherein the cluster ends, if the user closes a plurality of applications or event type together.
  • the processor is configured to determine a lifetime of the application or event type, wherein during the lifetime the user may constantly use or perform events on the application.
  • the processor is configured to calculate the lifetime (T[A]) in terms of minutes or seconds for an application A and an average time between the current launch of application A and any previous application for a plurality of users.
  • T[A] the lifetime
  • the processor is configured to calculate the lifetime (T[A]) in terms of minutes or seconds for an application A and an average time between the current launch of application A and any previous application for a plurality of users.
  • the predictive model predicts that the user will not launch the application A in the near future, since based on an average time the user should have launched the application again, however the application A is currently “closed” by the user.
  • a single boundary is set within the short window.
  • predetermined number of active segments refers to the maximum count of active segments set by either the processor or the user, wherein when the predetermined number of active segments are initiated by the user, a single boundary is set within the short window of time. Otherwise, when the number of active segments fall to the predefined threshold, the single boundary is set.
  • predefined threshold refers to the minimum count of active segments, after which a boundary is set by the processor.
  • the boundary divides the input data series into a series of activity clusters based on similar behaviours or upon meeting a predefined criterion.
  • the predetermined number of active segments is 5
  • the predefined time window is 2 minutes
  • the predefined threshold for the number of active segments is 2.
  • the processor is configured to set the boundary, when the predetermined number of active segments initiated is 5 and the predefined window is 2 minutes.
  • the boundary is set by the processor.
  • the processor is configured to set the boundary, when the number of active segments fall below the threshold of 2 active segments.
  • the boundary is set by the processor.
  • the input data series can be efficiently segregated into the series of activity clusters, i.e. the boundary between adjacent clusters can be easily demarcated. This allows in easily identifying the activity clusters from the input data series.
  • the method 100 comprises training, by the processor, a context-based event prediction model for each activity cluster based on the plurality of events within the cluster.
  • context-based event prediction moder refers to a predictive machine learning model configured for employing known results to create, process or validate another existing machine learning model to predict or forecast future events.
  • the processor is configured to train the context-based event prediction (CP) model for each activity cluster based on the plurality of events within the cluster.
  • the plurality of events including data storage events, server access events and user activity events.
  • the context-based event prediction model is trained and analyzed based on each event of the plurality of events with the activity cluster, where the context-based prediction model is configured to output a probability for an upcoming event type based on a currently active segment.
  • the “currently active segment” refers to the currently active cluster of the one or more clusters based on which the context-based event prediction model is configured to output the probability for the upcoming event type.
  • the context-based event prediction model analyses each cluster and the plurality of events therein, to predict which event type or application will be initiated next based on the currently active segment.
  • the context-based event prediction model may improve the predictions of a main model (such as a different machine learning model other than the contextbased event prediction model).
  • training by the processor, a general event prediction model based on the plurality of events within the input data series.
  • the processor is configured to train the general event prediction (GP) model based on the plurality of events including, one or more of a plurality of data storage events, a plurality of server access events or a plurality of user activity events.
  • the events such as the user activity events within the input data series are employed to train the general event prediction model.
  • the “general event prediction model” refers to a machine learning model on which another machine learning model, such as the context based prediction model may be used to enhance or improve the accuracy of the predictions.
  • the general event prediction model analyses the input data series and based on the plurality of events, predicts a future event based on a currently active segment.
  • the general event prediction model predicts the probability of any general event happening in the future, for example, the general prediction model predicts the probability of initiating a segment of a particular event type based on the currently active segment and the plurality of events.
  • the processor is further configured for generating a modified event prediction based on a probability output by the general event prediction model and the probability output by the context-based event prediction model.
  • each of the general event prediction model and the context based prediction model are configured to generate a probability output relating to an event or active segment.
  • the probability output of each of the general event prediction model and the context based prediction model are employed to improve the accuracy of prediction and generate the modified event prediction model.
  • the context based prediction model is trained on the series of activity clusters and the corresponding prediction information or probability output relating to the activity cluster is used by the general event prediction model to further generate a more accurate prediction by virtue of the modified event prediction model.
  • the probability of doing an event very next time is determined to be a real number between zero and one.
  • GP refers to the prediction of the general event prediction model
  • CP refers to the prediction of context based prediction model.
  • the modified event prediction model is configured to determine the probability or prediction based on the formula, given by:
  • Pa cp Probability of event from context based prediction model (CP) a - parameter used for regularization to linearly combine the prediction from both the GP and CP models.
  • the method 100 of complementing the already existing general event predictive model with a newly trained context based prediction model working specifically on the activity clusters for generating the modified event prediction model is 0.633, however, upon employing the context based prediction model along with the general event prediction model improves the probability output from 0,633 to 0,718.
  • the probability output improves from 0,7482 to 0,7632.
  • the term “markov chain” refers to a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Alternatively stated, a countably infinite sequence, in which the chain moves its state at discrete time steps or intervals and provides a discrete-time markov chain.
  • the present disclosure also provides a computer program product comprising a non-transitory computer-readable storage medium configured to store instructions or computer program code thereon, the instructions being executable by a processor to execute the method 100.
  • the method 100 is for a computing device for carrying out clustering of time series data.
  • Examples of implementation of the non-transitory computer-readable storage medium include, but is not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), a computer readable storage medium, and/or CPU cache memory.
  • a computer readable storage medium for providing a non-transient memory may include, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • FIG. 2 illustrated is a graphical representation 200 of an exemplary scenario of a series of activity clusters 210 and associated plurality of events, in accordance with another embodiment of the present disclosure.
  • the graphical representation 200 should be read in line in conjunction with the method 100 explained in conjunction with FIG. 1.
  • the graphical representation 200 represents the time period in seconds on X-axis and event type in numbers on Y-axis.
  • FIG. 2 depicts the series of activity clusters such as, a first activity cluster 202, a second activity cluster 204, a third activity cluster 206, a fourth activity cluster 208 and a fifth activity cluster 210.
  • Each activity cluster includes a plurality of active segments (represented as a solid line parallel to the Y-axis) and the associated plurality of events (represented as hollow circles or points on the active segment).
  • each of the activity cluster is a list of consecutive events, united by a single user activity.
  • the fourth activity cluster 208 represents or depicts a scenario, when a predetermined number of active segments are initiated within a predefined time window, for example, more than or equal to five.
  • the second activity cluster 204 represents or depicts, a number of active segments falls to a predefined threshold, for example, less than or equal to two.
  • FIG. 3 illustrates a block diagram of an apparatus 300 for clustering time series data, in accordance with an embodiment of the present disclosure.
  • the apparatus 300 comprises a processor 302 for clustering time series data.
  • the apparatus 300 of FIG. 3 should be read in line with FIGs. 1-2.
  • the apparatus 300 is operable to perform the method 100 for clustering time series data.
  • the apparatus 300 includes computational elements such as a memory, a processor, a data communication interface, a network adapter and the like, to store, process and/or share files or information with other apparatuses, such as another computation device, or server and the like.
  • Examples of the apparatus 300 may include, but are not limited to, a computation system, a virtual machine server(VM), a server arrangement, a data compression scheme.
  • VM virtual machine server

Landscapes

  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Educational Administration (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to a computer-implemented method and apparatus for clustering time series data. The method comprising receiving, by a processor, an input data series including a plurality of events associated with a corresponding event time and an event type; initiating, by the processor at each event time, an active segment for the corresponding event type; setting, by the processor, an expiry time for each active segment based on a predetermined threshold time period for each event type; expiring, by the processor, each active segment at the corresponding expiry time; and dividing, by the processor, the input data series into a series of activity clusters; where a boundary between adjacent clusters is set when either: (a) a predetermined number of active segments are initiated within a predefined time window; or (b) a number of active segments falls to a predefined threshold.

Description

METHOD AND APPARATUS FOR CLUSTERING TIME SERIES DATA
TECHNICAL FIELD
The present disclosure relates generally to the field of data analytics; and more specifically, to methods and apparatus for clustering time series data.
BACKGROUND
In recent times, with the proliferation of technology in every aspect of our daily lives, computing devices such as mobile phones, smart phones, tablets, smartwatches and so forth have become omnipresent and continues growing exponentially. Generally, each computing device is enabled to record behavioral data in the form of a time-series. Typically, the user behavior is analyzed based on various parameters, such as, but not limited to, a user activity performed at various instances over a period of time to infer any temporal patterns therein.
Conventionally, time series clustering is employed for monitoring current activities of an agent, such as, but not limited to, a mobile phone user, cloud virtual machine (VM) server or a data compression technique that can be improved by clustering data processing that allows to obtain information about the character and behavior of the agent. Moreover, different variants and techniques of clustering and segmentation of data for mining based on agent behavior are employed. However, such techniques are unable to map the patterns of individual agent activity and subsequent behavior due to many reasons, for example, the diverse behaviors of one or more agents are not accounted over a period of time. Moreover, existing techniques face the challenge of clustering time series data. Typically, a lot of different data is required for clustering time series data, such as sensor readings, location and so forth. However, such data is often difficult to access, and the clustering task itself becomes difficult and/or costly. Various techniques are employed to achieve various benefits, such as improved battery consumption of the computing device, better prediction results, efficient operation and so forth, however, existing techniques are unable to incorporate all the different elements together and are generally application specific i.e. existing techniques are not applicable to various scenarios and applications.
Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks associated with existing time series clustering and prediction techniques.
SUMMARY
The present disclosure seeks to provide a computer implemented method for clustering time series data, which mainly enables in predicting upcoming events to be performed by a computing device. The present disclosure seeks to provide a solution to the existing problem of processing a large variety and amount of operational data of computing devices for predicting upcoming events to be performed by such computing devices. An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provides an efficient way to predict upcoming events to be performed by computing devises by processing small amount of operational data (in the form of clustered time series data) of the computing device, and thereby optimizing performance of the computing devices.
The object of the present disclosure is achieved by the solutions provided in the enclosed independent claims. Advantageous implementations of the present disclosure are further defined in the dependent claims.
In one aspect, the present disclosure provides a computer implemented method for clustering time series data. The method comprises receiving, by a processor, an input data series including a plurality of events associated with a corresponding event time and an event type; initiating, by the processor at each event time, an active segment for the corresponding event type; setting, by the processor, an expiry time for each active segment based on a predetermined threshold time period for each event type; expiring, by the processor, each active segment at the corresponding expiry time; and dividing, by the processor, the input data series into a series of activity clusters; where a boundary between adjacent clusters is set when either: (a) predetermined number of active segments are initiated within a predefined time window; or (b) number of active segments falls to a predefined threshold. The method of the present disclosure enables time series clustering for various events to be performed by a computing device. Typically, the method does not require vast amounts and variety of information relating to the events for performing the time series clustering. For example, the method requires only an event type, an event time and an expiry tentatively assigned corresponding to the events time for clustering the time series data. In an example, the time series data is divided into different clusters or segments, wherein all clusters or segments are united by a single user activity. Further, the user activity may be associated with a processing power of the computing device. For example, the method of the present disclosure typically enables in identifying the operational aspects of the computing device i.e. when the computing device requires substantial processing power for corresponding set of events; and when the computing device requires minimal processing power for processing one or two events. Beneficially , the clustering of the time series data is also a preprocessing step in mining another time series data to identify homogeneous groups for further generating supervised models
In an implementation form, setting the expiry time for the active segment of a certain event type includes resetting the expiry time if a further event corresponding to the same event type occurs during the active segment.
The resetting of the expiry time of the certain event allows in altering the predetermined threshold time period associated with the expiry of such events. This allows in altering the active segment associated with such event. Typically, by resetting of the expiry time of the certain event allows the number of active segments in the predefined time window to change, which in turn changes fall of the number of active segments. Therefore, new boundaries can be identified and selected from the time series data for defining new clusters based on the active segments of the events. This typically allows in accommodating a real time or actual changes in the activity clusters based on the occurred events, which in turn enable in efficiently predicting upcoming events for optimizing operations of computing devices.
In an implementation form, when any number of (a) and/or (b) occur within a predefined short window of time, only a single boundary is set within the short window.
The happening of (a) and/or (b) within the predefined short window of time would be difficult to segregate as two different activity clusters, instead it is considered as a single activity cluster for ambiguity resolution during time series clustering. In an implementation form, the plurality of events includes one or more of a plurality of data storage events, a plurality of server access events or a plurality of user activity events.
By virtue of the plurality of events being one or more of the storage events, the server access events, or the user activity events; various operational aspects associated with a computing device can be analyzed. This enables in considering the operational aspects that can majorly influence usages of the computing devices, and by predicting such operational aspects performance of such computing device can be majorly optimized.
In an implementation form, a user activity event includes an application launch, where the associated event time includes an application launch time and the event type includes an application name.
The user activity event being an application launch, allows in typically identifying and predicting usages of various applications associated with a computing device. Since, usage of applications (i.e. launching or closing of applications) constitute a major portion of operational data, therefore this enables in optimizing operations of the computing devices for efficient performance thereof.
In an implementation form, the predetermined threshold time period for each event type is determined based on an average usage associated with each event type.
Upon considering the average usage associated with each event type while determining the predetermined threshold time period enables in more precisely determining the predetermined threshold time period. Typically, this enables in also considering uncommon or unlikely usages associated with each event type, thereby making the predetermined threshold time period more precise.
In an implementation form, the predetermined number of active segments is 5, the predefined time window is 2 minutes, and the predefined threshold for the number of active segments is 2.
By virtue of considering the predetermined number of active segments to be 5, the predefined time window to be 2 minutes, and the predefined threshold for the number of active segments to be 2; the input data series can be efficiently segregated into the series of activity clusters, i.e. the boundary between adjacent clusters can be easily demarcated. This allows in easily identifying the activity clusters from the input data series. In an implementation form, the method further comprises training, by the processor, a contextbased event prediction model for each activity cluster based on the plurality of events within the cluster, where the context-based prediction model is configured to output a probability for an upcoming event type based on a currently active segment.
In an implementation form, the method further comprises training, by the processor, a general event prediction model based on the plurality of events within the input data series; and generating, by the processor, a modified event prediction based on a probability output by the general event prediction model and the probability output by the context-based event prediction model.
The context based event prediction model and the general prediction model, for each activity cluster, co-operates to allow generation of a modified prediction model for more accurately determining the probability of an event occurring in future by providing relevant context associated with each event.
In another aspect, the present disclosure provides a computer-readable medium configured to store instructions which, when executed by a processor, cause the processor to perform the method for clustering time series data, mentioned herein above.
The computer-readable medium (specifically, a non-transitory computer-readable medium) carrying computer instructions achieves all the advantages and effects of the method.
In yet another aspect, the present disclosure provides a data-processing apparatus comprising a processor configured to perform the method for clustering time series data, mentioned herein above.
The data-processing apparatus is operable to achieve all the advantages and effects of the method described herein above. In other words, the data-processing apparatus is operable to be communicatively coupled with a plurality of computing devices for receiving operational data, in the form of input data series of events, performed by the computing devices. The apparatus is operable to cluster the input data series into activity clusters for detecting upcoming event to be performed by the computing devices and thereby optimizing performance of the computing devices.
It has to be noted that all devices, elements, circuitry, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.
Additional aspects, advantages, features and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative implementations construed in conjunction with the appended claims that follow.
BRIEF DESCRIPTION OF THE DRAWINGS
The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.
Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:
FIG. 1 illustrates a flowchart of a computer implemented method for clustering time series data, in accordance with an embodiment of the present disclosure;
FIG. 2 illustrates a graphical representation of a series of activity clusters and associated plurality of events, in accordance with an embodiment of the present disclosure; and
FIG. 3 is a block diagram of a data-processing apparatus for clustering time series data, in accordance with an embodiment of the present disclosure.
In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the nonunderlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.
DETAILED DESCRIPTION OF EMBODIMENTS
The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practicing the present disclosure are also possible.
Referring to FIG. 1, illustrated is a flowchart of a computer implemented method 100 for clustering time series data, in accordance with an embodiment of the present disclosure. As shown, the method 100 includes steps 102, 104, 106, 108 and 110.
Throughout the present disclosure, the term “time series data''’ refers to a dataset or a series of data points indexed (or listed or graphed) in a temporal order. The time series data is a sequence of events taken at successively spaced points in time. In other words, a sequence of discretetime data. Optionally, the points in time are spaced at equal intervals. Optionally, the points in time are spaced at varying intervals. Generally, the time series data includes large volumes of data having a high dimensionality, wherein the data in the time series is added and analyzed dynamically as time progresses. Moreover, the time series is updated in real time, specifically at the successively spaced points in time. Typically, the time series data comprises the information relating to events being performed by the user using any computing device, such as a mobile phone or tablet computer. Optionally, instead of a user activity by virtue of the computing device, other exemplary scenarios may also be implemented using the method 100, such as using a virtual machine server or for performing data compression and so forth. Generally, various users employ computing devices, such as a mobile phone, tablets, laptop, computers and perform various operations on a variety of applications depending upon usage. Typically, the usage of users is recorded by their respective computing devices and further analysed and processed by the processor to infer information from the recorded usage data of the user and predict future events to optimize performance of the computing device and the processor or apparatus performing the method 100. The present disclosure provides an effective time-series analysis technique that extracts optimal time segments of a user’s similar behavioral characteristics while utilizing their mobile phone data.
Throughout the present disclosure, the term "processor” refers to a computational element that is operable to respond to and process instructions to perform the clustering operations. In an example, the processor may be a microprocessor, a microcontroller, a complex instruction set computing (CISC) microprocessor, a reduced instruction set (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, or any other type of processing circuit, for example as aforementioned. Notably, the processor may be operated individually or as a part of a computation system. Herein, the processor is configured to perform a plurality of operations according to the method 100 for clustering time series data.
At step 102, the method 100 comprises receiving, by a processor, an input data series including a plurality of events associated with a corresponding event time and an event type. The “input data series” refers to the time series data received by the processor from various sources for performing several operations on the input data series to achieve a desired result, such as to forecast an event in future. Typically, the input data series includes a plurality of events. The term “event” refers to a structured data relating to a task or outcome of a task that has been or will be performed, for example, opening or closing of applications, data transfer or storage operations, changes to a cluster of servers, trends and patterns related to stock market trades. Typically, each event is associated with the corresponding event time, wherein the event time refers to the initiation time of the associated event, or optionally, the event time refers to the time period in which the associated event gets completed. Further, each event is associated with an event type. The “event type ” refers to a data structure that defines the data contained in an event. Typically, the event type includes the name of the application, for example, an application being launched by a user on any computing device. Generally, when the input data series is received by the processor, the processor associates the input data series to an event of a particular event type. Further, to categorize such events, the event type is employed to represent the event data and enable the processor to access, process and manipulate the input data series effectively.
In an embodiment, wherein the plurality of events include one or more of a plurality of data storage events, a plurality of server access events or a plurality of user activity events. The plurality of events includes different types of events being employed at different platforms and devices, such as, but not limited to, a data storage event such as a data transfer for data compression, a server access event such as employing a virtual machine (VM) server for processing data at a remote location or a user activity event, such as launching and using an application or closing an application. Typically, the plurality of events comprises one or more of each type of event. For example, the plurality of events may comprise one or more data storage events, one or more server access events and one or more user activity events either individually or any combination of these. By virtue of the plurality of events being one or more of the storage events, the server access events, or the user activity events; various operational aspects associated with a computing device can be analyzed. This enables in considering the operational aspects that can majorly influence usages of the computing devices, and by predicting such operational aspects performance of such computing device can be majorly optimized.
In another embodiment, wherein a user activity event includes an application launch, where the associated event time includes an application launch time and the event type includes an application name. The term “user activity event” refers to an event associated with a user behaviour or activity, such as opening or closing of an application. Specifically, the user activity includes an application launch. For example, a user may initiate or launch any application on the computing device. The user activity event further includes an associated event time, wherein the associated event time includes the application launch time. Moreover, the associated event type includes the application name, for example, a user launches a social media application on a computing device daily, wherein the event type is the name of the social media application. The user activity event being an application launch, allows in typically identifying and predicting usages of various applications associated with a computing device. Since, usage of applications (i.e. launching or closing of applications) constitute a major portion of operational data regarding usages computing devices, therefore this enables in optimizing operations of the computing devices for efficient performance thereof.
At step 104, the method 100 comprises initiating, by the processor at each event time, an active segment for the corresponding event type. The term “active segment” refers to a time period for the currently active event type. Thus, for each event type there may be one or more segments having different event times, wherein at each respective event time, each respective segment may be referred to as an active segment. Typically, the processor is configured to initiate the active segment at each event time. For example, the processor initiates the active segment whenever a user launches an application. The plurality of events are performed at each active segment and enable to effectively track and manipulate the input data series accordingly.
At step 106, the method 100 comprises setting, by the processor, an expiry time for each active segment based on a predetermined threshold time period for each event type. Typically, the processor is configured to set the expiry time for each active segment. The “expiry time” refers to an end time for an event type, for example, the closing time of an application initiated by either the user or the processor. Generally, the processor is configured to analyze the user behaviour regarding an event type, for example, the times of usage or duration of usage, frequency of usage for a particular application, to infer the behavioural patterns and/or tendencies of the user to further determine an end time or expiry time. The expiry time enables the processor to calculate the threshold time period i.e. the time period between the launch and closing of an application. Typically, the processor sets the expiry time for each active segment based on the predetermined threshold time period.
In an embodiment, wherein the predetermined threshold time period for each event type is determined based on an average usage associated with each event type. The processor is configured to determine the predetermined threshold time period for each event type is determined in terms of minutes or seconds based on an average usage associated with each event type. Generally, the user performs certain activities or events regularly, like launching a specific application for a specific time period daily on the computing device. In such cases, the processor determines an average usage time for the user for each event type based on the history of usage and correspondingly determines the predetermined threshold time period based on the average usage time of the associated event type. Beneficially, considering the average usage associated with each event type while determining the predetermined threshold time period enables in more precisely determining the predetermined threshold time period. Typically, this also enables the method 100 or the processor to consider different events associated with each event type, thereby making the predetermined threshold time period more precise.
In another embodiment, wherein setting the expiry time for the active segment of a certain event type includes resetting the expiry time if a further event corresponding to the same event type occurs during the active segment. Typically, when setting the expiry time for the active segment of a certain event type, if a further event corresponding to the event type occurs during the same active segment, in such cases, the expiry time of the active segment is reset by the processor. Alternatively stated, if the active segment entails more than one event, in such cases, the expiry time of the active segment is reset, since the earlier set expiry time might take place before the further event takes place. Thus, the resetting of the expiry time of the active segment allows in altering the predetermined threshold time period associated with the expiry of such segments and its associated events. This allows in altering the active segment associated with such events. Typically, by resetting the expiry time allows the number of active segments in the predefined time window to change, which in turn changes fall of the number of active segments. Therefore, new boundaries between adjacent clusters can be determined based on the resetting the expiry time. This allows in accommodating the real time or actual changes in the activity clusters based on the probable upcoming events, which in turn enable in efficiently predicting the events for optimizing operations of the apparatus and other apparatuses associated with the apparatus.
At step 108, the method 100 comprises expiring, by the processor, each active segment at the corresponding expiry time. Typically, the processor is configured to expire or close each active segment associated to a corresponding event type at the corresponding expiry time as set by the processor earlier at step 106 based on the predetermined threshold time period. Thus, upon setting the expiry time, the processor is configured to expire the active segment upon reaching the set expiry time.
And at step 110, the method 100 comprises dividing, by the processor, the input data series into a series of activity clusters. Typically, the processor is configured to divide the input data series into the series of activity clusters. The term “activity cluster” refers to an active group of events or active segments having a similar behaviour. Specifically, the processor is configured to divide the input data series via the boundary into a series of activity clusters. The term “boundary” refers to a logical division of the input data series based on a certain parameter or behaviour into multiple continuous clusters segmented by the boundary. Generally, each cluster is a list of consecutive events, united by a single user activity. Typically, the processor is configured to set the boundary between adjacent clusters when either:
(a) a predetermined number of active segments are initiated within a predefined time window; or
(b) a number of active segments falls to a predefined threshold.
Thus, two different scenarios may be tracked by the processor, wherein at least a predetermined number of active segments are launched or closed either simultaneously or within a short period of time. Thus, the processor is configured to set the boundary, if any one of the above conditions hold true. Optionally, if both the conditions hold true within a short period of time, only a single boundary is set by the processor instead of two closely spaced boundaries to simplify the clustering method.
In an exemplary scenario, as input data series, the times of application launches by a user are provided. Herein, essentially the input data series is divided into one or more clusters, wherein each cluster is associated with a single user activity. Typically, in such a scenario, the processor is configured to, search a begin of cluster or initiation time of a cluster of the one or more clusters, wherein another cluster is initiated, if the user changes behavior i.e. opens or launches a plurality of event types or applications and search an end of cluster or end time of cluster, wherein the cluster ends, if the user closes a plurality of applications or event type together. Typically, the processor is configured to determine a lifetime of the application or event type, wherein during the lifetime the user may constantly use or perform events on the application. Specifically, the processor is configured to calculate the lifetime (T[A]) in terms of minutes or seconds for an application A and an average time between the current launch of application A and any previous application for a plurality of users. Generally, if the user does not launch the application A after the lifetime T[A], then the predictive model predicts that the user will not launch the application A in the near future, since based on an average time the user should have launched the application again, however the application A is currently “closed” by the user.
In an embodiment, wherein when any number of (a) and/or (b) occur within a predefined short window of time, only a single boundary is set within the short window. Typically, when either the predetermined number of active segments are initiated within a predefined short window of time, or when the number of active segments fall to a predefined threshold, the single boundary is set within the short window. The term “predetermined number of active segments” refers to the maximum count of active segments set by either the processor or the user, wherein when the predetermined number of active segments are initiated by the user, a single boundary is set within the short window of time. Otherwise, when the number of active segments fall to the predefined threshold, the single boundary is set. The term “predefined threshold” refers to the minimum count of active segments, after which a boundary is set by the processor. Typically, the boundary divides the input data series into a series of activity clusters based on similar behaviours or upon meeting a predefined criterion. In an embodiment, wherein the predetermined number of active segments is 5, the predefined time window is 2 minutes, and the predefined threshold for the number of active segments is 2. Specifically, the processor is configured to set the boundary, when the predetermined number of active segments initiated is 5 and the predefined window is 2 minutes. Alternatively stated, in cases wherein, 5 active segments are initiated within a predefined time window of 2 mins, the boundary is set by the processor. Moreover, the processor is configured to set the boundary, when the number of active segments fall below the threshold of 2 active segments. Alternatively stated, when the number of active segments with a certain time period fall below the predefined threshold of 2 active segments, i.e. either a single active segment, two active segments or no active segments are present within the activity cluster, the boundary is set by the processor. Beneficially, the input data series can be efficiently segregated into the series of activity clusters, i.e. the boundary between adjacent clusters can be easily demarcated. This allows in easily identifying the activity clusters from the input data series.
In an embodiment, the method 100 comprises training, by the processor, a context-based event prediction model for each activity cluster based on the plurality of events within the cluster. The term “context-based event prediction moder refers to a predictive machine learning model configured for employing known results to create, process or validate another existing machine learning model to predict or forecast future events. Herein, the processor is configured to train the context-based event prediction (CP) model for each activity cluster based on the plurality of events within the cluster. Typically, the plurality of events including data storage events, server access events and user activity events. Thus, the context-based event prediction model is trained and analyzed based on each event of the plurality of events with the activity cluster, where the context-based prediction model is configured to output a probability for an upcoming event type based on a currently active segment. The “currently active segment” refers to the currently active cluster of the one or more clusters based on which the context-based event prediction model is configured to output the probability for the upcoming event type. In an example, the context-based event prediction model analyses each cluster and the plurality of events therein, to predict which event type or application will be initiated next based on the currently active segment. Generally, such context based prediction models employed within one cluster , wherein for example, a user performs the same events in a periodic manner such as daily, in such scenarios, the context-based event prediction model may improve the predictions of a main model (such as a different machine learning model other than the contextbased event prediction model). In an embodiment, training, by the processor, a general event prediction model based on the plurality of events within the input data series. The processor is configured to train the general event prediction (GP) model based on the plurality of events including, one or more of a plurality of data storage events, a plurality of server access events or a plurality of user activity events. In an exemplary scenario, the events such as the user activity events within the input data series are employed to train the general event prediction model. The “general event prediction model” refers to a machine learning model on which another machine learning model, such as the context based prediction model may be used to enhance or improve the accuracy of the predictions. Typically, the general event prediction model analyses the input data series and based on the plurality of events, predicts a future event based on a currently active segment. Typically, the general event prediction model predicts the probability of any general event happening in the future, for example, the general prediction model predicts the probability of initiating a segment of a particular event type based on the currently active segment and the plurality of events. The processor is further configured for generating a modified event prediction based on a probability output by the general event prediction model and the probability output by the context-based event prediction model. Typically, each of the general event prediction model and the context based prediction model, are configured to generate a probability output relating to an event or active segment. The probability output of each of the general event prediction model and the context based prediction model are employed to improve the accuracy of prediction and generate the modified event prediction model. Specifically, the context based prediction model is trained on the series of activity clusters and the corresponding prediction information or probability output relating to the activity cluster is used by the general event prediction model to further generate a more accurate prediction by virtue of the modified event prediction model. Typically, the probability of doing an event very next time is determined to be a real number between zero and one. Notably, GP refers to the prediction of the general event prediction model and CP refers to the prediction of context based prediction model. The modified event prediction model is configured to determine the probability or prediction based on the formula, given by:
Prediction
Figure imgf000016_0001
In other words,
Prediction
Figure imgf000016_0002
wherein PaGp - Probability of event from general event prediction model (GP)
Pacp - Probability of event from context based prediction model (CP) a - parameter used for regularization to linearly combine the prediction from both the GP and CP models.
Typically, the method 100 of complementing the already existing general event predictive model with a newly trained context based prediction model working specifically on the activity clusters for generating the modified event prediction model. In an exemplary scenario, the prediction output using the general event prediction model is 0.633, however, upon employing the context based prediction model along with the general event prediction model improves the probability output from 0,633 to 0,718. In another exemplary scenario, wherein a markov chain is implemented with the modified event prediction model, the probability output improves from 0,7482 to 0,7632. The term “markov chain" refers to a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Alternatively stated, a countably infinite sequence, in which the chain moves its state at discrete time steps or intervals and provides a discrete-time markov chain.
The present disclosure also provides a computer program product comprising a non-transitory computer-readable storage medium configured to store instructions or computer program code thereon, the instructions being executable by a processor to execute the method 100. Typically, the method 100 is for a computing device for carrying out clustering of time series data. Examples of implementation of the non-transitory computer-readable storage medium include, but is not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), a computer readable storage medium, and/or CPU cache memory. A computer readable storage medium for providing a non-transient memory may include, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
Referring to FIG. 2, illustrated is a graphical representation 200 of an exemplary scenario of a series of activity clusters 210 and associated plurality of events, in accordance with another embodiment of the present disclosure. The graphical representation 200 should be read in line in conjunction with the method 100 explained in conjunction with FIG. 1. The graphical representation 200 represents the time period in seconds on X-axis and event type in numbers on Y-axis. FIG. 2 depicts the series of activity clusters such as, a first activity cluster 202, a second activity cluster 204, a third activity cluster 206, a fourth activity cluster 208 and a fifth activity cluster 210. Each activity cluster includes a plurality of active segments (represented as a solid line parallel to the Y-axis) and the associated plurality of events (represented as hollow circles or points on the active segment). Generally, each of the activity cluster is a list of consecutive events, united by a single user activity. As shown, the fourth activity cluster 208 represents or depicts a scenario, when a predetermined number of active segments are initiated within a predefined time window, for example, more than or equal to five. Further, the second activity cluster 204 represents or depicts, a number of active segments falls to a predefined threshold, for example, less than or equal to two.
FIG. 3 illustrates a block diagram of an apparatus 300 for clustering time series data, in accordance with an embodiment of the present disclosure. As shown, the apparatus 300 comprises a processor 302 for clustering time series data. The apparatus 300 of FIG. 3 should be read in line with FIGs. 1-2. Typically, the apparatus 300 is operable to perform the method 100 for clustering time series data. Generally, the apparatus 300 includes computational elements such as a memory, a processor, a data communication interface, a network adapter and the like, to store, process and/or share files or information with other apparatuses, such as another computation device, or server and the like. Examples of the apparatus 300 may include, but are not limited to, a computation system, a virtual machine server(VM), a server arrangement, a data compression scheme.
Various embodiments, operations, and variants disclosed above, with respect to the method 100, apply mutatis mutandis to the apparatus 300.
Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the -present disclosure as defined by the accompanying claims. Expressions such as "including", "comprising", "incorporating", "have", "is" used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments. The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". It is appreciated that certain features of the present disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as suitable in any other described embodiment of the disclosure.

Claims

1 . A computer implemented method (100) for clustering time series data, comprising: receiving, by a processor (302), an input data series including a plurality of events associated with a corresponding event time and an event type; initiating, by the processor (302) at each event time, an active segment for the corresponding event type; setting, by the processor (302), an expiry time for each active segment based on a predetermined threshold time period for each event type; expiring, by the processor (302), each active segment at the corresponding expiry time; and dividing, by the processor (302), the input data series into a series of activity clusters; where a boundary between adjacent clusters (202, 204, 206, 208, 210) is set when either:
(a) a predetermined number of active segments are initiated within a predefined time window; or
(b) a number of active segments falls to a predefined threshold.
2. The method (100) of claim 1 , wherein setting the expiry time for the active segment of a certain event type includes resetting the expiry time if a further event corresponding to the same event type occurs during the active segment.
3. The method (100) of claim 1 or claim 2, wherein when any number of (a) and/or (b) occur within a predefined short window of time, only a single boundary is set within the short window.
4. The method (100) of any preceding claim, wherein the plurality of events include one or more of a plurality of data storage events, a plurality of server access events or a plurality of user activity events.
5. The method (100) of claim 4, wherein a user activity event includes an application launch, where the associated event time includes an application launch time and the event type includes an application name.
6. The method (100) of any preceding claim, wherein the predetermined threshold time period for each event type is determined based on an average usage associated with each event type.
7. The method (100) of any preceding claim, wherein the predetermined number of active segments is 5, the predefined time window is 2 minutes, and the predefined threshold for the number of active segments is 2.
8. The method (100) of any preceding claim, further comprising: training, by the processor (302), a context-based event prediction model for each activity cluster based on the plurality of events within the cluster, where the context-based prediction model is configured to output a probability for an upcoming event type based on a currently active segment.
9. The method (100) of claim 8, further comprising: training, by the processor (302), a general event prediction model based on the plurality of events within the input data series; and generating, by the processor (302), a modified event prediction based on a probability output by the general event prediction model and the probability output by the context-based event prediction model.
10. A computer-readable medium configured to store instructions which, when executed by a processor (302), cause the processor to perform the method (100) of any preceding claim.
11. A data-processing apparatus (300) comprising a processor (302) configured to perform the method (100) of any one of claims 1 to 9.
PCT/RU2020/000739 2020-12-22 2020-12-22 Method and apparatus for clustering time series data WO2022139615A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/RU2020/000739 WO2022139615A1 (en) 2020-12-22 2020-12-22 Method and apparatus for clustering time series data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2020/000739 WO2022139615A1 (en) 2020-12-22 2020-12-22 Method and apparatus for clustering time series data

Publications (1)

Publication Number Publication Date
WO2022139615A1 true WO2022139615A1 (en) 2022-06-30

Family

ID=75108759

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2020/000739 WO2022139615A1 (en) 2020-12-22 2020-12-22 Method and apparatus for clustering time series data

Country Status (1)

Country Link
WO (1) WO2022139615A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170228661A1 (en) * 2014-04-17 2017-08-10 Sas Institute Inc. Systems and methods for machine learning using classifying, clustering, and grouping time series data
EP3675008A1 (en) * 2018-12-31 2020-07-01 Kofax, Inc. Systems and methods for identifying processes for robotic automation and building models therefor

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170228661A1 (en) * 2014-04-17 2017-08-10 Sas Institute Inc. Systems and methods for machine learning using classifying, clustering, and grouping time series data
EP3675008A1 (en) * 2018-12-31 2020-07-01 Kofax, Inc. Systems and methods for identifying processes for robotic automation and building models therefor

Similar Documents

Publication Publication Date Title
US20230289661A1 (en) Root cause discovery engine
US11038984B2 (en) Data prefetching for large data systems
US11087245B2 (en) Predictive issue detection
US10216558B1 (en) Predicting drive failures
CN111258593B (en) Application program prediction model building method and device, storage medium and terminal
US20200341955A1 (en) Prediction and repair of database fragmentation
CN111460384A (en) Policy evaluation method, device and equipment
CN111414868A (en) Method for determining time sequence action fragment, action detection method and device
CN115269108A (en) Data processing method, device and equipment
WO2020140624A1 (en) Method for extracting data from log, and related device
CA2713889A1 (en) System and method for estimating combined workloads of systems with uncorrelated and non-deterministic workload patterns
CN115309510A (en) Method, device, equipment and storage medium for detecting running state of virtual machine
CN110837529B (en) Big data analysis monitoring method and device, server and readable storage medium
US11429436B2 (en) Method, device and computer program product for determining execution progress of task
CN110874601B (en) Method for identifying running state of equipment, state identification model training method and device
CN116757476A (en) Method and device for constructing risk prediction model and method and device for risk prevention and control
US20220083320A1 (en) Maintenance of computing devices
WO2022139615A1 (en) Method and apparatus for clustering time series data
CN112988497B (en) Method, electronic device and computer program product for managing backup system
CN111401959A (en) Risk group prediction method and device, computer equipment and storage medium
CN115829755B (en) Interpretation method and device for prediction result of transaction risk
US12050626B2 (en) Unsupervised segmentation of a univariate time series dataset using motifs and shapelets
US20240303148A1 (en) Systems and methods for detecting drift
US20240314044A1 (en) Maintaining configurable systems based on connectivity data
Mohagheghi State ordering and classification for analyzing non-sparse large Markov Models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20866921

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20866921

Country of ref document: EP

Kind code of ref document: A1