Disclosure of Invention
According to a first aspect of the present disclosure, there is provided a method of discovering data transactions, comprising: receiving a data sequence from a traffic scenario; applying a sliding window to the data sequence; determining a median of the data within the sliding window; comparing each data point of the data within the sliding window to the median to determine if the data point is within a range of an up-float threshold and a down-float threshold of the median to determine if there is a data transaction; and sliding the sliding window over the data sequence with a particular step size to take a next set of data and repeating the steps of determining a median and determining whether there is a data transaction above, wherein at least one of the sliding window size, the step size, the float threshold, and the float threshold is specific to data originating from different traffic scenarios.
According to an embodiment, the method comprises determining that the data point is a transaction data point when the ratio of the difference value obtained by subtracting the median from the data point is greater than the floating threshold of the median or the ratio of the difference value obtained by subtracting the median from the data point is greater than the floating threshold of the median.
According to a further embodiment, the method further comprises, upon determining that a data transaction is present, recording and/or informing the user of such data transaction.
According to a further embodiment, the method further comprises receiving feedback from the user on data transactions and adjusting at least one of the size of the sliding window, the step size, the float threshold, and the float threshold based on the feedback.
According to a further embodiment, the method further comprises upon receiving feedback from the user regarding data recall, adjusting at least one of the size of the sliding window, the step size, the float-up threshold, and the float-down threshold such that the data point is no longer determined to be a transaction data point.
According to a further embodiment, the size of the sliding window, the step size, the float-up threshold and the float-down threshold are predefined or obtained by a training process based on historical data.
According to a further embodiment, the history data is data transaction tagged history data, the training process comprising: training on the historical data by using grid search and using various combinations of different values of sliding window size, step length, floating threshold and floating threshold so as to mark data transaction in the historical data; comparing the marked data transaction with the data transaction tag of the historical data to obtain the currently used sliding window size, step length, floating threshold value and data transaction recall rate and precision rate under the floating threshold value; the set of sliding window sizes, step sizes, float thresholds, and float thresholds that use data transaction tags that match the historical data or that have the highest data transaction accuracy or lowest data transaction recall.
According to a further embodiment, the data sequence is a real-time data stream and the method is performed in real-time.
According to a second aspect of the present disclosure, there is provided a system for discovering data transactions, comprising: a receiving component configured to receive a data sequence originating from a traffic scenario; a sliding window assembly configured to slide the sliding window over the data sequence in a particular step size to apply a sliding window to the data sequence; a median component configured to determine a median of data within the sliding window; a comparison component configured to compare each data point of data within the sliding window to the median to determine whether the data point is within a range of an up-float threshold and a down-float threshold of the median to determine whether there is a data transaction, wherein at least one of the sliding window size, the step size, the up-float threshold, and the down-float threshold is different from data originating from different traffic scenarios.
According to an embodiment, the comparison component is further configured to determine that the data point is a transaction data point when a ratio of the difference of the data point minus the median to the median is greater than an up-float threshold of the median or a ratio of the difference of the median minus the data point to the median is greater than a down-float threshold of the median.
According to a further embodiment, the system further comprises a notification component configured to notify the user of a data transaction after the comparison component determines that such a data transaction exists.
According to a further embodiment, the system further comprises an adjustment component configured to receive feedback on data transactions from the user and adjust at least one of the size of the sliding window, the step size, the float threshold, and the float threshold based on the feedback.
According to a further embodiment, the adjustment component is further configured to adjust at least one of the size of the sliding window, the step size, the float-up threshold, and the float-down threshold upon receiving feedback from the user regarding data transaction recall such that the data point is no longer determined to be a transaction data point.
According to a further embodiment, the size of the sliding window, the step size, the float-up threshold and the float-down threshold are predefined or obtained by a training process based on historical data.
According to a further embodiment, the history data is data transaction tagged history data, the training process comprising: training on the historical data by using grid search and using various combinations of different values of sliding window size, step length, floating threshold and floating threshold so as to mark data transaction in the historical data; comparing the marked data transaction with the data transaction tag of the historical data to obtain the currently used sliding window size, step length, floating threshold value and data transaction recall rate and precision rate under the floating threshold value; the set of sliding window sizes, step sizes, float thresholds, and float thresholds that use data transaction tags that match the historical data or that have the highest data transaction accuracy or lowest data transaction recall.
According to a further embodiment, the data sequence is a real-time data stream.
According to a third aspect of the present disclosure, there is provided a system for discovering data transactions, comprising: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform a method according to the first aspect of the present disclosure.
Aspects generally include a method, apparatus, system, computer program product, and processing system substantially as described herein with reference to and as illustrated by the accompanying drawings.
The foregoing has outlined rather broadly the features and technical advantages of examples in accordance with the present disclosure in order that the detailed description that follows may be better understood. Additional features and advantages will be described hereinafter. The disclosed concepts and specific examples may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present disclosure. Such equivalent constructions do not depart from the scope of the appended claims. The features of the concepts disclosed herein, both as to their organization and method of operation, together with associated advantages, will be better understood from the following description when considered in connection with the accompanying drawings. Each of the figures is provided for the purpose of illustration and description and is not intended to limit the claims.
Detailed Description
The detailed description set forth below in connection with FIGS. 1-3 is intended as a description of various configurations and is not intended to represent the only configurations in which the concepts described herein may be practiced. The detailed description includes specific details for the purpose of providing a thorough understanding of the various concepts. It will be apparent, however, to one skilled in the art that these concepts may be practiced without these specific details.
As shown in fig. 1, an example method 100 of discovery data transaction in accordance with aspects of the present disclosure is illustrated. The method 100 may include receiving a data sequence originating from a traffic scenario at block 110. For example, there may be various business scenarios of credit card repayment, financial product revenue consultation, telecommunication/telephone packages, etc., and the intelligent question-answering system applied to these business scenarios may generate corresponding data sequences (i.e., data streams). The method 100 may receive such a data sequence. In an example, the data sequence is real-time data or historical data of a business scenario.
At block 120, the method 100 may include applying a sliding window to the data sequence. In this example, applying the sliding window to the data sequence includes applying the sliding window to the accumulated data sequence of the real-time data stream to determine data that falls within the sliding window. For example, where the size of the sliding window is N (where N is any natural number), applying the sliding window to the data sequence may result in N data in the data sequence that fall within the sliding window.
At block 130, the method 100 may include determining a median of the data within the sliding window. It will be appreciated by those skilled in the art that the median is the number in the middle of a set of data in a sequential order, and if the set of data includes an even number, the average of the two numbers in the middle may be returned.
In an embodiment, the determination of the median is made based on a certain indicator of the data. For example, in a process asset benefit advisory scenario, the data indicator may be an amount of advisory per unit time within the sliding window, in which case the median may be the median of the amounts of advisory per unit time within the sliding window.
At block 140, the method 100 may include comparing each data point of the data within the sliding window to the median to determine whether the data point is within a float threshold and a float threshold of the median to determine whether a data transaction exists.
In one embodiment, a data point is determined to be a transaction data point when the data point exceeds the float threshold or float threshold of the median. For example, the float threshold and/or the float threshold may be set as a ratio of the difference between the data point and the median to the median threshold (e.g., 20%, 25%, respectively). In this example, the data point may be determined to be a transaction data point when the ratio of the difference of the median subtracted from the data point to the median is greater than the float threshold for the median (i.e., 20% in the above example) or the ratio of the difference of the median subtracted from the data point to the median is greater than the float threshold for the median (i.e., 25% in the above example).
Assuming an example, in the case of a physical property return consultation scenario, the consultation amount per unit time within the sliding window is {25,30,25,40,21}, the median is 25. With the float threshold set to 20% and the float threshold set to 30%, the data point {40} may be determined to be a transaction data point because (40-25)/25=60% (this is greater than the float threshold of 20%). While for data point {21}, data point {21} may be determined to not be a transaction data point because of (25-21)/25 = 16% (this is less than the 30% float threshold). Similarly, data points {25}, {30} are not transaction data points.
In an embodiment, if it is determined that a data transaction exists, the method 100 may further include notifying the user of the data transaction. Those skilled in the art will appreciate that the user may be notified in various ways, such as sending an email to the user, a short message, making a user phone call, popping up a prompt box on the user's computing device, and so forth. In this embodiment, the user may make a determination of the data transaction based on this notification. For example, the user may determine whether the data transaction does exist by investigating the corresponding data, and the user may feed back this determination.
For example, if the user determines that the data transaction is not a true data transaction, feedback regarding the data transaction recall may be sent at the user.
In this embodiment, the method 100 may further include, after receiving feedback on the data transaction from the user, adjusting at least one of a size of the sliding window, a step size, an up-float threshold, and a down-float threshold based on such feedback. For example, upon receiving feedback from a user regarding a data transaction recall, method 100 may include adjusting at least one of a size of a sliding window, a step size, an up-float threshold, and a down-float threshold such that the data point is no longer determined to be a transaction data point, i.e., recall of such data transaction.
In one example, the adjustment may be performed using a training process as described below. In this example, the adjustment may be made upon receiving user feedback or may be made after a predetermined number of user feedback is received.
At block 150, the method 100 may include sliding the sliding window over the data sequence by a particular step size to take a set of data and repeating the steps in blocks 130, 140.
In an example, where the data sequence is a real-time data stream originating from a traffic scene, the method 100 may wait for this real-time data stream to fill the sliding window and then begin repeating the steps in blocks 130, 140.
In the case where the data sequence is historical data of a business scenario, if the last data of the historical data does not exactly fill a sliding window (e.g., after sliding the sliding window over the historical data in a particular step size, the last data point of the historical data does not fall on the last bit of the sliding window), the sliding window may be rolled back a particular number of data points such that the last data point of the historical data falls exactly on the last bit of the sliding window. Alternatively, in another example, the last partial data point of the history data that is insufficient to fill the sliding window may be discarded directly.
In an embodiment, at least one of the size of the sliding window, the step size, the float threshold, and the float threshold is different from data originating from different scenes. In general, the magnitude of their respective data may be different for different scenarios, and the periodicity of the data, etc. may also be different, and thus the applicable sliding window size, step size, float threshold, and float threshold may also be different. For example, for the products of the ant gold company, since payouts are generally concentrated on 10 per month, the size of the sliding window may be set to around one month, and the step size of the sliding window may be set to be one month or less (e.g., one day, one week, 10 days, one month, etc.) accordingly; for balance treasures revenue consultation, which is typically focused on every monday, the size of the sliding window may be set to around one week, and the step size of the sliding window may be set to be less than or equal to one week (e.g., one day, two days, one week, etc.) accordingly.
In an embodiment, the size of the sliding window, the step size, the float threshold, and the float threshold are predefined. For example, the user may make corresponding predefine of the size of the sliding window, the step size, and the threshold without any available historical data.
In another embodiment, the size of the sliding window, the step size, the float threshold, and the float threshold are obtained through a training process based on historical data. In this embodiment, the history data is data transaction tagged history data, and the training process includes: training on the historical data by using grid search and using various combinations of different values of sliding window size, step length, floating threshold and floating threshold so as to mark data transaction in the historical data; comparing the marked data transaction with the data transaction tag of the historical data to obtain the currently used sliding window size, step length, floating threshold value and data transaction recall rate and precision rate under the floating threshold value; the set of sliding window sizes, step sizes, float thresholds, and float thresholds that use data transaction tags that match the historical data or that have the highest data transaction accuracy or lowest data transaction recall.
By way of illustration, one specific example of this training process is given below in connection with a consultation scenario. Those skilled in the art will appreciate that this example is provided for illustrative purposes only. It will be appreciated that for simplicity, in some examples, the step size of the sliding window is set to 2. However, as mentioned above, the step size of the sliding window may also be set to any size and may also be trained by a training process.
For a given one of the time series data (e.g., time series data ending at 10:05 am for 30 days or more of accumulated consultation volume),
setting a sliding window size range [7,21], an upward floating threshold value range [0.2%,1.5% ], a downward floating threshold value range [0.2%,1.5% ];
and traversing the sliding window range, the floating threshold range and the floating threshold range by using grid search at 1, 0.05% and 0.05% step sizes respectively, calculating the median on given time sequence data respectively, comparing the time sequence data in each sliding window with the median in the window to determine whether the time sequence data exceeds the floating threshold or the floating threshold, and marking corresponding data points as outlier points or non-outlier points in sequence. And sliding a window forwards according to the step length of 2, and marking whether the data point is abnormal or not until all the data points are marked. Then, comparing the labeling result with an abnormal value label of the historical time sequence data to obtain abnormal recall rate and precision rate of the sliding window size, the floating threshold value and the floating threshold value; and then iteratively searching for the next set of sliding window size, floating up threshold and floating down threshold (i.e., sliding window size increases from 7 to 21 in sequence by 1, floating up threshold and floating down threshold increases from 0.2% to 1.5% in sequence by 0.05%) until the search is completed.
Finally, the set of parameter values (i.e., the combination of the sliding window size, the floating-up threshold, and the floating-down threshold) with the highest anomaly annotation precision or lowest anomaly annotation recall or most consistent with the historical time series data anomaly value tags are used as the optimal parameters and stored along with the time, data index, and scene to which the time series data corresponds for use in the methods of the present disclosure.
Fig. 2 is a block diagram illustrating an example system 200 of discovery data transaction in accordance with aspects of the present disclosure.
As shown in fig. 2, a system 200 for discovering data transactions may include a receiving component 202 configured to receive a sequence of data originating from a traffic scenario; a sliding window component 204 configured to slide the sliding window over the data sequence in a particular step size to apply a sliding window to the data sequence; a median component 206 configured to determine a median of data within the sliding window; and a comparison component 208 configured to compare each data point of the data within the sliding window to the median to determine whether the data point is within a range of the float threshold and float threshold of the median to determine whether a data transaction exists.
In an embodiment, at least one of the size of the sliding window, the step size, the float threshold, and the float threshold is different from data originating from different traffic scenarios.
In another embodiment, the comparison component 208 is further configured to determine that the data point is a transaction data point when the ratio of the difference of the data point minus the median to the median is greater than an up-float threshold of the median, or the ratio of the difference of the median minus the data point to the median is greater than a down-float threshold of the median.
In yet another embodiment, the system 200 can optionally further include a notification component 210 configured to notify a user of a data transaction after the comparison component determines that such a data transaction exists.
In yet another embodiment, system 200 can also optionally include an adjustment component 212 configured to receive feedback from the user on data transactions and adjust at least one of the size of the sliding window, the step size, the float threshold, and the float threshold based on the feedback. In this embodiment, adjustment component 212 may be further configured to adjust at least one of the size of the sliding window, the step size, the float-up threshold, and the float-down threshold upon receiving feedback from the user regarding data transaction recall such that the data point is no longer determined to be a transaction data point.
In a further embodiment, the size of the sliding window, the step size, the float-up threshold and the float-down threshold are predefined or obtained through a training process based on historical data. In this embodiment, the history data is data transaction tagged history data, and the training process includes: training on the historical data by using grid search and using various combinations of different values of sliding window size, step length, floating threshold and floating threshold so as to mark data transaction in the historical data; comparing the marked data transaction with the data transaction tag of the historical data to obtain the currently used sliding window size, step length, floating threshold value and data transaction recall rate and precision rate under the floating threshold value; the set of sliding window sizes, step sizes, float thresholds, and float thresholds that use data transaction tags that match the historical data or that have the highest data transaction accuracy or lowest data transaction recall.
Fig. 3 is a schematic diagram illustrating an example system 300 of discovery data transaction in accordance with aspects of the present disclosure. As shown, system 300 includes a processor 305 and a memory 310. Memory 310 stores computer executable instructions that are executable by processor 305 to implement the method described above in connection with fig. 1.
As described above, in the methods and systems of the present disclosure, the sliding window size, sliding step size, float-up threshold, float-down threshold, etc. may not be fixed, but may be dynamically adjustable over time, thereby being well suited for multi-scenario data transaction detection.
Meanwhile, the method and the system of the present disclosure utilize a sliding median method to train iteratively through historical data or real-time data to obtain the most suitable sliding window size, sliding step length, floating threshold and floating threshold, thereby overcoming the problem of different period fluctuation amplitude caused by long time sequence variance, and greatly improving the accuracy and recall rate of local short-term (for example, near one month or near two weeks, etc.) anomaly detection. Moreover, the method and system of the present disclosure can also iteratively learn based on user feedback to optimize detection performance. Thus, the methods and systems of the present disclosure are applicable to all of the various data, and are not limited to data subject to normal distribution. Also, the median method is more stable than the mean and is not affected by discrete values individually.
The above detailed description includes references to the accompanying drawings, which form a part of the detailed description. The drawings illustrate specific embodiments that can be practiced by way of illustration. These embodiments are also referred to herein as "examples". Such examples may include elements other than those shown or described. However, examples including the elements shown or described are also contemplated. Moreover, it is also contemplated that examples using any combination or permutation of those elements shown or described, or with reference to specific examples (or one or more aspects thereof) shown or described herein, or with reference to other examples (or one or more aspects thereof) shown or described herein.
In the appended claims, the terms "including" and "comprising" are open-ended, i.e., a system, apparatus, article, or process of claim that is defined to be within the scope of the claim, except for those elements recited after such term. Furthermore, in the appended claims, the terms "first," "second," and "third," etc. are used merely as labels, and are not intended to indicate the numerical order of their objects.
In addition, the order of the operations illustrated in the present specification is exemplary. In alternative embodiments, the operations may be performed in a different order than shown in the figures, and the operations may be combined into a single operation or split into more operations.
The above description is intended to be illustrative, and not restrictive. For example, the examples described above (or one or more aspects thereof) may be used in connection with other embodiments. Other embodiments may be used, such as by one of ordinary skill in the art after reviewing the above description. The abstract allows the reader to quickly ascertain the nature of the technical disclosure. This Abstract is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. Furthermore, in the above detailed description, various features may be grouped together to streamline the disclosure. However, the claims may not state every feature disclosed herein, as embodiments may characterize a subset of the features. Further, embodiments may include fewer features than are disclosed in the specific examples. Thus the following claims are hereby incorporated into the detailed description, with one claim standing on its own as a separate embodiment. The scope of the embodiments disclosed herein should be determined with reference to the appended claims, along with the full range of equivalents to which such claims are entitled.