WO2021043140A1 - 标签确定方法、装置和系统 - Google Patents

标签确定方法、装置和系统 Download PDF

Info

Publication number
WO2021043140A1
WO2021043140A1 PCT/CN2020/112878 CN2020112878W WO2021043140A1 WO 2021043140 A1 WO2021043140 A1 WO 2021043140A1 CN 2020112878 W CN2020112878 W CN 2020112878W WO 2021043140 A1 WO2021043140 A1 WO 2021043140A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature vector
label
time series
similarity
reference feature
Prior art date
Application number
PCT/CN2020/112878
Other languages
English (en)
French (fr)
Inventor
张彦芳
薛莉
孙旭东
常庆龙
罗磊
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20861495.8A priority Critical patent/EP4020315A4/en
Publication of WO2021043140A1 publication Critical patent/WO2021043140A1/zh
Priority to US17/683,973 priority patent/US20220179884A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/06Generation of reports
    • H04L43/067Generation of reports using time frame reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Definitions

  • This application relates to the field of Artificial Intelligence (AI), and in particular to a method, device and system for label determination.
  • AI Artificial Intelligence
  • Machine learning refers to allowing the machine to train a machine learning model based on training samples, so that the machine learning model has predictive capabilities (such as category prediction capabilities) for data outside of the training samples.
  • machine learning has been widely used in many fields. From the perspective of learning methods, machine learning algorithms can be divided into supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. Among them, supervised learning is a type of basic algorithm in machine learning algorithms.
  • the labeling process can be to mark a face picture as: "wear glasses"
  • the machine learning model can be used to perform corresponding functions, such as image recognition or language translation.
  • the labeling process of the sample data is called the process of determining the label, and the content of the label is the label, and the label is used to identify the data, such as the category of the data.
  • the embodiments of the present application provide a method, device, and system for determining a label. It can solve the current problem of high label identification cost.
  • the technical solution is as follows:
  • a method for determining a label includes:
  • the label corresponding to the first reference feature vector is determined as the label of the first time series, and the first reference feature
  • the vector is a reference feature vector in the reference feature vector set.
  • the label determination method provided by the embodiment of the present application performs label migration based on the similarity of the feature vectors of the time series, which can realize automatic labeling of sample data and reduce the cost of label determination. And because the similarity calculation is related to the feature vector of the time series, the influence of the interference information of the time series itself is avoided, for example, the influence of the interference information such as sampling period, amplitude change, quadrant drift and noise can be reduced. Improved the accuracy of label determination. Especially in high-dimensional time series, label migration can still be performed accurately.
  • the first time sequence is a time sequence of a network KPI.
  • the reference feature vector includes data of one or more features
  • the target feature vector includes data of one or more features
  • the degree of similarity between the target feature vector and the first reference feature vector is the first The similarity between a feature vector and a second sub feature vector, where the first sub feature vector and the second sub feature vector are respectively composed of data corresponding to the same feature in the target feature vector and the first reference feature vector .
  • the reference feature vector and the target feature vector included in the reference feature vector set may be obtained by using the same extraction algorithm or by using different extraction algorithms.
  • the category and the number of features involved in each reference feature vector and the target feature vector may be different. Therefore, it is necessary to deal with different situations accordingly.
  • the similarity determination process includes: screening the same first feature among the features corresponding to the target feature vector and the features corresponding to the first reference feature vector; obtaining the data corresponding to the first feature in the target feature vector, and obtaining the information obtained from the acquired data.
  • the first sub-feature vector is composed; the data corresponding to the first feature in the first reference feature vector is obtained, and the second sub-feature vector composed of the obtained data is obtained; the similarity between the first sub-feature vector and the second sub-feature vector is determined .
  • the similarity between the first sub-feature vector and the second sub-feature vector is the similarity between the reference feature vector and the target feature vector.
  • the similarity calculation process can be simplified to ensure the final The accuracy of the calculated similarity.
  • the reference feature vector and the target feature vector have the same type and number of features.
  • the reference feature vector and the target feature vector can be directly obtained as the first sub feature vector and the second sub feature vector; the similarity between the first sub feature vector and the second sub feature vector is determined, and the first sub feature vector and the second sub feature vector are similar.
  • the similarity of the sub-feature vector is the similarity between the reference feature vector and the target feature vector.
  • the feature screening process can be reduced and the similarity calculation process can be further simplified.
  • the first sub-feature vector and the second sub-feature vector are both characterized in the form of a sequence, and the data at the same position in the first sub-feature vector and the second sub-feature vector correspond to features of the same category,
  • the similarity between the first sub-feature vector and the second sub-feature vector is negatively related to the distance between the first sub-feature vector and the second sub-feature vector.
  • the distance between the first sub-feature vector and the second sub-feature vector can be acquired first; then, based on the acquired distance, the similarity between the first sub-feature vector and the second sub-feature vector is determined.
  • the distance may be calculated using Euclidean distance formula, Chebyshev distance formula, cosine distance formula, Mahalanobis distance formula or other distance formulas.
  • the distance between the first sub-feature vector and the second sub-feature vector can effectively reflect the similarity between the two, the similarity can be quickly determined by calculating the distance, and the efficiency of the similarity determination can be improved.
  • the manual labeling process can be divided into the process of individual labeling (in this scenario, the analysis device usually sends one to be labeled to the management device at a time). The time series of the cluster) and the process of cluster labeling (in this scenario, the analysis device usually sends a set of time series to be labeled at a time to the management device).
  • the embodiment of the present application uses the following two optional methods to manually label the process Be explained:
  • the manual labeling process includes the following individual labeling process:
  • the first time sequence is sent to the management device for presentation by the management device The first time sequence; receiving the label of the first time sequence sent by the management device.
  • the label of the first time series can be marked by professionals, so that the label can still be determined when it is guaranteed that the first time series cannot be transferred.
  • the manual labeling process includes the following cluster labeling process:
  • the similarity between any feature vector in the first feature vector set and each reference feature vector in the reference feature vector set is not greater than the similarity threshold, and the any one
  • the label of the time sequence corresponding to the feature vector is not determined; the time sequence corresponding to the first feature vector set is sent to the management device for the management device to present the time sequence corresponding to the first feature vector set; the management is received The label of the time series corresponding to the first feature vector set sent by the device.
  • the analysis device sends the time sequence corresponding to the first feature vector set to the management device, and after receiving the time sequence, the management device presents the time sequence corresponding to the first feature vector set, and the professional uses the first feature vector Label the time series corresponding to the collection.
  • the analysis device may also perform clustering processing on the feature vectors in the first feature vector set to obtain the features in the first feature vector set.
  • the category relationship of the vector then when sending the time sequence corresponding to the first feature vector set to the management device, the category relationship is also sent to the management device, so that the management device presents the time sequence corresponding to the first feature vector set according to the category relationship .
  • the management device can display multiple time series belonging to the same category on the same user page, and multiple time series belonging to different categories can be displayed on different user pages; for example, the management device can display multiple time series belonging to different categories. Displayed in different positions on the same user page; for another example, the management device can display each time series corresponding to its category.
  • the management device presents the time series corresponding to the first feature vector set according to the category relationship, which can be used by professionals to refer to the category relationship when labeling, and play a role of assisting professionals in labeling. Based on this, professionals can label time series belonging to the same category with the same label, improve labeling efficiency, and increase label labeling accuracy.
  • the performing clustering processing on the feature vectors in the first feature vector set includes:
  • the distance threshold is a distance specified among a plurality of distances determined based on the first feature vector set
  • every two feature vectors with the same number of neighbor vectors greater than the number threshold are classified into the same type of feature vector.
  • the number threshold is a specified number among the number of neighbor vectors of each feature vector in the first feature vector set.
  • the final classification relationship based on these two thresholds is more accurate, can better reflect the correlation between each feature vector, and improve the adaptability of the clustering algorithm.
  • the feature vector when the time series label corresponding to a feature vector is determined, the feature vector can be added to the reference feature vector set as a reference basis for label migration.
  • the labels corresponding to some feature vectors may have errors due to artificial errors or machine algorithm errors. If these feature vectors are added to the reference feature vector set, it is easy to cause label conflicts in the label migration process, such as with the target feature of a certain time series. There are multiple reference feature vectors with vector similarity greater than the similarity threshold, and the labels are different, which makes it impossible to perform label migration on a certain time series. Therefore, it is necessary to perform conflict detection processing on the feature vectors added to the reference feature vector set to avoid the feature vectors with false labels from being added to the reference feature vector set.
  • the conflict detection process may include the following steps:
  • the first feature vector is added to the reference feature set as a reference feature vector in.
  • the method further includes:
  • the first feature vector is added as a reference feature vector to the reference feature set.
  • the method further includes:
  • the similarity between the first feature vector and the second feature vector in the reference feature vector set is greater than the stock-in similarity threshold, and the label corresponding to the first feature vector corresponds to the second feature vector
  • the time sequence corresponding to the first feature vector and the time sequence corresponding to the second feature vector are sent to the management device, so that the management device can present the time sequence corresponding to the first feature vector and the time sequence corresponding to the first feature vector.
  • the first feature vector is added to the reference feature set as a reference feature vector.
  • both the target feature vector and the reference feature vector include data of one or more of statistical features, fitting features, or frequency domain features.
  • the label determination method provided in the embodiment of the present application is applied in an anomaly detection scenario and can perform automatic label determination.
  • the aforementioned label determination method is executed by a network analyzer, and the label corresponding to the reference feature vector is an anomaly detection label.
  • time series data includes network key performance indicators (KPIs), and network KPIs include network traffic KPs, network business KPIs, and so on.
  • KPIs network key performance indicators
  • the network device KPI may be a central processing unit (CPU, central processing unit) utilization rate, optical power, etc.
  • the network service KPI may be network traffic, packet loss rate, delay, number of user accesses, and so on.
  • the network traffic KPI is periodic time series data.
  • the label determination method provided in the embodiments of this application is applied to anomaly detection scenarios, which can automatically migrate labels within a certain range, improve label utilization, reduce labeling costs, and compare to traditional The label migration method, the accuracy of the determined label is high.
  • a label determination device comprising: a plurality of functional modules: the multiple functional modules interact to implement the methods in the first aspect and various embodiments described above.
  • the multiple functional modules can be implemented based on software, hardware, or a combination of software and hardware, and the multiple functional modules can be combined or divided arbitrarily based on specific implementations.
  • a label determination device including: a processor and a memory;
  • the memory is used to store a computer program, and the computer program includes program instructions
  • the processor is configured to call the computer program to implement the label determination method according to any one of the first aspect.
  • a computer storage medium is provided, and instructions are stored on the computer storage medium, and when the instructions are executed by a processor, the label determination method according to any one of the first aspects is implemented.
  • a chip in a fifth aspect, includes a programmable logic circuit and/or program instructions. When the chip is running, the method for determining a label as described in any one of the first aspects is implemented.
  • a computer program product is provided, and instructions are stored in the computer program product, and when the instructions are run on a computer, the computer executes the label determination method according to any one of the first aspects.
  • the label determination method provided by the embodiment of the present application performs label migration based on the similarity of the feature vectors of the time series, which can realize automatic labeling of sample data and reduce the cost of label determination. And because the similarity calculation is related to the feature vector of the time series, the influence of the interference information of the time series itself is avoided, for example, the influence of the interference information such as sampling period, amplitude change, quadrant drift and noise can be reduced. Improved the accuracy of label determination. Especially in high-dimensional time series, label migration can still be performed accurately. Applying the label determination method provided by the embodiments of the present application to scenarios requiring a large amount of labeled sample data, such as supervised learning algorithms or semi-supervised learning algorithms, can effectively reduce labeling costs and improve the modeling efficiency of machine learning models.
  • the label determination method provided by the embodiment of the present application adopts the similarity of feature vectors for label migration, and is not limited to the label migration of time series with similar waveforms.
  • the label migration can be performed as long as the similarity in certain feature dimensions is ensured. It can be seen that the embodiment of the present application can be applied to label migration of time series with different waveforms. Therefore, the scenario of tag generalization can be expanded, the flexibility and utilization rate of tag migration can be improved, and the modeling cost of the machine learning model can be reduced. Especially in anomaly detection scenarios, it is possible to realize label migration between KPIs with similar characteristics.
  • the analysis device determines the category relationship by clustering the first feature vector set, and the management device presents the time series corresponding to the first feature vector set according to the category relationship, so that professionals can refer to the category relationship when labeling. Play the role of assisting professionals in labeling. Based on this, professionals can label time series belonging to the same category with the same label, improve labeling efficiency, and increase label labeling accuracy.
  • FIG. 1 is a schematic diagram of an application scenario involved in a label determination method provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of another application scenario involved in a label determination method provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method for determining a label according to an embodiment of the present application
  • FIG. 4 is a schematic diagram of a process for obtaining the similarity between a target feature vector and a reference feature vector in a reference feature vector set according to an embodiment of the present application;
  • FIG. 5 is a flowchart of a conflict detection method provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of another label determination method provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a clustering process for feature vectors in a first feature vector set provided by an embodiment of the present application
  • FIG. 8 is a block diagram of a label determining device provided by an exemplary embodiment of the present application.
  • FIG. 9 is a block diagram of another label determining device provided by an exemplary embodiment of the present application.
  • FIG. 10 is a block diagram of still another label determining device provided by an exemplary embodiment of the present application.
  • FIG. 11 is a block diagram of still another label determining device provided by an exemplary embodiment of the present application.
  • FIG. 12 is a block diagram of a label determining device provided by another exemplary embodiment of the present application.
  • FIG. 13 is a block diagram of another device for determining a label according to another exemplary embodiment of the present application.
  • FIG. 14 is a block diagram of still another device for determining a label according to another exemplary embodiment of the present application.
  • FIG. 15 is a block diagram of a label determining apparatus provided by another exemplary embodiment of the present application.
  • machine learning algorithms have been widely used in many fields. From the perspective of learning methods, machine learning algorithms can be divided into supervised learning algorithms, unsupervised learning algorithms, semi-supervised learning algorithms, and reinforcement learning algorithms.
  • Supervised learning algorithm refers to the ability to learn an algorithm or establish a model based on training data, and use this algorithm or model to infer new instances.
  • Training data also called sample data, is composed of input data and expected output.
  • the expected output of a machine learning algorithm model is called a label, which can be a continuous value (called a regression or regression label) or a predicted classification result (called a classification label).
  • unsupervised learning algorithm The difference between unsupervised learning algorithm and supervised learning algorithm is that the sample data of unsupervised learning algorithm does not have a given label, and the machine learning algorithm model obtains certain results by analyzing the characteristics of the data.
  • a semi-supervised learning algorithm part of the sample data is labeled and the other part is unlabeled, and there are far more unlabeled data than labeled data.
  • Reinforcement learning algorithms try to maximize the expected benefits through continuous attempts in the environment, and through the rewards or punishments given by the environment, generate the choices that can obtain the greatest benefits.
  • supervised learning algorithms are a relatively basic type of machine learning algorithms, which can achieve good results with sufficient data, such as image recognition and language translation.
  • the label acquisition cost in the supervised learning algorithm is high, and a lot of manpower is required for sample labeling. Many application scenarios do not have a large amount of labeling data (that is, labelled sample data).
  • a method for label determination is proposed.
  • the method adopts label migration (also called label generalization) to determine the label, that is, to migrate the label of one time series of the determined label to another similar time series.
  • label migration also called label generalization
  • a time series is a collection of a set of data arranged in a time series, the time series is usually the order in which the data is generated, a time series is a data form of sample data, and the data in the time series is also called a data point.
  • the time series has n data points, namely x 1 to x n
  • the length of the time series is n.
  • the label determination process includes: acquiring the waveform similarity between the first time series and a plurality of reference time series, when the first time series and one of the plurality of reference time series When the waveform similarity of the reference time series is greater than the waveform similarity threshold, the label corresponding to the reference time series is determined as the label of the first time series.
  • this method of label migration by comparing the waveform similarity of the time series is susceptible to the influence of various interference information of the time series itself (such as sampling period, amplitude change, quadrant drift, noise, etc.). The label is determined The accuracy is low.
  • a label determination method based on Dynamic Time Warping is also proposed.
  • the time is adjusted Axis, to establish the correspondence between the two time series, and then calculate the similarity of the two waveforms, so as to reduce the impact of sampling period, amplitude change and quadrant drift to a certain extent.
  • the algorithm for regulating the time axis in the method for determining the label is complicated, and the noise influence of the time series is still unavoidable. Especially in high-dimensional time series, the practicability is low.
  • the embodiment of the present application provides a label determination method, which performs label migration based on the similarity of the feature vector of the time series, and the similarity calculation is related to the feature vector of the time series, avoiding the influence of the interference information of the time series itself, and improving The accuracy of label determination is improved. Especially in high-dimensional time series, label migration can still be performed accurately.
  • FIG. 1 is a schematic diagram of an application scenario involved in the label determination method provided by an embodiment of the present application.
  • the application scenario includes an analysis device 101, a management device 102, and network devices 103a to 103c (collectively referred to as network devices 103).
  • the numbers of analysis devices, management devices, and network devices in FIG. 1 are only for illustration, and not as a restriction on the application scenarios involved in the label determination method provided in the embodiment of the present application.
  • the network involved in this application scenario can be a second-generation (2-Generation, 2G) communication network, a third-generation (3rd Generation, 3G) communication network, a long-term evolution (Long Term Evolution, LTE) communication network, or the fifth-generation ( 5rd Generation, 5G) communication network, etc.
  • 2G second-generation
  • 3rd Generation, 3G third-generation
  • LTE long-term evolution
  • 5rd Generation, 5G fifth-generation
  • the analysis device 101, the management device 102, and the network device 103 may be deployed on the same device, or may be deployed on different devices.
  • the analysis device 101 may be a server, or a server cluster composed of several servers, or a cloud computing service center.
  • the management device 102 may be a computer, or a server, or a server cluster composed of several servers, or a cloud computing service center, and the management device 102 may be an operations support system (OSS) or Other network equipment connected to the analysis equipment.
  • OSS operations support system
  • the network device 103 may be a router, a switch, a base station, etc., which may be a network device of a core network or a network device of an edge network.
  • the analysis device 101 is connected to the network device 103 and the management device 102 through a wired network or a wireless network, respectively.
  • the network device 103 is used to upload collected data to the analysis device 101, such as various time series data.
  • the analysis device 101 is used to extract and use data from the network device 103, for example, to determine the tags of the acquired time series, and the management device 103 uses To manage the analysis device 101.
  • the data uploaded by the network device 103 to the analysis device 101 may also include various types of log data and device status data.
  • the analysis device 101 is also used to train one or more machine learning models. Different machine learning models use data uploaded by the network device 103 to implement functions such as anomaly detection, prediction, network security protection, and application recognition.
  • the analysis device can also implement feature selection and automatic update of each machine learning model, and feed back the selected features and model update results to the management device 102, and the management device 102 decides whether to retrain the model.
  • the analysis device 101 can determine different labels using the label determination method provided in this application.
  • the foregoing application scenario may not include the network device 103, and the analysis device 101 may also receive the time series data input by the management device 103.
  • the embodiment of the present application only schematically illustrates the source of the time series data, and does not This is limited.
  • the label determination method provided in the embodiment of the present application can be used in an anomaly detection scenario.
  • Anomaly detection refers to the detection of patterns, data, or time that do not meet the prediction.
  • professionals also called experts
  • learn from historical data and then find anomalies, that is, anomalies are labeled with "abnormal" data.
  • Data sources for anomaly detection include applications, processes, operating systems, equipment, or networks. With the increase in the complexity of computing systems, humans are no longer competent for the current difficulty of anomaly detection.
  • the label determination method provided in the embodiments of the present application is applied in an abnormality detection scenario, and can perform automatic label determination.
  • FIG. 2 is a schematic diagram of an application scenario of anomaly detection involved in the label determination method provided in an embodiment of the present application.
  • the analysis device 101 can be a network analyzer
  • the management device 102 can be a controller
  • the machine learning model maintained by the analysis device 101 is an anomaly detection model
  • the determined tag is an anomaly detection tag.
  • the anomaly detection tag includes two Types of classification labels, respectively: "normal” and "abnormal”.
  • the application scenario may also include a storage device 104, which is used to store data provided by the network device 103.
  • the storage device 104 may be a distributed storage device, and the analysis device 101 may store the data provided by the network device 103. The data stored in the device 104 is read and written. In this way, when the network device 103 has a large amount of data, the storage device 104 performs data storage, which can reduce the load of the analysis device 101 and improve the data analysis efficiency of the analysis device 101. It should be noted that when the amount of data provided by the network device 103 is small, the storage device 104 may not be provided. At this time, the application scenario of anomaly detection can refer to the application scenario shown in FIG. 1.
  • time series anomaly detection is usually to find data points far away from a relatively established pattern or distribution.
  • the anomalies of the time series include: sudden rise, sudden fall, and mean change.
  • Time series anomaly detection algorithms include algorithms based on statistics and data distribution (such as N-Sigma algorithm), distance/density-based algorithms (such as local anomaly factor algorithm), isolated forest algorithm, or prediction-based algorithm (such as differential integrated moving average) autoregressive Model (Autoregressive Integrated Moving Average model, ARIMA) algorithm) and so on.
  • the corresponding machine learning model may be a model based on statistics and data distribution (such as an N-Sigma model), a model based on distance/density (such as a local anomaly factor model), an isolated forest model, or a model based on prediction (such as ARIMA).
  • a model based on statistics and data distribution such as an N-Sigma model
  • a model based on distance/density such as a local anomaly factor model
  • an isolated forest model such as a model based on prediction (such as ARIMA).
  • ARIMA model based on prediction
  • the data uploaded by the network device 103 includes various time series data, which has the characteristics of large data scale and complex patterns and rules. Therefore, a large number of machine learning models are used in applications such as anomaly detection, prediction, classification, network security protection, application identification, or user experience evaluation (such as evaluating user experience based on these data) using these data. Professionals need to label these data, the workload is very large, and the labeling cost is extremely high.
  • the embodiment of the application provides a label determination method, which can perform label migration, thereby reducing the cost of labeling, and because the label migration is performed based on the similarity of the feature vector of the time series, the similarity calculation is related to the feature vector of the time series, which avoids The influence of the interference information of the time series itself improves the accuracy of label determination.
  • time series data includes network key performance indicators (KPIs), and network KPIs include network equipment KPIs, network business KPIs, and so on.
  • KPIs network key performance indicators
  • the network device KPI may be a central processing unit (CPU, central processing unit) utilization rate, optical power, etc.
  • the network service KPI may be network traffic, packet loss rate, delay, number of user accesses, and so on.
  • the network traffic KPI is periodic time series data.
  • the machine learning model is used to perform anomaly detection on the network traffic KPI.
  • the label determination method provided in the embodiments of this application is applied to anomaly detection scenarios, which can automatically migrate labels within a certain range, improve label utilization, reduce labeling costs, and compare to traditional The label migration method, the accuracy of the determined label is high.
  • the embodiment of the present application provides a label determination method, which can be executed by the aforementioned analysis device, assuming that the first time series is a sequence that requires label determination. As shown in FIG. 3, the method includes:
  • Step 301 Obtain the target feature vector of the first time series.
  • a time series is a collection of a set of data arranged in a time series.
  • the time series is usually the order in which the data is generated.
  • the data in the time series is also called a data point.
  • the time interval of each data point in a time series is a constant value, so the time series can be analyzed and processed as discrete time data.
  • the first time series may be a time series of a network KPI.
  • the analysis device may receive the time sequence sent by the network device or the management device; in another optional example, the analysis device has an input/output (I/O) interface through which it receives Time series; in yet another optional example, the analysis device can read the time series from the storage device.
  • I/O input/output
  • the target feature vector is a vector that characterizes the features of the first time series, which includes data of one or more features, that is, the target feature vector corresponds to one-dimensional or multi-dimensional features, and the dimension of the feature corresponding to the target feature vector is the same as the target feature vector
  • the number of data in is the same (that is, the feature corresponds to the data one-to-one).
  • the feature refers to the feature of the first time series, which may include data features and/or extracted features.
  • the data characteristics are the characteristics of the data in the time series.
  • the data feature includes data arrangement period, data change trend or data fluctuation, etc.
  • the data of the data feature includes: data arrangement period data, data change trend data or data fluctuation data, etc.
  • the data arrangement period refers to the period involved in the data arrangement in the time series if the data in the time series is arranged periodically.
  • the data arrangement period includes the period duration (that is, the time interval between the initiation of two periods) and/or the period.
  • data change trend data is used to reflect the change trend of the data arrangement in the time series (that is, the data change trend), for example, the data includes: continuous growth, continuous decline, first rise and then fall, first fall and then rise, or meet the positive
  • Data fluctuation data is used to reflect the fluctuation state of the data in the time series (that is, data fluctuation), for example, the data includes a function that characterizes the fluctuation curve of the time series, or the specified value of the time series, such as the maximum value , Minimum or average value.
  • the extracted feature is the feature in the process of extracting the data in the time series.
  • the extracted features include statistical features, fitted features, or frequency domain features, and correspondingly, the extracted feature data includes statistical feature data, fitted feature data, or frequency domain feature data.
  • Statistical features refer to the statistical features of time series.
  • Statistical features are divided into quantitative features and attribute features.
  • the quantitative features are divided into measurement features and counting features. Quantitative features can be directly represented by numerical values, for example, CPU,
  • the consumption value of various resources such as memory and IO resources is a measurement feature; the number of abnormalities and the number of devices that are working normally are count features; attribute characteristics cannot be directly represented by numerical values, such as whether the device is abnormal or whether the device is down
  • the characteristics of statistical characteristics are the indicators that need to be investigated in statistics.
  • the statistical feature data includes moving average (Moving_average), weighted average (Weighted_mv), etc.;
  • the fitting feature is the feature of the time series fitting, then the fitting feature data is used to reflect the features of the time series for fitting
  • fitting feature data includes the algorithm used for fitting, such as ARIMA; frequency domain features are the features of the time series in the frequency domain, and the frequency domain features are used to reflect the characteristics of the time series in the frequency domain.
  • the frequency domain feature data includes: data of the law followed by the time series distribution in the frequency domain, such as the proportion of high frequency components in the time series.
  • the frequency domain feature data can be obtained by performing wavelet decomposition on the time series.
  • the process of obtaining the target feature vector of the first time series may include: first determining the target feature to be extracted, and then extracting the determined target feature data in the first time series to obtain the target feature vector.
  • the target feature that needs to be extracted is determined based on the application scenario involved in the label determination method.
  • the target feature is a pre-configured feature, for example, a feature configured by a user.
  • the target feature is one or more of the specified features, for example, the specified feature is the aforementioned statistical feature.
  • the user can pre-set designated characteristics, but for the first time series, it may not have all designated characteristics, and the analysis device can filter the characteristics belonging to the designated characteristics in the first time series as target characteristics.
  • the target feature includes statistical features: time series decompose_seasonal (Tsd_seasonal), moving average, weighted average, time series classification, maximum, minimum, quantile, variance, standard deviation, Cycle year-on-year (year on year, yoy, refers to the comparison with the historical same period), one or more of daily volatility, bucket entropy, sample entropy, moving average, exponential moving average, Gaussian distribution feature, or T distribution feature, etc.
  • the target characteristic data includes the one or more statistical characteristic data;
  • the target feature includes fitting features: one or more of autoregressive fitting error, Gaussian process regression fitting error, or neural network fitting error, and correspondingly, the target feature data includes the one or more fitting features. Data of appropriate characteristics;
  • the target feature includes frequency domain features: the proportion of high-frequency components in the time series; correspondingly, the target feature data includes data on the proportion of high-frequency components in the time series, and the data can be obtained by performing wavelet decomposition on the time series.
  • Step 302 Obtain the similarity between the target feature vector and the reference feature vector in the reference feature vector set. Perform step 303 or 304.
  • a reference feature vector set is pre-established in the analysis device, and the reference feature vector set includes one or more reference feature vectors, and the reference feature vector is a feature vector of a second time sequence of a determined label.
  • the label may be manually annotated, may also be determined by the label determination method provided in the embodiment of the present application, or may be determined by other algorithms, which is not limited in the embodiment of the present application.
  • the label corresponding to each reference feature vector in the reference feature vector set and the second time series may be stored in the reference feature vector set, or may be stored in other storage spaces. As long as the reference feature vector can be used to query the corresponding label and the second time series.
  • the reference feature vector is a vector that characterizes the features of the second time series, and includes data of one or more features. That is, the reference feature vector corresponds to one-dimensional or multi-dimensional features.
  • the features involved in the reference feature vector may include data features and/or extracted features.
  • the process of obtaining the reference feature vector of each second time series can refer to the aforementioned process of obtaining the target feature vector of the first time series. This is not repeated in the embodiment of the application.
  • Table 1 is a schematic description of the data stored in the reference feature vector set.
  • the sample data identification (ID) in Table 1 is the reference feature vector of KPI_1, which includes the data of 4 features.
  • the data of the 4 features are: moving average (Moving_average), weighted average (Weighted_mv), Time series decomposition_period component (time series decompose_seasonal, Tsd_seasonal) and period yoy.
  • the time series corresponding to the reference feature vector is (x1, x2,..., xn), and the corresponding label is "abnormal".
  • Table 1 assumes that the reference feature vector set stores data in a fixed format, and the stored features of the reference feature vector can also be preset features, and the data of the reference feature vector set can all be stored in the format of Table 1.
  • the reference feature vector set may also have other forms, which are not limited in the embodiments of the present application.
  • the reference feature vector and the target feature vector included in the reference feature vector set may be obtained by using the same extraction algorithm or by using different extraction algorithms.
  • the category and the number of features involved in each reference feature vector and the target feature vector may be different. Therefore, it is necessary to deal with different situations accordingly.
  • the first reference feature vector is a reference feature vector in the reference feature vector set
  • the first feature is the same feature in the feature corresponding to the target feature vector and the feature corresponding to the first reference feature vector, that is, the first feature is the target
  • the first sub feature vector is a vector composed of data corresponding to the first feature in the target feature vector
  • the second sub feature vector is the first feature vector in the first reference feature vector.
  • the similarity between the target feature vector and the first reference feature vector is the similarity between the first sub feature vector and the second sub feature vector.
  • the process of obtaining the similarity between the target feature vector and the reference feature vector in the reference feature vector set may include the following steps:
  • Step 3021 among the features corresponding to the target feature vector and the features corresponding to the first reference feature vector, select the same first feature.
  • the first feature includes one or more features.
  • the process of obtaining the first feature in the foregoing step 3021 can be implemented by obtaining the intersection of the feature corresponding to the target feature vector and the feature corresponding to the first reference feature vector.
  • the first reference feature vector Q2 includes data corresponding to three features of features y 1 , y 4 and y 5 , 3
  • the first feature may also be obtained in other ways, for example, comparing the feature corresponding to the target feature vector with the feature corresponding to the first reference feature vector in turn, which is not limited in the embodiment of the present application.
  • Step 3022 Obtain data corresponding to the first feature in the target feature vector, and obtain a first sub-feature vector composed of the acquired data.
  • Step 3023 Obtain data corresponding to the first feature in the first reference feature vector, and obtain a second sub-feature vector composed of the acquired data.
  • Step 3024 Determine the similarity between the first sub-feature vector and the second sub-feature vector.
  • the first sub-feature vector and the second sub-feature vector are both represented in the form of a sequence.
  • the data at the same position in the first sub-feature vector and the second sub-feature vector correspond to features of the same category, and the first sub-feature vector
  • the similarity between the vector and the second sub-feature vector can be measured by the distance between the two, and the similarity is negatively related to the distance between the first sub-feature vector and the second sub-feature vector. That is, the greater the similarity of the two sub-feature vectors, the smaller the distance; the smaller the similarity, the greater the distance.
  • the distance between the first sub-feature vector and the second sub-feature vector can be acquired first; then, based on the acquired distance, the similarity between the first sub-feature vector and the second sub-feature vector is determined.
  • the distance between the first feature vector and the second feature vector is used to characterize the distance between the target feature vector and the first reference feature vector, and the distance between the first feature vector and the second feature vector can be obtained in multiple ways Ways, for example, calculated by using Euclidean distance formula, Chebyshev distance formula, cosine distance formula, Mahalanobis distance formula or other distance formulas.
  • the Mahalanobis distance formula is as follows:
  • ⁇ -1 is the covariance matrix
  • ⁇ -1 E[(XE[X])(XE(X)) T ].
  • the covariance matrix ⁇ -1 is a predetermined matrix, which can be calculated from the data of features with the same dimension as the first sub-eigenvector in the first sub-eigenvector and the reference eigenvector set.
  • the similarity between the first sub-feature vector and the second sub-feature vector is negatively related to the distance between the two, and then the first sub-feature vector and the first sub-feature vector can be determined based on the obtained distance D and the similarity calculation formula.
  • the reference feature vector and the target feature vector have the same type and number of features.
  • the first sub-feature vector is the same as the target feature vector
  • the second sub-feature vector is the same as the first reference feature vector.
  • the degree process can be: directly determine the similarity between the target feature vector and the first reference feature vector, that is, first obtain the distance between the target feature vector and the first reference feature vector; then, based on the obtained distance, determine the target feature vector and the The similarity of the first reference feature vector.
  • the aforementioned reference feature vector and target feature vector involve multiple feature data, and the greater the number of data with the same feature, the more similar the final calculated similarity can reflect the reference feature vector and target feature from multiple angles. Based on the correlation of the vector, the accuracy of the label determined based on this is higher.
  • Step 303 When the similarity between the target feature vector and the first reference feature vector is greater than the similarity threshold, the label corresponding to the first reference feature vector is determined as the label of the first time series.
  • the similarity threshold may be preset by the user, or may be determined by the analysis device based on the current application scenario. When the similarity between the target feature vector and the first reference feature vector is greater than the similarity threshold, it means that the first time series and the second time series corresponding to the first reference feature vector have a higher feature similarity, and the first time series When the label transition condition is satisfied, the label corresponding to the first reference feature vector may be determined as the label of the first time series.
  • the label corresponding to the first reference feature vector is "abnormal”
  • the label of the first time series is also "abnormal”.
  • Step 304 When the similarity between the target feature vector and each reference feature vector in the reference feature vector set is not greater than the similarity threshold, send the first time sequence to the management device for the management device to present the first time sequence.
  • the analysis device may send the first time sequence to the management device, which may be the management device 102 in the aforementioned application environment. After receiving the first time sequence, the management device presents the first time sequence, and the professional labels the label of the first time sequence.
  • Step 305 Receive the label of the first time sequence sent by the management device.
  • the management device receives the label and sends the label to the analysis device.
  • the analysis device receives the label and corresponds the label to the first time series. save.
  • the analysis device may not label the labels of the first time series, that is, without performing steps 304 and 305, delete the first time series and obtain a new one.
  • the above steps 301 to 303 are executed again to realize the label determination of the time sequence that meets the label migration condition. In this way, manual participation is not required, and labels can be determined for all time series that meet the label migration conditions.
  • Step 306 Perform conflict detection processing on the feature vectors that need to be added to the reference feature vector set.
  • the feature vector when the time series label corresponding to a feature vector is determined, the feature vector can be added to the reference feature vector set as a reference basis for label migration.
  • the labels corresponding to some feature vectors may have errors due to artificial errors or machine algorithm errors. If these feature vectors are added to the reference feature vector set, it is easy to cause label conflicts in the label migration process, such as with the target feature of a certain time series. There are multiple reference feature vectors with vector similarity greater than the similarity threshold, and the labels are different, which makes it impossible to perform label migration on a certain time series. Therefore, it is necessary to perform conflict detection processing on the feature vectors added to the reference feature vector set to avoid the feature vectors with false labels from being added to the reference feature vector set. For example, as shown in FIG. 5, the conflict detection process may include the following steps:
  • Step 3061 Obtain the first feature vector of the third time series of the determined label.
  • the label of the third time series may be manually annotated, may also be determined by the label determination method provided in the embodiment of the present application, or may be determined by other algorithms.
  • the label of the third time sequence may be the label determined in the foregoing step 303 or step 305, and correspondingly, the third time sequence is the passage of the foregoing first time sequence.
  • the process of obtaining the first feature vector of the third time series reference may be made to the process of obtaining the target feature vector of the first time series in the aforementioned step 301, which will not be repeated in this embodiment of the application. It is worth noting that when the third time series is the aforementioned first time series, the aforementioned target feature vector can be directly used as the first feature vector to reduce the process of re-extracting the feature vector and reduce the computational cost.
  • Step 3062 obtain the similarity between the first feature vector and the reference feature vector in the reference feature vector set.
  • step 3062 reference may be made to the aforementioned step 302, which will not be repeated in this embodiment of the present application.
  • one or more reference feature vectors can correspond to the same label.
  • the following error scenarios are prone to occur: multiple reference feature vectors that are substantially related correspond to different labels due to manual errors or machine algorithm errors, that is, reference feature vectors that should correspond to the same label correspond to different labels.
  • step 3062, step 3063, step 3064, or step 3065 can be performed.
  • Step 3063 When the similarity between the first feature vector and each reference feature vector in the reference feature vector set is not greater than the inbound similarity threshold, the first feature vector is added to the reference feature set as the reference feature vector.
  • the third time series has a low similarity with the second time series corresponding to each reference feature vector in the reference feature vector set. It is a brand new time series.
  • the first feature vector can be added to the reference feature as the reference feature vector In the collection.
  • the threshold of similarity in storage may be preset by the user, or determined by the analysis device based on the current application scenario, and it may be the same or different from the threshold of similarity in step 303.
  • Step 3064 When the similarity between the first feature vector and the second feature vector in the reference feature vector set is greater than the inbound similarity threshold, and the label corresponding to the first feature vector is the same as the label corresponding to the second feature vector, the first feature vector is the same as the label corresponding to the second feature vector.
  • a feature vector is added to the reference feature set as a reference feature vector.
  • the similarity between the first feature vector and the second feature vector in the reference feature vector set is greater than the inbound similarity threshold, it means that the first feature vector is similar to the second feature vector, and the two are related; when the first feature vector corresponds to the label When the label corresponding to the second feature vector is the same, it means that the two related feature vectors correspond to the same label, and the first feature vector meets the conditions for adding the reference feature vector set, and the first feature vector is added to the reference as the reference feature vector Feature collection.
  • Step 3065 When the similarity between the first feature vector and the second feature vector in the reference feature vector set is greater than the inbound similarity threshold, and the label corresponding to the first feature vector is different from the label corresponding to the second feature vector, report to the management The device sends the time sequence corresponding to the first feature vector and the time sequence corresponding to the second feature vector for the management device to present the time sequence corresponding to the first feature vector and the time sequence corresponding to the second feature vector. Go to step 3066.
  • the analysis device may send the time sequence corresponding to the first feature vector and the time sequence corresponding to the second feature vector to the management device, and the management device may be the management device 102 in the aforementioned application environment.
  • the management device After the management device receives the time series corresponding to the first feature vector and the time series corresponding to the second feature vector, the management device presents the received time series, and the professional labels the labels of the presented time series, because the two time series correspond to The feature vectors are related, and the labels of the two time series manually labeled are the same label.
  • the analysis device can also send the label corresponding to the first feature vector and the label corresponding to the second feature vector to the management device, and the management device can synchronously present the received labels when presenting the received time series for professional use. Reference by personnel can improve the accuracy of final labeling to a certain extent.
  • Step 3066 Receive the same label of the time sequence corresponding to the first feature vector and the time sequence corresponding to the second feature vector sent by the management device. Go to step 3067.
  • the management device receives the label and sends the label to the analysis device, and the analysis device receives the label.
  • Step 3067 Based on the received label, update the pre-stored label of the time series corresponding to the first feature vector and the label of the time series corresponding to the second feature vector. Go to step 3068.
  • the analysis device can update the pre-stored time corresponding to the first feature vector based on the received label
  • the label of the sequence and the label of the time sequence corresponding to the second feature vector ensure that the label of the time sequence corresponding to the updated first feature vector and the label of the time sequence corresponding to the second feature vector are the same. In order to avoid label conflicts.
  • Step 3068 Add the first feature vector as a reference feature vector to the reference feature set.
  • steps 3061 to 3068 are only illustrative implementations of conflict detection.
  • other methods may also be used to perform conflict detection.
  • steps 3065 to 3068 can also be replaced by: presenting the first feature vector, and the corresponding time sequence and label through the analysis device itself or the management device; and presenting the second feature vector, and the corresponding time sequence Receive the delete instruction, the delete instruction instructs to delete the first feature vector, and the corresponding time series and label, or the delete instruction instructs to delete the second feature vector, and the corresponding time series label; delete the delete instruction The indicated feature vector, and the corresponding time series and label. If the analysis device receives a deletion instruction, it indicates that the feature vector indicated by
  • step 306 is described by taking as an example the conflict detection processing when the first feature vector is added to the reference feature vector set.
  • the conflict detection processing can also be performed periodically, or when it is received After the trigger instruction is detected, conflict detection processing is performed, and the conflict detection processing process includes: steps A1 to A6.
  • Step A1 Obtain any feature vector of the reference feature vector set as the third feature vector.
  • Step A2 Obtain the similarity between the third feature vector and other reference feature vectors in the reference feature vector set.
  • step A2 reference may be made to the aforementioned step 302, which will not be repeated in the embodiment of the present application.
  • Step A3 When the similarity between the third feature vector and each of the other reference feature vectors in the reference feature vector set is not greater than the inbound similarity threshold, set any other reference feature vector except the third feature vector in the reference feature vector set.
  • the feature vector is used as the third feature vector, and steps A1 to A7 are repeated until all feature vectors in the reference feature vector set are traversed, and the action is stopped.
  • Step A4 When the similarity between the third feature vector and the fourth feature vector in the reference feature vector set is greater than the inbound similarity threshold, and the label corresponding to the third feature vector is the same as the label corresponding to the fourth feature vector, refer to Any feature vector in the feature vector set except the third feature vector is used as the third feature vector, and steps A1 to A7 are repeated until all feature vectors in the reference feature vector set are traversed, and the action is stopped.
  • Step A5 When the similarity between the third feature vector and the fourth feature vector in the reference feature vector set is greater than the inbound similarity threshold, and the label corresponding to the third feature vector is different from the label corresponding to the fourth feature vector, report to the management
  • the device sends the time sequence corresponding to the third feature vector and the time sequence corresponding to the fourth feature vector for the management device to present the time sequence corresponding to the third feature vector and the time sequence corresponding to the fourth feature vector. Go to step A6.
  • step A5 refer to the aforementioned step 3065, which is not described in detail in the embodiment of the present application.
  • Step A6 Receive the same label of the time sequence corresponding to the third feature vector and the time sequence corresponding to the fourth feature vector sent by the management device. Go to step A7.
  • Step A6 refers to the aforementioned step 3066, which is not described in detail in the embodiment of the present application.
  • Step A7 Based on the received label, the pre-stored label of the time series corresponding to the third feature vector and the label of the time series corresponding to the fourth feature vector are updated. Use any feature vector other than the third feature vector in the reference feature vector set as the third feature vector, and repeat steps A1 to A7 until all feature vectors in the reference feature vector set are traversed, and the action is stopped.
  • the first time sequence is sent to the management device, that is, the analysis device obtains one and The time series in which the similarity of each reference feature vector in the reference feature vector set is not greater than the similarity threshold is sent to the management device for manual labeling.
  • This labeling method is the individual labeling method, which is related to the management A label is marked during one interaction of the device.
  • the manual labeling process can also be implemented in other ways, such as the cluster labeling method, that is, multiple labels are labeled in one interaction process with the management device. As shown in Figure 6, when the cluster labeling method is adopted, The foregoing step 304 and step 305 can be replaced with steps 307 to 309:
  • Step 307 Obtain a first feature vector set.
  • the similarity between any feature vector in the first feature vector set and each reference feature vector in the reference feature vector set is not greater than the similarity threshold, and any feature vector corresponds to The label of the time series is undetermined.
  • the number of feature vectors in the first feature vector set is a specified number.
  • the analysis device obtains a specified number of fifth feature vectors after repeatedly performing the foregoing steps 301 to 303 for many times, and determines the specified number of fifth feature vectors as the first feature vector set, and the fifth feature vector
  • the similarity with each reference feature vector in the reference feature vector set is not greater than the similarity threshold, and the label of the time series corresponding to the fifth feature vector is undetermined.
  • the fifth feature vector may include the aforementioned target feature vector.
  • the first feature vector set is a periodically acquired set.
  • the analysis device obtains the fifth feature vector every specified time period to obtain the first feature vector set, and the fifth feature vector is within the nearest specified time period and is compared with the reference feature.
  • the similarity of each reference feature vector in the vector set is not greater than the feature vector of the similarity threshold, and the label of the time series corresponding to the fifth feature vector is undetermined.
  • the fifth feature vector may include the aforementioned target feature vector.
  • the first feature vector set is a set obtained by the analysis device after receiving the collection instruction. For example, in the process of repeating the foregoing steps 301 to 303 for multiple times, if the analysis device receives a collection instruction instructing to collect the fifth feature vector, it acquires the fifth feature vector based on the collection instruction to obtain the first feature vector set.
  • the fifth feature vector is the historical duration (the historical duration can be a specified duration, it can also be the duration between the last collection instruction and the current collection instruction, or the duration specified in other ways), and the reference feature vector set
  • the similarity of the reference feature vectors is not greater than the feature vector of the similarity threshold, and the label of the time series corresponding to the fifth feature vector is undetermined.
  • the fifth feature vector may include the aforementioned target feature vector.
  • Step 308 Send the time sequence corresponding to the first feature vector set to the management device for the management device to present the time sequence corresponding to the first feature vector set.
  • the analysis device sends the time sequence corresponding to the first feature vector set to the management device.
  • the management device After receiving the time sequence, the management device presents the time sequence corresponding to the first feature vector set. Label the time series corresponding to a feature vector set.
  • the management device can display the time series corresponding to multiple feature vectors in the first feature vector set at the same time on the same user interface, and can also display the time series corresponding to multiple feature vectors in the first feature vector set in a scrolling manner.
  • the application embodiment does not limit this.
  • the analysis device can also send the first feature vector set to the management device.
  • the management device presents each time series, it can present the corresponding feature vector for reference by professionals and to assist professionals in labeling. Function to improve the accuracy of labeling.
  • the feature vectors in the first feature vector set may also be clustered to obtain the first feature vector The category relationship of the feature vectors in the set; then in step 308, the category relationship and the time sequence corresponding to the first feature vector set are sent to the management device, so that the management device presents the time sequence corresponding to the first feature vector set according to the category relationship.
  • the process of clustering the feature vectors in the first feature vector set includes:
  • Step 3081 based on the distance between every two feature vectors in the first feature vector set, count the neighbor vectors of each feature vector, and the neighbor vector of any feature vector in the first feature vector set is the first feature vector set and any one
  • the distance of the feature vector is smaller than the other feature vectors of a distance threshold, and the distance threshold is a distance specified in a plurality of distances determined based on the first feature vector set.
  • this step 3081 may include the following steps:
  • Step B1 The analysis device obtains the distance between every two feature vectors in the first feature vector set.
  • the second reference feature vector and the third reference feature vector are any two reference feature vectors in the reference feature vector set, and the second feature is the same in the feature corresponding to the second reference feature vector and the feature corresponding to the third reference feature vector Feature, that is, the second feature is the intersection of the feature corresponding to the second reference feature vector and the feature corresponding to the third reference feature vector, and the third sub feature vector is composed of data corresponding to the second feature in the second reference feature vector
  • the fourth sub-feature vector is a vector composed of data corresponding to the second feature in the third reference feature vector, and the similarity between the second reference feature vector and the third reference feature vector is the third sub-feature vector and the fourth sub-feature vector. The distance of the feature vector.
  • the distance between the third sub feature vector and the fourth sub feature vector For the method of obtaining refer to the aforementioned steps 3021 to 3024; when the second reference feature vector and the third reference feature vector involve the same type and number of features, refer to the second case of the aforementioned step 302 to directly obtain the second The distance between the reference feature vector and the third reference feature vector.
  • Step B2 The analysis device determines a distance threshold among the multiple distances determined based on the first feature vector set.
  • the analysis device sorts the acquired distances, for example, in ascending order or descending order.
  • the distance threshold can be a distance in a specified quantile or a specified order among the sorted distances.
  • the specified quantile or specified order is an empirical value. For example, the specified quantile is the top 50% or the top 90%, then the distance threshold is the distance at the top 50% or the top 90% of the sorted distances, where, "Front" refers to the order from front to back according to the arrangement order; for example, the designated order is the 5th, and the distance threshold is the 5th distance among the sorted distances.
  • the distances between the feature vectors z 1 and z 2 , z 3 , and z 4 are 10, 9, and 8, respectively, and the feature vector z
  • the distance between 2 and z 3 and z 4 is 11 and 6, respectively, the distance between z 3 and z 4 is 5, and the quantile is the top 50%.
  • the analysis device sorts the acquired distances in descending order, the distance sequence obtained is: 11, 10, 9, 8, 6, 5, and the distance threshold is 9.
  • Step B3 Based on the distance between every two feature vectors in the first feature vector set, count the neighbor vectors of each feature vector, and the neighbor vector of any feature vector in the first feature vector set is the one in the first feature vector set.
  • the distance of a feature vector is less than the distance threshold of other feature vectors.
  • the neighbor vector of the feature vector z 1 is z 4
  • the number of neighbor vectors of the feature vector z 1 is 1
  • the neighbor vector of the feature vector z 2 is z 4
  • the neighbor vector of the feature vector z 2 is z 4.
  • the number is 1;
  • the neighbor vector of the feature vector z 3 is z 4
  • the number of the neighbor vector of the feature vector z 3 is 1;
  • the neighbor vectors of the feature vector z 4 are z 1 , z 2 and z 3
  • the feature vector z 4 The number of neighbor vectors of is 3.
  • Step 3082 based on the statistical results, divide every two feature vectors with the same number of neighbor vectors greater than the number threshold into the same type of feature vector, and the number threshold is among the number of neighbor vectors of each feature vector in the first feature vector set The specified quantity.
  • this step 3082 may include the following steps:
  • Step C1 The analysis device obtains the number of neighbor vectors of each feature vector in the first feature vector set.
  • the numbers of neighbor vectors of the feature vectors z 1 and z 2 , z 3 , and z 4 are 1, 1, 1, and 3, respectively.
  • Step C2 The analysis device determines the number threshold among the number of neighbor vectors of each feature vector in the first feature vector set.
  • the analysis device sorts the acquired quantities, for example, ascending order or descending order.
  • the quantity threshold may be a quantity in a specified quantile or a specified order among the sorted quantities.
  • the designated quantile or designated order is an empirical value, for example, the designated quantile is the top 50% or the top 60%.
  • the analysis device sorts the acquired quantities in descending order to obtain the quantity sequence: 3, 1, 1, 1. Then the number threshold is 1.
  • Step C3 Based on the statistical result, classify every two feature vectors with the same number of neighbor vectors greater than the number threshold into the same type of feature vector.
  • the number threshold is 1, the number of the same neighbor vectors between two of the feature vectors z 1 and z 2 , z 3 , and z 4 are all 0. Therefore, the feature vectors z 1 and z 2 , z 3 , and z 4 are respectively 0 Grouped into one category.
  • the neighbor vectors with the same feature vectors z 1 and z 4 are z 2 and z 3
  • the neighbor vectors with the same feature vectors z 2 and z 3 are z 1 and z 4
  • the feature vectors z 1 and z 2 are empty
  • the feature vectors z 1 and z 4 are divided into the same type of feature vector
  • the feature vectors z 2 and z 3 are divided It is the feature vector of the same type.
  • the distance threshold is a distance specified among a plurality of distances determined based on the first feature vector set, which reflects the distance corresponding to the first feature vector set.
  • the distribution relationship of multiple distances is a value that changes with the change of the first feature vector set;
  • the number threshold is the number specified in the number of neighbor vectors of each feature vector in the first feature vector set, which reflects the first feature vector set.
  • the distribution relationship of the number of neighbor vectors of each feature vector in the feature vector set is a value that changes with the change of the first feature vector set. Therefore, the distance threshold and the number threshold are relatively variable values, and the final classification based on at least one of these two thresholds is more accurate, and can better reflect the correlation between each feature vector, and improve the adaptability of the clustering algorithm .
  • the analysis device may also send the category relationship to the management device, and the management device may present the time sequence corresponding to the first feature vector set according to the category relationship.
  • the management device can display multiple time series belonging to the same category on the same user page, and multiple time series belonging to different categories can be displayed on different user pages; for example, the management device can display multiple time series belonging to different categories. Display in different positions of the same user page; for another example, the management device displays each time series corresponding to its category.
  • the management device can present the time series corresponding to the first feature vector set according to the category relationship, which can be used by professionals to refer to the category relationship when labeling, and play a role of assisting professionals in labeling. Based on this, professionals can label time series belonging to the same category with the same label, improve labeling efficiency, and increase label labeling accuracy.
  • a Shared Nearest Neighbor (SNN) algorithm may be used for clustering processing.
  • the clustering threshold and the number threshold for clustering using the SNN algorithm are preset.
  • clustering algorithms can also be used for clustering processing.
  • a clustering algorithm based on a neural network model is used for clustering processing.
  • the embodiment of the present application does not limit the algorithm used in the clustering process.
  • Table 2 assumes that the feature vector of the sample data ID is KPI_2 is clustered.
  • the time series corresponding to the feature vector is (z1, z2,..., zn).
  • the feature vector includes the data of 4 features.
  • the data of the 4 features are: Moving_average, Weighted_mv, Tsd_seasonal and period yoy.
  • the time series corresponding to the reference feature vector is (z1, z2,..., zn), and the corresponding category identification is "1".
  • Step 309 Receive the time series label corresponding to the first feature vector set sent by the management device.
  • the management device receives the label and sends the label to the analysis device.
  • the analysis device receives the label and saves the label corresponding to the corresponding time sequence.
  • the label determination method provided by the embodiments of the present application performs label migration based on the similarity of the feature vectors of the time series, which can realize automatic labeling of sample data and reduce the cost of label determination. And because the similarity calculation is related to the feature vector of the time series, the influence of the interference information of the time series itself is avoided, for example, the influence of the interference information such as sampling period, amplitude change, quadrant drift and noise can be reduced. Improved the accuracy of label determination. Especially in high-dimensional time series, label migration can still be performed accurately. Applying the label determination method provided by the embodiments of the present application to scenarios requiring a large amount of labeled sample data, such as supervised learning algorithms or semi-supervised learning algorithms, can effectively reduce labeling costs and improve the modeling efficiency of machine learning models.
  • the traditional label determination method performs label migration based on the waveform similarity of the time series, for some situations where the waveforms of the time series themselves are not similar, the label migration cannot be performed.
  • the similarity of the feature vector is used for label migration, it is not limited to the label migration of time series with similar waveforms.
  • the label migration can be performed as long as the similarity in certain feature dimensions is ensured. It can be seen that the embodiment of the present application can be applied to label migration of time series with different waveforms. Therefore, the scenario of tag generalization can be expanded, the flexibility and utilization rate of tag migration can be improved, and the modeling cost of the machine learning model can be reduced. Especially in anomaly detection scenarios, it is possible to realize label migration between KPIs with similar characteristics.
  • the analysis device has an input and output interface (such as a user interface), which presents the first time sequence through the input and output interface, and receives the label of the first time sequence, without performing the interaction process with the management device in steps 304 and 305; or , The analysis device presents the time series corresponding to the first feature vector and the time series corresponding to the second feature vector through the input and output interface, and receives the same label of the time series corresponding to the first feature vector and the time series corresponding to the second feature vector, There is no need to perform the interaction process with the management device in steps 3065 and 3066; or, the analysis device presents the time sequence corresponding to the third feature vector and the time sequence corresponding to the fourth feature vector through the input and output interface, and receives the time corresponding to the third feature vector.
  • an input and output interface such as a user interface
  • An embodiment of the present application provides a label determining device 80. As shown in FIG. 8, the device includes:
  • the first obtaining module 801 is configured to obtain a target feature vector of a first time series, where the time series is a set of data arranged in a time series;
  • the second obtaining module 802 is configured to obtain the similarity between the target feature vector and a reference feature vector in a reference feature vector set, where the reference feature vector is a feature vector of a second time series with a determined label;
  • the determining module 803 is configured to determine the label corresponding to the first reference feature vector as the label of the first time series when the similarity between the target feature vector and the first reference feature vector is greater than the similarity threshold,
  • the first reference feature vector is a reference feature vector in the set of reference feature vectors.
  • the second acquisition module performs label migration based on the similarity of the feature vectors of the time series, which can realize automatic labeling of sample data and reduce the cost of label determination.
  • the similarity calculation is related to the feature vector of the time series, the influence of the interference information of the time series itself is avoided, for example, the influence of the interference information such as sampling period, amplitude change, quadrant drift and noise can be reduced.
  • the accuracy of label determination Especially in high-dimensional time series, label migration can still be performed accurately.
  • Applying the label determination device provided in the embodiments of the present application to scenarios requiring a large amount of labeled sample data, such as supervised learning algorithms or semi-supervised learning algorithms can effectively reduce labeling costs and improve the modeling efficiency of machine learning models.
  • the first time series is a time series of a network key performance indicator KPI.
  • the reference feature vector includes data of one or more features
  • the target feature vector includes data of one or more features
  • the similarity between the target feature vector and the first reference feature vector is the similarity between the first sub feature vector and the second sub feature vector, and the first sub feature vector and the second sub feature vector are respectively determined by The target feature vector and the first reference feature vector are composed of data corresponding to the same feature.
  • the first sub-feature vector and the second sub-feature vector are both characterized in the form of a sequence, and data at the same position in the first sub-feature vector and the second sub-feature vector correspond to features of the same category ,
  • the similarity between the first sub-feature vector and the second sub-feature vector is negatively related to the distance between the first sub-feature vector and the second sub-feature vector.
  • the apparatus 80 further includes:
  • the first sending module 804 is configured to send the first time sequence to the management device when the similarity between the target feature vector and each reference feature vector in the reference feature vector set is not greater than the similarity threshold. , For the management device to present the first time sequence;
  • the first receiving module 805 is configured to receive the label of the first time sequence sent by the management device.
  • the apparatus 80 further includes:
  • the third obtaining module 806 is configured to obtain a first feature vector set, and the similarity between any feature vector in the first feature vector set and each reference feature vector in the reference feature vector set is not greater than the similarity Degree threshold, and the label of the time series corresponding to any one of the feature vectors is undetermined;
  • the second sending module 807 is configured to send the time sequence corresponding to the first feature vector set to the management device, so that the management device can present the time sequence corresponding to the first feature vector set;
  • the second receiving module 808 is configured to receive the time series label corresponding to the first feature vector set sent by the management device.
  • the apparatus 80 further includes:
  • the clustering module 809 is the same as performing clustering processing on the feature vectors in the first feature vector set before sending the time sequence corresponding to the first feature vector set to the management device to obtain the first feature The category relationship of the feature vector in the vector set;
  • the second sending module 807 is used to:
  • the clustering module 809 is configured to:
  • the distance threshold is a distance specified among a plurality of distances determined based on the first feature vector set
  • the apparatus 80 further includes:
  • the fourth obtaining module 810 is configured to obtain the first feature vector of the third time series of the determined label
  • the fifth obtaining module 811 is configured to obtain the similarity between the first feature vector and the reference feature vector in the reference feature vector set;
  • the first adding module 812 is configured to use the first feature vector as a reference feature when the similarity between the first feature vector and each of the reference feature vectors in the set of reference feature vectors is not greater than the inbound similarity threshold. The vector is added to the reference feature set.
  • the apparatus 80 further includes:
  • the second adding module 813 is configured to: when the similarity between the first feature vector and the second feature vector in the reference feature vector set is greater than the inbound similarity threshold, and the label corresponding to the first feature vector When the label corresponding to the second feature vector is the same, the first feature vector is added as a reference feature vector to the reference feature set.
  • the apparatus 80 further includes:
  • the third sending module 814 is configured to: when the similarity between the first feature vector and the second feature vector in the reference feature vector set is greater than the inbound similarity threshold, and the label corresponding to the first feature vector When the label corresponding to the second feature vector is different, the time sequence corresponding to the first feature vector and the time sequence corresponding to the second feature vector are sent to the management device for the management device to present the first feature vector.
  • the third receiving module 815 is configured to receive the same label of the time sequence corresponding to the first feature vector and the time sequence corresponding to the second feature vector sent by the management device;
  • the update module 816 is configured to update the pre-stored label of the time series corresponding to the first feature vector and the label of the time series corresponding to the second feature vector based on the received label;
  • the third adding module 817 is configured to add the first feature vector as a reference feature vector to the reference feature set.
  • both the target feature vector and the reference feature vector include data of one or more of statistical features, fitting features, or frequency domain features.
  • the device is applied to a network analyzer, and the label corresponding to the reference feature vector is an anomaly detection label.
  • the second acquisition module performs label migration based on the similarity of the feature vectors of the time series, which can realize automatic labeling of sample data and reduce the cost of label determination.
  • the similarity calculation is related to the feature vector of the time series, the influence of the interference information of the time series itself is avoided, for example, the influence of the interference information such as sampling period, amplitude change, quadrant drift and noise can be reduced.
  • the accuracy of label determination Especially in high-dimensional time series, label migration can still be performed accurately.
  • Applying the label determination device provided in the embodiments of the present application to scenarios requiring a large amount of labeled sample data, such as supervised learning algorithms or semi-supervised learning algorithms can effectively reduce labeling costs and improve the modeling efficiency of machine learning models.
  • Fig. 15 is a block diagram of a label determination device provided by an embodiment of the present application.
  • the label determination device may be an analysis device.
  • the analysis device 150 includes: a processor 1501 and a memory 1502.
  • the memory 1501 is used to store a computer program, and the computer program includes program instructions;
  • the processor 1502 is configured to call a computer program to implement the label determination method provided in the embodiment of the present application.
  • the network device 150 further includes a communication bus 1503 and a communication interface 1504.
  • the processor 1501 includes one or more processing cores, and the processor 1501 executes various functional applications and data processing by running a computer program.
  • the memory 1502 may be used to store computer programs.
  • the memory may store an operating system and at least one application program unit required by the function.
  • the operating system can be a real-time operating system (Real Time eXecutive, RTX), LINUX, UNIX, WINDOWS, or OS X.
  • the communication interfaces 1504 may be used to communicate with other storage devices or network devices.
  • the communication interface 1504 may be used to receive sample data sent by a network device in a communication network.
  • the memory 1502 and the communication interface 1504 are respectively connected to the processor 1501 through a communication bus 1503.
  • the embodiment of the present application provides a computer storage medium.
  • the computer storage medium stores instructions.
  • the instructions are executed by a processor, the label determination method provided in the embodiments of the present application is implemented.
  • the program can be stored in a computer-readable storage medium.
  • the storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.
  • the computer may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software it may be implemented in whole or in part in the form of a computer program product, which includes one or more computer instructions.
  • the computer may be a general-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data.
  • the center transmits to another website site, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium, or a semiconductor medium (for example, a solid state hard disk).

Abstract

本申请公开了一种标签确定方法、装置及系统,属于AI领域。所述方法包括:获取第一时间序列的目标特征向量,时间序列为按照时序排列的一组数据的集合;获取所述目标特征向量与参考特征向量集合中参考特征向量的相似度,所述参考特征向量为已确定标签的第二时间序列的特征向量;当所述目标特征向量与第一参考特征向量的相似度大于相似度阈值时,将所述第一参考特征向量所对应的标签确定为所述第一时间序列的标签,所述第一参考特征向量为所述参考特征向量集合中的一个参考特征向量。本申请提高了标签确定的准确性,本申请用于机器学习模型的标签的确定。

Description

标签确定方法、装置和系统 技术领域
本申请涉及人工智能(Artificial Intelligence,AI)领域,特别涉及一种标签确定方法、装置和系统。
背景技术
机器学习,是指让机器基于训练样本训练出机器学习模型,使机器学习模型对训练样本之外的数据具有预测能力(如类别预测能力)。机器学习作为AI领域的一个重要分支,在众多领域得到了广泛的应用。从学习方法的角度,机器学习算法可以分为监督式学习、非监督式学习、半监督式学习、强化学习等几大类算法。其中,监督式学习是机器学习算法中的一类基础算法。
在采用监督学习算法对机器学习模型进行训练的过程中,需要先人工对大量的样本数据进行标注(如样本为人脸图片,标注过程可以为将某一人脸图片标注为:“戴眼镜”),然后利用这些已经进行标注的样本数据来对机器学习模型进行训练,以调整机器学习模型所采用的参数。训练完成后的机器学习模型即可用来执行相应的功能,例如图像识别或语言翻译等。其中,样本数据的标注过程称为确定标签的过程,标注的内容即为标签,标签用于标识数据,如标识数据的类别。
但是采用监督式学习算法或半监督式学习算法等进行模型训练时,需要大量人力进行样本数据的标注,标签的确定成本较高。
发明内容
本申请实施例提供了一种标签确定方法、装置及系统。可以解决目前的标签确定成本较高的问题。所述技术方案如下:
第一方面,提供了一种标签确定方法,所述方法包括:
获取第一时间序列的目标特征向量,时间序列为按照时序排列的一组数据的集合;
获取所述目标特征向量与参考特征向量集合中参考特征向量的相似度,所述参考特征向量为已确定标签的第二时间序列的特征向量;
当所述目标特征向量与第一参考特征向量的相似度大于相似度阈值时,将所述第一参考特征向量所对应的标签确定为所述第一时间序列的标签,所述第一参考特征向量为所述参考特征向量集合中的一个参考特征向量。
本申请实施例提供的标签确定方法,基于时间序列的特征向量的相似度进行标签的迁移,能够实现样本数据的自动标注,降低标签的确定成本。并且由于相似度计算与时间序列的特征向量相关,避免了时间序列自身所具有的干扰信息的影响,例如能够降低采样时段、幅度变化、象限漂移和噪声等干扰信息的影响。提高了标签确定的准确性。尤其在高维时间序列中仍然能够准确地进行标签迁移。
并且,将本申请实施例提供的标签确定方法应用于监督式学习算法或半监督式学习算法 等需要大量标注的样本数据的场景中,能够有效降低标注成本,提高机器学习模型的建模效率。
可选地,所述第一时间序列为网络KPI的时间序列。
可选地,所述参考特征向量包括一个或多个特征的数据,所述目标特征向量包括一个或多个特征的数据,所述目标特征向量与所述第一参考特征向量的相似度为第一特征向量和第二子特征向量的相似度,所述第一子特征向量和所述第二子特征向量分别由所述目标特征向量和所述第一参考特征向量中对应相同特征的数据组成。
在本申请实施例中,参考特征向量集合中包括的参考特征向量与目标特征向量可能采用相同的提取算法也可能采用不同的提取算法获取。相应的,每个参考特征向量和目标特征向量所涉及的特征的类别以及特征的个数可能不同。因此,需要针对不同的情况进行相应的处理。
第一种情况,参考特征向量和目标特征向量所涉及的特征的类别和特征的个数不同。则相似度确定过程包括:在目标特征向量对应的特征与第一参考特征向量对应的特征中,筛选相同的第一特征;获取目标特征向量中第一特征对应的数据,得到由获取的数据所组成的第一子特征向量;获取第一参考特征向量中第一特征对应的数据,得到由获取的数据组成的第二子特征向量;确定第一子特征向量和第二子特征向量的相似度。该第一子特征向量和第二子特征向量的相似度即为参考特征向量和目标特征向量的相似度。
在第一种情况中,通过筛选第一子特征向量和第二子特征向量,并计算两者的相似度来作为参考特征向量和目标特征向量的相似度,可以简化相似度计算流程,保证最终计算得到的相似度的准确性。
第二种情况,参考特征向量和目标特征向量所涉及的特征的类别和特征的个数相同。可以直接将参考特征向量和目标特征向量分别获取为第一子特征向量和第二子特征向量;确定第一子特征向量和第二子特征向量的相似度,该第一子特征向量和第二子特征向量的相似度即为参考特征向量和目标特征向量的相似度。
在第二种情况中,通过设置参考特征向量和目标特征向量所涉及的特征的类别和特征的个数相同,可以减少特征筛选过程,进一步简化相似度计算流程。
在前述两种情况中,第一子特征向量和第二子特征向量均以序列形式表征,所述第一子特征向量和所述第二子特征向量中相同位置的数据对应同一类别的特征,所述第一子特征向量和所述第二子特征向量的相似度,与所述第一子特征向量和所述第二子特征向量的距离负相关。
相应的,可以先获取第一子特征向量和第二子特征向量的距离;然后,基于获取的距离,确定第一子特征向量和第二子特征向量的相似度。示例的,该距离可以采用欧式距离公式、切比雪夫距离公式、余弦距离公式、马氏距离公式或者其他距离公式等计算得到。
由于第一子特征向量和第二子特征向量的距离能够有效反应两者的相似度,通过计算距离可以实现相似度快速确定,提高相似度确定的效率。
在本申请实施例中,当存在与所述参考特征向量集合中每个参考特征向量的相似度均不大于所述相似度阈值的特征向量时,还需要通过人工标注的方式确定标签,以保证需要确定标签的特征向量最终能够标注相应的标签。本申请实施例中,基于分析设备向管理设备发送的时间序列的形式不同,可以将人工标注的过程划分为个体标注的过程(在这种场景下,分 析设备向管理设备通常一次发一个待标注的时间序列)和集群标注的过程(在这种场景下,分析设备向管理设备通常一次发一个集合的待标注的时间序列),本申请实施例以以下两种可选方式对人工标注的过程进行说明:
在第一种可选方式中,人工标注的过程包括以下个体标注过程:
当所述目标特征向量与所述参考特征向量集合中每个参考特征向量的相似度均不大于所述相似度阈值时,向管理设备发送所述第一时间序列,以供所述管理设备呈现所述第一时间序列;接收所述管理设备发送的所述第一时间序列的标签。
通过专业人员对第一时间序列的标签进行标注,可以在保证第一时间序列无法进行标签迁移时,仍能确定其标签。
在第二种可选方式中,人工标注的过程包括以下集群标注过程:
获取第一特征向量集合,所述第一特征向量集合中的任一特征向量与所述参考特征向量集合中每个参考特征向量的相似度均不大于所述相似度阈值,且所述任一特征向量对应的时间序列的标签未确定;向管理设备发送所述第一特征向量集合对应的时间序列,以供所述管理设备呈现所述第一特征向量集合对应的时间序列;接收所述管理设备发送的所述第一特征向量集合对应的时间序列的标签。
通过专业人员对第一特征向量集合对应的时间序列的标签进行标注,可以在保证第一特征向量集合对应的时间序列无法进行标签迁移时,仍能确定其标签。并且与管理设备的一次交互,可以实现多个时间序列的标签标注,节省网络开销。
分析设备向管理设备发送第一特征向量集合对应的时间序列可以有多种实现方式,本申请实施例以以下两种实现方式为例进行说明:
第一种实现方式,分析设备向管理设备发送第一特征向量集合对应的时间序列,管理设备接收该时间序列后,呈现第一特征向量集合对应的时间序列,由专业人员对该第一特征向量集合对应的时间序列的标签进行标注。
在第二实现方式,在向管理设备发送第一特征向量集合对应的时间序列之前,分析设备还可以先对第一特征向量集合中的特征向量进行聚类处理,得到第一特征向量集合中特征向量的类别关系;然后在向管理设备发送所述第一特征向量集合对应的时间序列时,同时向管理设备发送类别关系,以供管理设备按照类别关系,呈现第一特征向量集合对应的时间序列。
例如,管理设备可以将属于同一类别的多个时间序列在同一用户页面显示,将属于不同类别的多个时间序列在不同用户页面显示;又例如,管理设备可以将属于不同类别的多个时间序列在同一用户页面的不同位置显示;再例如,管理设备可以将每个时间序列与其所属类别对应显示。管理设备按照类别关系,呈现第一特征向量集合对应的时间序列,可以供专业人员在标注时参考该类别关系,起到辅助专业人员进行标签标注的作用。基于此,专业人员可以对属于同一类别的时间序列标注同一标签,提高标注效率,增加标签标注的准确性。
可选地,所述对所述第一特征向量集合中的特征向量进行聚类处理,包括:
基于所述第一特征向量集合中每两个特征向量的距离,统计每个所述特征向量的近邻向量,所述第一特征向量集合中任一特征向量的近邻向量为所述第一特征向量集合中与所述任一特征向量的距离小于距离阈值的其他特征向量,所述距离阈值为在基于所述第一特征向量集合确定的多个距离中指定的距离;
基于统计结果,将相同的近邻向量的数量大于数量阈值的每两个特征向量划分为同一类 特征向量。示例的,所述数量阈值为在所述第一特征向量集合中各个特征向量的近邻向量的数量中指定的数量。
由于距离阈值和数量阈值是相对变化的值,基于这两个阈值最终划分得到的类别关系更准确,更能体现各个特征向量之间的关联性,提升聚类算法的适应性。
在本申请实施例中,当一个特征向量对应的时间序列的标签确定,可以将该特征向量添加到参考特征向量集合中,以作为标签迁移的参考基础。但是一些特征向量对应的标签可能由于人工误差或者机器算法失误而出现错误,如果将这些特征向量添加到参考特征向量集合中,容易引起标签迁移过程的标签冲突,例如与某一时间序列的目标特征向量的相似度大于相似度阈值的参考特征向量有多个,且标签不同,导致无法对该某一时间序列进行标签迁移。因此,需要对添加至参考特征向量集合的特征向量进行冲突检测处理,以避免出现错误标签的特征向量添加到参考特征向量集合中。示例的,该冲突检测过程可以包括以下步骤:
获取已确定标签的第三时间序列的第一特征向量;
获取所述第一特征向量与所述参考特征向量集合中参考特征向量的相似度;
当所述第一特征向量与所述参考特征向量集合中每个参考特征向量的相似度均不大于入库相似度阈值,将所述第一特征向量作为参考特征向量添加至所述参考特征集合中。
可选地,所述方法还包括:
当所述第一特征向量与所述参考特征向量集合中的第二特征向量的相似度大于所述入库相似度阈值,且所述第一特征向量对应的标签与所述第二特征向量对应的标签相同时,将所述第一特征向量作为参考特征向量添加至所述参考特征集合中。
可选地,所述方法还包括:
当所述第一特征向量与所述参考特征向量集合中的第二特征向量的相似度大于所述入库相似度阈值,且所述第一特征向量对应的标签与所述第二特征向量对应的标签不同时,向管理设备发送所述第一特征向量对应的时间序列以及所述第二特征向量对应的时间序列,以供所述管理设备呈现所述第一特征向量对应的时间序列以及所述第二特征向量对应的时间序列;
接收所述管理设备发送的所述第一特征向量对应的时间序列以及所述第二特征向量对应的时间序列的相同的标签;
基于接收的标签,更新预先存储的所述第一特征向量对应的时间序列的标签以及所述第二特征向量对应的时间序列的标签;
将所述第一特征向量作为参考特征向量添加至所述参考特征集合中。
可选地,所述目标特征向量和所述参考特征向量均包括统计特征、拟合特征或频域特征中的一种或多种特征的数据。
可选地,本申请实施例提供的标签确定方法,应用在异常检测场景中,能够进行自动的标签确定。在该应用场景中,前述标签确定方法由网络分析器执行,所述参考特征向量对应的标签为异常检测标签。在异常检测场景中,时间序列数据包括网络关键绩效指标(key performance indicator,KPI),网络KPI包括网络流量KP、网络业务KPI等。其中,网络设备KPI可以是中央处理器(CPU,central processing unit)利用率、光功率等,网络业务KPI可以是网络流量、丢包率、时延、用户接入数等。其中,网络流量KPI为具有周期性的时间序列数据。由于大量的KPI异常的特征相似,本申请实施例提供的标签确定方法应用于异常 检测场景中,可在一定范围内进行标签自动迁移,提升标签的利用率,降低标注成本,并且相对于传统的标签迁移方法,确定的标签的准确性较高。
第二方面,提供了一种标签确定装置,所述装置包括:多个功能模块:所述多个功能模块相互作用,实现上述第一方面及其各实施方式中的方法。所述多个功能模块可以基于软件、硬件或软件和硬件的结合实现,且所述多个功能模块可以基于具体实现进行任意组合或分割。
第三方面,提供了一种标签确定装置,包括:处理器和存储器;
所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;
所述处理器,用于调用所述计算机程序,实现如第一方面任一所述的标签确定方法。
第四方面,提供了一种计算机存储介质,所述计算机存储介质上存储有指令,当所述指令被处理器执行时,实现如第一方面任一所述的标签确定方法。
第五方面,提供了一种芯片,芯片包括可编程逻辑电路和/或程序指令,当芯片运行时,实现如第一方面任一所述的标签确定方法。
第六方面,提供了一种计算机程序产品,所述计算机程序产品中存储有指令,当所述指令在计算机上运行时,使得所述计算机执行如第一方面任一所述的标签确定方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
本申请实施例提供的标签确定方法,基于时间序列的特征向量的相似度进行标签的迁移,能够实现样本数据的自动标注,降低标签的确定成本。并且由于相似度计算与时间序列的特征向量相关,避免了时间序列自身所具有的干扰信息的影响,例如能够降低采样时段、幅度变化、象限漂移和噪声等干扰信息的影响。提高了标签确定的准确性。尤其在高维时间序列中仍然能够准确地进行标签迁移。将本申请实施例提供的标签确定方法应用于监督式学习算法或半监督式学习算法等需要大量标注的样本数据的场景中,能够有效降低标注成本,提高机器学习模型的建模效率。
并且本申请实施例提供的标签确定方法,由于采用特征向量的相似度进行标签迁移,不局限于波形相似的时间序列的标签迁移,只要保证在某些特征维度上相似即可进行标签迁移,由此可知,本申请实施例可以适用于波形不同的时间序列的标签迁移。因此可以扩大标签泛化的场景,提升标签迁移的灵活性和利用率,降低机器学习模型的建模成本。尤其在异常检测场景中,可以实现某些相似特征的KPI间的标签迁移。
进一步的,分析设备通过对第一特征向量集合进行聚类确定类别关系,并由管理设备按照类别关系,呈现第一特征向量集合对应的时间序列,可以供专业人员在标注时参考该类别关系,起到辅助专业人员进行标签标注的作用。基于此,专业人员可以对属于同一类别的时间序列标注同一标签,提高标注效率,增加标签标注的准确性。
附图说明
图1是本申请实施例提供的一种标签确定方法所涉及的一种应用场景示意图;
图2是本申请实施例提供的一种标签确定方法所涉及的另一种应用场景示意图;
图3是本申请实施例提供的一种标签确定方法的流程示意图;
图4是本申请实施例提供的一种获取目标特征向量与参考特征向量集合中参考特征向量的相似度的流程示意图;
图5是本申请实施例提供的一种冲突检测方法的流程图;
图6是本申请实施例提供的另一种标签确定方法的流程示意图;
图7是本申请实施例提供的一种对第一特征向量集合中的特征向量进行聚类处理的流程示意图;
图8是本申请一示意性实施例提供的一种标签确定装置的框图;
图9是本申请一示意性实施例提供的另一种标签确定装置的框图;
图10是本申请一示意性实施例提供的又一种标签确定装置的框图;
图11是本申请一示意性实施例提供的再一种标签确定装置的框图;
图12是本申请另一示意性实施例提供的一种标签确定装置的框图;
图13是本申请另一示意性实施例提供的另一种标签确定装置的框图;
图14是本申请另一示意性实施例提供的又一种标签确定装置的框图;
图15是本申请又一示意性实施例提供的一种标签确定装置的框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
为了便于读者理解,本申请实施例对提供的标签确定方法所涉及的机器学习算法进行简单介绍。
机器学习算法作为AI领域的一个重要分支,在众多领域得到了广泛的应用。从学习方法的角度,机器学习算法可以分为监督式学习算法、非监督式学习算法、半监督式学习算法、强化学习算法几大类。监督式学习算法,是指可以基于训练数据学习一个算法或建立一个模式,并以此算法或模式推测新的实例。训练数据,也称样本数据,是由输入数据和预期输出组成。机器学习算法模型的预期输出,称为标签,其可以是一个连续的值(称为回归或者回归标签),或是一个预测的分类结果(称作分类标签)。非监督式学习算法与监督式学习算法的区别在于,非监督式学习算法的样本数据没有给定标签,机器学习算法模型通过分析数据的特征,从而得到一定的成果。半监督学习算法,其样本数据一部分带有标签,另一部分没有标签,而无标签的数据远远多于有标签的数据。强化学习算法通过不断在环境中尝试,以取得最大化的预期利益,通过环境给予的奖励或惩罚,产生能获得最大利益的选择。其中,监督式学习算法是机器学习算法中较为基础的一类算法,在足够数据量的情况下能够取得很好的效果,例如图像识别、语言翻译等。但是监督式学习算法中的标签获取成本高,需要大量人力进行样本标注,很多应用场景并不具备大量的标注数据(即标注了标签的样本数据)。
如前所述,采用监督式学习算法或者半监督式学习算法等进行模型训练时,需要大量人力进行样本数据的标注,标签的确定成本较高。
目前提出一种标签确定方法,该方法采用标签迁移(也称标签泛化)的方式进行标签的确定,也即是将已确定标签的一个时间序列的标签迁移至与该时间序列类似的另一时间序列上,作为该另一时间序列的标签。其中,时间序列为按照时序排列的一组数据的集合,该时序通常为数据产生的先后顺序,时间序列是样本数据的一种数据形式,时间序列中的数据也称为数据点。例如,时间序列X为X=(x 1,x 2,…,x n),则该时间序列有n个数据点,分别为x 1至x n,该时间序列的长度为n。
假设第一时间序列为待确定标签的时间序列,该标签确定过程包括:获取第一时间序列与多个参考时间序列的波形相似度,当第一时间序列与该多个参考时间序列中的一个参考时间序列的波形相似度大于波形相似度阈值时,将该参考时间序列所对应的标签确定为第一时间序列的标签。但是,这种通过对比时间序列的波形相似度进行标签迁移的方式,容易受到时间序列自身所具有的各种干扰信息(如采样时段、幅度变化、象限漂移和噪声等)的影响,标签确定的准确性较低。
进一步的,目前还提出一种基于动态时间规整(Dynamic Time Warping,DTW)的标签确定方法,当第一时间序列和参考时间序列的长度(即序列中数据点的个数)不同,通过规整时间轴,来建立两个时间序列的对应关系,之后再计算两者的波形相似度,从而在一定程度上减少采样时段、幅度变化和象限漂移的影响。但是该标签的确定方法中规整时间轴的算法复杂,且仍然无法避免时间序列的噪声影响。尤其在高维时间序列中的实用性较低。
本申请实施例提供一种标签确定方法,基于时间序列的特征向量的相似度进行标签的迁移,相似度计算与时间序列的特征向量相关,避免了时间序列自身所具有的干扰信息的影响,提高了标签确定的准确性。尤其在高维时间序列中仍然能够准确地进行标签迁移。
请参考图1,图1是本申请实施例提供的标签确定方法所涉及的一种应用场景示意图。如图1所示,该应用场景中包括分析设备101、管理设备102和网络设备103a至103c(统称为网络设备103)。图1中分析设备、管理设备和网络设备的数量仅用作示意,不作为对本申请实施例提供的标签确定方法所涉及的应用场景的限制。该应用场景所涉及的网络可以是第二代(2-Generation,2G)通信网络、第三代(3rd Generation,3G)通信网络、长期演进(Long Term Evolution,LTE)通信网络或第五代(5rd Generation,5G)通信网络等。
其中,分析设备101、管理设备102和网络设备103可以部署在同一台设备上,也可以分别部署于不同设备上。例如,分析设备101、管理设备102和网络设备103部署在不同设备上时,分析设备101可以是一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务中心。管理设备102可以是一台计算机,或者一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务中心,并且管理设备102可以是运维支撑系统(operations support system,OSS)或其它与分析设备连接的网络设备。网络设备103可以是路由器、交换机、基站等,其可以为核心网的网络设备,也可以为边缘网络的网络设备。分析设备101分别与网络设备103以及管理设备102之间通过有线网络或无线网络连接。
网络设备103用于向分析设备101上传采集到的数据,例如各类时间序列的数据,分析设备101用于从网络设备103提取和使用数据,例如确定获取的时间序列的标签,管理设备103用于对分析设备101进行管理。可选地,网络设备103向分析设备101上传的数据还可以包括各类日志数据和设备状态数据等。分析设备101还用于训练有一个或多个机器学习模型,不同的机器学习模型利用网络设备103上传的数据,可以分别实现异常检测、预测、网络安全防护和应用识别等功能。分析设备还可以实现各个机器学习模型的特征选择和自动更新,并将选择的特征以及模型的更新结果反馈给管理设备102,由管理设备102来决策是否进行模型的重新训练。对应不同的机器学习模型,该分析设备101采用本申请提供的标签确定方法可以确定不同的标签。
可选地,上述应用场景还可以不包括网络设备103,分析设备101还可以接收管理设备 103输入的时间序列的数据,本申请实施例只是对时间序列的数据的来源进行示意性说明,并不对此进行限定。
进一步的,本申请实施例提供的标签确定方法可以用于异常检测场景中。异常检测是指对不符合预测的模式、数据或时间进行检测。传统的异常检测是由专业人员(也称专家)对历史数据进行学习,然后找出异常,即为异常数据标注“异常”标签。异常检测的数据来源包括应用、进程、操作系统、设备或者网络,随着计算系统复杂度的提升,人工已经不能胜任现在的异常检测难度。
本申请实施例提供的标签确定方法,应用在异常检测场景中,能够进行自动的标签确定。请参考图2,图2是本申请实施例提供的标签确定方法所涉及的一种异常检测的应用场景示意图。在该应用场景中,分析设备101可以为网络分析器,管理设备102可以为控制器,分析设备101维护的机器学习模型为异常检测模型,确定的标签为异常检测标签,该异常检测标签包括两种分类标签,分别为:“正常”和“异常”。在图1所示的场景的基础上,该应用场景还可以包括存储设备104,其用于存储网络设备103提供的数据,该存储设备104可以为分布式存储设备,分析设备101可以对该存储设备104所存储的数据进行读写。这样在网络设备103的数据较多的情况下,由存储设备104进行数据存储,可以减轻分析设备101的负载,提高分析设备101的数据分析效率。需要说明的是,当网络设备103提供的数据量较少时,也可以不设置该存储设备104,此时异常检测的应用场景可以参考图1所示的应用场景。
在异常检测场景中,时间序列的异常检测通常是找出远离相对既定模式或分布的数据点。时间序列的异常包括:突升、突降、均值变化等。时间序列的异常检测算法包括基于统计与数据分布的算法(例如N-Sigma算法)、基于距离/密度的算法(例如局部异常因子算法)、孤立森林算法或基于预测的算法(例如差分整合移动平均自回归模型(Autoregressive Integrated Moving Average model,ARIMA)算法)等。相应的机器学习模型可以为基于统计与数据分布的模型(例如N-Sigma模型)、基于距离/密度的模型(例如局部异常因子模型)、孤立森林模型或基于预测的模型(例如ARIMA)。
如图1和图2所述,网络设备103上传的数据包括各类时间序列数据,具有数据规模庞大、模式和规律复杂的特点。因此,在利用这些数据进行异常检测、预测、分类、网络安全防护、应用识别或用户体验评估(例如基于这些数据评估用户的体验)等应用时,使用了大量的机器学习模型。专业人员需要对这些数据进行标注,工作量非常大,标注成本极高。
本申请实施例提供一种标签确定方法,能够进行标签迁移,从而降低标注成本,且由于基于时间序列的特征向量的相似度进行标签的迁移,相似度计算与时间序列的特征向量相关,避免了时间序列自身所具有的干扰信息的影响,提高了标签确定的准确性。
在异常检测场景中,时间序列数据包括网络关键绩效指标(key performance indicator,KPI),网络KPI包括网络设备KPI、网络业务KPI等。其中,网络设备KPI可以是中央处理器(CPU,central processing unit)利用率、光功率等,网络业务KPI可以是网络流量、丢包率、时延、用户接入数等。其中,网络流量KPI为具有周期性的时间序列数据。示例地,图2中所示的异常检测场景中,机器学习模型以用于对网络流量KPI进行异常检测。由于大量的KPI异常的特征相似,本申请实施例提供的标签确定方法应用于异常检测场景中,可在一定范围内进行标签自动迁移,提升标签的利用率,降低标注成本,并且相对于传统的标签 迁移方法,确定的标签的准确性较高。
本申请实施例提供一种标签确定方法,该方法可以由前述分析设备执行,假设第一时间序列为需要进行标签确定的序列,如图3所示,方法包括:
步骤301、获取第一时间序列的目标特征向量。
时间序列为按照时序排列的一组数据的集合,该时序通常为数据产生的先后顺序,时间序列中的数据也称为数据点。通常一个时间序列中各个数据点的时间间隔为一恒定值,因此时间序列可以作为离散时间数据进行分析处理。示例的,该第一时间序列可以为网络KPI的时间序列。
在一种可选示例中,分析设备可以接收网络设备或者管理设备发送的时间序列;在另一种可选示例中,分析设备具有输入输出(I/O)接口,通过该I/O接口接收时间序列;在又一种可选示例中,分析设备可以从存储设备中读取时间序列。
目标特征向量是表征第一时间序列的特征的向量,其包括一个或多个特征的数据,也即是,目标特征向量对应一维或多维特征,目标特征向量对应的特征的维度与目标特征向量中的数据的个数相同(即特征与数据一一对应)。其中,特征指的是第一时间序列所具有的特征,其可以包括数据特征和/或提取特征。
其中,数据特征是时间序列中的数据的自身特征。例如,数据特征包括数据排列周期、数据变化趋势或数据波动等,相应的,数据特征的数据包括:数据排列周期的数据、数据变化趋势数据或数据波动数据等。数据排列周期是指若时间序列中数据周期性排列,该时间序列中数据排列所涉及的周期,例如,数据排列周期的数据包括周期时长(也即两个周期发起的时间间隔)和/或周期个数;数据变化趋势数据用于反映时间序列中数据排列的变化趋势(即数据变化趋势),例如,该数据包括:持续增长、持续下降、先升后降,先降后升,或者满足正态分布等等;数据波动数据用于反映时间序列中数据的波动状态(即数据波动),例如该数据包括表征该时间序列的波动曲线的函数,或者,该时间序列的指定值,如最大值、最小值或平均值。
提取特征是提取该时间序列中的数据的过程中的特征。例如,提取特征包括统计特征、拟合特征或频域特征等,相应的,提取特征的数据包括统计特征数据、拟合特征数据或频域特征数据等。统计特征是指时间序列所具有的统计学特征,统计特征有数量特征和属性特征之分,其中数量特征又有计量特征和计数特征之分,数量特征可以直接用数值来表示,例如,CPU、内存、IO资源等多种资源的消耗值为计量特征;而出现异常的次数、正常工作的设备个数是计数特征;属性特征不能直接用数值来表示,如设备是否出现异常、设备是否产生宕机等,统计特征中的特征就是统计时需要考察的指标。例如,该统计特征数据包括移动平均值(Moving_average)、加权平均值(Weighted_mv)等;拟合特征是时间序列拟合时的特征,则拟合特征数据用于反映时间序列用于拟合的特征,例如拟合特征数据包括进行拟合时所采用的算法,如ARIMA;频域特征是时间序列在频域上的特征,则频域特征用于反映时间序列在频域上的特征。例如,频域特征数据包括:时间序列在频域上分布所遵循的规律的数据,如该时间序列中高频分量的占比。可选地,频域特征数据可以通过对时间序列进行小波分解得到。
该获取第一时间序列的目标特征向量的过程可以包括:先确定需要提取的目标特征,然后在第一时间序列中提取确定的目标特征的数据,得到目标特征向量。示例的,该需要提取 的目标特征是基于标签确定方法所涉及的应用场景确定的。在一种可选示例中,该目标特征为预先配置的特征,例如是由用户配置的特征。
在另一种可选示例中,该目标特征为指定特征中的一个或多个,例如该指定特征为前述统计特征。
值得说明的是,用户可以预先设置指定特征,但是对于第一时间序列,其可能无法具有全部指定特征,分析设备可以在第一时间序列中筛选属于该指定特征的特征作为目标特征。例如,该目标特征包括统计特征:时间序列分解_周期分量(time series decompose_seasonal,Tsd_seasonal)、移动平均值、加权平均值、时间序列分类、最大值、最小值、分位数、方差、标准差、周期同比(year on year,yoy,指的是与历史同时期比较)、每天波动率、分桶熵、样本熵、滑动平均、指数滑动平均、高斯分布特征或T分布特征等中的一个或多个,相应的,目标特征数据包括该一个或多个统计特征的数据;
和/或,该目标特征包括拟合特征:自回归拟合误差、高斯过程回归拟合误差或神经网络拟合误差中的一个或多个,相应的,目标特征数据包括该一个或多个拟合特征的数据;
和/或,该目标特征包括频域特征:时间序列中高频分量的占比;相应的,目标特征数据包括时间序列中高频分量的占比的数据,该数据可以对时间序列进行小波分解得到。
步骤302、获取目标特征向量与参考特征向量集合中参考特征向量的相似度。执行步骤303或304。
分析设备中预先建立有参考特征向量集合,该参考特征向量集合包括一个或多个参考特征向量,该参考特征向量为已确定标签的第二时间序列的特征向量。该标签可以是人工标注的,也可以是通过本申请实施例提供的标签确定方法确定的,还可以是通过其他算法确定的,本申请实施例对此不做限定。
该参考特征向量集合中每个参考特征向量对应的标签以及第二时间序列可以存储在参考特征向量集合中,也可以存储在其他存储空间。只要通过该参考特征向量可以查询得到对应的标签以及第二时间序列即可。
参考特征向量是表征第二时间序列的特征的向量,其包括一个或多个特征的数据。也即是,参考特征向量对应一维或多维特征。参考特征向量所涉及的特征可以包括数据特征和/或提取特征。数据的维度与特征的个数以及相应的特征向量的解释可以参考前述目标特征向量的解释。获取每个第二时间序列的参考特征向量的过程可以参考前述获取第一时间序列的目标特征向量的过程。本申请实施例对此不做赘述。
表1为参考特征向量集合中存储的数据的示意性说明,表1中,该参考特征向量集合中每个参考特征向量对应的时间序列以及标签可以存储在参考特征向量集合中。表1中样本数据身份标识(identification,ID)为KPI_1的参考特征向量,其包括4个特征的数据,该4个特征的数据分别为:移动平均值(Moving_average)、加权平均值(Weighted_mv)、时间序列分解_周期分量(time series decompose_seasonal,Tsd_seasonal)和周期yoy。该参考特征向量对应的时间序列为(x1,x2,……,xn),对应的标签为“异常”。表1假设参考特征向量集合按照固定格式存储数据,其存储的参考特征向量的特征也可以为预先设定的特征,则参考特征向量集合的数据均可以按照表1的格式存储。本申请实施例在实际实现时,参考特征向量集合还可以有其他形式,本申请实施例对此不做限定。
表1
Figure PCTCN2020112878-appb-000001
在本申请实施例中,参考特征向量集合中包括的参考特征向量与目标特征向量可能采用相同的提取算法也可能采用不同的提取算法获取。相应的,每个参考特征向量和目标特征向量所涉及的特征的类别以及特征的个数可能不同。因此,需要针对不同的情况进行相应的处理。
假设第一参考特征向量为参考特征向量集合中的一个参考特征向量,第一特征为目标特征向量对应的特征与第一参考特征向量对应的特征中相同的特征,也即是第一特征为目标特征向量对应的特征与第一参考特征向量对应的特征的交集,第一子特征向量为目标特征向量中第一特征对应的数据组成的向量,第二子特征向量为第一参考特征向量中第一特征对应的数据组成的向量,则目标特征向量与第一参考特征向量的相似度为第一子特征向量和第二子特征向量的相似度。本申请实施例以以下两种情况为例进行说明。
第一种情况,参考特征向量和目标特征向量所涉及的特征的类别和特征的个数不同。则,如图4所示,获取目标特征向量与参考特征向量集合中参考特征向量的相似度的过程可以包括以下步骤:
步骤3021、在目标特征向量对应的特征与第一参考特征向量对应的特征中,筛选相同的第一特征。
第一特征包括一个或多个特征。前述步骤3021获取第一特征的过程,可以通过获取目标特征向量对应的特征与第一参考特征向量对应的特征的交集实现。例如,假设目标特征向量Q1包括对应特征y 1至y 4共4个特征的数据,该4个数据分别为q 1至q 4,即Q1=(q 1,q 2,q 3,q 4),对应的特征的集合Y1满足:Y1=(y 1,y 2,y 3,y 4);第一参考特征向量Q2包括对应特征y 1、y 4和y 5共3个特征的数据,3个数据分别为p 1、p 4和p 5,即Q2=(p 1,p 4,p 5),对应的特征的集合Y2满足:Y2=(y 1,y 4,y 5)。则第一特征Y满足:Y=Y1∩Y2,则Y=(y 1,y 4)。
值得说明的是,第一特征还可以采用其他方式得到,例如依次比较目标特征向量对应的特征与第一参考特征向量对应的特征,本申请实施例对此不做限定。
步骤3022、获取目标特征向量中第一特征对应的数据,得到由获取的数据组成的第一子特征向量。
仍然以步骤3021的例子为例,则第一子特征向量为目标特征向量Q1=(q 1,q 2,q 3,q 4)中的Q11=(q 1,q 4)。
步骤3023、获取第一参考特征向量中第一特征对应的数据,得到由获取的数据组成的第二子特征向量。
仍然以步骤3021的例子为例,则第二子特征向量为第一参考特征向量Q2=(p 1,p 4,p 5)中的Q21=(p 1,p 4)。值得说明的是,第一子特征向量和第二子特征向量中的数据的个数以及排列方式一致,以保证后续相似度计算的准确性。
步骤3024、确定第一子特征向量和第二子特征向量的相似度。
在本申请实施例中,第一子特征向量和第二子特征向量均以序列形式表征,第一子特征向量和第二子特征向量中相同位置的数据对应同一类别的特征,第一子特征向量和第二子特征向量的相似度可以采用两者的距离来衡量,该相似度与第一子特征向量和第二子特征向量的距离负相关。也即是两个子特征向量的相似度越大,距离越小;相似度越小,距离越大。
则,可以先获取第一子特征向量和第二子特征向量的距离;然后,基于获取的距离,确定第一子特征向量和第二子特征向量的相似度。
可选地,第一子特征向量和第二子特征向量的距离用于表征目标特征向量与第一参考特征向量的距离,第一子特征向量和第二子特征向量的距离可以有多种获取方式,例如,采用欧式距离公式、切比雪夫距离公式、余弦距离公式、马氏距离公式或者其他距离公式等计算得到。
示例的,假设第一子特征向量为x=(f x1,f x2,…,f xn),第二子特征向量为y=(f y1,f y2,…,f yn),采用马式距离公式计算第一子特征向量和第二子特征向量的距离D M(x,y)D M(x,y),则马氏距离公式如下:
Figure PCTCN2020112878-appb-000002
其中,Σ -1为协方差矩阵,Σ -1=E[(X-E[X])(X-E(X)) T]。协方差矩阵Σ -1为预先确定的矩阵,其可以由第一子特征向量与参考特征向量集合中与第一子特征向量相同维度的特征的数据计算得到。
在本申请实施例中,第一子特征向量和第二子特征向量的相似度与两者的距离负相关,则可以基于获取的距离D以及相似度计算公式,确定第一子特征向量和第二子特征向量的相似度S。在一种可选方式中,相似度计算公式为:S=a/D。其中,a为预先设置的数值。例如a=1。在另一种可选方式中,相似度计算公式为:S=1-f(D),其中,f(D)表示对距离D进行归一化处理。
第二种情况,参考特征向量和目标特征向量所涉及的特征的类别和特征的个数相同。则,第一子特征向量与目标特征向量相同,第二子特征向量与第一参考特征向量相同,无需执行前述步骤3021的筛选动作,获取目标特征向量与参考特征向量集合中参考特征向量的相似度的过程可以为:直接确定目标特征向量与第一参考特征向量的相似度,也即是先获取目标特征向量与第一参考特征向量的距离;然后,基于获取的距离,确定目标特征向量与第一参考特征向量的相似度。该确定目标特征向量与第一参考特征向量的相似度的过程可以参考前述步骤3024。本申请实施例对此不再赘述。
值得说明的是,前述参考特征向量和目标特征向量涉及多个特征的数据,且相同特征的数据的个数越多,最终计算得到的相似度越能从多个角度反映参考特征向量和目标特征向量的相关性,基于此确定的标签准确性更高。
步骤303、当目标特征向量与第一参考特征向量的相似度大于相似度阈值时,将第一参考特征向量所对应的标签确定为第一时间序列的标签。
该相似度阈值可以由用户预先设置,也可以是分析设备基于当前的应用场景确定的。当目标特征向量与第一参考特征向量的相似度大于相似度阈值时,说明第一时间序列与第一参考特征向量所对应的第二时间序列在特征上的相似度较高,第一时间序列满足标签迁移条件,可以将第一参考特征向量所对应的标签确定为第一时间序列的标签。
例如,第一参考特征向量所对应的标签为“异常”,则第一时间序列的标签也为“异常”。
步骤304、当目标特征向量与参考特征向量集合中每个参考特征向量的相似度均不大于相似度阈值时,向管理设备发送第一时间序列,以供管理设备呈现第一时间序列。
当目标特征向量与参考特征向量集合中每个参考特征向量的相似度均不大于相似度阈值时,说明第一时间序列与参考特征向量集合中每个参考特征向量所对应的第二时间序列在特征上均相似度较低,第一时间序列不满足标签迁移条件,则该目标特征向量对应的第一时间序列的标签可以由人工标注。因此,分析设备可以向管理设备发送第一时间序列,该管理设备可以为前述应用环境中的管理设备102。管理设备在接收了第一时间序列后,呈现该第一时间序列,由专业人员对该第一时间序列的标签进行标注。
步骤305、接收管理设备发送的第一时间序列的标签。
参考步骤304,专业人员对该第一时间序列的标签进行标注后,管理设备接收标注的标签,并将该标签发送至分析设备,分析设备接收该标签,并将该标签与第一时间序列对应保存。
值得说明的是,当目标特征向量与参考特征向量集合中每个参考特征向量的相似度均不大于相似度阈值时,若该第一时间序列的重要性低于预设阈值,例如该第一时间序列为随机获取的时间序列,由于无法进行自动的标签确定,分析设备也可以不进行该第一时间序列的标签的标注,即不执行步骤304和305,删除该第一时间序列,获取新的时间序列作为第一时间序列,再次执行上述步骤301至303,以实现符合标签迁移条件的时间序列的标签确定。这样就无需人工参与,可以为符合标签迁移条件的时间序列均确定标签。
步骤306、对需要添加至参考特征向量集合的特征向量进行冲突检测处理。
在本申请实施例中,当一个特征向量对应的时间序列的标签确定,可以将该特征向量添加到参考特征向量集合中,以作为标签迁移的参考基础。但是一些特征向量对应的标签可能由于人工误差或者机器算法失误而出现错误,如果将这些特征向量添加到参考特征向量集合中,容易引起标签迁移过程的标签冲突,例如与某一时间序列的目标特征向量的相似度大于相似度阈值的参考特征向量有多个,且标签不同,导致无法对该某一时间序列进行标签迁移。因此,需要对添加至参考特征向量集合的特征向量进行冲突检测处理,以避免出现错误标签的特征向量添加到参考特征向量集合中。示例的,如图5所示,该冲突检测过程可以包括以下步骤:
步骤3061、获取已确定标签的第三时间序列的第一特征向量。
该第三时间序列的标签可以是人工标注的,也可以是通过本申请实施例提供的标签确定方法确定的,还可以是通过其他算法确定的。示例的,该第三时间序列的标签可以为前述步骤303或步骤305所确定的标签,相应的,该第三时间序列即为通过前述第一时间序列。
该第三时间序列的第一特征向量的获取过程可以参考前述步骤301中第一时间序列的目标特征向量的获取过程,本申请实施例不再赘述。值得说明的是,当第三时间序列为前述第一时间序列时,则可以直接将前述目标特征向量作为第一特征向量,以减少再次提取特征向量的过程,降低运算代价。
步骤3062、获取第一特征向量与参考特征向量集合中参考特征向量的相似度。
步骤3062可以参考前述步骤302,本申请实施例对此不再赘述。
在参考特征集合中,一个或多个参考特征向量可以对应同一标签。但容易出现以下错误 场景:实质相关的多个参考特征向量由于人工误差或者机器算法失误而对应不同标签,也即是,本应该对应同一标签的参考特征向量对应了不同的标签。为了减少这种错误场景的出现,在步骤3062之后,可以执行步骤3063、步骤3064或步骤3065。
步骤3063、当第一特征向量与参考特征向量集合中每个参考特征向量的相似度均不大于入库相似度阈值,将第一特征向量作为参考特征向量添加至参考特征集合中。
当第一特征向量与参考特征向量集合中每个参考特征向量的相似度均不大于入库相似度阈值,说明第一特征向量与参考特征向量集合中每个参考特征向量均不相似,相应的,第三时间序列与参考特征向量集合中每个参考特征向量对应的第二时间序列的相似度较低,其为一个全新的时间序列,可以将第一特征向量作为参考特征向量添加至参考特征集合中。
值得说明的是,该入库相似度阈值可以由用户预先设置,也可以由分析设备基于当前的应用场景确定,其与前述步骤303中的相似度阈值可以相同也可以不同。
步骤3064、当第一特征向量与参考特征向量集合中的第二特征向量的相似度大于入库相似度阈值,且第一特征向量对应的标签与第二特征向量对应的标签相同时,将第一特征向量作为参考特征向量添加至参考特征集合中。
当第一特征向量与参考特征向量集合中的第二特征向量的相似度大于入库相似度阈值,说明第一特征向量与第二特征向量类似,两者相关;当第一特征向量对应的标签与第二特征向量对应的标签相同时,则说明相关的两个特征向量对应同一标签,则第一特征向量符合加入参考特征向量集合的条件,将该第一特征向量作为参考特征向量添加至参考特征集合中。
步骤3065、当第一特征向量与参考特征向量集合中的第二特征向量的相似度大于入库相似度阈值,且第一特征向量对应的标签与第二特征向量对应的标签不同时,向管理设备发送第一特征向量对应的时间序列以及第二特征向量对应的时间序列,以供管理设备呈现第一特征向量对应的时间序列以及第二特征向量对应的时间序列。执行步骤3066。
当第一特征向量与参考特征向量集合中的第二特征向量的相似度大于入库相似度阈值,说明第一特征向量与第二特征向量类似,两者相关;当第一特征向量对应的标签与第二特征向量对应的标签不同时,说明相关的两个特征向量对应不同标签,则第一特征向量或第二特征向量的标签有误。该第一特征向量或第二特征向量的标签可以再次由人工标注,以保证标签的准确性。因此,分析设备可以向管理设备发送第一特征向量对应的时间序列以及第二特征向量对应的时间序列,该管理设备可以为前述应用环境中的管理设备102。管理设备在接收了第一特征向量对应的时间序列以及第二特征向量对应的时间序列后,呈现接收的时间序列,由专业人员对呈现的时间序列的标签进行标注,由于两个时间序列对应的特征向量相关,人工标注的两个时间序列的标签为同一标签。
需要说明的是,分析设备还可以向管理设备发送第一特征向量对应的标签与第二特征向量对应的标签,管理设备可以在呈现接收的时间序列时,同步呈现接收到的标签,以供专业人员进行参考,在一定程度上能够提高最终标签标注的准确率。
步骤3066、接收管理设备发送的第一特征向量对应的时间序列以及第二特征向量对应的时间序列的相同的标签。执行步骤3067。
专业人员对该呈现的时间序列的标签进行标注后,管理设备接收标注的标签,并将该标签发送至分析设备,分析设备接收该标签。
步骤3067、基于接收的标签,更新预先存储的第一特征向量对应的时间序列的标签以及 第二特征向量对应的时间序列的标签。执行步骤3068。
参考步骤3065,由于预先存储的第一特征向量对应的时间序列的标签以及第二特征向量对应的时间序列的标签不同,分析设备可以基于接收的标签,更新预先存储的第一特征向量对应的时间序列的标签以及第二特征向量对应的时间序列的标签,保证更新后的第一特征向量对应的时间序列的标签以及第二特征向量对应的时间序列的标签相同。从而避免出现标签冲突。
步骤3068、将第一特征向量作为参考特征向量添加至参考特征集合中。
值得说明的是,前述步骤3061至3068只是进行冲突检测的一种示意性实现方式说明,本申请实施例在实际实现时,还可以采用其他方式进行冲突检测。例如,当第一特征向量与参考特征向量集合中的第二特征向量的相似度大于入库相似度阈值,且第一特征向量对应的标签与第二特征向量对应的标签不同时,还可以由专业人员人工进行冲突检测,则步骤3065至3068还可以替换为:通过分析设备自身或管理设备呈现第一特征向量,以及对应的时间序列和标签;并呈现第二特征向量,以及对应的时间序列的标签;接收删除指令,该删除指令指示删除第一特征向量,以及对应的时间序列和标签,或者,该删除指令指示删除第二特征向量,以及对应的时间序列的标签;删除该删除指令所指示的特征向量,以及对应的时间序列和标签。若分析设备接收到删除指令,说明该删除指令指示删除的特征向量无法在标签迁移过程中起到有效的参考作用,通过将该特征向量删除,可以避免标签迁移过程中的标签冲突。
前述步骤306是以将第一特征向量添加至参考特征向量集合时,进行冲突检测处理为例进行说明的,本申请实施例在实际实现时,也可以周期性进行冲突检测处理,或者在接收到检测触发指令后进行冲突检测处理,该冲突检测处理过程包括:步骤A1至A6。
步骤A1、获取参考特征向量集合的任一特征向量作为第三特征向量。
步骤A2、获取该第三特征向量与参考特征向量集合中其他参考特征向量的相似度。
步骤A2可以参考前述步骤302,本申请实施例对此不再赘述。
步骤A3、当该第三特征向量与参考特征向量集合中每个其他参考特征向量的相似度均不大于入库相似度阈值,将参考特征向量集合中除第三特征向量之外的其他任一特征向量作为第三特征向量,重复执行步骤A1至A7,直至遍历参考特征向量集合中所有特征向量,停止动作。
步骤A4、当第三特征向量与参考特征向量集合中的第四特征向量的相似度大于入库相似度阈值,且第三特征向量对应的标签与第四特征向量对应的标签相同时,将参考特征向量集合中除第三特征向量之外的其他任一特征向量作为第三特征向量,重复执行步骤A1至A7,直至遍历参考特征向量集合中所有特征向量,停止动作。
步骤A5、当第三特征向量与参考特征向量集合中的第四特征向量的相似度大于入库相似度阈值,且第三特征向量对应的标签与第四特征向量对应的标签不同时,向管理设备发送第三特征向量对应的时间序列以及第四特征向量对应的时间序列,以供管理设备呈现第三特征向量对应的时间序列以及第四特征向量对应的时间序列。执行步骤A6。
步骤A5参考前述步骤3065,本申请实施例对此不做赘述。
步骤A6、接收管理设备发送的第三特征向量对应的时间序列以及第四特征向量对应的时间序列的相同的标签。执行步骤A7。
步骤A6参考前述步骤3066,本申请实施例对此不做赘述。
步骤A7、基于接收的标签,更新预先存储的第三特征向量对应的时间序列的标签以及第四特征向量对应的时间序列的标签。将参考特征向量集合中除第三特征向量之外的其他任一特征向量作为第三特征向量,重复执行步骤A1至A7,直至遍历参考特征向量集合中所有特征向量,停止动作。
通过在参考特征向量集合内部进行冲突检测,可以避免标签冲突,保证参考特征向量集合中的参考特征向量起到有效的参考作用,通过将没有参考价值的特征向量删除,提高标签确定准确性。
前述步骤304中,当目标特征向量与参考特征向量集合中每个参考特征向量的相似度均不大于相似度阈值时,向管理设备发送第一时间序列,也即是分析设备每次获取一个与参考特征向量集合中每个参考特征向量的相似度均不大于相似度阈值的时间序列,便将该时间序列发送至管理设备,以进行人工标注,这样的标注方式为个体标注方式,即与管理设备的一次交互过程中标注一个标签。本申请实施例在实际实现时,人工标注过程还可以有其他实现方式,例如集群标注方式,即与管理设备的一次交互过程标注多个标签,则如图6所示,采用集群标注方式时,前述步骤304和步骤305可以替换为步骤307至309:
步骤307、获取第一特征向量集合,该第一特征向量集合中的任一特征向量与参考特征向量集合中每个参考特征向量的相似度均不大于相似度阈值,且任一特征向量对应的时间序列的标签未确定。
在一种可选示例中,第一特征向量集合中的特征向量的个数为指定个数。例如,分析设备在重复执行多次前述步骤301至303后,获取指定个数个第五特征向量,并将该指定个数的第五特征向量确定为第一特征向量集合,该第五特征向量与参考特征向量集合中每个参考特征向量的相似度均不大于相似度阈值,且第五特征向量对应的时间序列的标签未确定。该第五特征向量可以包括前述目标特征向量。
在另一种可选示例中,第一特征向量集合为周期性获取的集合。例如,分析设备在重复执行多次前述步骤301至303的过程中,每隔指定时长获取第五特征向量,得到第一特征向量集合,该第五特征向量为最近的指定时长内,与参考特征向量集合中每个参考特征向量的相似度均不大于相似度阈值的特征向量,且第五特征向量对应的时间序列的标签未确定。该第五特征向量可以包括前述目标特征向量。
在又一种可选示例中,第一特征向量集合为分析设备在接收到收集指令后获取的集合。例如,分析设备在重复执行多次前述步骤301至303的过程中,若接收到指示收集第五特征向量的收集指令,则基于该收集指令,获取第五特征向量得到第一特征向量集合,该第五特征向量为历史时长(该历史时长可以为指定时长,也可以是上次收集指令与本次收集指令之间的时长,还可以其他方式规定的时长)内,与参考特征向量集合中每个参考特征向量的相似度均不大于相似度阈值的特征向量,且第五特征向量对应的时间序列的标签未确定。该第五特征向量可以包括前述目标特征向量。
步骤308、向管理设备发送第一特征向量集合对应的时间序列,以供管理设备呈现第一特征向量集合对应的时间序列。
在第一种可选方式中,分析设备向管理设备发送第一特征向量集合对应的时间序列,管 理设备接收该时间序列后,呈现第一特征向量集合对应的时间序列,由专业人员对该第一特征向量集合对应的时间序列的标签进行标注。
示例的,管理设备可以在同一用户界面同时显示第一特征向量集合中多个特征向量对应的时间序列,也可以采用滚动方式分别显示第一特征向量集合中多个特征向量对应的时间序列,本申请实施例对此不做限定。
进一步的,分析设备还可以向管理设备发送第一特征向量集合,管理设备在呈现每个时间序列时,可以呈现对应的特征向量,以供专业人员进行参考,起到辅助专业人员进行标签标注的作用,提高标签标注的准确性。
在第二种可选方式中,在步骤308向管理设备发送第一特征向量集合对应的时间序列之前,还可以先对第一特征向量集合中的特征向量进行聚类处理,得到第一特征向量集合中特征向量的类别关系;然后在步骤308中,向管理设备发送类别关系以及第一特征向量集合对应的时间序列,以供管理设备按照类别关系,呈现第一特征向量集合对应的时间序列。
其中,聚类处理的方式可以有多种。在一种可选的实现方式中,如图7所示,对第一特征向量集合中的特征向量进行聚类处理的过程,包括:
步骤3081、基于第一特征向量集合中每两个特征向量的距离,统计每个特征向量的近邻向量,第一特征向量集合中任一特征向量的近邻向量为第一特征向量集合中与任一特征向量的距离小于距离阈值的其他特征向量,该距离阈值为在基于第一特征向量集合确定的多个距离中指定的距离。
示例的,该步骤3081可以包括以下步骤:
步骤B1、分析设备获取第一特征向量集合中每两个特征向量的距离。
假设第二参考特征向量和第三参考特征向量为参考特征向量集合中的任意两个参考特征向量,第二特征为第二参考特征向量对应的特征和第三参考特征向量对应的特征中相同的特征,也即是第二特征为第二参考特征向量对应的特征和第三参考特征向量对应的特征的交集,第三子特征向量为第二参考特征向量中第二特征对应的数据所组成的向量,第四子特征向量为第三参考特征向量中第二特征对应的数据所组成的向量,则第二参考特征向量和第三参考特征向量的相似度为第三子特征向量和第四子特征向量的距离。其中,第二参考特征向量和第三参考特征向量所涉及的特征的类别和特征的个数不同时,参考前述步骤302的第一种情况,第三子特征向量和第四子特征向量的距离的获取方法可以参考前述步骤3021至3024;第二参考特征向量和第三参考特征向量所涉及的特征的类别和特征的个数相同时,参考前述步骤302的第二种情况,直接获取第二参考特征向量和第三参考特征向量的距离。
步骤B2、分析设备在基于第一特征向量集合确定的多个距离中,确定距离阈值。
可选地,分析设备对获取的各个距离进行排序,例如升序排序或降序排序。该距离阈值可以为排序后的距离中位于指定分位数或者指定顺序的距离。该指定分位数或指定顺序为经验值,例如指定分位数为前50%或前90%,则该距离阈值为排序后的距离中位于前50%或前90%处的距离,其中,“前”指的是按照排列顺序由前到后的顺序;例如指定顺序为第5个,则该距离阈值为排序后的距离中位于第5个的距离。例如,假设第一特征向量集合Z=(z 1,z 2,z 3,z 4),特征向量z 1和z 2、z 3、z 4的距离分别为10、9、8,特征向量z 2和z 3、z 4的距离分别为11、6,z 3和z 4的距离为5,分位数为前50%。分析设备对获取的各个距离进行降序排列后得到的距离序列为:11、10、9、8、6、5,则距离阈值为9。
步骤B3、基于第一特征向量集合中每两个特征向量的距离,统计每个特征向量的近邻向量,第一特征向量集合中任一特征向量的近邻向量为第一特征向量集合中与该任一特征向量的距离小于距离阈值的其他特征向量。
仍然以前述步骤B2中的例子为例,特征向量z 1的近邻向量为z 4,特征向量z 1的近邻向量的个数为1;特征向量z 2的近邻向量为z 4,则近邻向量的个数为1;特征向量z 3的近邻向量为z 4,特征向量z 3的近邻向量的个数为1;特征向量z 4的近邻向量为z 1、z 2和z 3,特征向量z 4的近邻向量的个数为3。
步骤3082、基于统计结果,将相同的近邻向量的数量大于数量阈值的每两个特征向量划分为同一类特征向量,该数量阈值为在第一特征向量集合中各个特征向量的近邻向量的数量中指定的数量。
示例的,该步骤3082可以包括以下步骤:
步骤C1、分析设备获取第一特征向量集合中各个特征向量的近邻向量的数量。
仍然以前述步骤B2中的例子为例,特征向量z 1和z 2、z 3、z 4的近邻向量的数量分别为1、1、1、3。
步骤C2、分析设备在第一特征向量集合中各个特征向量的近邻向量的数量中,确定数量阈值。
可选地,分析设备对获取的各个数量进行排序,例如升序排序或降序排序。该数量阈值可以为排序后的数量中位于指定分位数或者指定顺序的数量。该指定分位数或指定顺序为经验值,例如指定分位数为前50%或前60%。
例如,假设指定分位数为前50%,分析设备对获取的各个数量进行降序排列后得到的数量序列为:3、1、1、1。则数量阈值为1。
步骤C3、基于统计结果,将相同的近邻向量的数量大于数量阈值的每两个特征向量划分为同一类特征向量。
假设数量阈值为1,特征向量z 1和z 2、z 3、z 4中两两之间相同的近邻向量的数量均为0,因此,特征向量z 1和z 2、z 3、z 4分别归为一个类别。
假设数量阈值为1,特征向量z 1和z 4相同的近邻向量为z 2、z 3,特征向量z 2和z 3相同的近邻向量为z 1和z 4,特征向量z 1和z 2、z 3相同的近邻向量均为空,特征向量z 4和z 2、z 3相同的近邻向量为空,则特征向量z 1和z 4划分为同一类特征向量,特征向量z 2和z 3划分为同一类特征向量。
本申请实施例中,由于前述距离阈值和数量阈值是动态确定的,其中,距离阈值为在基于第一特征向量集合确定的多个距离中指定的距离,其反应了第一特征向量集合对应的多个距离的分布关系,是一个随第一特征向量集合的变化而变化的值;数量阈值为在第一特征向量集合中各个特征向量的近邻向量的数量中指定的数量,其反应了第一特征向量集合中各个特征向量的近邻向量的数量的分布关系,是一个随第一特征向量集合的变化而变化的值。因此,距离阈值和数量阈值是相对变化的值,基于这两个阈值中至少一个阈值最终划分得到的类别关系更准确,更能体现各个特征向量之间的关联性,提升聚类算法的适应性。
值得说明的是,在该第二种可选方式中,分析设备还可以向管理设备发送类别关系,管理设备可以按照类别关系,呈现第一特征向量集合对应的时间序列。例如,管理设备可以将属于同一类别的多个时间序列在同一用户页面显示,将属于不同类别的多个时间序列在不同 用户页面显示;又例如,管理设备可以将属于不同类别的多个时间序列在同一用户页面的不同位置显示;再例如,管理设备将每个时间序列与其所属类别对应显示。管理设备可以按照类别关系,呈现第一特征向量集合对应的时间序列,可以供专业人员在标注时参考该类别关系,起到辅助专业人员进行标签标注的作用。基于此,专业人员可以对属于同一类别的时间序列标注同一标签,提高标注效率,增加标签标注的准确性。
在另一种可选的实现方式中,可以采用共享最近邻(Shared Nearest Neighbor,SNN)算法进行聚类处理。相对于前述可选的实现方式所提供的聚类处理过程,采用SNN算法进行聚类处理的聚类阈值和数量阈值是预先设定的。
在再一种可选的实现方式,还可以采用其他聚类算法进行聚类处理。例如采用基于神经网络模型的聚类算法进行聚类处理。本申请实施例对聚类处理所采用的算法不进行限定。
示例的,表2假设对样本数据ID为KPI_2的特征向量进行聚类处理,该特征向量对应的时间序列为(z1,z2,……,zn),该特征向量包括4个特征的数据,该4个特征的数据分别为:Moving_average、Weighted_mv、Tsd_seasonal和周期yoy。该参考特征向量对应的时间序列为(z1,z2,……,zn),对应的类别标识为“1”。
表2
Figure PCTCN2020112878-appb-000003
步骤309、接收管理设备发送的第一特征向量集合对应的时间序列的标签。
专业人员对该时间序列的标签进行标注后,管理设备接收标注的标签,并将该标签发送至分析设备,分析设备接收该标签,并将该标签与相应的时间序列对应保存。
综上所述,本申请实施例提供的标签确定方法,基于时间序列的特征向量的相似度进行标签的迁移,能够实现样本数据的自动标注,降低标签的确定成本。并且由于相似度计算与时间序列的特征向量相关,避免了时间序列自身所具有的干扰信息的影响,例如能够降低采样时段、幅度变化、象限漂移和噪声等干扰信息的影响。提高了标签确定的准确性。尤其在高维时间序列中仍然能够准确地进行标签迁移。将本申请实施例提供的标签确定方法应用于监督式学习算法或半监督式学习算法等需要大量标注的样本数据的场景中,能够有效降低标注成本,提高机器学习模型的建模效率。
传统的标签确定方法,虽然基于时间序列的波形相似度进行标签迁移,对于一些时间序列本身波形不相似的情况,无法进行标签迁移。
而本申请实施例提供的标签确定方法,由于采用特征向量的相似度进行标签迁移,不局限于波形相似的时间序列的标签迁移,只要保证在某些特征维度上相似即可进行标签迁移,由此可知,本申请实施例可以适用于波形不同的时间序列的标签迁移。因此可以扩大标签泛化的场景,提升标签迁移的灵活性和利用率,降低机器学习模型的建模成本。尤其在异常检测场景中,可以实现某些相似特征的KPI间的标签迁移。
本申请实施例提供的用于实现模型训练的方法的步骤先后顺序可以进行适当调整,步骤 也可以根据情况进行相应增减,例如,前述步骤306可以与其他步骤并行执行。又例如,分析设备具有输入输出接口(如用户界面),其通过输入输出接口呈现第一时间序列,并接收第一时间序列的标签,无需执行步骤304和305中与管理设备的交互过程;或者,分析设备通过输入输出接口呈现第一特征向量对应的时间序列以及第二特征向量对应的时间序列,并接收第一特征向量对应的时间序列以及第二特征向量对应的时间序列的相同的标签,无需执行步骤3065和3066中与管理设备的交互过程;或者,分析设备通过输入输出接口呈现第三特征向量对应的时间序列以及第四特征向量对应的时间序列,并接收第三特征向量对应的时间序列以及第四特征向量对应的时间序列的相同的标签,无需执行步骤A5和A6中与管理设备的交互过程;或者,分析设备通过输入输出接口呈现第一特征向量集合对应的时间序列,并接收第一特征向量集合对应的时间序列的标签,无需执行上述步骤308和309中与管理设备的交互过程。任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化的方法,都应涵盖在本申请的保护范围之内,因此不再赘述。
本申请实施例提供一种标签确定装置80,如图8所示,所述装置包括:
第一获取模块801,用于获取第一时间序列的目标特征向量,时间序列为按照时序排列的一组数据的集合;
第二获取模块802,用于获取所述目标特征向量与参考特征向量集合中参考特征向量的相似度,所述参考特征向量为已确定标签的第二时间序列的特征向量;
确定模块803,用于当所述目标特征向量与第一参考特征向量的相似度大于相似度阈值时,将所述第一参考特征向量所对应的标签确定为所述第一时间序列的标签,所述第一参考特征向量为所述参考特征向量集合中的一个参考特征向量。
综上所述,本申请实施例提供的标签确定装置,第二获取模块基于时间序列的特征向量的相似度进行标签的迁移,能够实现样本数据的自动标注,降低标签的确定成本。并且由于相似度计算与时间序列的特征向量相关,避免了时间序列自身所具有的干扰信息的影响,例如能够降低采样时段、幅度变化、象限漂移和噪声等干扰信息的影响。提高了标签确定的准确性。尤其在高维时间序列中仍然能够准确地进行标签迁移。将本申请实施例提供的标签确定装置应用于监督式学习算法或半监督式学习算法等需要大量标注的样本数据的场景中,能够有效降低标注成本,提高机器学习模型的建模效率。
可选地,所述第一时间序列为网络关键绩效指标KPI的时间序列。
可选地,所述参考特征向量包括一个或多个特征的数据,所述目标特征向量包括一个或多个特征的数据,
所述目标特征向量与所述第一参考特征向量的相似度为第一子特征向量和第二子特征向量的相似度,所述第一子特征向量和所述第二子特征向量分别由所述目标特征向量和所述第一参考特征向量中对应相同特征的数据组成。
可选地,所述第一子特征向量和所述第二子特征向量均以序列形式表征,所述第一子特征向量和所述第二子特征向量中相同位置的数据对应同一类别的特征,所述第一子特征向量和所述第二子特征向量的相似度,与所述第一子特征向量和所述第二子特征向量的距离负相关。
在一种可选方式中,如图9所示,所述装置80还包括:
第一发送模块804,用于当所述目标特征向量与所述参考特征向量集合中每个参考特征向量的相似度均不大于所述相似度阈值时,向管理设备发送所述第一时间序列,以供所述管理设备呈现所述第一时间序列;
第一接收模块805,用于接收所述管理设备发送的所述第一时间序列的标签。
在另一种可选方式中,如图10所示,所述装置80还包括:
第三获取模块806,用于获取第一特征向量集合,所述第一特征向量集合中的任一特征向量与所述参考特征向量集合中每个参考特征向量的相似度均不大于所述相似度阈值,且所述任一特征向量对应的时间序列的标签未确定;
第二发送模块807,用于向管理设备发送所述第一特征向量集合对应的时间序列,以供所述管理设备呈现所述第一特征向量集合对应的时间序列;
第二接收模块808,用于接收所述管理设备发送的所述第一特征向量集合对应的时间序列的标签。
可选地,如图11所示,在图10所示的基础上,所述装置80还包括:
聚类模块809,同于在所述向管理设备发送所述第一特征向量集合对应的时间序列之前,对所述第一特征向量集合中的特征向量进行聚类处理,得到所述第一特征向量集合中特征向量的类别关系;
所述第二发送模块807,用于:
向所述管理设备发送所述类别关系以及所述第一特征向量集合对应的时间序列,以供所述管理设备按照所述类别关系,呈现所述第一特征向量集合对应的时间序列。
可选地,所述聚类模块809,用于:
基于所述第一特征向量集合中每两个特征向量的距离,统计每个所述特征向量的近邻向量,所述第一特征向量集合中任一特征向量的近邻向量为所述第一特征向量集合中与所述任一特征向量的距离小于距离阈值的其他特征向量,所述距离阈值为在基于所述第一特征向量集合确定的多个距离中指定的距离;
基于统计结果,将相同的近邻向量的数量大于数量阈值的每两个特征向量划分为同一类特征向量,所述数量阈值为在所述第一特征向量集合中各个特征向量的近邻向量的数量中指定的数量。
可选地,如图12所示,所述装置80还包括:
第四获取模块810,用于获取已确定标签的第三时间序列的第一特征向量;
第五获取模块811,用于获取所述第一特征向量与所述参考特征向量集合中参考特征向量的相似度;
第一添加模块812,用于当所述第一特征向量与所述参考特征向量集合中每个参考特征向量的相似度均不大于入库相似度阈值,将所述第一特征向量作为参考特征向量添加至所述参考特征集合中。
在一种可选实现方式中,如图13所示,在图12所示的基础上,所述装置80还包括:
第二添加模块813,用于当所述第一特征向量与所述参考特征向量集合中的第二特征向量的相似度大于所述入库相似度阈值,且所述第一特征向量对应的标签与所述第二特征向量对应的标签相同时,将所述第一特征向量作为参考特征向量添加至所述参考特征集合中。
在另一种可选实现方式中,如图14所示,在图12所示的基础上,所述装置80还包括:
第三发送模块814,用于当所述第一特征向量与所述参考特征向量集合中的第二特征向量的相似度大于所述入库相似度阈值,且所述第一特征向量对应的标签与所述第二特征向量对应的标签不同时,向管理设备发送所述第一特征向量对应的时间序列以及所述第二特征向量对应的时间序列,以供所述管理设备呈现所述第一特征向量对应的时间序列以及所述第二特征向量对应的时间序列;
第三接收模块815,用于接收所述管理设备发送的所述第一特征向量对应的时间序列以及所述第二特征向量对应的时间序列的相同的标签;
更新模块816,用于基于接收的标签,更新预先存储的所述第一特征向量对应的时间序列的标签以及所述第二特征向量对应的时间序列的标签;
第三添加模块817,用于将所述第一特征向量作为参考特征向量添加至所述参考特征集合中。
可选地,所述目标特征向量和所述参考特征向量均包括统计特征、拟合特征或频域特征中的一种或多种特征的数据。
可选地,所述装置应用于网络分析器,所述参考特征向量对应的标签为异常检测标签。
综上所述,本申请实施例提供的标签确定装置,第二获取模块基于时间序列的特征向量的相似度进行标签的迁移,能够实现样本数据的自动标注,降低标签的确定成本。并且由于相似度计算与时间序列的特征向量相关,避免了时间序列自身所具有的干扰信息的影响,例如能够降低采样时段、幅度变化、象限漂移和噪声等干扰信息的影响。提高了标签确定的准确性。尤其在高维时间序列中仍然能够准确地进行标签迁移。将本申请实施例提供的标签确定装置应用于监督式学习算法或半监督式学习算法等需要大量标注的样本数据的场景中,能够有效降低标注成本,提高机器学习模型的建模效率。
图15是本申请实施例提供的一种标签确定装置的框图。该标签确定装置可以是分析设备。如图15所示,分析设备150包括:处理器1501和存储器1502。
存储器1501,用于存储计算机程序,计算机程序包括程序指令;
处理器1502,用于调用计算机程序,实现本申请实施例提供的标签确定方法。
可选地,该网络设备150还包括通信总线1503和通信接口1504。
其中,处理器1501包括一个或者一个以上处理核心,处理器1501通过运行计算机程序,从而执行各种功能应用以及数据处理。
存储器1502可用于存储计算机程序。可选地,存储器可存储操作系统和至少一个功能所需的应用程序单元。操作系统可以是实时操作系统(Real Time eXecutive,RTX)、LINUX、UNIX、WINDOWS或OS X之类的操作系统。
通信接口1504可以为多个,通信接口1504用于与其它存储设备或网络设备进行通信。例如在本申请实施例中,通信接口1504可以用于接收通信网络中的网络设备发送的样本数据。
存储器1502与通信接口1504分别通过通信总线1503与处理器1501连接。
本申请实施例提供了一种计算机存储介质,计算机存储介质上存储有指令,当指令被处理器执行时,实现本申请实施例提供的标签确定方法。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现,所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机的可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质,或者半导体介质(例如固态硬盘)等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (28)

  1. 一种标签确定方法,其特征在于,所述方法包括:
    获取第一时间序列的目标特征向量;
    获取所述目标特征向量与参考特征向量集合中参考特征向量的相似度,所述参考特征向量为已确定标签的第二时间序列的特征向量;
    当所述目标特征向量与第一参考特征向量的相似度大于相似度阈值时,将所述第一参考特征向量所对应的标签确定为所述第一时间序列的标签,所述第一参考特征向量为所述参考特征向量集合中的一个参考特征向量。
  2. 根据权利要求1所述的方法,其特征在于,所述第一时间序列为网络关键绩效指标KPI的时间序列。
  3. 根据权利要求1或2所述的方法,其特征在于,所述参考特征向量包括一个或多个特征的数据,所述目标特征向量包括一个或多个特征的数据,
    所述目标特征向量与所述第一参考特征向量的相似度为第一子特征向量和第二子特征向量的相似度,所述第一子特征向量和所述第二子特征向量分别由所述目标特征向量和所述第一参考特征向量中对应相同特征的数据组成。
  4. 根据权利要求3所述的方法,其特征在于,所述第一子特征向量和所述第二子特征向量的相似度,与所述第一子特征向量和所述第二子特征向量的距离负相关。
  5. 根据权利要求1至4任一所述的方法,其特征在于,所述方法还包括:
    当所述目标特征向量与所述参考特征向量集合中每个参考特征向量的相似度均不大于所述相似度阈值时,向管理设备发送所述第一时间序列,以供所述管理设备呈现所述第一时间序列;
    接收所述管理设备发送的所述第一时间序列的标签。
  6. 根据权利要求1至4任一所述的方法,其特征在于,所述方法还包括:
    获取第一特征向量集合,所述第一特征向量集合中的任一特征向量与所述参考特征向量集合中每个参考特征向量的相似度均不大于所述相似度阈值,且所述任一特征向量对应的时间序列的标签未确定;
    向管理设备发送所述第一特征向量集合对应的时间序列,以供所述管理设备呈现所述第一特征向量集合对应的时间序列;
    接收所述管理设备发送的所述第一特征向量集合对应的时间序列的标签。
  7. 根据权利要求6所述的方法,其特征在于,在所述向管理设备发送所述第一特征向量集合对应的时间序列之前,所述方法还包括:
    对所述第一特征向量集合中的特征向量进行聚类处理,得到所述第一特征向量集合中特 征向量的类别关系;
    所述向管理设备发送所述第一特征向量集合对应的时间序列,以供所述管理设备呈现所述第一特征向量集合对应的时间序列,包括:
    向所述管理设备发送所述类别关系以及所述第一特征向量集合对应的时间序列,以供所述管理设备按照所述类别关系,呈现所述第一特征向量集合对应的时间序列。
  8. 根据权利要求7所述的方法,其特征在于,所述对所述第一特征向量集合中的特征向量进行聚类处理,包括:
    基于所述第一特征向量集合中每两个特征向量的距离,统计每个所述特征向量的近邻向量,所述第一特征向量集合中任一特征向量的近邻向量为所述第一特征向量集合中与所述任一特征向量的距离小于距离阈值的其他特征向量,所述距离阈值为在基于所述第一特征向量集合确定的多个距离中指定的距离;
    基于统计结果,将相同的近邻向量的数量大于数量阈值的每两个特征向量划分为同一类特征向量。
  9. 根据权利要求1至8任一所述的方法,其特征在于,所述方法还包括:
    获取已确定标签的第三时间序列的第一特征向量;
    获取所述第一特征向量与所述参考特征向量集合中参考特征向量的相似度;
    当所述第一特征向量与所述参考特征向量集合中每个参考特征向量的相似度均不大于入库相似度阈值,将所述第一特征向量作为参考特征向量添加至所述参考特征集合中。
  10. 根据权利要求9所述的方法,其特征在于,所述方法还包括:
    当所述第一特征向量与所述参考特征向量集合中的第二特征向量的相似度大于所述入库相似度阈值,且所述第一特征向量对应的标签与所述第二特征向量对应的标签相同时,将所述第一特征向量作为参考特征向量添加至所述参考特征集合中。
  11. 根据权利要求9所述的方法,其特征在于,所述方法还包括:
    当所述第一特征向量与所述参考特征向量集合中的第二特征向量的相似度大于所述入库相似度阈值,且所述第一特征向量对应的标签与所述第二特征向量对应的标签不同时,向管理设备发送所述第一特征向量对应的时间序列以及所述第二特征向量对应的时间序列,以供所述管理设备呈现所述第一特征向量对应的时间序列以及所述第二特征向量对应的时间序列;
    接收所述管理设备发送的所述第一特征向量对应的时间序列以及所述第二特征向量对应的时间序列的相同的标签;
    基于接收的标签,更新预先存储的所述第一特征向量对应的时间序列的标签以及所述第二特征向量对应的时间序列的标签;
    将所述第一特征向量作为参考特征向量添加至所述参考特征集合中。
  12. 根据权利要求1至11任一所述的方法,其特征在于,所述目标特征向量和所述参考 特征向量均包括统计特征、拟合特征或频域特征中的一种或多种特征的数据。
  13. 根据权利要求1至12任一所述的方法,其特征在于,所述方法应用于网络分析器,所述参考特征向量对应的标签为异常检测标签。
  14. 一种标签确定装置,其特征在于,所述装置包括:
    第一获取模块,用于获取第一时间序列的目标特征向量,时间序列为按照时序排列的一组数据的集合;
    第二获取模块,用于获取所述目标特征向量与参考特征向量集合中参考特征向量的相似度,所述参考特征向量为已确定标签的第二时间序列的特征向量;
    确定模块,用于当所述目标特征向量与第一参考特征向量的相似度大于相似度阈值时,将所述第一参考特征向量所对应的标签确定为所述第一时间序列的标签,所述第一参考特征向量为所述参考特征向量集合中的一个参考特征向量。
  15. 根据权利要求14所述的装置,其特征在于,所述第一时间序列为网络关键绩效指标KPI的时间序列。
  16. 根据权利要求14或15所述的装置,其特征在于,所述参考特征向量包括一个或多个特征的数据,所述目标特征向量包括一个或多个特征的数据,
    所述目标特征向量与所述第一参考特征向量的相似度为第一子特征向量和第二子特征向量的相似度,所述第一子特征向量和所述第二子特征向量分别由所述目标特征向量和所述第一参考特征向量中对应相同特征的数据组成。
  17. 根据权利要求16所述的装置,其特征在于,所述第一子特征向量和所述第二子特征向量均以序列形式表征,所述第一子特征向量和所述第二子特征向量中相同位置的数据对应同一类别的特征,所述第一子特征向量和所述第二子特征向量的相似度,与所述第一子特征向量和所述第二子特征向量的距离负相关。
  18. 根据权利要求14至17任一所述的装置,其特征在于,所述装置还包括:
    第一发送模块,用于当所述目标特征向量与所述参考特征向量集合中每个参考特征向量的相似度均不大于所述相似度阈值时,向管理设备发送所述第一时间序列,以供所述管理设备呈现所述第一时间序列;
    第一接收模块,用于接收所述管理设备发送的所述第一时间序列的标签。
  19. 根据权利要求14至17任一所述的装置,其特征在于,所述装置还包括:
    第三获取模块,用于获取第一特征向量集合,所述第一特征向量集合中的任一特征向量与所述参考特征向量集合中每个参考特征向量的相似度均不大于所述相似度阈值,且所述任一特征向量对应的时间序列的标签未确定;
    第二发送模块,用于向管理设备发送所述第一特征向量集合对应的时间序列,以供所述 管理设备呈现所述第一特征向量集合对应的时间序列;
    第二接收模块,用于接收所述管理设备发送的所述第一特征向量集合对应的时间序列的标签。
  20. 根据权利要求19所述的装置,其特征在于,所述装置还包括:
    聚类模块,同于在所述向管理设备发送所述第一特征向量集合对应的时间序列之前,对所述第一特征向量集合中的特征向量进行聚类处理,得到所述第一特征向量集合中特征向量的类别关系;
    所述第二发送模块,用于:
    向所述管理设备发送所述类别关系以及所述第一特征向量集合对应的时间序列,以供所述管理设备按照所述类别关系,呈现所述第一特征向量集合对应的时间序列。
  21. 根据权利要求20所述的装置,其特征在于,所述聚类模块,用于:
    基于所述第一特征向量集合中每两个特征向量的距离,统计每个所述特征向量的近邻向量,所述第一特征向量集合中任一特征向量的近邻向量为所述第一特征向量集合中与所述任一特征向量的距离小于距离阈值的其他特征向量,所述距离阈值为在基于所述第一特征向量集合确定的多个距离中指定的距离;
    基于统计结果,将相同的近邻向量的数量大于数量阈值的每两个特征向量划分为同一类特征向量,所述数量阈值为在所述第一特征向量集合中各个特征向量的近邻向量的数量中指定的数量。
  22. 根据权利要求14至21任一所述的装置,其特征在于,所述装置还包括:
    第四获取模块,用于获取已确定标签的第三时间序列的第一特征向量;
    第五获取模块,用于获取所述第一特征向量与所述参考特征向量集合中参考特征向量的相似度;
    第一添加模块,用于当所述第一特征向量与所述参考特征向量集合中每个参考特征向量的相似度均不大于入库相似度阈值,将所述第一特征向量作为参考特征向量添加至所述参考特征集合中。
  23. 根据权利要求22所述的装置,其特征在于,所述装置还包括:
    第二添加模块,用于当所述第一特征向量与所述参考特征向量集合中的第二特征向量的相似度大于所述入库相似度阈值,且所述第一特征向量对应的标签与所述第二特征向量对应的标签相同时,将所述第一特征向量作为参考特征向量添加至所述参考特征集合中。
  24. 根据权利要求22所述的装置,其特征在于,所述装置还包括:
    第三发送模块,用于当所述第一特征向量与所述参考特征向量集合中的第二特征向量的相似度大于所述入库相似度阈值,且所述第一特征向量对应的标签与所述第二特征向量对应的标签不同时,向管理设备发送所述第一特征向量对应的时间序列以及所述第二特征向量对应的时间序列,以供所述管理设备呈现所述第一特征向量对应的时间序列以及所述第二特征 向量对应的时间序列;
    第三接收模块,用于接收所述管理设备发送的所述第一特征向量对应的时间序列以及所述第二特征向量对应的时间序列的相同的标签;
    更新模块,用于基于接收的标签,更新预先存储的所述第一特征向量对应的时间序列的标签以及所述第二特征向量对应的时间序列的标签;
    第三添加模块,用于将所述第一特征向量作为参考特征向量添加至所述参考特征集合中。
  25. 根据权利要求14至24任一所述的装置,其特征在于,所述目标特征向量和所述参考特征向量均包括统计特征、拟合特征或频域特征中的一种或多种特征的数据。
  26. 根据权利要求14至25任一所述的装置,其特征在于,所述装置应用于网络分析器,所述参考特征向量对应的标签为异常检测标签。
  27. 一种标签确定装置,其特征在于,包括:处理器和存储器;
    所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;
    所述处理器,用于调用所述计算机程序,实现如权利要求1至13任一所述的标签确定方法。
  28. 一种计算机存储介质,其特征在于,所述计算机存储介质上存储有指令,当所述指令被处理器执行时,实现如权利要求1至13任一所述的标签确定方法。
PCT/CN2020/112878 2019-09-02 2020-09-01 标签确定方法、装置和系统 WO2021043140A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20861495.8A EP4020315A4 (en) 2019-09-02 2020-09-01 METHOD, DEVICE AND SYSTEM FOR DETERMINING LABELS
US17/683,973 US20220179884A1 (en) 2019-09-02 2022-03-01 Label Determining Method, Apparatus, and System

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910824755.6A CN112446399A (zh) 2019-09-02 2019-09-02 标签确定方法、装置和系统
CN201910824755.6 2019-09-02

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/683,973 Continuation US20220179884A1 (en) 2019-09-02 2022-03-01 Label Determining Method, Apparatus, and System

Publications (1)

Publication Number Publication Date
WO2021043140A1 true WO2021043140A1 (zh) 2021-03-11

Family

ID=74734198

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/112878 WO2021043140A1 (zh) 2019-09-02 2020-09-01 标签确定方法、装置和系统

Country Status (4)

Country Link
US (1) US20220179884A1 (zh)
EP (1) EP4020315A4 (zh)
CN (1) CN112446399A (zh)
WO (1) WO2021043140A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435499A (zh) * 2021-06-25 2021-09-24 平安科技(深圳)有限公司 标签分类方法、装置、电子设备和存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113159583A (zh) * 2021-04-25 2021-07-23 上海淇玥信息技术有限公司 一种业务策略管理方法、装置和电子设备
CN113239990A (zh) * 2021-04-27 2021-08-10 中国银联股份有限公司 对序列数据进行特征处理的方法、装置及存储介质
CN115169709B (zh) * 2022-07-18 2023-04-18 华能汕头海门发电有限责任公司 一种基于数据驱动的电站辅机故障诊断方法及系统
CN116051558B (zh) * 2023-03-31 2023-06-16 菲特(天津)检测技术有限公司 一种缺陷图像标注方法、装置、设备及介质
CN116820056B (zh) * 2023-08-29 2023-11-14 青岛义龙包装机械有限公司 用于袋式包装机的生产工艺参数处理方法
CN117332303B (zh) * 2023-12-01 2024-03-26 太极计算机股份有限公司 一种用于集群的标签纠正方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013192087A (ja) * 2012-03-14 2013-09-26 Fujitsu Ltd 雑音抑制装置、マイクロホンアレイ装置、雑音抑制方法、及びプログラム
CN103337248A (zh) * 2013-05-17 2013-10-02 南京航空航天大学 一种基于时间序列核聚类的机场噪声事件识别方法
CN104794484A (zh) * 2015-04-07 2015-07-22 浙江大学 基于分段正交多项式分解的时序数据最近邻分类方法
CN107766426A (zh) * 2017-09-14 2018-03-06 北京百分点信息科技有限公司 一种文本分类方法、装置及电子设备

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462217B (zh) * 2014-11-09 2017-09-29 浙江大学 一种基于分段统计近似表示的时间序列相似性度量方法
US10235633B2 (en) * 2014-12-19 2019-03-19 Medidata Solutions, Inc. Method and system for linking heterogeneous data sources

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013192087A (ja) * 2012-03-14 2013-09-26 Fujitsu Ltd 雑音抑制装置、マイクロホンアレイ装置、雑音抑制方法、及びプログラム
CN103337248A (zh) * 2013-05-17 2013-10-02 南京航空航天大学 一种基于时间序列核聚类的机场噪声事件识别方法
CN104794484A (zh) * 2015-04-07 2015-07-22 浙江大学 基于分段正交多项式分解的时序数据最近邻分类方法
CN107766426A (zh) * 2017-09-14 2018-03-06 北京百分点信息科技有限公司 一种文本分类方法、装置及电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4020315A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435499A (zh) * 2021-06-25 2021-09-24 平安科技(深圳)有限公司 标签分类方法、装置、电子设备和存储介质
CN113435499B (zh) * 2021-06-25 2023-06-20 平安科技(深圳)有限公司 标签分类方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
EP4020315A4 (en) 2022-10-12
US20220179884A1 (en) 2022-06-09
CN112446399A (zh) 2021-03-05
EP4020315A1 (en) 2022-06-29

Similar Documents

Publication Publication Date Title
WO2021043140A1 (zh) 标签确定方法、装置和系统
US11017220B2 (en) Classification model training method, server, and storage medium
US20210124983A1 (en) Device and method for anomaly detection on an input stream of events
CN107766929B (zh) 模型分析方法及装置
WO2021000958A1 (zh) 用于实现模型训练的方法及装置、计算机存储介质
CN110196908A (zh) 数据分类方法、装置、计算机装置及存储介质
WO2021103823A1 (zh) 模型更新系统、模型更新方法及相关设备
EP4177792A1 (en) Ai model updating method and apparatus, computing device and storage medium
CN108830417B (zh) 一种基于arma和回归分析的生活能源消费预测方法及系统
CN111796957B (zh) 基于应用日志的交易异常根因分析方法及系统
WO2017071369A1 (zh) 一种预测用户离网的方法和设备
CN110458096A (zh) 一种基于深度学习的大规模商品识别方法
CN111949429A (zh) 基于密度聚类算法的服务器故障监测方法及系统
CN110458022A (zh) 一种基于域适应的可自主学习目标检测方法
CN114495498B (zh) 一种交通数据分布有效性判别方法及装置
CN113703506B (zh) 一种建筑材料生产车间环境控制调节方法及系统
CN117041017B (zh) 数据中心的智能运维管理方法及系统
CN110287256B (zh) 一种基于云计算的电网数据并行处理系统及其处理方法
CN109801394B (zh) 一种工作人员考勤方法及装置、电子设备和可读存储介质
US11782923B2 (en) Optimizing breakeven points for enhancing system performance
CN111090585A (zh) 一种基于众测过程的众测任务关闭时间自动预测方法
CN115935285A (zh) 基于掩码图神经网络模型的多元时间序列异常检测方法和系统
CN113656452A (zh) 调用链指标异常的检测方法、装置、电子设备及存储介质
CN110569277A (zh) 一种配置数据信息自动识别与归类方法及系统
TWI831462B (zh) 客戶消費行為預測系統及客戶消費行為預測方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20861495

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020861495

Country of ref document: EP

Effective date: 20220324