WO2022111284A1 - 一种数据标注处理方法、装置、存储介质及电子装置 - Google Patents

一种数据标注处理方法、装置、存储介质及电子装置 Download PDF

Info

Publication number
WO2022111284A1
WO2022111284A1 PCT/CN2021/129871 CN2021129871W WO2022111284A1 WO 2022111284 A1 WO2022111284 A1 WO 2022111284A1 CN 2021129871 W CN2021129871 W CN 2021129871W WO 2022111284 A1 WO2022111284 A1 WO 2022111284A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
label
sample set
feature vector
newly added
Prior art date
Application number
PCT/CN2021/129871
Other languages
English (en)
French (fr)
Inventor
严心月
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2022111284A1 publication Critical patent/WO2022111284A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection

Definitions

  • the embodiments of the present application relate to the field of data, and in particular, to a method, device, storage medium, and electronic device for processing data annotation.
  • the timely detection and accurate positioning of network faults play a pivotal role in ensuring the stable operation of the wireless network environment and system, so as to meet the communication needs of daily society, business and public services.
  • the wireless network operation process often requires a lot of manpower, and relying on experienced industry experts to participate in the diagnosis process.
  • business personnel can find out the changes of indicators through real-time monitoring, and further correlate and drill down to find out the causes of faults, so as to realize transmission faults, network hardware equipment abnormalities, etc.
  • Quick location and solution support including multiple fault types.
  • unsupervised clustering methods or supervised classification methods are mainly used.
  • the former requires business experts to label and confirm the clustering results in the application process, and the entire model needs to be updated for streaming input data, so the stability is poor and cannot well meet the requirements of incremental abnormal data classification and labeling; although the latter It can make full use of the existing category information, but the label completeness and sufficiency of the training data is relatively high, and there is also the problem of model updating, which cannot be well adapted to streaming data.
  • the embodiments of the present application provide a data labeling processing method, device, storage medium, and electronic device, so as to at least partially solve the problem in the related art that the performance index data is labelled through a supervised classification method and cannot be well adapted to streaming data .
  • a data labeling processing method includes: performing anomaly detection on performance index data to obtain a sample set composed of abnormal points and a label set corresponding to the sample set; Perform feature expansion on the sample set to obtain the feature vector of the sample set and the corresponding label value; perform feature selection on the feature vector to obtain the target feature vector of the sample set; and pair the target feature vector according to the sample set Add samples for labeling.
  • a data labeling processing device includes: an abnormality detection module, configured to perform abnormality detection on performance index data, and obtain a sample set composed of abnormal points and the sample set a corresponding label set; a feature expansion module, configured to perform feature expansion on the sample set, to obtain a feature vector of the sample set and a corresponding label value; a feature selection module, configured to perform feature selection on the feature vector , to obtain the target feature vector of the sample set; and a first labeling module, configured to incrementally label the newly added samples according to the target feature vector of the sample set.
  • a computer-readable storage medium is also provided, where a computer program is stored in the storage medium, wherein the computer program is configured to execute any one of the above method embodiments when running steps in .
  • an electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to execute any one of the above Steps in Method Examples.
  • Fig. 1 is the hardware structure block diagram of the mobile terminal of the data labeling processing method of the embodiment of the present application;
  • FIG. 2 is a flowchart of a data labeling processing method according to an embodiment of the present application.
  • FIG. 8 is a flowchart of tag propagation according to an embodiment of the present application.
  • FIG. 9 is a block diagram of a data annotation processing apparatus according to an embodiment of the present application.
  • FIG. 1 is a block diagram of the hardware structure of a mobile terminal of the data labeling processing method according to an embodiment of the present application.
  • the mobile terminal may include one or more (only shown in FIG. 1 ).
  • a processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.
  • a memory 104 for storing data
  • the above-mentioned mobile terminal may also include a communication function
  • the transmission device 106 and the input and output device 108 can understand that the structure shown in FIG. 1 is only a schematic diagram, which does not limit the structure of the above-mentioned mobile terminal.
  • the mobile terminal may also include more or fewer components than those shown in FIG. 1 , or have a different configuration than that shown in FIG. 1 .
  • the memory 104 can be used to store computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the data transmission methods in the embodiments of the present application. This function application and the business chain address pool slicing processing are implemented, that is, the above method is implemented.
  • Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, and these remote memories may be connected to the mobile terminal through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • Transmission means 106 are used to receive or transmit data via a network.
  • the specific example of the above-mentioned network may include a wireless network provided by a communication provider of the mobile terminal.
  • the transmission device 106 includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station so as to communicate with the Internet.
  • the transmission device 106 may be a radio frequency (Radio Frequency, RF for short) module, which is used to communicate with the Internet in a wireless manner.
  • RF Radio Frequency
  • FIG. 2 is a flowchart of the data annotation processing method according to an embodiment of the present application. As shown in FIG. 2 , the process includes the following step:
  • Step S202 perform abnormality detection on the performance index data, and obtain a sample set composed of abnormal points and a label set corresponding to the sample set;
  • Step S204 performing feature expansion on the sample set to obtain a feature vector and a corresponding label value of the sample set
  • step S204 may specifically include:
  • Difference model Holt-Winters time series model, moving average model, moving median model, time series decomposition model, time series decomposition median model, wavelet transform model.
  • Step S206 performing feature selection on the feature vector to obtain the target feature vector of the sample set
  • step S206 may specifically include:
  • Feature items capable of distinguishing different abnormal types are selected from the feature vector to obtain the target feature vector of the sample set.
  • Step S208 label the newly added samples according to the target feature vector of the sample set.
  • abnormality detection is performed on the performance index data, and a sample set composed of abnormal points and a label set corresponding to the sample set are obtained; the feature expansion of the sample set is performed to obtain the feature vector of the sample set and corresponding label value; perform feature selection on the feature vector to obtain the target feature vector of the sample set; label the newly added samples according to the target feature vector of the sample set, which can solve the classification method through supervision in the related art
  • Annotating the performance index data can not be well adapted to the problem of streaming data, to realize the effective discrimination of the cause of the failure of the key performance index data of the wireless network, and can better adapt to the streaming data.
  • the above step S208 may specifically include: in the case that the newly added sample is not marked, marking the label of the newly added sample according to the target feature vector of the sample set and the corresponding label set.
  • step S208 may specifically include:
  • step S2082 may specifically include:
  • the distance between all sample points in the sample point set and the target feature vector of the sample set is determined as the distance between the newly added sample and the target feature vector of the sample set.
  • step S2083 may specifically include:
  • the diagonal matrix is obtained by combining the sum of all eigenvectors of each row.
  • the foregoing S2085 may specifically include:
  • the newly added sample is labeled according to the estimated label value.
  • unmarked samples adjacent to the newly added samples are determined, and the unmarked samples are added to the candidates.
  • the sub-matrix of the label matrix determines the label estimate value of the unlabeled sample; if the L1 norm of the difference between the label estimate value and the initial label value of the unlabeled sample in the candidate set is greater than the preset threshold, according to the The label estimation value updates the label value of the unlabeled sample, wherein the initial label value is a 0 vector.
  • the missing values in the performance index data are determined; One piece of data or multiple pieces of data at the same historical moment corresponding to the sampling time of the missing value; the missing value is filled according to the mean value of the one piece of data or the multiple pieces of data.
  • data mining and machine learning methods are used to perform feature extraction, feature selection, and semi-supervised label propagation on the streaming input wireless network key performance indicator data to realize the labeling of unknown types of sample data, so as to achieve automatic expansion of label samples. , which assists the objective of the optimization of the causal positioning operator. Further, it can also be directly used in the follow-up analysis to clarify the categories of faults.
  • the input data objects are the core performance indicators of the wireless network service at the abnormal time point obtained by the abnormality detection algorithm, and the time series data set of the service concern counter.
  • the first step is to preprocess the input data, and fill in the missing values by the following methods:
  • the granularity of sample collection time is unified, and the model is initialized based on the initial sample set after processing.
  • the second step is to implement feature engineering for the processed core indicator data, which mainly includes:
  • Feature extraction including:
  • this module mainly adopts the following general models for time series data processing, including difference model, Holt-Winters time series model, moving average model, moving median model, time series decomposition model, time series decomposition median model and wavelet transform Model.
  • Feature selection including:
  • the feature selection is carried out according to the feature data obtained by the feature enhancement module.
  • the feature vector x [x 1 , x 2 ,...,x l ] T , and the corresponding sample label value y, solve it so that the feature selection probability equation p(y
  • x) p(y
  • the minimum feature subset x ⁇ * of so as to realize the dimensionality reduction process from the original dimension L to the selected dimension M, where p is the true mapping of the functional relationship between the occurrence probability of the label value y under the given feature set x.
  • an approximate prediction model q of p can be constructed, and the target feature set x ⁇ * can be obtained through maximum likelihood estimation:
  • represents the selected feature
  • represents the parameter used to predict the category label.
  • the purpose of the above solution is to find a minimum target feature set x ⁇ * so that the prediction model q is infinitely close to the real model p.
  • the result of x ⁇ * can be obtained by normalizing the above expression, obtaining the logarithm, calculating mutual information, etc.
  • the third step is to perform incremental marking based on the screening feature data, which mainly includes:
  • the relationship pair of each sample feature value and class label can be expressed as (x, y).
  • the subscript is the M-dimensional feature space obtained by feature selection
  • y is the labeled category label.
  • sample data k-nn relationship structure including:
  • the distance calculation method is:
  • M is the identity matrix (identity matrix: a square matrix whose main diagonal element is 1 and the remaining elements are 0), i and j are two different sample objects, and N(i) is the neighbor of sample point i (A variety of calculation methods can be used, and this patent adopts the k-nn calculation method based on Euclidean distance), that is, j belongs to the set of k nearest k sample points of i.
  • the obtained distance value is the weight of the edge between the corresponding node of the new sample and the known nodes, which can form a weight matrix
  • W weight matrix: the element values in the matrix represent the weight of the edge between any two sample points, that is, the approximate degree of the two sample objects in the current feature dimension
  • a diagonal matrix diagonal matrix: a A matrix whose elements outside the main diagonal are all 0
  • the value of the diagonal elements of each row is the summation result of the elements in the row
  • the transition probability matrix P transition matrix: the elements in the matrix are all non-negative, and the sum of the elements in each row is 1, means that under certain conditions, the elements change from a certain value to a certain value).
  • the expression for the probability of a state transitioning to another state) is:
  • P LL , P LU , P UL , and P UU are the sub-matrices of the corresponding labeled sample objects, the mixture of labeled and unlabeled sample objects, and the unlabeled sample objects in the transition probability.
  • Update the label matrix including:
  • P n+1,1:n is the value of the n+1th row of the above transition probability matrix
  • F is the above label matrix
  • f n+1 is expressed in the vector form of 1 ⁇ (n+1), that is, for the new Increase the label estimate of the unlabeled sample object n+1; for the labeled sample, keep its original label value unchanged.
  • Limited label dissemination mainly including:
  • the label propagation algorithm is the core algorithm of the incremental labeling sub-module.
  • the class label diffusion and propagation can be achieved with minimal resource consumption.
  • For the newly added unlabeled sample n+1 on the basis of estimating its own label (if there is no label category information), all the unlabeled samples belonging to the neighbor nodes in the k-nn relationship of the sample are included in the candidate set , update the label estimate for any sample object k in the set as follows:
  • P UL(k) is the value of the k-th row of the transition probability sub-matrix P UL
  • FL is the sub-label matrix corresponding to the labeled sample object
  • P UU(k) is the k-th row of the transition probability sub-matrix P UU
  • F U is the sub-label matrix corresponding to the unlabeled sample object (initially a 0 matrix).
  • L1Norm the sum of the absolute values of the elements in the vector
  • this embodiment is based on the multi-classification framework of streaming data based on incremental labeling algorithm, and based on configurable feature engineering including feature extraction and selection, the estimated value of the tag is
  • the deviation degree is a constraint condition for local diffusion update, which allows unlabeled sample objects to dynamically update the label type according to the input, and achieves an adaptive multi-classification target with a small computational cost.
  • Fig. 3 is the main flow chart of the algorithm according to the present embodiment, as shown in Fig. 3, including:
  • S302 perform feature extraction on the data, use the get_feature function to perform extraction function configuration, and form several independent perceptrons for each indicator object;
  • S303 perform feature selection on the extracted feature objects, and select features with high correlation with tags in the scaled feature data
  • FIG. 4 is a flowchart of feature extraction according to the present embodiment, as shown in FIG. 4 , including:
  • S402 feature item selection, modify feature_list to configure feature items, define feature operators through feature_mapping, and the algorithm dynamically sets the number of parallel processes according to the feature item configuration.
  • the current default configuration items are:
  • Diff Difference Model
  • Historical average(window 1,2,3,4weeks), using the average value of historical data of a specific window length as the feature value.
  • TSD Time Series Decomposition
  • the seasonal component, trend component and residual component can be obtained, and the characteristic item is the product of the mean values of each component (using the multiplication decomposition method).
  • ARMA Autoregressive Moving Average Model
  • S404 feature enhancement, calculate the error between the extracted feature value and the original data, and perform feature enhancement on the error to improve the ability to characterize abnormal data fluctuations;
  • FIG. 5 is a flowchart of feature enhancement according to this embodiment, as shown in FIG. 5 , including:
  • KPI Key Performance Indicator
  • FIG. 6 is a flowchart of feature selection according to the present embodiment, as shown in FIG. 6 , including:
  • S601 feature data input, obtain feature data for feature engineering as algorithm input;
  • S604 initialize the selected feature set, according to the feature correlation result in S303, initialize the intermediate result pool and create the initial level of the feature;
  • step S605 judging whether the selected feature satisfies the termination condition, if the judgment result is no, execute step S606, if the judgment result is yes, execute step S607, calculate the mutual information and conditional mutual information value, and iteratively update the intermediate The result pool, until the termination condition is met, jump to S608;
  • FIG. 7 is a flowchart of incremental marking according to the present embodiment, as shown in FIG. 7 , including:
  • S701 feature data input, use the feature set screened by the feature selection module as the algorithm input, divide the initial marked sample set as training data for model initialization, and perform streaming input for other data;
  • step S705 judging whether the loop condition is satisfied, if the judgment result is yes, go to step S706, otherwise go to step S708, take the unmarked object set and the neighbor node set belonging to the current node as the candidate diffusion set, for each of the candidate sets Element k, check whether the loop conditions are met: 1) the candidate diffusion set is not empty, 2) the number of iterations is less than the threshold T max , if it is satisfied, end the iteration and jump to S708, otherwise stay in the loop;
  • FIG. 8 is a flowchart of tag propagation according to the present embodiment, as shown in FIG. 8 , including:
  • the candidate set data is input, and the candidate label diffusion sample set obtained in S505 is used as the algorithm input;
  • step S802 judging whether the candidate set is empty, if the judgment result is yes, go to step S803, otherwise go to step S806, loop condition, judge whether the candidate is not empty and the number of iterations is less than the threshold T max , if so, go to the loop , otherwise output the result;
  • ⁇ f i P UL(i) F L +P UU(i) F U -F U(i) , where the absolute value of the deviation is the influence factor.
  • FIG. 9 is a block diagram of the data annotation processing apparatus according to this embodiment. As shown in FIG. 9 , the apparatus includes:
  • the abnormality detection module 92 is configured to perform abnormality detection on the performance index data, and obtain a sample set composed of abnormal points and a label set corresponding to the sample set;
  • the feature expansion module 94 is used to perform feature expansion on the sample set to obtain the feature vector of the sample set and the corresponding label value;
  • Feature selection module 96 for performing feature selection on the feature vector to obtain the target feature vector of the sample set
  • the first labeling module 98 is configured to incrementally label the newly added samples according to the target feature vector of the sample set.
  • the first labeling module 98 is further configured to
  • the label of the newly added sample is labeled according to the target feature vector of the sample set and the corresponding label set.
  • the first labeling module 98 includes:
  • a first determination submodule configured to determine the distance between the newly added sample and the target feature vector of the sample set if some or all of the sample points in the newly added sample are adjacent to the sample set;
  • a second determination submodule configured to determine that the distance is the weight of the edge between the node in the newly added sample and each node in the sample set, to obtain a weight matrix
  • a third determination submodule configured to determine a transition probability matrix according to the diagonal matrix and the weight matrix
  • the labeling sub-module is configured to label the labels of the newly added samples according to the transition probability matrix.
  • the first determination submodule includes:
  • an acquisition unit configured to acquire a set of adjacent sample points belonging to the sample set in the newly added sample
  • a first determining unit used for determining the distance between all sample points in the sample point set and the target feature vector of the sample set
  • the second determining unit is configured to determine the distance between all sample points in the sample point set and the target feature vector of the sample set as the distance between the newly added sample and the target feature vector of the sample set.
  • the building block includes:
  • the third determining unit is used to respectively determine the sum of all eigenvectors in each row in the weight matrix
  • a combining unit configured to combine the sums of all eigenvectors of each row to obtain the diagonal matrix.
  • the labeling submodule includes:
  • an obtaining unit configured to obtain the row corresponding to the newly added sample from the transition probability matrix, and the value of the column corresponding to the target feature vector of the sample set to obtain the target transition matrix
  • a fourth determination unit configured to determine the product of the target transition probability matrix and the label set corresponding to the sample set as the label estimate value of the newly added sample
  • a labeling unit configured to label the newly added samples according to the label estimation value.
  • the apparatus further includes:
  • the adding module is used to determine the unlabeled samples adjacent to the newly added samples, and add the unlabeled samples to the candidate set;
  • the second labeling module is configured to label the unlabeled samples in the candidate set according to the newly added samples.
  • the second labeling module includes:
  • a fourth determination sub-module configured to determine, for the unlabeled samples in the candidate set, the estimated label value of the unlabeled sample according to the sub-matrix of the transition probability matrix and the sub-matrix of the sample set label matrix;
  • the update sub-module is used to update the unlabeled sample according to the label estimated value if the L1 norm of the difference between the estimated label value and the initial label value of the unlabeled sample in the candidate set is greater than a preset threshold.
  • label value wherein the initial label value is a 0 vector.
  • the apparatus further includes:
  • a determination module for determining missing values in the performance indicator data
  • an acquisition module for acquiring one or more data at the same historical moment corresponding to the sampling time of the missing value
  • a filling module configured to fill in the missing value according to the mean value of the one data or the plurality of data.
  • the feature expansion module is also used for
  • Difference model Holt-Winters time series model, moving average model, moving median model, time series decomposition model, time series decomposition median model, wavelet transform model.
  • the feature selection module is also used to determine whether the feature selection module is also used to
  • Feature items capable of distinguishing different abnormal types are selected from the feature vector to obtain the target feature vector of the sample set.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, wherein the computer program is configured to execute the steps in any of the above method embodiments when running.
  • the above-mentioned computer-readable storage medium may include, but is not limited to, a USB flash drive, a read-only memory (Read-Only Memory, referred to as ROM for short), and a random access memory (Random Access Memory, referred to as RAM for short) , mobile hard disk, magnetic disk or CD-ROM and other media that can store computer programs.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • Embodiments of the present application further provide an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer program to execute the steps in any one of the above method embodiments.
  • the above-mentioned electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the above-mentioned processor, and the input-output device is connected to the above-mentioned processor.
  • anomaly detection is performed on performance index data to obtain a sample set composed of abnormal points and a label set corresponding to the sample set; feature expansion is performed on the sample set to obtain the feature vector of the sample set and the corresponding label set. label value; perform feature selection on the feature vector to obtain the target feature vector of the sample set; label the newly added samples according to the target feature vector of the sample set, which can solve the problem of performance problems by the supervised classification method in the related art
  • the indicator data is marked, which cannot be well adapted to the problem of streaming data. It can effectively determine the cause of the failure of the wireless network key performance indicator data, and can better adapt to the streaming data.
  • modules or steps of the present application can be implemented by a general-purpose computing device, and they can be centralized on a single computing device or distributed in a network composed of multiple computing devices
  • they can be implemented in program code executable by a computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, can be performed in a different order than shown here.
  • the described steps, or they are respectively made into individual integrated circuit modules, or a plurality of modules or steps in them are made into a single integrated circuit module to realize.
  • the present application is not limited to any particular combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Operations Research (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据标注处理方法、装置、存储介质及电子装置,该方法包括:对性能指标数据进行异常检测,得到异常点组成的样本集合与该样本集合对应的标签集合;对该样本集合进行特征扩充,得到该样本集合的特征向量与对应的标签值;对该特征向量进行特征选择,得到该样本集合的目标特征向量;以及根据该样本集合的目标特征向量对新增样本进行标注。

Description

一种数据标注处理方法、装置、存储介质及电子装置
相关申请的交叉引用
本申请基于申请号为202011349875.4、申请日为2020年11月26日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请实施例涉及数据领域,具体而言,涉及一种数据标注处理方法、装置、存储介质及电子装置。
背景技术
作为智能运维体系的关键场景与重要环节,网络故障的及时发现与准确定位对保障无线网络环境与系统稳定运行,从而满足社会日常、商业及公共服务等各方面的通讯需要发挥举足轻重的作用。为应对这一需求,无线网络运营过程往往需要投入大量人力,并依托具有丰富经验的业内专家参与诊断过程。基于能够表征各网络组件运行情况以及网络健康状态的关键性能指标体系,业务人员通过实时监测发现指标异动情况,进一步关联及下钻深层挖掘故障跟因,以实现包括传输故障、网络硬件设备异常等多种故障类型在内的快速定位与解决方案支持。当前,以上过程已通过异常检测算子、小区算子及跟因定位算子逐步实现并通过验证,但仍需业务人员介入跟因定位模块进行人工样本标注,以辅助故障定界及算法优化闭环实现,并未完全实现专家经验的凝练与固化。
针对上述情况,当前主要采用无监督的聚类方法或有监督的分类方法。前者在应用过程中需要业务专家针对聚类结果进行标注确认,针对流式输入数据需要更新整个模型,因此稳定性较差,不能够较好满足增量异常数据分类及标注的要求;后者虽然能够充分利用已有类别信息,但对于训练数据的标签完备性及充分性较高,且同样存在模型更新问题,无法较好地适应流式数据。
针对相关技术中通过监督的分类方法对性能指标数据进行标注,无法较好地适应流式数据的问题,尚未提出解决方案。
发明内容
本申请实施例提供了一种数据标注处理方法、装置、存储介质及电子装置,以至少部分解决相关技术中通过监督的分类方法对性能指标数据进行标注,无法较好地适应流式数据的问题。
根据本申请的一个实施例,提供了一种数据标注处理方法,所述方法包括:对性能指标数据进行异常检测,得到异常点组成的样本集合与所述样本集合对应的标签集合;对所述样本集合进行特征扩充,得到所述样本集合的特征向量与对应的标签值;对所述特征向量进行特征选择,得到所述样本集合的目标特征向量;以及根据所述样本集合的目标特征向量对新增样本进行标注。
根据本申请的另一个实施例,提供了一种数据标注处理装置,所述装置包括:异常检测模块,被配置为对性能指标数据进行异常检测,得到异常点组成的样本集合与所述样本集合 对应的标签集合;特征扩充模块,被配置为对所述样本集合进行特征扩充,得到所述样本集合的特征向量与对应的标签值;特征选择模块,被配置为对所述特征向量进行特征选择,得到所述样本集合的目标特征向量;以及第一标注模块,被配置为根据所述样本集合的目标特征向量对新增样本进行增量标注。
根据本申请的又一个实施例,还提供了一种计算机可读的存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
根据本申请的又一个实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。
附图说明
图1是本申请实施例的数据标注处理方法的移动终端的硬件结构框图;
图2是根据本申请实施例的数据标注处理方法的流程图;
图3是根据本申请实施例的算法主流程图;
图4是根据本申请实施例的特征抽取的流程图;
图5是根据本申请实施例的特征增强的流程图;
图6是根据本申请实施例的特征选择的流程图;
图7是根据本申请实施例的增量标记的流程图;
图8是根据本申请实施例的标签传播的流程图;以及
图9是根据本申请实施例的数据标注处理装置的框图。
具体实施方式
下文中将参考附图并结合实施例来详细说明本申请的实施例。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
本申请实施例中所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在移动终端上为例,图1是本申请实施例的数据标注处理方法的移动终端的硬件结构框图,如图1所示,移动终端可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和用于存储数据的存储器104,其中,上述移动终端还可以包括用于通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述移动终端的结构造成限定。例如,移动终端还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。
存储器104可用于存储计算机程序,例如,应用软件的软件程序以及模块,如本申请实施例中的数据传输方法对应的计算机程序,处理器102通过运行存储在存储器104内的计算机程序,从而执行各种功能应用以及业务链地址池切片处理,即实现上述的方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至移动终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
传输装置106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括移动终端的通信供应商提供的无线网络。在一个实例中,传输装置106包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置106可以为射频(Radio Frequency,简称为RF)模块,其用于通过无线方式与互联网进行通讯。
在本实施例中提供了一种运行于上述移动终端或网络架构的数据标注处理方法,图2是根据本申请实施例的数据标注处理方法的流程图,如图2所示,该流程包括如下步骤:
步骤S202,对性能指标数据进行异常检测,得到异常点组成的样本集合与所述样本集合对应的标签集合;
步骤S204,对所述样本集合进行特征扩充,得到所述样本集合的特征向量与对应的标签值;
本实施例中,上述步骤S204具体可以包括:
通过以下方式之一对预处理后的所述性能指标数据进行特征扩充,得到所述特征向量中每个原始值对应的预测值:
差分模型、Holt-Winters时间序列模型、滑动平均模型、滑动中值模型、时间序列分解模型、时间序列分解中值模型、小波变换模型。
步骤S206,对所述特征向量进行特征选择,得到所述样本集合的目标特征向量;
本实施例中,上述步骤S206具体可以包括:
从所述特征向量中选取能够区分不同异常类型的特征项,得到所述样本集合的目标特征向量。
步骤S208,根据所述样本集合的目标特征向量对新增样本进行标注。
通过上述步骤S202至S208,对性能指标数据进行异常检测,得到异常点组成的样本集合与所述样本集合对应的标签集合;对所述样本集合进行特征扩充,得到所述样本集合的特征向量与对应的标签值;对所述特征向量进行特征选择,得到所述样本集合的目标特征向量;根据所述样本集合的目标特征向量对新增样本进行标注,可以解决相关技术中通过监督的分类方法对性能指标数据进行标注,无法较好地适应流式数据的问题,实现无线网络关键性能指标数据故障跟因的有效判别,可以较好地适应流式数据。
本实施例中,上述步骤S208具体可以包括:在所述新增样本未标注的情况下,根据所述样本集合的目标特征向量与对应的标签集合对所述新增样本的标签进行标注。
在一可选的实施例中,上述步骤S208具体可以包括:
S2081,若所述新增样本中的部分或全部样本点与所述样本集合相邻,确定所述新增样本与所述样本集合的目标特征向量之间的距离;
S2082,确定所述距离为所述新增样本中的节点与所述样本集合中各节点之间边的权重,得到权重矩阵;
S2083,根据所述权重矩阵构建对角矩阵;
S2084,根据所述对角矩阵与所述权重矩阵确定转移概率矩阵;
S2085,根据所述转移概率矩阵对所述新增样本的标签进行标注。
在一可选的实施例中,上述步骤S2082具体可以包括:
获取所述新增样本中属于所述样本集合的邻近的样本点集合;
确定所述样本点集合中所有样本点与所述样本集合的目标特征向量的距离;
将所述样本点集合中所有样本点与所述样本集合的目标特征向量的距离确定为所述新增样本与所述样本集合的目标特征向量之间的距离。
在一可选的实施例中,上述步骤S2083具体可以包括:
分别确定所述权重矩阵中每行所有特征向量之和;
将所述每行所有特征向量之和组合得到所述对角矩阵。
在一可选的实施例中,上述S2085具体可以包括:
从所述转移概率矩阵中获取所述新增样本对应的行,且所述样本集合的目标特征向量对应列的取值,得到目标转移矩阵;
将所述目标转移概率矩阵与所述样本集合对应的标签集合的乘积确定为所述新增样本的标签估计值;
根据所述标签估计值对所述新增样本进行标注。
在一可选的实施例中,在根据所述样本集合的目标特征向量对新增样本进行标注之后,确定与所述新增样本相邻的未标注样本,将所述未标注样本添加到候选集合中;根据所述新增样本对所述候选集合中的未标注样本进行标注,进一步的,针对所述候选集合中的未标注样本,根据所述转移概率矩阵的子矩阵与所述样本集合标签矩阵的子矩阵确定所述未标注样本的标签估计值;若所述标签估计值与所述候选集合中未标注样本的初始标签值的差值的L1范数大于预设阈值,根据所述标签估计值更新所述未标注样本的标签值,其中,所述初始标签值为0向量。
在一可选的实施例中,在对性能指标数据进行异常检测,得到异常点组成的样本集合与所述样本集合对应的标签集合之前,确定所述性能指标数据中的缺失值;获取与所述缺失值的采样时间对应的历史同时刻的一个数据或多个数据;根据所述一个数据或所述多个数据的均值填充所述缺失值。
本实施例采用数据挖掘与机器学习方法,通过对流式输入的无线网络关键性能指标数据进行特征抽取、特征选择及半监督的标签传播,实现未知类型样本数据的标注工作,从而达到标签样本自动扩充,辅助跟因定位算子优化的目标。进一步的,还可直接用于跟因分析环节,对故障所属大类进行明确。
本实施例中输入数据对象为异常检测算法得到的异常时刻点无线网络业务核心性能指标指标,及业务关心计数器的时序数据集合。
第一步,针对输入数据进行预处理,对其中缺失值采用如下方法填充:
选择历史同时刻数据点均值进行填充;
若无对应时刻数据点,则使用总体均值填充;
与此同时,对样本采集时间粒度进行统一,基于处理后初始样本集合进入流程,对模型进行初始化。
第二步,针对处理后核心指标数据实施特征工程,主要包括:
1,特征抽取,具体包括:
为充分表征不同性能指标数据时序特征,需要针对每一个输入异常对象对应K维指标数据进行特征扩充,以形成键值对(x,y),其中x为任一样本点进行特征抽取后所得到的L维特 征向量x=[x 1,x 2,...,x l] T,y为对应样本对象异常类型标签值。当前该模块主要采用以下用于时间序列数据处理的通用型模型包括差分模型、Holt-Winters时间序列模型、滑动平均模型、滑动中值模型、时间序列分解模型、时间序列分解中值模型及小波变换模型。
2,特征选择,具体包括:
针对特征增强模块获得特征数据进行特征选择,相关流程及说明参考专利《一种有监督异常检测的特征选择方法和装置》,其主要内容为:对于每一个异常样本对象经过特征抽取得到的L维特征向量x=[x 1,x 2,...,x l] T,及对应样本标签值y,求解使得特征选择概率等式p(y|x)=p(y|x θ*)成立的最小特征子集x θ*,从而实现由原始维度L至选择后维度M的降维过程,其中p为给定特征集合x下标签值y出现概率之间函数关系的真实映射。进一步的,针对以上假设,可构建p的近似预测模型q,并通过极大似然估计求解得到目标特征集合x θ*
Figure PCTCN2021129871-appb-000001
其中,θ表示选定特征,τ表示用于预测类别标签的参数,以上求解的目的在于找到一个最小的目标特征集合x θ*使得预测模型q无限接近真实模型p。通过对以上表达式归一化、求取对数、计算互信息等过程,即可得到x θ*结果。
第三步,基于筛选特征数据进行增量标记,主要包括:
通过数据预处理及特征工程子模块,对于已标记样本对象,每个样本特征值与类别标签的关系对可表示为(x,y)。
其中x为特征向量x=[x 1,x 2,...,x m] T,下标为特征选择所得M维特征空间;y为已标注类别标签。基于性能指标实时监测输入的N个异常点样本对象,每个样本对应经过特征抽取及选择得到的M个特征值,有特征向量集合X={X 1,X 2,...,X n}及标签集合L={L 1,L 2,...,L n}。
为针对于流式进入系统内的每一个新的样本n+1,开始模型训练过程:若样本未标注,则通过模型计算对其标签进行估计,从而实现跟因标注与样本扩充;若样本已有标签,则对模型本身进行更新。主要分为两个步骤实现:
1,样本数据k-nn关系构造,具体包括:
以原始样本点作为独立结点,对于新增样本n+1,计算该样本与已知各样本之间距离,距离计算方式为:
Figure PCTCN2021129871-appb-000002
其中,M为单位矩阵(identity matrix:主对角线元素为1,其余元素均为0的方阵),i,j为两个不相同样本对象,N(i)为样本点i的邻居(可使用多种计算方式,本专利采用基 于欧式距离的k-nn计算方法),即j属于i的最邻近k个样本点集合。
所得距离值即为新增样本对应节点与已知各节点之间边的权重,可形成权重矩阵
Figure PCTCN2021129871-appb-000003
基于权重矩阵W(weight matrix:矩阵内元素值表示任意两个样本点之间边的权重,即两个样本对象的在当前特征维度下的近似程度),可构造对角矩阵(diagonal matrix:一个主对角线之外的元素皆为0的矩阵)
Figure PCTCN2021129871-appb-000004
每一行对角元素值即为该行元素求和结果,进一步计算得到转移概率矩阵P(transition matrix:矩阵内元素均为非负,且各行元素之和为1,表示在一定条件下元素从某状态转移到另一个状态的概率)的表达式为:
Figure PCTCN2021129871-appb-000005
其中,P LL,P LU,P UL,P UU为转移概率中对应标签样本对象、标签与无标签样本对象混合、无标签样本对象的子矩阵。
2,标签矩阵更新,具体包括:
原始样本标签F可表示为n×c矩阵
Figure PCTCN2021129871-appb-000006
n为初始n个样本对象,c为标签个数,或可理解为c个故障大类,若已知某标签样本i属于类别c 1,则标签矩阵对应赋值为
Figure PCTCN2021129871-appb-000007
通过构造k-nn关系,可以得到用于表示状态转移的迁移矩阵P,并由此实现对于未标注的新增样本n+1的标签估计,即f n+1=P n+1,1:n·F。
其中,P n+1,1:n为上述转移概率矩阵第n+1行取值,F为上述标签矩阵,则f n+1表现为1×(n+1)的向量形式,即对于新增未标注样本对象n+1的标签估计;对于已标注样本,则保留其原有标签值不作更改。
3,有限标签传播,主要包括:
标签传播算法为增量标记子模块的核心算法,通过对满足影响显著性条件的未标注样本标签的局部更新,实现最小资源消耗下的类别标签扩散与传播。对于新增未标注样本n+1,在对其本身标签进行估计的基础上(若无标签类别信息),将该样本在k-nn关系中所有属于邻居节点且未标注的样本计入候选集合,对集合中任一样本对象k使用如下方式更新标签估计值:
Figure PCTCN2021129871-appb-000008
其中,P UL(k)为转移概率子矩阵P UL第k行取值,F L为标注样本对象对应子标签矩阵,P UU(k)为转移概率子矩阵P UU第k行取值,F U为未标注样本对象对应子标签矩阵(初始为0矩阵)。
若该估计值与当前标签值之差的L1范数(L1Norm:向量中各元素绝对值之和)大于阈值
Figure PCTCN2021129871-appb-000009
则更新样本对象k对应标签值,同时将对象k作为新的扩散中心,将其周围满足条件的未标注样本点继续加入候选集合,重复以上步骤直至达到迭代步数或候选集合为空,得到标签值更新收敛结果。
为实现基于流式输入性能指标异常数据的故障跟因分类,本实施例基于增量标记算法的流式数据多分类框架,基于包括特征抽取及选择在内的可配置特征工程,以标签估计值偏离度为限制条件进行局部扩散更新,允许未标注样本对象根据输入动态更新标签类型,以较小计算代价实现可适应的多分类目标。
图3是根据本实施例的算法主流程图,如图3所示,包括:
S301,数据输入,以无线网络核心性能指标数据异常时刻为起始点,向前读取30天历史性能数据,并进行预处理后作为算法输入数据;
S302,对数据进行特征抽取,使用get_feature函数进行抽取函数配置,针对每个指标对象形成若干独立感知器;
S303,对所提取特征对象进行特征选择,在规模化特征数据中选择与标签相关性高的特征;
S304,数据标记,基于S303形成样本对象特征项集合X={X 1,X 2,...,X n}及标签集合L,首先将初始化数据与增量数据分离,在增量数据中,将标记对象与未标记对象分离。进一步,对模型进行初始化,并将增量数据进行流式处理输入模型进行更新;
S305,获取未标记样本数据标签,即特定性能指标异常所对应故障大类。
图4是根据本实施例的特征抽取的流程图,如图4所示,包括:
S4201,数据输入,对性能指标时间序列数据进行采样粒度统一,作为算法输入;
S402,特征项选择,修改feature_list进行特征项配置,通过feature_mapping对特征算子进行定义,且算法根据特征项配置情况动态设定并行进程数,当前默认配置项为:
通过差分模型(Difference Model,简称为Diff)(last-day,last-week),具体如下:
Δf(x k)=f(x k)-f(x k-h);
Holt-Winters(α,β,γ={0.2,0.4,0.6,0.8});
Figure PCTCN2021129871-appb-000010
Figure PCTCN2021129871-appb-000011
b t=β*(l t-l t-1)+(1-β*)b t-1
Figure PCTCN2021129871-appb-000012
使用水平分量l t平滑方程,趋势分量b t平滑方程,及季节性分量s t平滑方程的乘法模型。
Historical average(window=1,2,3,4weeks),使用特定窗长的历史数据平均值作为特征值。
Historical median(window=1,2,3,4weeks),使用特定窗长的历史数据中值作为特 征值。
可以通过时间序列分解(Time Series Decomposition,简称为TSD)(window=1,2,3,4weeks),如下:
y t=S t*T t*R t
Figure PCTCN2021129871-appb-000013
对时间序列分解后可以得到季节分量、趋势性分量及残差分量,特征项为各分量均值乘积(采用乘法分解方式)。
TSD median(window=1,2,3,4weeks),时间序列分解结果同上,特征项为各分量中值乘积(采用乘法分解方式)。
Wavelet(window=1,3,5,7days),对时间序列进行小波分解,得到高频信号部分,对各层高频信号建立自回归滑动平均模型(Autoregressive Moving Average Model,简称为ARMA)模型,以预测对应小波系数,最终使用小波系数重构数据获得特征值。共计7种通用预测模型,86种预测值类型。
S403,构造特征感知器,利用选择特征算子及对应参数构造特征感知器,形成对性能数据异常时刻的特征提取;
S404,特征增强,计算抽取特征值与原始数据误差,对误差进行特征增强以提高对于数据异常波动的表征能力;
S405,获取特征数据。
图5是根据本实施例的特征增强的流程图,如图5所示,包括:
S501,特征数据输入,使用算法2获取特征数据作为算法输入,旨在增强特征本身对于数据异常波动的表征能力;
S502,计算预测残差项,计算特征项与原始数据误差,多数情况误差围绕0波动;
S503,计算误差标准分数值,该值越接近0,则表明关键性能指标(Key Performance Indicator,简称为KPI)数据波动越小,计算公式如下:
Figure PCTCN2021129871-appb-000014
S504,特征增强,对标准化后数据进行特征增强以放大显著波动,同时削弱噪声值影响,即对于偏离0值较远的对象进行扩大,而对于接近0值对象的影响进行限制。
S505,获取增强后特征数据。
图6是根据本实施例的特征选择的流程图,如图6所示,包括:
S601,特征数据输入,对特征工程获得特征数据作为算法输入;
S602,对特征数据进行转置运算,算法的目的是将特征数据转置,实现过程保持数据原有的分区,进而抑制数据散化,并通过数据池化降低计算成本;
S603,计算所有特征的相关性;
S604,初始化选定特征集合,根据S303中的特征相关性结果,初始化中间结果池并且创建特征的初始等级;
S605,判断选定特征是否满足中止条件,在判断结果为否的情况下,执行步骤S606,在判断结果为是的情况下,执行步骤S607,计算互信息和条件互信息值,并迭代更新中间结果 池,直至满足中止条件跳转至S608;
S606,计算特征间的冗余性;
S607,更新选定特征集合;
S608,获取选定特征集和。
图7是根据本实施例的增量标记的流程图,如图7所示,包括:
S701,特征数据输入,将特征选择模块筛选所得特征集合作为算法输入,划分初始已标注样本集合作为训练数据进行模型初始化,其他数据进行流式输入;
S702,更新权重矩阵,对于新增样本n+1,利用特征向量与单位矩阵M计算与原样本集合中每个对象之间的权重计为w,生成权重矩阵W;
S703,更新迁移矩阵,计算更新对角矩阵D与迁移矩阵P为(n+1)×(n+1),其中使用二叉树结构对样本空间中的点进行排序,使得已标注样本排在未标注样本之前;
S704,更新标签矩阵,对新增样本n+1标签值进行估计;
S705,判断是否满足循环条件,在判断结果为是的情况下,执行步骤S706,否则执行步骤S708,对于未标注对象集合且属于当前节点邻居节点集合作为候选扩散集合,对于候选集合中的每个元素k,检验是否满足循环条件:1)候选扩散集合不为空,2)迭代次数小于阈值T max,若满足则结束迭代,跳转至S708,否则停留在循环中;
S706,标签传播,针对候选扩散集合中的每一个元素k,实施局部标签传播算法;
S707,计算更新标签矩阵的估计值;
S708,获取数据标签估计值。
图8是根据本实施例的标签传播的流程图,如图8所示,包括:
S801,候选集合数据输入,将S505得到候选标签扩散样本集合作为算法输入;
S802,判断候选集合是否为空,在判断结果为是的情况下,执行步骤S803,否则执行步骤S806,循环条件,判断候选是否不为空且迭代次数小于阈值T max,若满足,则执行循环,否则输出结果;
S803,计算标签更新影响因子,对于候选集合中每个样本对象ii,得到其标签估计值与原始值偏差如下:
δf i=P UL(i)F L+P UU(i)F U-F U(i),其中,偏差绝对值即影响因子。
S804,评估标签更新影响因子,若S803所得的偏差绝对值大于特定阈值
Figure PCTCN2021129871-appb-000015
则对样本对象i的标签值进行更新,并将该对象与影响因子值存储为集合A;
S805,更新候选集合,对于S604所获得集合A中的每一个对象j,获取未标注对象集合中且属于当前节点邻居节点加入候选扩散集合,并对对应标签值进行更新如下:
Figure PCTCN2021129871-appb-000016
S806,获取标签更新值。
根据本申请的另一实施例,提供了一种数据标注处理装置,图9是根据本实施例的数据标注处理装置的框图,如图9所示,所述装置包括:
异常检测模块92,用于对性能指标数据进行异常检测,得到异常点组成的样本集合与所述样本集合对应的标签集合;
特征扩充模块94,用于对所述样本集合进行特征扩充,得到所述样本集合的特征向量与对应的标签值;
特征选择模块96,用于对所述特征向量进行特征选择,得到所述样本集合的目标特征向量;
第一标注模块98,用于根据所述样本集合的目标特征向量对新增样本进行增量标注。
在一可选的实施例中,所述第一标注模块98,还用于
在所述新增样本未标注的情况下,根据所述样本集合的目标特征向量与对应的标签集合对所述新增样本的标签进行标注。
在一可选的实施例中,所述第一标注模块98包括:
第一确定子模块,用于若所述新增样本中的部分或全部样本点与所述样本集合相邻,确定所述新增样本与所述样本集合的目标特征向量之间的距离;
第二确定子模块,用于确定所述距离为所述新增样本中的节点与所述样本集合中各节点之间边的权重,得到权重矩阵;
构建子模块,用于根据所述权重矩阵构建对角矩阵;
第三确定子模块,用于根据所述对角矩阵与所述权重矩阵确定转移概率矩阵;
标注子模块,用于根据所述转移概率矩阵对所述新增样本的标签进行标注。
在一可选的实施例中,所述第一确定子模块包括:
获取单元,用于获取所述新增样本中属于所述样本集合的邻近的样本点集合;
第一确定单元,用于确定所述样本点集合中所有样本点与所述样本集合的目标特征向量的距离;
第二确定单元,用于将所述样本点集合中所有样本点与所述样本集合的目标特征向量的距离确定为所述新增样本与所述样本集合的目标特征向量之间的距离。
在一可选的实施例中,所述构建子模块包括:
第三确定单元,用于分别确定所述权重矩阵中每行所有特征向量之和;
组合单元,用于将所述每行所有特征向量之和组合得到所述对角矩阵。
在一可选的实施例中,所述标注子模块包括:
获取单元,用于从所述转移概率矩阵中获取所述新增样本对应的行,且所述样本集合的目标特征向量对应列的取值,得到目标转移矩阵;
第四确定单元,用于将所述目标转移概率矩阵与所述样本集合对应的标签集合的乘积确定为所述新增样本的标签估计值;
标注单元,用于根据所述标签估计值对所述新增样本进行标注。
在一可选的实施例中,在根据所述样本集合的目标特征向量对新增样本进行标注之后,所述装置还包括:
添加模块,用于确定与所述新增样本相邻的未标注样本,将所述未标注样本添加到候选集合中;
第二标注模块,用于根据所述新增样本对所述候选集合中的未标注样本进行标注。
在一可选的实施例中,所述第二标注模块包括:
第四确定子模块,用于针对所述候选集合中的未标注样本,根据所述转移概率矩阵的子矩阵与所述样本集合标签矩阵的子矩阵确定所述未标注样本的标签估计值;
更新子模块,用于若所述标签估计值与所述候选集合中未标注样本的初始标签值的差值的L1范数大于预设阈值,根据所述标签估计值更新所述未标注样本的标签值,其中,所述初始标签值为0向量。
在一可选的实施例中,所述装置还包括:
确定模块,用于确定所述性能指标数据中的缺失值;
获取模块,用于获取与所述缺失值的采样时间对应的历史同时刻的一个数据或多个数据;
填充模块,用于根据所述一个数据或所述多个数据的均值填充所述缺失值。
在一可选的实施例中,所述特征扩充模块,还用于
通过以下方式之一对预处理后的所述性能指标数据进行特征扩充,得到所述特征向量中每个原始值对应的预测值:
差分模型、Holt-Winters时间序列模型、滑动平均模型、滑动中值模型、时间序列分解模型、时间序列分解中值模型、小波变换模型。
在一可选的实施例中,所述特征选择模块,还用于
从所述特征向量中选取能够区分不同异常类型的特征项,得到所述样本集合的目标特征向量。
本申请的实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
在一个示例性实施例中,上述计算机可读存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。
本申请的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。
在一个示例性实施例中,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。
本实施例中的具体示例可以参考上述实施例及示例性实施方式中所描述的示例,本实施例在此不再赘述。
本申请实施例,对性能指标数据进行异常检测,得到异常点组成的样本集合与所述样本集合对应的标签集合;对所述样本集合进行特征扩充,得到所述样本集合的特征向量与对应的标签值;对所述特征向量进行特征选择,得到所述样本集合的目标特征向量;根据所述样本集合的目标特征向量对新增样本进行标注,可以解决相关技术中通过监督的分类方法对性能指标数据进行标注,无法较好地适应流式数据的问题,实现无线网络关键性能指标数据故障跟因的有效判别,可以较好地适应流式数据。
显然,本领域的技术人员应该明白,上述的本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件结合。
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员 来说,本申请可以有各种更改和变化。凡在本申请的原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (14)

  1. 一种数据标注处理方法,包括:
    对性能指标数据进行异常检测,得到异常点组成的样本集合与所述样本集合对应的标签集合;
    对所述样本集合进行特征扩充,得到所述样本集合的特征向量与对应的标签值;
    对所述特征向量进行特征选择,得到所述样本集合的目标特征向量;以及
    根据所述样本集合的目标特征向量对新增样本进行标注。
  2. 根据权利要求1所述的方法,其中,根据所述样本集合的目标特征向量对新增样本进行标注包括:
    在所述新增样本未标注的情况下,根据所述样本集合的目标特征向量与对应的标签集合对所述新增样本的标签进行标注。
  3. 根据权利要求2所述的方法,其中,根据所述样本集合的目标特征向量与对应的标签集合对所述新增样本的标签进行标注包括:
    若所述新增样本中的部分或全部样本点与所述样本集合相邻,确定所述新增样本与所述样本集合的目标特征向量之间的距离;
    确定所述距离为所述新增样本中的节点与所述样本集合中各节点之间边的权重,得到权重矩阵;
    根据所述权重矩阵构建对角矩阵;
    根据所述对角矩阵与所述权重矩阵确定转移概率矩阵;以及
    根据所述转移概率矩阵对所述新增样本的标签进行标注。
  4. 根据权利要求3所述的方法,其中,确定所述新增样本与所述样本集合的目标特征向量之间的距离包括:
    获取所述新增样本中属于所述样本集合的邻近的样本点集合;
    确定所述样本点集合中所有样本点与所述样本集合的目标特征向量的距离;以及
    将所述样本点集合中所有样本点与所述样本集合的目标特征向量的距离确定为所述新增样本与所述样本集合的目标特征向量之间的距离。
  5. 根据权利要求3所述的方法,其中,根据所述权重矩阵构建对角矩阵包括:
    分别确定所述权重矩阵中每行所有特征向量之和;以及
    将所述每行所有特征向量之和组合得到所述对角矩阵。
  6. 根据权利要求3所述的方法,其中,根据所述转移概率矩阵对所述新增样本的标签进行标注包括:
    从所述转移概率矩阵中获取所述新增样本对应的行,且所述样本集合的目标特征向量对应列的取值,得到目标转移矩阵;
    将所述目标转移概率矩阵与所述样本集合对应的标签集合的乘积确定为所述新增样本的标签估计值;以及
    根据所述标签估计值对所述新增样本进行标注。
  7. 根据权利要求3所述的方法,其中,在根据所述样本集合的目标特征向量对新增样本进行标注之后,所述方法还包括:
    确定与所述新增样本相邻的未标注样本,将所述未标注样本添加到候选集合中;以及
    根据所述新增样本对所述候选集合中的未标注样本进行标注。
  8. 根据权利要求7所述的方法,其中,根据所述新增样本对所述候选集合中的未标注样本进行标注包括:
    针对所述候选集合中的未标注样本,根据所述转移概率矩阵的子矩阵与所述样本集合标签矩阵的子矩阵确定所述未标注样本的标签估计值;以及
    若所述标签估计值与所述候选集合中未标注样本的初始标签值的差值的L1范数大于预设阈值,根据所述标签估计值更新所述未标注样本的标签值,其中,所述初始标签值为0向量。
  9. 根据权利要求1所述的方法,其中,在对性能指标数据进行异常检测,得到异常点组成的样本集合与所述样本集合对应的标签集合之前,所述方法还包括:
    确定所述性能指标数据中的缺失值;
    获取与所述缺失值的采样时间对应的历史同时刻的一个数据或多个数据;以及
    根据所述一个数据或所述多个数据的均值填充所述缺失值。
  10. 根据权利要求1至9中任一项所述的方法,其中,对所述性能指标数据进行特征扩充,得到特征向量与对应的标签值包括:
    通过以下方式之一对预处理后的所述性能指标数据进行特征扩充,得到所述特征向量中每个原始值对应的预测值:以及
    差分模型、Holt-Winters时间序列模型、滑动平均模型、滑动中值模型、时间序列分解模型、时间序列分解中值模型、小波变换模型。
  11. 根据权利要求1至9中任一项所述的方法,其中,对所述特征向量进行特征选择,得到所述样本集合的目标特征向量包括:
    从所述特征向量中选取能够区分不同异常类型的特征项,得到所述样本集合的目标特征向量。
  12. 一种数据标注处理装置,包括:
    异常检测模块,被配置为对性能指标数据进行异常检测,得到异常点组成的样本集合与所述样本集合对应的标签集合;
    特征扩充模块,被配置为对所述样本集合进行特征扩充,得到所述样本集合的特征向量与对应的标签值;
    特征选择模块,被配置为对所述特征向量进行特征选择,得到所述样本集合的目标特征向量;以及
    第一标注模块,被配置为根据所述样本集合的目标特征向量对新增样本进行增量标注。
  13. 一种计算机可读的存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至11任一项中所述的方法。
  14. 一种电子装置,包括存储器和处理器,其中,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至11任一项中所述的方法。
PCT/CN2021/129871 2020-11-26 2021-11-10 一种数据标注处理方法、装置、存储介质及电子装置 WO2022111284A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011349875.4A CN114548195A (zh) 2020-11-26 2020-11-26 一种数据标注处理方法、装置、存储介质及电子装置
CN202011349875.4 2020-11-26

Publications (1)

Publication Number Publication Date
WO2022111284A1 true WO2022111284A1 (zh) 2022-06-02

Family

ID=81668365

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/129871 WO2022111284A1 (zh) 2020-11-26 2021-11-10 一种数据标注处理方法、装置、存储介质及电子装置

Country Status (2)

Country Link
CN (1) CN114548195A (zh)
WO (1) WO2022111284A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116257800A (zh) * 2023-05-12 2023-06-13 智慧眼科技股份有限公司 一种训练样本的标注方法及系统
CN117563144A (zh) * 2023-12-04 2024-02-20 惠州市凌盛医疗科技有限公司 一种红外治疗仪状况评估与剩余寿命预测方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509969A (zh) * 2017-09-06 2018-09-07 腾讯科技(深圳)有限公司 数据标注方法及终端
US20190238396A1 (en) * 2018-01-29 2019-08-01 Cisco Technology, Inc. Using random forests to generate rules for causation analysis of network anomalies
CN111224805A (zh) * 2018-11-26 2020-06-02 中兴通讯股份有限公司 一种网络故障根因检测方法、系统及存储介质
CN111368890A (zh) * 2020-02-26 2020-07-03 珠海格力电器股份有限公司 故障检测方法及装置、信息物理融合系统
CN111586728A (zh) * 2020-04-30 2020-08-25 南京邮电大学 一种面向小样本特征的异构无线网络故障检测与诊断方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509969A (zh) * 2017-09-06 2018-09-07 腾讯科技(深圳)有限公司 数据标注方法及终端
US20190238396A1 (en) * 2018-01-29 2019-08-01 Cisco Technology, Inc. Using random forests to generate rules for causation analysis of network anomalies
CN111224805A (zh) * 2018-11-26 2020-06-02 中兴通讯股份有限公司 一种网络故障根因检测方法、系统及存储介质
CN111368890A (zh) * 2020-02-26 2020-07-03 珠海格力电器股份有限公司 故障检测方法及装置、信息物理融合系统
CN111586728A (zh) * 2020-04-30 2020-08-25 南京邮电大学 一种面向小样本特征的异构无线网络故障检测与诊断方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116257800A (zh) * 2023-05-12 2023-06-13 智慧眼科技股份有限公司 一种训练样本的标注方法及系统
CN116257800B (zh) * 2023-05-12 2023-08-25 智慧眼科技股份有限公司 一种训练样本的标注方法及系统
CN117563144A (zh) * 2023-12-04 2024-02-20 惠州市凌盛医疗科技有限公司 一种红外治疗仪状况评估与剩余寿命预测方法及系统
CN117563144B (zh) * 2023-12-04 2024-05-28 郭永强 一种红外治疗仪状况评估与剩余寿命预测方法及系统

Also Published As

Publication number Publication date
CN114548195A (zh) 2022-05-27

Similar Documents

Publication Publication Date Title
CN111694879B (zh) 一种多元时间序列异常模式预测方法及数据采集监控装置
CN110888755B (zh) 一种微服务系统异常根因节点的查找方法及装置
US20220121994A1 (en) Method and apparatus for implementing model training, and computer storage medium
WO2022111284A1 (zh) 一种数据标注处理方法、装置、存储介质及电子装置
WO2021057576A1 (zh) 一种构造云化网络告警根因关系树模型方法、装置和存储介质
US20210124983A1 (en) Device and method for anomaly detection on an input stream of events
US20230300159A1 (en) Network traffic anomaly detection method and apparatus, and electronic apparatus and storage medium
CN110335168B (zh) 基于gru优化用电信息采集终端故障预测模型的方法及系统
CN104601565A (zh) 一种智能优化规则的网络入侵检测分类方法
CN107247666B (zh) 一种基于特征选择和集成学习的软件缺陷个数预测方法
US20220284352A1 (en) Model update system, model update method, and related device
CN111738520A (zh) 一种融合孤立森林与长短期记忆网络的系统负载预测方法
WO2023029654A1 (zh) 一种故障根因确定方法、装置、存储介质及电子装置
CN111431819A (zh) 一种基于序列化的协议流特征的网络流量分类方法和装置
CN108596204B (zh) 一种基于改进型scdae的半监督调制方式分类模型的方法
US20230342606A1 (en) Training method and apparatus for graph neural network
US20230117980A1 (en) Systems and methods for graph prototypical networks for few-shot learning on attributed networks
WO2020024444A1 (zh) 人群绩效等级识别方法、装置、存储介质及计算机设备
CN110781818B (zh) 视频分类方法、模型训练方法、装置及设备
CN112507720A (zh) 基于因果语义关系传递的图卷积网络根因识别方法
CN109376964B (zh) 一种基于记忆神经网络的刑事案件罪名预测方法
CN114842371A (zh) 一种无监督视频异常检测方法
Mortazavi et al. Efficient mobile cellular traffic forecasting using spatial-temporal graph attention networks
US20210133593A1 (en) Analysis of anomalies in a facility
Li et al. A BYY scale-incremental EM algorithm for Gaussian mixture learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896785

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06.11.2023)