CN114385463A

CN114385463A - Data acquisition method and device and electronic equipment

Info

Publication number: CN114385463A
Application number: CN202111498617.7A
Authority: CN
Inventors: 郑南成
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Sangfor Technologies Co Ltd
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-04-22

Abstract

The embodiments of the present application disclose a data collection method, an apparatus, and an electronic device. Wherein, the method includes: in response to the collection instruction, acquiring the collection data collected in the current collection period as the data to be processed; calculating the similarity between the data to be processed and the historical collection data collected in the historical collection period; if the similarity satisfies the specified Threshold condition, to store the data to be processed. Through the above method, the similarity between the calculated data to be processed and the historical collection data collected in the historical collection period can be compared with the specified threshold condition, and when the similarity satisfies the specified threshold condition, the to-be-processed data can be processed again. The data is stored, so that the acquired collected data will be screened according to the specified threshold conditions to obtain the collected data that meets the requirements (specified threshold conditions) for storage, so that it is not necessary to directly perform each acquisition of the collected data. storage, saving storage space.

Description

Data acquisition method, device and electronic device

技术领域technical field

本申请涉及计算机技术领域，更具体地，涉及一种数据采集方法、装置、以及电子设备。The present application relates to the field of computer technology, and more particularly, to a data collection method, apparatus, and electronic device.

背景技术Background technique

为了了解设备或者程序的运行状态，可以对设备或者程序在运行过程中的数据进行采集，然后根据所采集的数据来确定设备或者程序是否出现故障。但在相关的数据采集方式中还存在存储空间浪费的问题。In order to know the running state of the device or program, data during the running process of the device or program can be collected, and then it can be determined whether the device or program is faulty according to the collected data. However, there is still the problem of wasting storage space in related data collection methods.

发明内容SUMMARY OF THE INVENTION

鉴于上述问题，本申请提出了一种数据采集方法、装置以及电子设备，以实现改善上述问题。In view of the above problems, the present application proposes a data acquisition method, device and electronic device to improve the above problems.

第一方面，本申请提供了一种数据采集方法，所述方法包括：响应于采集指令，获取在当前采集周期采集得到的采集数据作为待处理数据；计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性；若所述相似性满足指定阈值条件，对所述待处理数据进行存储。In a first aspect, the present application provides a data collection method, the method includes: in response to a collection instruction, obtaining collection data collected in a current collection period as data to be processed; The similarity of the collected historically collected data; if the similarity meets the specified threshold condition, the data to be processed is stored.

可选的，所述计算所述采集数据与在历史采集周期所采集的历史采集数据的相似性，包括：基于所述待处理数据的类型计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性。Optionally, the calculating the similarity between the collected data and the historical collection data collected in the historical collection period includes: calculating the data to be processed and the data collected in the historical collection period based on the type of the data to be processed. Similarity of historically collected data.

可选的，所述基于所述待处理数据的类型计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性，包括：若所述待处理数据为累积型数据，基于所述待处理数据以及在历史采集周期所采集的历史采集数据的变化率计算相似性；若所述待处理数据为非累积型数据，基于所述待处理数据以及在历史采集周期所采集的历史采集数据计算相似性。Optionally, calculating the similarity between the data to be processed and the historically collected data collected in the historical collection period based on the type of the data to be processed includes: if the data to be processed is cumulative data, based on the data to be processed. Calculate similarity between the data to be processed and the rate of change of the historical collection data collected in the historical collection period; if the data to be processed is non-cumulative data, based on the data to be processed and the historical collection collected in the historical collection period Data calculation similarity.

其中，可选的，所述基于所述待处理数据以及在历史采集周期所采集的历史采集数据的变化率计算相似性，包括：若所述待处理数据为单维数据，获取多个历史采集周期各自采集的历史采集数据，得到多个历史采集数据；从所述待处理数据和所述多个历史采集数据中获取多对数据，每对数据中的数据各自对应的采集周期相邻；获取每对数据中对应采集周期在后的数据与对应采集周期在前的数据的差作为参考差值，以得到每对数据对应的参考差值；将每对数据对应的参考差值与每对数据中对应采集周期在前的数据相比，得到每对数据对应的变化率；基于第一相似性算法以及每对数据对应的变化率计算相似性，所述第一相似性算法包括标准差、欧式距离、余弦距离、皮尔逊相关系数、修正余弦距离、汉明距离、曼哈顿距离中的任意一项。Wherein, optionally, calculating the similarity based on the data to be processed and the rate of change of the historical collection data collected in the historical collection period includes: if the data to be processed is single-dimensional data, acquiring multiple historical collections Obtaining a plurality of historical collection data from the historical collection data collected in the respective periods; obtaining a plurality of pairs of data from the data to be processed and the plurality of historical collection data, and the data in each pair of data has adjacent collection periods; In each pair of data, the difference between the data after the corresponding collection period and the data before the corresponding collection period is used as the reference difference value to obtain the reference difference value corresponding to each pair of data; the reference difference value corresponding to each pair of data and each pair of data The corresponding change rate of each pair of data is obtained by comparing the data corresponding to the previous collection period in Any of distance, cosine distance, Pearson correlation coefficient, modified cosine distance, Hamming distance, Manhattan distance.

可选的，所述基于所述待处理数据以及在历史采集周期所采集的历史采集数据的变化率计算相似性，包括：若所述待处理数据为多维数据，获取多个历史采集周期各自采集的历史采集数据，得到多个历史采集数据；获取所述待处理数据和所述多个历史采集数据中每个维度位置的数据；基于对应的维度位置将每个维度位置的数据划分为多组，得到多组数据，其中，同一组数据所对应的维度位置相同；获取每组数据中的多对数据，每对数据中的数据各自对应的采集周期相邻；获取每对数据中对应采集周期在后的数据与对应采集周期在前的数据的差作为参考差值，以得到每组数据中的每对数据对应的参考差值；将每对数据对应的参考差值与每对数据中对应采集周期在前的数据相比，得到每对数据对应的变化率；基于所包括的数据的采样周期，将多对数据划分为多个集合，其中，同一个集合中的每对数据中所对应有的采集周期相同；基于每个集合中的每个数据对应的变化率生成对应的多维数据，得到多个多维数据，其中，每对数据对应的变化率在对应生成的多维数据的中维度位置与该对数据中的数据在所述待处理数据或者所述历史采集数据中的维度位置相同；基于第二相似性算法以及所述多个多维数据计算相似性，所述第二相似性算法包括欧式距离、余弦距离、皮尔逊相关系数、修正余弦距离、汉明距离、曼哈顿距离中的任意一项。Optionally, calculating the similarity based on the data to be processed and the rate of change of the historically collected data collected in the historical collection period includes: if the data to be processed is multi-dimensional data, obtaining the data collected in multiple historical collection periods respectively. obtain a plurality of historical collection data; obtain the data to be processed and the data of each dimension position in the plurality of historical collection data; divide the data of each dimension position into multiple groups based on the corresponding dimension position , obtain multiple sets of data, wherein the dimensional positions corresponding to the same set of data are the same; obtain multiple pairs of data in each set of data, and the data in each pair of data have adjacent collection periods; obtain the corresponding collection periods in each pair of data The difference between the following data and the data corresponding to the previous acquisition period is taken as the reference difference value, so as to obtain the reference difference value corresponding to each pair of data in each group of data; Comparing the data with the previous collection period, the rate of change corresponding to each pair of data is obtained; based on the sampling period of the included data, the multiple pairs of data are divided into multiple sets, wherein the corresponding pairs of data in the same set are Some collection cycles are the same; corresponding multi-dimensional data are generated based on the rate of change corresponding to each data in each set, and multiple multi-dimensional data are obtained, wherein the rate of change corresponding to each pair of data is in the middle dimension of the corresponding generated multi-dimensional data. The dimensional position of the data in the pair of data is the same in the data to be processed or the historically collected data; similarity is calculated based on a second similarity algorithm and the plurality of multidimensional data, and the second similarity algorithm includes Any of Euclidean distance, cosine distance, Pearson correlation coefficient, modified cosine distance, Hamming distance, Manhattan distance.

可选的，所述基于所述待处理数据以及在历史采集周期所采集的历史采集数据计算相似性，包括：若所述待处理数据为单维数据，获取多个历史采集周期各自采集的历史采集数据，得到多个历史采集数据；基于第一相似性算法、所述待处理数据以及所述多个历史采集数据计算相似性，所述第一相似性算法包括标准差、欧式距离、余弦距离、皮尔逊相关系数、修正余弦距离、汉明距离、曼哈顿距离中的任意一项。Optionally, calculating the similarity based on the data to be processed and the historical collection data collected in the historical collection period includes: if the data to be processed is single-dimensional data, acquiring the histories collected by each of the multiple historical collection periods. Collect data to obtain a plurality of historically collected data; calculate similarity based on a first similarity algorithm, the data to be processed, and the plurality of historically collected data, where the first similarity algorithm includes standard deviation, Euclidean distance, and cosine distance , Pearson correlation coefficient, modified cosine distance, Hamming distance, Manhattan distance.

可选的，所述基于所述待处理数据以及在历史采集周期所采集的历史采集数据计算相似性，包括：若所述待处理数据为多维数据，获取多个历史采集周期各自采集的历史采集数据，得到多个历史采集数据；基于第二相似性算法、所述待处理数据以及所述多个历史采集数据计算相似性，所述第二相似性算法包括欧式距离、余弦距离、皮尔逊相关系数、修正余弦距离、汉明距离、曼哈顿距离中的任意一项。Optionally, the calculating similarity based on the data to be processed and the historical collection data collected in the historical collection period includes: if the data to be processed is multi-dimensional data, acquiring the historical collection data collected by each of the multiple historical collection periods. data to obtain a plurality of historical collection data; calculate similarity based on the second similarity algorithm, the data to be processed and the plurality of historical collection data, the second similarity algorithm includes Euclidean distance, cosine distance, Pearson correlation Coefficient, Modified Cosine Distance, Hamming Distance, Manhattan Distance.

从而通过当待处理数据为累积型数据时，可以基于待处理数据以及多个历史采集数据的变化率计算相似性；当待处理数据为非累积型数据时，可以基于待处理数据以及多个历史采集数据计算相似性的方式使得，由于不同数据类型所具有的数据特征也存在差异，基于待处理数据类型的不同采用不同的数据进行相似性计算，可以更好地识别出待处理数据是否具有较高价值的数据特征。并且，将累积/非累积型数据进一步划分为单维数据和多维数据，再根据数据维度的不同采用不同的相似性算法进行相似性计算，从而提高了本申请提出的数据采集方法的适用性和可拓展性。Therefore, when the data to be processed is cumulative data, the similarity can be calculated based on the data to be processed and the rate of change of multiple historical collection data; when the data to be processed is non-cumulative data, the similarity can be calculated based on the data to be processed and multiple historical data. The method of collecting data to calculate the similarity makes it possible to better identify whether the data to be processed has relatively different data characteristics based on the difference of the data types to be processed. High-value data features. In addition, the cumulative/non-cumulative data is further divided into single-dimensional data and multi-dimensional data, and then different similarity algorithms are used to calculate the similarity according to the different dimensions of the data, thereby improving the applicability of the data collection method proposed in this application. Scalability.

可选的，所述计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性，包括：基于所述待处理数据的维度计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性。Optionally, the calculating the similarity between the data to be processed and the historical collection data collected in the historical collection period includes: calculating the data to be processed based on the dimension of the data to be processed and the data collected in the historical collection period. similarity of historically collected data.

可选的，若所述相似性不满足所述指定阈值条件且所述待处理数据的采集时刻与持久化存储周期匹配，对所述待处理数据进行存储。Optionally, if the similarity does not meet the specified threshold condition and the collection time of the data to be processed matches the persistent storage period, the data to be processed is stored.

基于持久化存储周期采集并存储数据，其中，所述持久化存储周期大于所述采样周期。Data is collected and stored based on a persistent storage period, wherein the persistent storage period is greater than the sampling period.

可选的，基于所述待处理数据的类型，确定持久化存储周期。Optionally, a persistent storage period is determined based on the type of the data to be processed.

通过上述方式使得，待处理数据可以在采集时刻与持久化存储周期匹配或者相似性可以满足指定阈值条件时被存储，由于持久化存储周期到达时一定会存储待处理数据，而相似性能否满足指定阈值条件是具有随机性的，从而可以对待处理数据进行可变频率的持久化存储，使得具有较高价值的待处理数据和常规待处理数据都可以被持久化存储，在节约存储空间的同时提高了设备的安全性和稳定性。Through the above method, the data to be processed can be stored when the collection time matches the persistent storage period or the similarity can meet the specified threshold condition, because the data to be processed will be stored when the persistent storage period arrives, and whether the similarity can meet the specified threshold The threshold condition is random, so that the data to be processed can be persistently stored at a variable frequency, so that both the data to be processed with high value and the conventional data to be processed can be stored persistently, which saves storage space and improves the the security and stability of the device.

可选的，所述指定阈值条件包括所述相似性小于第一相似阈值，或者所述相似性与上一个采集周期所对应的相似性的差值的绝对值小于第二相似阈值。Optionally, the specified threshold condition includes that the similarity is smaller than a first similarity threshold, or the absolute value of the difference between the similarity and the similarity corresponding to the previous collection period is smaller than a second similarity threshold.

通过上述方式使得，可以当待处理数据表征的设备故障所产生的问题严重时(例如：直接导致设备停止工作)，可以将待处理数据对应的指定阈值条件设置为：待处理数据的相似性与上一个采集周期所对应的相似性的差值的绝对值小于第二相似阈值，以便可以及时发现设备故障；当待处理数据表征的设备故障所产生的问题较轻时(例如：设备的某个功能出现故障，但还可以执行其他任务)，可以将待处理数据对应的指定阈值条件设置为：待处理数据的相似性小于第一相似阈值，以便可以快速分析出设备运行状态。从而使得可以基于实际需求确定指定阈值条件，进而提高相似性判断方法的灵活性。Through the above method, when the problem caused by the equipment failure represented by the data to be processed is serious (for example: directly causing the equipment to stop working), the specified threshold condition corresponding to the data to be processed can be set as: the similarity of the data to be processed is the same as The absolute value of the difference of the similarity corresponding to the previous collection period is smaller than the second similarity threshold, so that the equipment failure can be found in time; when the problem caused by the equipment failure represented by the data to be processed is relatively minor (for example: a certain The function fails, but other tasks can also be performed), the specified threshold condition corresponding to the data to be processed can be set as: the similarity of the data to be processed is less than the first similarity threshold, so that the operating status of the device can be quickly analyzed. Therefore, the specified threshold condition can be determined based on actual requirements, thereby improving the flexibility of the similarity judgment method.

第二方面，本申请提供了一种数据采集方法装置，运行于电子设备，所述装置包括：待处理数据获取单元，用于响应于采集指令，获取在当前采集周期采集得到的采集数据作为待处理数据；相似性计算单元，用于计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性；存储单元，用于若所述相似性满足指定阈值条件，对所述待处理数据进行存储。In a second aspect, the present application provides a data collection method and apparatus, which runs on an electronic device, and the apparatus includes: a to-be-processed data acquisition unit, configured to, in response to a collection instruction, acquire the collection data collected in the current collection cycle as the waiting-to-process data processing data; a similarity calculating unit for calculating the similarity between the data to be processed and the historical collection data collected in the historical collection period; a storage unit for calculating the similarity of the data to be Process data for storage.

第三方面，本申请提供了一种电子设备，包括一个或多个处理器以及存储器；一个或多个程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行，所述一个或多个程序配置用于执行上述的方法。In a third aspect, the present application provides an electronic device comprising one or more processors and a memory; one or more programs are stored in the memory and configured to be executed by the one or more processors, The one or more programs are configured to perform the methods described above.

第四方面，本申请提供的一种计算机可读存储介质，所述计算机可读存储介质中存储有程序代码，其中，在所述程序代码运行时执行上述的方法。In a fourth aspect, the present application provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, wherein the above-mentioned method is executed when the program code is executed.

本申请提供的一种数据采集方法、装置、电子设备以及存储介质，在响应于采集指令，获取在当前采集周期采集得到的采集数据后，将当前采集周期采集得到的采集数据作为待处理数据，计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性，若所述相似性满足指定阈值条件，对所述待处理数据进行存储。从而通过上述方式使得，可以将计算得到的待处理数据与在历史采集周期所采集的历史采集数据的相似性与指定阈值条件进行比对，并且在相似性满足指定阈值条件的情况下，再对待处理数据进行存储，使得对于所获取得到的采集数据会根据指定阈值条件进行一定的筛选，以得到满足需求(指定阈值条件)的采集数据进行存储，进而不用直接对每次获取得到的采集数据都进行存储，节约了存储空间。In a data acquisition method, device, electronic device and storage medium provided by the present application, after acquiring the acquisition data acquired in the current acquisition period in response to the acquisition instruction, the acquisition data acquired in the current acquisition period is used as the data to be processed, The similarity between the data to be processed and the historical collection data collected in the historical collection period is calculated, and if the similarity meets a specified threshold condition, the data to be processed is stored. Thus, by the above method, the similarity between the calculated data to be processed and the historical collection data collected in the historical collection period can be compared with the specified threshold condition, and if the similarity satisfies the specified threshold condition, then treat it again. The processed data is stored, so that the acquired collected data will be screened according to the specified threshold conditions, so as to obtain the collected data that meets the requirements (specified threshold conditions) for storage, so that it is not necessary to directly store the collected data obtained each time. Save storage space.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained from these drawings without creative effort.

图1示出了本申请提出的一种数据采集方法的应用场景示意图；FIG. 1 shows a schematic diagram of an application scenario of a data collection method proposed by the present application;

图2示出了本申请实施例提出的一种数据采集方法的流程图；FIG. 2 shows a flowchart of a data collection method proposed by an embodiment of the present application;

图3示出了本申请另一实施例提出的一种数据采集方法的流程图；FIG. 3 shows a flowchart of a data collection method proposed by another embodiment of the present application;

图4示出了本申请图2中S220的一种实施例方式的流程图；FIG. 4 shows a flowchart of an embodiment of S220 in FIG. 2 of the present application;

图5示出了本申请图2中S220的另一种实施例方式的流程图；FIG. 5 shows a flowchart of another embodiment of S220 in FIG. 2 of the present application;

图6示出了本申请提出的一种多维数据变化率计算方法的示意图；6 shows a schematic diagram of a method for calculating the rate of change of multidimensional data proposed by the present application;

图7示出了本申请图2中S230的一种实施例方式的流程图；FIG. 7 shows a flowchart of an embodiment of S230 in FIG. 2 of the present application;

图8示出了本申请本申请图2中S230的另一种实施例方式的流程图；FIG. 8 shows a flowchart of another embodiment of S230 in FIG. 2 of the present application;

图9示出了本申请再一实施例提出的一种数据采集方法的流程图；FIG. 9 shows a flowchart of a data collection method proposed by still another embodiment of the present application;

图10示出了本申请又一实施例提出的一种数据采集方法的流程图；FIG. 10 shows a flowchart of a data collection method proposed by another embodiment of the present application;

图11示出了本申请提出的一种数据采集方法流程的示意图；FIG. 11 shows a schematic diagram of the flow of a data acquisition method proposed by the present application;

图12示出了本申请实施例提出的一种数据采集装置的结构框图；FIG. 12 shows a structural block diagram of a data acquisition device proposed by an embodiment of the present application;

图13示出了本申请提出的一种电子设备的结构框图；FIG. 13 shows a structural block diagram of an electronic device proposed by the present application;

图14是本申请实施例的用于保存或者携带实现根据本申请实施例的参数获取方法的程序代码的存储单元。FIG. 14 is a storage unit for storing or carrying a program code for implementing a parameter acquisition method according to an embodiment of the present application according to an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

为了了解设备或者程序的运行状态，可以对设备或者程序在运行过程中的数据进行采集，然后根据所采集的数据来确定设备或者程序是否出现故障。例如，可以对与磁盘I/O口的读写速率相关的数据进行采集，以根据所采集的相关数据来确定磁盘是否出现数据读写故障。In order to know the running state of the device or program, data during the running process of the device or program can be collected, and then it can be determined whether the device or program is faulty according to the collected data. For example, data related to the read/write rate of the disk I/O port may be collected, so as to determine whether the disk has a data read/write failure according to the collected related data.

发明人在对相关研究中发现，在相关的数据采集方式中还存在存储空间浪费或者无法采集到高价值数据的问题。例如：在对设备数据进行全量采集的方式中，需要采集大量数据，从而占用大量的本地存储空间，并且存储的大部分数据都是重复的低价值数据。The inventor found in the related research that the related data collection methods still have the problem of wasting storage space or failing to collect high-value data. For example, in the method of collecting the full amount of device data, a large amount of data needs to be collected, thus occupying a large amount of local storage space, and most of the stored data is repeated low-value data.

因此，发明人提出了本申请中的一种数据采集方法、装置以及电子设备，在响应于采集指令，获取在当前采集周期采集得到的采集数据后，将当前采集周期采集得到的采集数据作为待处理数据，计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性，若所述相似性满足指定阈值条件，对所述待处理数据进行存储。从而通过上述方式使得，可以将计算得到的待处理数据与在历史采集周期所采集的历史采集数据的相似性与指定阈值条件进行比对，并且在相似性满足指定阈值条件的情况下，再对待处理数据进行存储，使得对于所获取得到的采集数据会根据指定阈值条件进行一定的筛选，以得到满足需求(指定阈值条件)的采集数据进行存储，进而不用直接对每次获取得到的采集数据都进行存储，节约了存储空间。Therefore, the inventor proposes a data acquisition method, device and electronic device in the present application. After acquiring the acquisition data acquired in the current acquisition period in response to the acquisition instruction, the acquisition data acquired in the current acquisition period is used as the pending data collection. Processing data, calculating the similarity between the data to be processed and the historical collection data collected in the historical collection period, and storing the data to be processed if the similarity meets a specified threshold condition. Thus, by the above method, the similarity between the calculated data to be processed and the historical collection data collected in the historical collection period can be compared with the specified threshold condition, and if the similarity satisfies the specified threshold condition, then treat it again. The processed data is stored, so that the acquired collected data will be screened according to the specified threshold conditions, so as to obtain the collected data that meets the requirements (specified threshold conditions) for storage, so that it is not necessary to directly store the collected data obtained each time. Save storage space.

场景：多个设备(可以举例)，网络，云服务器，多个设备可以采集各自的运行数据上传给云服务器(设备可以是响应式上传，也可以是主动上传)，云服务器执行本申请实施例提供的数据采集方法。Scenario: multiple devices (for example), network, cloud server, multiple devices can collect their respective operating data and upload it to the cloud server (devices can upload responsively or actively), and the cloud server executes the embodiments of this application Provided data collection methods.

为了更好地理解本申请实施例的方案，下面先对本申请实施例所涉及的一种应用场景进行介绍。In order to better understand the solutions of the embodiments of the present application, an application scenario involved in the embodiments of the present application is first introduced below.

请参阅图1，在图1所示的场景中包括有多个设备、网关、云平台以及用户终端。其中，设备可以接收云端、网关下发的控制命令，在响应于控制指令后通过网关上报与控制指令对应的信息(例如：设备可以在响应于云端或者网关下发的数据采集指令后通过网关上报该设备对应的的运行数据)，还可以通过网关上传设备各自对应的运行数据，以便让网关和云平台确认设备是否出现故障；网关可以负责处理物联网设备的上行信息和云端的下行命令；云平台可以用于执行本申请实施例提供的数据采集方法，还可以提供接口给用户终端设备中的应用程序调用，以实现信息的上行和下行，使得当云平台采集到的数据表征设备可能出现故障时，可以向用户终端的应用程序发送提示信息；用户终端设备的应用程序可以通过调用云平台提供的接口来实现发送控制命令到设备和读取设备上报的运行数据，以便可以基于该运行数据进行物联网设备的故障排查和维护。Referring to FIG. 1 , the scenario shown in FIG. 1 includes multiple devices, gateways, cloud platforms, and user terminals. Among them, the device can receive the control command issued by the cloud and the gateway, and report the information corresponding to the control command through the gateway after responding to the control command (for example, the device can report the information corresponding to the control command through the gateway after responding to the data collection command issued by the cloud or the gateway) The operating data corresponding to the device), and the corresponding operating data of the device can also be uploaded through the gateway, so that the gateway and the cloud platform can confirm whether the device is faulty; the gateway can be responsible for processing the uplink information of the IoT device and the downlink commands from the cloud; The platform can be used to execute the data collection method provided by the embodiment of the present application, and can also provide an interface for calling the application program in the user terminal device, so as to realize the uplink and downlink of information, so that when the data collected by the cloud platform indicates that the device may fail When the device is activated, prompt information can be sent to the application program of the user terminal; the application program of the user terminal device can send control commands to the device and read the operation data reported by the device by calling the interface provided by the cloud platform, so that the operation data can be carried out based on the operation data. Troubleshooting and maintenance of IoT devices.

下面将结合附图来介绍本申请所涉及的实施例。The embodiments involved in the present application will be described below with reference to the accompanying drawings.

请参阅图2，本申请提供的一种数据采集方法，所述方法包括：Please refer to FIG. 2, a data collection method provided by the present application, the method includes:

S110：响应于采集指令，获取在当前采集周期采集得到的采集数据作为待处理数据。S110: In response to the acquisition instruction, acquire the acquisition data acquired in the current acquisition cycle as the data to be processed.

其中，待处理数据可以为表征设备(例如：磁盘等)状态的数据。示例性的，当设备为磁盘时，待处理数据可以为每秒完成的读次数(r/s)、每秒完成的写次数(w/s)、磨损均衡计数(Wear Leveling Count)等，其中，r/s、w/s可以表征磁盘I/O口读、写数据功能是否正常，磨损均衡计数可以表征磁盘存储功能是否正常。The data to be processed may be data representing the state of a device (eg, a disk, etc.). Exemplarily, when the device is a disk, the data to be processed may be the number of reads per second (r/s), the number of writes per second (w/s), the wear leveling count (Wear Leveling Count), etc., where , r/s and w/s can indicate whether the read and write data functions of the disk I/O port are normal, and the wear leveling count can indicate whether the disk storage function is normal.

作为一种方式，当采集周期对应的时刻到达时，电子设备的控制单元可以向数据采集单元发出采集指令，该采集指令可以包括有待采集数据标识，数据采集单元响应于该采集指令后，可以根据待采集数据标识执行数据采集操作，以将在当前采集周期采集到的采集数据作为待处理数据。示例性的，待采集数据标识可以为每秒完成的读次数(r/s)，采样周期可以为t，每隔时间t，电子设备的控制单元可以向数据采集单元发出一次采集指令，数据采集单元响应于该采集指令后，可以根据待采集数据标识执行数据采集操作，以将在当前采集周期采集得到的每秒完成的读次数(r/s)作为待处理数据。In one way, when the time corresponding to the collection cycle arrives, the control unit of the electronic device may issue a collection instruction to the data collection unit, and the collection instruction may include an identifier of the data to be collected. After the data collection unit responds to the collection instruction, it may The identifier of the data to be collected performs a data collection operation, so that the collected data collected in the current collection period is regarded as the data to be processed. Exemplarily, the identifier of the data to be collected may be the number of readings completed per second (r/s), the sampling period may be t, and every time t, the control unit of the electronic device may issue a collection instruction to the data collection unit, and the data collection After the unit responds to the collection instruction, it can perform a data collection operation according to the identifier of the data to be collected, so as to use the number of completed reads per second (r/s) collected in the current collection cycle as the data to be processed.

需要说明的是，在一次采集指令中可以包括有一个或者多个待采集数据的标识，采集指令所包括的待采集数据标识数量可以基于电子设备的任务需求确定。It should be noted that a collection instruction may include one or more identifiers of data to be collected, and the number of identifiers of data to be collected included in the collection instruction may be determined based on the task requirements of the electronic device.

可选的，如表1所示，可以对数据的标识进行编码，每一个编码可以对应于一个不重复的标识。Optionally, as shown in Table 1, the identifier of the data may be encoded, and each encoding may correspond to a unique identifier.

表1Table 1

编码coding 数据的标识identification of data 001001 每秒完成的读次数Reads completed per second 002002 每秒完成的写次数Writes completed per second ...... ......

S120：计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性。S120: Calculate the similarity between the data to be processed and the historical collection data collected in the historical collection period.

其中，相似性表征了待处理数据与历史采集采集数据之间的相似程度，相似性越大则对应的相似程度也就越大。作为一种方式，可以基于待处理数据的类型计算待处理数据与在历史采集周期所采集的历史采集数据的相似性。The similarity represents the degree of similarity between the data to be processed and the historically collected data, and the greater the similarity, the greater the corresponding degree of similarity. As one way, the similarity between the data to be processed and the historically collected data collected in the historical collection period may be calculated based on the type of the data to be processed.

在本申请实施例中，待处理数据可以被分为累积型数据和非累积型数据。其中，累积型数据为与历史数据数据可以产生累加效应的数据，也就是说，累积型数据可以是与设备的历史状态和当前状态有关的数据。例如：当设备为磁盘时，累积型数据可以为重定位磁区计数(Reallocated_Sector_Ct)、终端校验出错(End-to-End error)、重定位事件计数(Reallocated_Event_Count)剩余寿命百分比(Remain Life Percentage)、可用的预留空间(Available Reserved Space)、可用的或剩余可用的保留块百分比(Reserved BlockCount)、程序失败总数(Program Fail Count)、允许删除失败的剩余百分比(Erase FailCount)、磨损均衡计数(Wear Leveling Count)、LBA写入总数计数(Total LBAs Written)、无法校正的扇区计数(Uncorrectable Sector Count On Line)等。In this embodiment of the present application, the data to be processed can be classified into cumulative data and non-cumulative data. Among them, the cumulative data is data that can produce cumulative effects with historical data, that is, the cumulative data can be data related to the historical state and current state of the device. For example: when the device is a disk, the accumulated data can be the relocation sector count (Reallocated_Sector_Ct), the terminal check error (End-to-End error), the relocation event count (Reallocated_Event_Count), the remaining life percentage (Remain Life Percentage), Available reserved space (Available Reserved Space), percentage of available or remaining available reserved blocks (Reserved BlockCount), total number of program failures (Program Fail Count), remaining percentage of allowed deletion failures (Erase FailCount), wear leveling count (Wear Leveling Count), Total LBAs Write Count (Total LBAs Written), Uncorrectable Sector Count On Line, etc.

非累积型数据可以为具有随机性或者突发性的数据，也就是说，非累积型数据可以是只与设备当前状态有关的数据，例如：当设备为磁盘时，非累积型数据可以为每秒对电子设备的读请求被合并次数(rrqm/s)、每秒对该设备的写请求被合并次数(wrqm/s)、每秒完成的读次数(r/s)、每秒完成的写次数(w/s)、每秒读数据量(rkB/s，kB为单位)、每秒写数据量(wkB/s，kB为单位)、平均每次I/O操作的数据量(avgrq-sz，扇区数为单位)、平均等待处理的I/O请求队列长度(avgqu-sz)、平均每次I/O请求等待时间(await，包括等待时间和处理时间，毫秒为单位)、I/O队列非空的时间比率(％util)等。Non-cumulative data can be random or burst data, that is to say, non-cumulative data can be data only related to the current state of the device, for example: when the device is a disk, the non-cumulative data can be every The number of merged read requests to the electronic device per second (rrqm/s), the number of merged write requests to the device per second (wrqm/s), the number of reads completed per second (r/s), the number of writes completed per second The number of times (w/s), the amount of data read per second (rkB/s, in kB), the amount of data written per second (wkB/s, in kB), and the average amount of data per I/O operation (avgrq- sz, the number of sectors in units), the average queue length of I/O requests waiting to be processed (avgqu-sz), the average waiting time of each I/O request (await, including waiting time and processing time, in milliseconds), I /O ratio of times when the queue is not empty (%util), etc.

需要说明的是，累积型数据和非累积型数据所对应的采集周期可以是不同的，由于非累积型数据比累积型数据更具随机性和突发性，所以非累积型数据所对应的采集周期可以比累积型数据所对应的采集周期短。示例性的，每秒对电子设备的读请求被合并次数(rrqm/s)为非累积型数据，可以将rrqm/s对应的采集周期设置为30s，剩余寿命百分比(Remain Life Percentage)为累积型数据，可以将剩余寿命百分比对应的采集周期设置为24h。通过针对不同类型的数据设置不同的采集周期的方式，使得可以及时对数据进行分析以便及时发现硬盘状态是否发生改变。It should be noted that the collection period corresponding to the cumulative data and the non-cumulative data can be different. Since the non-cumulative data is more random and bursty than the cumulative data, the acquisition period corresponding to the The period can be shorter than the acquisition period corresponding to the accumulated data. Exemplarily, the number of times the read requests to the electronic device are merged per second (rrqm/s) is non-cumulative data, the collection period corresponding to rrqm/s can be set to 30s, and the remaining life percentage (Remain Life Percentage) is cumulative. data, the acquisition period corresponding to the percentage of remaining life can be set to 24h. By setting different collection cycles for different types of data, it is possible to analyze the data in time to find out whether the status of the hard disk has changed.

再者，需要说明的是，所要采集的运行数据属于哪种数据类型是被预先设置好的，设备可以基于当前执行的任务确定所采集到的数据的类型。Furthermore, it should be noted that the data type of the operation data to be collected is preset, and the device can determine the type of the collected data based on the currently executed task.

作为另一种方式，可以基于待处理数据的维度计算待处理数据与在历史采集周期所采集的历史采集数据的相似性。As another way, the similarity between the data to be processed and the historically collected data collected in the historical collection period may be calculated based on the dimensions of the data to be processed.

在本申请实施例中，待处理数据还可以根据维度的数量被分为单维数据和多维数据。其中，单维数据为包含一种标识的数据，例如：单维数据可以为每秒完成的读次数(r/s)；单维数据还可以为每秒完成的写次数(w/s)等。多维数据为包含两种及以上标识的数据，例如：多维数据可以包括有剩余寿命百分比(Remain Life Percentage)、磨损均衡计数(Wear Leveling Count)、LBA写入总数计数(Total LBAs Written)，该多维数据可以用于计算磁盘寿命。In this embodiment of the present application, the data to be processed may also be classified into single-dimensional data and multi-dimensional data according to the number of dimensions. Among them, the single-dimensional data is the data containing a kind of identification, for example: the single-dimensional data can be the number of reads per second (r/s); the single-dimensional data can also be the number of writes per second (w/s), etc. . Multidimensional data is data that contains two or more identifiers. For example, multidimensional data may include Remain Life Percentage, Wear Leveling Count, and Total LBAs Written. The data can be used to calculate disk life.

需要说明的是，设备可以基于不同的任务预先确定好需要采集的数据是单维数据还是多维数据，并且单维数据在设备中可以是以单个数值的形式传输的，而多维数据在设备中可以是以数组的形式传输的，从而设备可以通过数据的格式以及当前执行的任务来确定获取到的待处理数据(采集到的数据)是单维数据还是多维数据。It should be noted that the device can pre-determine whether the data to be collected is single-dimensional data or multi-dimensional data based on different tasks, and the single-dimensional data can be transmitted in the form of a single value in the device, while the multi-dimensional data can be It is transmitted in the form of an array, so that the device can determine whether the acquired data to be processed (collected data) is single-dimensional data or multi-dimensional data according to the data format and the currently executed task.

再者，需要说明的是，待处理数据的类型和维度的分类是可以重合的，即：当待处理数据为累积型/非累积型数据时，还可以进一步将待处理数据分为单维累积/非累积型数据和多维累积/非累积型数据；当待处理数据为单/多维数据时，还可以进一步将待处理数据分为单/多维累积型数据和单/多维非累积型数据。Furthermore, it should be noted that the types and dimensions of the data to be processed can overlap, that is, when the data to be processed is cumulative/non-cumulative data, the data to be processed can be further divided into single-dimensional cumulative data. /Non-cumulative data and multi-dimensional cumulative/non-cumulative data; when the data to be processed is single/multi-dimensional data, the data to be processed can be further divided into single/multi-dimensional cumulative data and single/multi-dimensional non-cumulative data.

S130：若所述相似性满足指定阈值条件，对所述待处理数据进行存储。S130: If the similarity meets a specified threshold condition, store the data to be processed.

其中，在本申请实施例中，指定阈值条件可以为表征高价值数据的条件，因此对于若相似性满足该指定阈值条件，则可以确定待处理数据为高价值数据。在本申请实施例中，高价值数据可以为表征设备可能出现故障的数据，例如：当待处理数据为每秒完成读次数时，磁盘的历史采集的多个每秒完成读次数对应的数值都较大，而待处理数据的数值为0，表明磁盘可能出现读数据故障。Wherein, in this embodiment of the present application, the specified threshold condition may be a condition for characterizing high-value data. Therefore, if the similarity satisfies the specified threshold condition, the data to be processed may be determined to be high-value data. In this embodiment of the present application, the high-value data may be data representing a possible failure of the device. For example, when the data to be processed is the number of completed reads per second, the values corresponding to the number of completed reads per second collected in the history of the disk are all is larger, and the value of the pending data is 0, indicating that the disk may have a read data failure.

作为一种方式，指定阈值条件可以包括：相似性小于第一相似阈值。在这种方式下，当待处理数据对应的相似性小于第一相似阈值时，则可以表明待处理数据为奇异数据并且可能包含与设备状态相关的信息，具有较高的价值，也就是说，待处理数据与历史采集数据差别较大，此时设备状态可能会发生突变(例如：磁盘的I/O读写速率从快到慢、磁盘突然出现卡慢故障等)，可以将该待处理数据进行持久化存储，即存储在数据库中。As one approach, specifying the threshold condition may include that the similarity is less than a first similarity threshold. In this way, when the similarity corresponding to the data to be processed is smaller than the first similarity threshold, it can be indicated that the data to be processed is singular data and may contain information related to the state of the equipment, which has a high value, that is, There is a big difference between the data to be processed and the historically collected data. At this time, the status of the device may change suddenly (for example, the I/O read/write rate of the disk changes from fast to slow, the disk suddenly has a slow card failure, etc.), and the data to be processed can be For persistent storage, that is, stored in the database.

再者，在本申请实施例中，作为另外一种方式，指定阈值条件包括：相似性与上一个采集周期所对应的相似性的差值的绝对值小于第二相似阈值。示例性的，假设待处理数据对应的相似性为A，与待处理数据的数据标识相同的上一个采集周期所对应的相似性为B，若|A-B|<第二相似阈值，则表明待处理数据可能为奇异数据，可以将该待处理数据进行持久化存储，即存储在数据库中。Furthermore, in the embodiment of the present application, as another way, the specified threshold condition includes: the absolute value of the difference between the similarity and the similarity corresponding to the previous collection period is smaller than the second similarity threshold. Exemplarily, it is assumed that the similarity corresponding to the data to be processed is A, and the similarity corresponding to the previous acquisition period that is the same as the data identifier of the data to be processed is B. If |A-B|<the second similarity threshold, it indicates that the data to be processed is The data may be singular data, and the to-be-processed data can be persistently stored, that is, stored in the database.

上述两种指定阈值条件都可以作为待处理数据是否被存储的依据，第一种条件(相似性小于第一相似阈值)计算简单；第二种条件(相似性与上一个采集周期所对应的相似性的差值的绝对值小于第二相似阈值)对于两个相邻采集周期的相似性变化情况更为敏感，可以更快速地发现设备状态的突变。通过上述方式使得，可以当待处理数据表征的设备故障所产生的问题严重时(例如：直接导致设备停止工作)，可以将待处理数据对应的指定阈值条件设置为：待处理数据的相似性与上一个采集周期所对应的相似性的差值的绝对值小于第二相似阈值，以便可以及时发现设备故障；当待处理数据表征的设备故障所产生的问题较轻时(例如：设备的某个功能出现故障，但还可以执行其他任务)，可以将待处理数据对应的指定阈值条件设置为：待处理数据的相似性小于第一相似阈值，以便可以快速分析出设备运行状态。从而使得可以基于实际需求确定指定阈值条件，进而提高相似性判断方法的灵活性。The above two specified threshold conditions can be used as the basis for whether the data to be processed is stored. The first condition (similarity is less than the first similarity threshold) is simple to calculate; (the absolute value of the difference of the properties is smaller than the second similarity threshold) is more sensitive to the similarity change of two adjacent acquisition periods, and the sudden change of the device state can be found more quickly. Through the above method, when the problem caused by the equipment failure represented by the data to be processed is serious (for example: directly causing the equipment to stop working), the specified threshold condition corresponding to the data to be processed can be set as: the similarity of the data to be processed is the same as The absolute value of the difference of the similarity corresponding to the previous collection period is smaller than the second similarity threshold, so that the equipment failure can be found in time; when the problem caused by the equipment failure represented by the data to be processed is relatively minor (for example: a certain The function fails, but other tasks can also be performed), the specified threshold condition corresponding to the data to be processed can be set as: the similarity of the data to be processed is less than the first similarity threshold, so that the operating status of the device can be quickly analyzed. Therefore, the specified threshold condition can be determined based on actual requirements, thereby improving the flexibility of the similarity judgment method.

需要说明的是，可以根据待处理数据的类型、经验等因素确定第一相似阈值和第二相似阈值。例如：因为非累积型数据具有随机性和突变性，累积型数据比非累积数据更稳定(即累积型数据的数值变化缓慢)，所以累积型数据对应的相似阈值可以小于非累积型数据对应的相似阈值。再例如：因为第二相似阈值是与相邻周期的相似性差值的绝对值进行比对的，而相邻周期的相似性变化较小，所以第二相似阈值可以小于第一相似阈值。It should be noted that the first similarity threshold and the second similarity threshold may be determined according to factors such as the type of data to be processed, experience, and the like. For example: because non-cumulative data has randomness and mutation, cumulative data is more stable than non-cumulative data (that is, the value of cumulative data changes slowly), so the similarity threshold corresponding to cumulative data can be smaller than that of non-cumulative data. Similarity threshold. For another example: because the second similarity threshold is compared with the absolute value of the similarity difference between adjacent periods, and the similarity of adjacent periods changes little, the second similarity threshold may be smaller than the first similarity threshold.

本实施例提供的一种数据采集方法，在响应于采集指令，获取在当前采集周期采集得到的采集数据后，将当前采集周期采集得到的采集数据作为待处理数据，计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性，若所述相似性满足指定阈值条件，对所述待处理数据进行存储。通过上述方式使得，可以将计算得到的待处理数据与在历史采集周期所采集的历史采集数据的相似性与指定阈值条件进行比对，并且在相似性满足指定阈值条件的情况下，再对待处理数据进行存储，使得对于所获取得到的采集数据会根据指定阈值条件进行一定的筛选，以得到满足需求(指定阈值条件)的采集数据进行存储，进而不用直接对每次获取得到的采集数据都进行存储，节约了存储空间。In a data collection method provided by this embodiment, after acquiring the collection data collected in the current collection period in response to the collection instruction, the collection data collected in the current collection period is used as the data to be processed, and the difference between the data to be processed and the data to be processed is calculated. For the similarity of the historically collected data collected in the historical collection period, if the similarity meets the specified threshold condition, the data to be processed is stored. Through the above method, the similarity between the calculated data to be processed and the historical collection data collected in the historical collection period can be compared with the specified threshold condition, and when the similarity satisfies the specified threshold condition, the to-be-processed data can be processed again. The data is stored, so that the acquired collected data will be screened according to the specified threshold conditions to obtain the collected data that meets the requirements (specified threshold conditions) for storage, so that it is not necessary to directly perform each acquisition of the collected data. storage, saving storage space.

请参阅图3，本申请提供的一种数据采集方法，所述方法包括：Please refer to FIG. 3, a data collection method provided by the present application, the method includes:

S210：响应于采集指令，获取在当前采集周期采集得到的采集数据作为待处理数据。S210: In response to the acquisition instruction, acquire the acquisition data acquired in the current acquisition cycle as the data to be processed.

S220：若所述待处理数据为累积型数据，基于所述待处理数据以及在历史采集周期所采集的历史采集数据的变化率计算相似性。S220: If the data to be processed is cumulative data, calculate similarity based on the data to be processed and the rate of change of the historically collected data collected in the historical collection period.

其中，作为一种方式，如图4所示，若所述待处理数据为累积型数据，基于所述待处理数据以及在历史采集周期所采集的历史采集数据的变化率计算相似性，包括：Wherein, as one way, as shown in Figure 4, if the data to be processed is cumulative data, the similarity is calculated based on the data to be processed and the rate of change of the historical collection data collected in the historical collection period, including:

S2201：若所述待处理数据为单维数据，获取多个历史采集周期各自采集的历史采集数据，得到多个历史采集数据。S2201: If the data to be processed is single-dimensional data, obtain the historical collection data collected by each of a plurality of historical collection periods, and obtain a plurality of historical collection data.

其中，可以根据分析设备状态的任务、采集周期等因素确定获取历史采集数据的数量，根据实际情况，一般可以设置为获取7个、14个、20个或者30个历史采集数据。示例性的，若分析设备状态的任务为查看磁盘剩余寿命的百分比，采集周期为24h，则可以获取30个历史采集数据；若分析设备状态的任务为查看磁盘I/O队列非空的时间比率,采集周期为30s，则可以获取20个历史采集数据。Among them, the number of historically collected data can be determined according to factors such as the task of analyzing the device status and the collection period. According to the actual situation, it can generally be set to obtain 7, 14, 20 or 30 historically collected data. Exemplarily, if the task of analyzing the device status is to check the percentage of the remaining life of the disk and the collection period is 24 hours, then 30 historical collection data can be obtained; if the task of analyzing the device status is to check the time ratio of the disk I/O queue not empty , the collection period is 30s, then 20 historical collection data can be obtained.

S2202：从所述待处理数据和所述多个历史采集数据中获取多对数据，每对数据中的数据各自对应的采集周期相邻。S2202: Acquire multiple pairs of data from the data to be processed and the plurality of historical collection data, and the respective collection periods corresponding to the data in each pair of data are adjacent to each other.

其中，示例性的，假设获取到的历史采集数据按时间顺序由先到后依次为：A、B、C、D、E、F、G，待处理数据数据为H，则可以有以下数据对：AB、BC、CD、DE、EF、FG、GH，其中，每对数据的采集周期都是相邻的。Among them, exemplarily, assuming that the acquired historical collection data are in the order of time: A, B, C, D, E, F, G, and the data to be processed is H, the following data pairs can be : AB, BC, CD, DE, EF, FG, GH, where the collection period of each pair of data is adjacent.

S2203：获取每对数据中对应采集周期在后的数据与对应采集周期在前的数据的差作为参考差值，以得到每对数据对应的参考差值。S2203: Acquire the difference between the data corresponding to the later acquisition period and the data corresponding to the previous acquisition period in each pair of data as a reference difference value, so as to obtain a reference difference value corresponding to each pair of data.

其中，示例性的，假设数据对为：AB、BC、CD、DE、EF、FG、GH，则可以通过B-A、C-B、D-C、E-D、F-E、G-F、H-G得到每对数据对应的参考差值：X₁(对应于B-A)、X₂(对应于C-B)、X₃(对应于D-C)、X₄(对应于E-D)、X₅(对应于F-E)、X₆(对应于G-F)、X₇(对应于H-G)。Wherein, exemplarily, assuming that the data pairs are: AB, BC, CD, DE, EF, FG, GH, the reference difference corresponding to each pair of data can be obtained through BA, CB, DC, ED, FE, GF, HG : X ₁ (corresponding to BA), X ₂ (corresponding to CB), X ₃ (corresponding to DC), X ₄ (corresponding to ED), X ₅ (corresponding to FE), X ₆ (corresponding to GF), X ₇ (corresponds to HG).

S2204：将每对数据对应的参考差值与每对数据中对应采集周期在前的数据相比，得到每对数据对应的变化率。S2204: Compare the reference difference value corresponding to each pair of data with the data corresponding to the previous collection period in each pair of data, to obtain a rate of change corresponding to each pair of data.

其中，示例性的，假设数据对为：AB、BC、CD、DE、EF、FG、GH，每对数据对应的参考差值为：X₁、X₂、X₃、X₄、X₅、X₆、X₇，则可以通过X₁/A、X₂/B、X₃/C、X₄/D、X₅/E、X₆/F、X₇/G得到每对数据对应的变化率：Y₁(对应于X₁/A)、Y₂(对应于X₂/B)、Y₃(对应于X₃/C)、Y₄(对应于X₄/D)、Y₅(对应于X₅/E)、Y₆(对应于X₆/F)、Y₇(对应于X₇/G)。Wherein, for example, it is assumed that the data pairs are: AB, BC, CD, DE, EF, FG, GH, and the reference difference corresponding to each pair of data is: X ₁ , X ₂ , X ₃ , X ₄ , X ₅ , X ₆ , X ₇ , the corresponding changes of each pair of data can be obtained through X ₁ /A, X ₂ /B, X ₃ /C, X ₄ /D, X ₅ /E, X ₆ /F, X ₇ /G Rate: _Y1 (corresponds to X1/ _A ), _Y2 ₍ corresponds to X2/B), _Y3 (corresponds to _X3 /C), Y4 (corresponds to _X4 /D), Y5 ₍ corresponds to _X4 /D) X ₅ /E), Y ₆ (corresponding to X ₆ /F), Y ₇ (corresponding to X ₇ /G).

S2205：基于第一相似性算法以及每对数据对应的变化率计算相似性，所述第一相似性算法包括标准差、欧式距离、余弦距离、皮尔逊相关系数、修正余弦距离、汉明距离、曼哈顿距离中的任意一项。S2205: Calculate similarity based on a first similarity algorithm and the rate of change corresponding to each pair of data, where the first similarity algorithm includes standard deviation, Euclidean distance, cosine distance, Pearson correlation coefficient, modified cosine distance, Hamming distance, Any of the Manhattan distances.

其中，示例性的，当采用第一相似性算法中的标准差计算相似性时，假设每对数据对应的变化率为：Y₁、Y₂、Y₃、Y₄、Y₅、Y₆、Y₇，则可以先通过(Y₁+Y₂+...+Y₇)/7得到变化率的平均值M，再通过标准差公式：sqrt(((Y₁-M)^2+(Y₂-M)^2+...+(Y₇-M)^2)/7)可以得到待处理数据对应的标准差为N，由于，N的值越小表明待处理数据与历史数据越相似，N的值越大表明待处理数据与历史数据越不相似，所以为了不与指定阈值条件(相似性越小表明待处理数据与历史数据越不相似)产生矛盾，待处理数据的相似性可以为1/(N+1)。同样，当其它第一相似性算法(如：欧式距离等)与指定阈值条件产生矛盾时，也可以做类似处理。Wherein, exemplarily, when calculating the similarity using the standard deviation in the first similarity algorithm, it is assumed that the corresponding rate of change of each pair of data is: Y ₁ , Y ₂ , Y ₃ , Y ₄ , Y ₅ , Y ₆ , Y ₇ , then you can first obtain the average value M of the rate of change by (Y ₁ +Y ₂ +...+Y ₇ )/7, and then use the standard deviation formula: sqrt(((Y ₁ -M)^2+( Y ₂ -M)^2+...+(Y ₇ -M)^2)/7) can obtain that the standard deviation corresponding to the data to be processed is N, because the smaller the value of N is, the smaller the value of N indicates that the data to be processed and the historical data The more similar, the larger the value of N, the less similar the data to be processed and the historical data. Therefore, in order not to conflict with the specified threshold condition (the smaller the similarity, the less similar the data to be processed and the historical data are), the similarity of the data to be processed is similar. The sex can be 1/(N+1). Similarly, when other first similarity algorithms (eg, Euclidean distance, etc.) contradict the specified threshold condition, similar processing can also be performed.

其中，示例性的，当采用第一相似性算法中的余弦距离计算相似性时，假设每对数据对应的变化率为：Y₁、Y₂、Y₃、Y₄、Y₅、Y₆、Y₇，则可以将Y₆和Y₇代入余弦距离公式计算待处理数据对应的相似性；还可以分别将Y₁和Y₂、Y₂和Y₃、Y₃和Y₄、Y₄和Y₅、Y₅和Y₆、Y₆和Y₇代入余弦距离公式计算得到多个余弦距离，再对多个余弦距离做标准差运算得到待处理数据对应的相似性。Wherein, exemplarily, when calculating the similarity using the cosine distance in the first similarity algorithm, it is assumed that the corresponding change rate of each pair of data is: Y ₁ , Y ₂ , Y ₃ , Y ₄ , Y ₅ , Y ₆ , Y ₇ , then Y ₆ and Y ₇ can be substituted into the cosine distance formula to calculate the similarity corresponding to the data to be processed; Y ₁ and Y ₂ , Y ₂ and Y ₃ , Y ₃ and Y ₄ , Y ₄ and Y can also be respectively ₅ , Y ₅ and Y ₆ , Y ₆ and Y ₇ are substituted into the cosine distance formula to obtain a plurality of cosine distances, and then the standard deviation operation is performed on the plurality of cosine distances to obtain the corresponding similarity of the data to be processed.

作为另一种方式，如图5所示，若所述待处理数据为累积型数据，基于所述待处理数据以及在历史采集周期所采集的历史采集数据的变化率计算相似性，包括：As another method, as shown in FIG. 5 , if the data to be processed is cumulative data, the similarity is calculated based on the data to be processed and the rate of change of the historically collected data collected during the historical collection period, including:

S2211：若所述待处理数据为多维数据，获取多个历史采集周期各自采集的历史采集数据，得到多个历史采集数据。S2211 : If the data to be processed is multi-dimensional data, acquire the historical collection data collected by each of a plurality of historical collection periods to obtain a plurality of historical collection data.

S2212：获取所述待处理数据和所述多个历史采集数据中每个维度位置的数据。S2212: Acquire the data to be processed and the data of each dimension position in the plurality of historically collected data.

其中，示例性的，如图6所示，获取到的历史采集数据按时间顺序由先到后依次为：{A1，B1，C1，D1}、{A2，B2，C2，D2}、{A3，B3，C3，D3}，待处理数据为{A4，B4，C4，D4}，其中，A、B、C、D分别表示多维数据的四个维度位置，例如，待处理数据A维度位置的数据为A4、B维度位置的数据为B4、C维度位置的数据为C4、D维度位置的数据为D4。Wherein, exemplarily, as shown in FIG. 6 , the acquired historical collection data in chronological order are: {A1, B1, C1, D1}, {A2, B2, C2, D2}, {A3 , B3, C3, D3}, the data to be processed is {A4, B4, C4, D4}, where A, B, C, and D respectively represent the four dimension positions of the multidimensional data, for example, the position of the A dimension of the data to be processed The data is A4, the data in the B dimension is B4, the data in the C dimension is C4, and the data in the D dimension is D4.

S2213：基于对应的维度位置将每个维度位置的数据划分为多组，得到多组数据，其中，同一组数据所对应的维度位置相同。S2213: Divide the data of each dimension position into multiple groups based on the corresponding dimension positions, to obtain multiple groups of data, wherein the dimension positions corresponding to the same group of data are the same.

其中，示例性的，如图6所示，获取到的历史采集数据按时间顺序由先到后依次为：{A1，B1，C1，D1}、{A2，B2，C2，D2}、{A3，B3，C3，D3}，待处理数据为{A4，B4，C4，D4}，则可以得到的多组数据为：{A1、A2、A3、A4}，{B1、B2、B3、B4}，{C1、C2、C3、C4}，{D1、D2、D3、D4}。Wherein, exemplarily, as shown in FIG. 6 , the acquired historical collection data in chronological order are: {A1, B1, C1, D1}, {A2, B2, C2, D2}, {A3 , B3, C3, D3}, the data to be processed is {A4, B4, C4, D4}, then the multiple sets of data that can be obtained are: {A1, A2, A3, A4}, {B1, B2, B3, B4} , {C1, C2, C3, C4}, {D1, D2, D3, D4}.

S2214：获取每组数据中的多对数据，每对数据中的数据各自对应的采集周期相邻。S2214: Acquire multiple pairs of data in each group of data, and the respective corresponding collection periods of the data in each pair of data are adjacent.

其中，示例性的，如图6所示，{A1、A2、A3、A4}这组数据可以得到以下三对数据：A1A2、A2A3、A3A4。Wherein, exemplarily, as shown in FIG. 6 , the following three pairs of data can be obtained from the set of data {A1, A2, A3, A4}: A1A2, A2A3, and A3A4.

S2215：获取每对数据中对应采集周期在后的数据与对应采集周期在前的数据的差作为参考差值，以得到每组数据中的每对数据对应的参考差值。S2215: Acquire the difference between the data corresponding to the later acquisition period and the data corresponding to the previous acquisition period in each pair of data as a reference difference value, so as to obtain a reference difference value corresponding to each pair of data in each group of data.

其中，示例性的，如图6所示，在{A1、A2、A3、A4}这组数据中，有数据对：A1A2、A2A3、A3A4，可以通过A2-A1、A3-A2、A4-A3得到的参考差值为：X1(对应于A2-A1)、X2(对应于A3-A2)、X3(对应于A4-A3)。Among them, exemplarily, as shown in Figure 6, in the set of data {A1, A2, A3, A4}, there are data pairs: A1A2, A2A3, A3A4, which can be passed through A2-A1, A3-A2, A4-A3 The obtained reference differences are: X1 (corresponding to A2-A1), X2 (corresponding to A3-A2), X3 (corresponding to A4-A3).

S2216：将每对数据对应的参考差值与每对数据中对应采集周期在前的数据相比，得到每对数据对应的变化率。S2216: Compare the reference difference value corresponding to each pair of data with the data corresponding to the previous collection period in each pair of data, to obtain a rate of change corresponding to each pair of data.

其中，示例性的，如图6所示，在{A1、A2、A3、A4}这组数据中，数据对A1A2、A2A3、A3A4对应的参考差值为：X1、X2、X3，则可以通过X1/A1、X2/A2、X3/A3得到每对数据对应的变化率为：Y_A1(对应于X1/A1)、Y_A2(对应于X2/A2)、Y_A3(对应于X3/A3)。Wherein, exemplarily, as shown in FIG. 6 , in the set of data {A1, A2, A3, A4}, the reference difference values corresponding to the data pairs A1A2, A2A3, and A3A4 are: X1, X2, X3, which can be passed through X1/A1, X2/A2, X3/A3 get the corresponding rate of change of each pair of data: Y _A1 (corresponding to X1/A1), Y _A2 (corresponding to X2/A2), Y _A3 (corresponding to X3/A3) .

S2217：基于所包括的数据的采样周期，将多对数据划分为多个集合，其中，同一个集合中的每对数据中所对应有的采集周期相同。S2217: Divide the multiple pairs of data into multiple sets based on the sampling periods of the included data, wherein the corresponding collection periods in each pair of data in the same set are the same.

其中，示例性的，如图6所示，A组数据对为：A1A2、A2A3、A3A4，B组数据对为：B1B2、B2B3、B3B4，C组数据对为：C1C2、C2C3、C3C4，D组数据对为：D1D2、D2D3、D3D4，则可以划分为以下集合：{A1A2，B1B2，C1C2，D1D2}、{A2A3，B2B3，C2C3，D2D3}、{A3A4，B3B4，C3C4，D3D4}，其中，每个集合中的每对数据所包括的两个采样周期都相同。Wherein, exemplarily, as shown in FIG. 6 , the data pairs in group A are: A1A2, A2A3, and A3A4, the data pairs in group B are: B1B2, B2B3, and B3B4, the data pairs in group C are: C1C2, C2C3, and C3C4, and the data pairs in group D are: C1C2, C2C3, and C3C4. The data pairs are: D1D2, D2D3, D3D4, and can be divided into the following sets: {A1A2, B1B2, C1C2, D1D2}, {A2A3, B2B3, C2C3, D2D3}, {A3A4, B3B4, C3C4, D3D4}, where each Each pair of data in each set includes the same two sampling periods.

S2218：基于每个集合中的每个数据对应的变化率生成对应的多维数据，得到多个多维数据，其中，每对数据对应的变化率在对应生成的多维数据的中维度位置与该对数据中的数据在所述待处理数据或者所述历史采集数据中的维度位置相同。S2218: Generate corresponding multi-dimensional data based on the rate of change corresponding to each data in each set, and obtain a plurality of multi-dimensional data, wherein the rate of change corresponding to each pair of data is in the middle dimension position of the corresponding generated multi-dimensional data and the pair of data The dimension positions of the data in the data to be processed or the historically collected data are the same.

其中，示例性的，如图6所示，集合{A2A3，B2B3，C2C3，D2D3}中每对数据的变化率依次为：Y_A2、Y_B2、Y_C2、Y_D2，上述变化率分别对应于维度位置A、B、C、D。Wherein, exemplarily, as shown in FIG. 6 , the rate of change of each pair of data in the set {A2A3, B2B3, C2C3, D2D3} is sequentially: Y _A2 , Y _B2 , Y _C2 , Y _D2 , and the above rate of change corresponds to Dimension positions A, B, C, D.

S2219：基于第二相似性算法以及所述多个多维数据计算相似性，所述第二相似性算法包括欧式距离、余弦距离、皮尔逊相关系数、修正余弦距离、汉明距离、曼哈顿距离中的任意一项。S2219: Calculate similarity based on a second similarity algorithm and the plurality of multidimensional data, where the second similarity algorithm includes Euclidean distance, cosine distance, Pearson correlation coefficient, modified cosine distance, Hamming distance, and Manhattan distance any one.

其中，示例性的，如图5所示，基于集合{A2A3，B2B3，C2C3，D2D3}中每对数据的变化率生成的多维数据依次为：Y_A2、Y_B2、Y_C2、Y_D2，基于集合{A3A4，B3B4，C3C4，D3D4}中每对数据的变化率生成的多维数据依次为：Y_A3、Y_B3、Y_C3、Y_D3，可以将上述两个多维数据当作两个向量，进而通过多余弦距离公式可以得到待处理数据对应的相似性。Wherein, exemplary, as shown in FIG. 5 , the multidimensional data generated based on the rate of change of each pair of data in the set {A2A3, B2B3, C2C3, D2D3} are sequentially: Y _A2 , Y _B2 , Y _C2 , Y _D2 , based on The multi-dimensional data generated by the rate of change of each pair of data in the set {A3A4, B3B4, C3C4, D3D4} are: Y _A3 , Y _B3 , Y _C3 , Y _D3 , the above two multi-dimensional data can be regarded as two vectors, and then The similarity corresponding to the data to be processed can be obtained through the redundant chord distance formula.

可选的，若基于待处理数据和获取的多维历史采集数据可以生成多个多维变化率数据，例如：Y_A1、Y_B1、Y_C1、Y_D1，Y_A2、Y_B2、Y_C2、Y_D2，Y_A3、Y_B3、Y_C3、Y_D3等，可以基于相邻两个多维变化率数据得到多个余弦距离，将多个距离做标准差，再用1/(标准差的值+1)以得到待处理数据对应的相似性。Optionally, multiple multi-dimensional rate-of-change data can be generated based on the data to be processed and the acquired multi-dimensional historical collection data, for example: Y _A1 , Y _B1 , Y _C1 , Y _D1 , Y _A2 , Y _B2 , Y _C2 , Y _D2 , Y _A3 , Y _B3 , Y _C3 , Y _D3 , etc., multiple cosine distances can be obtained based on two adjacent multi-dimensional rate-of-change data, and the multiple distances can be used as the standard deviation, and then 1/(The value of the standard deviation + 1) In order to obtain the similarity corresponding to the data to be processed.

S230：若所述待处理数据为非累积型数据，基于所述待处理数据以及在历史采集周期所采集的历史采集数据计算相似性。S230: If the data to be processed is non-cumulative data, calculate similarity based on the data to be processed and the historical collection data collected in the historical collection period.

其中，作为一种方式，如图7所示，若所述待处理数据为非累积型数据，基于所述待处理数据以及在历史采集周期所采集的历史采集数据计算相似性，包括：Wherein, as a way, as shown in FIG. 7, if the data to be processed is non-cumulative data, the similarity is calculated based on the data to be processed and the historical collection data collected in the historical collection period, including:

S231：若所述待处理数据为单维数据，获取多个历史采集周期各自采集的历史采集数据，得到多个历史采集数据。S231: If the data to be processed is single-dimensional data, acquire historical collection data collected in multiple historical collection periods respectively, and obtain a plurality of historical collection data.

S232：基于第一相似性算法、所述待处理数据以及所述多个历史采集数据计算相似性，所述第一相似性算法包括标准差、欧式距离、余弦距离、皮尔逊相关系数、修正余弦距离、汉明距离、曼哈顿距离中的任意一项。S232: Calculate similarity based on a first similarity algorithm, the data to be processed, and the plurality of historically collected data, where the first similarity algorithm includes standard deviation, Euclidean distance, cosine distance, Pearson correlation coefficient, modified cosine Any of distance, Hamming distance, Manhattan distance.

其中，示例性的，当采用第一相似性算法中的标准差计算相似性时，假设获取到的历史采集数据按时间顺序由先到后依次为：A、B、C、D、E、F、G，待处理数据数据为H，则可以先通过(A+B+...+H)/8得到平均值M，再通过标准差公式：sqrt(((A-M)^2+(B-M)^2+...+(H-M)^2)/8)可以得到待处理数据对应的标准差为N，由于，N的值越小表明待处理数据与历史数据越相似，N的值越大表明待处理数据与历史数据越不相似，所以为了不与指定阈值条件产生矛盾，待处理数据的相似性可以为1/(N+1)。同样，当其它第一相似性算法(如：欧式距离等)与指定阈值条件产生矛盾时，也可以做类似处理。Wherein, exemplarily, when using the standard deviation in the first similarity algorithm to calculate the similarity, it is assumed that the acquired historically collected data are: A, B, C, D, E, F in chronological order. , G, the data to be processed is H, then you can first get the average value M by (A+B+...+H)/8, and then use the standard deviation formula: sqrt(((A-M)^2+(B-M)^ 2+...+(H-M)^2)/8), the standard deviation corresponding to the data to be processed is N, because the smaller the value of N, the more similar the data to be processed is with the historical data, and the larger the value of N, the more similar the data to be processed is to the historical data. The data to be processed is less similar to the historical data, so in order not to contradict the specified threshold condition, the similarity of the data to be processed may be 1/(N+1). Similarly, when other first similarity algorithms (eg, Euclidean distance, etc.) contradict the specified threshold condition, similar processing can also be performed.

作为另一种方式，如图8所示，若所述待处理数据为非累积型数据，基于所述待处理数据以及在历史采集周期所采集的历史采集数据计算相似性，包括：As another way, as shown in FIG. 8 , if the data to be processed is non-cumulative data, calculating similarity based on the data to be processed and the historical collection data collected in the historical collection period, including:

S236：若所述待处理数据为多维数据，获取多个历史采集周期各自采集的历史采集数据，得到多个历史采集数据。S236: If the data to be processed is multi-dimensional data, acquire the historical collection data collected by each of a plurality of historical collection periods to obtain a plurality of historical collection data.

S237：基于第二相似性算法、所述待处理数据以及所述多个历史采集数据计算相似性，所述第二相似性算法包括欧式距离、余弦距离、皮尔逊相关系数、修正余弦距离、汉明距离、曼哈顿距离中的任意一项。S237: Calculate similarity based on a second similarity algorithm, the data to be processed, and the plurality of historically collected data, where the second similarity algorithm includes Euclidean distance, cosine distance, Pearson correlation coefficient, modified cosine distance, Chinese Any of the Ming distance and Manhattan distance.

其中，示例性的，获取到的历史采集数据按时间顺序由先到后依次为：{A1，B1，C1，D1}、{A2，B2，C2，D2}、{A3，B3，C3，D3}，待处理数据为{A4，B4，C4，D4}，当第二相似性算法为余弦距离时，可以通过计算{A4，B4，C4，D4}与{A3，B3，C3，D3}的余弦距离得到待处理数据的相似性。Wherein, for example, the acquired historical collection data in chronological order are: {A1, B1, C1, D1}, {A2, B2, C2, D2}, {A3, B3, C3, D3 }, the data to be processed is {A4, B4, C4, D4}, when the second similarity algorithm is cosine distance, you can calculate the difference between {A4, B4, C4, D4} and {A3, B3, C3, D3} The cosine distance gives the similarity of the data to be processed.

可选的，若基于待处理数据和获取的多维历史采集数据可以得到多个余弦距离，可以将多个余弦距离做标准差，再用1/(标准差的值+1)以得到待处理数据对应的相似性。Optionally, if multiple cosine distances can be obtained based on the data to be processed and the acquired multi-dimensional historical collection data, the multiple cosine distances can be used as the standard deviation, and then 1/(the value of the standard deviation + 1) can be used to obtain the data to be processed. corresponding similarity.

需要说明的是，当通过第二相似性算法得到的值越大表明待处理数据与历史采集数据越不相似时，可以通过1/(第二相似性算法的值+1)以得到待处理数据对应的相似性，以便将相似性与指定阈值条件进行比对。It should be noted that when the value obtained by the second similarity algorithm is larger, it indicates that the data to be processed is less similar to the historically collected data, and the data to be processed can be obtained by 1/(value of the second similarity algorithm+1) The corresponding similarity to compare the similarity to the specified threshold condition.

S240：若所述相似性满足指定阈值条件，对所述待处理数据进行存储。S240: If the similarity meets a specified threshold condition, store the data to be processed.

本实施例提供的一种数据采集方法，通过上述方式使得，可以将计算得到的待处理数据与在历史采集周期所采集的历史采集数据的相似性与指定阈值条件进行比对，并且在相似性满足指定阈值条件的情况下，再对待处理数据进行存储，使得对于所获取得到的采集数据会根据指定阈值条件进行一定的筛选，以得到满足需求(指定阈值条件)的采集数据进行存储，进而不用直接对每次获取得到的采集数据都进行存储，节约了存储空间。并且，在本实施例中，当待处理数据为累积型数据时，可以基于待处理数据以及多个历史采集数据的变化率计算相似性；当待处理数据为非累积型数据时，可以基于待处理数据以及多个历史采集数据计算相似性，通过上述方式使得，由于不同数据类型所具有的数据特征也存在差异，基于待处理数据类型的不同采用不同的数据进行相似性计算，可以更好地识别出待处理数据是否具有较高价值的数据特征。并且，将累积/非累积型数据进一步划分为单维数据和多维数据，再根据数据维度的不同采用不同的相似性算法进行相似性计算，从而提高了本申请提出的数据采集方法的适用性和可拓展性。In a data collection method provided by this embodiment, in the above manner, the similarity between the calculated to-be-processed data and the historically collected data collected in the historical collection period can be compared with a specified threshold condition, and in the similarity When the specified threshold conditions are met, the data to be processed will be stored, so that the acquired collected data will be screened according to the specified threshold conditions to obtain the collected data that meets the requirements (specified threshold conditions) for storage, and no need The collected data obtained each time is directly stored, which saves storage space. Moreover, in this embodiment, when the data to be processed is cumulative data, the similarity can be calculated based on the data to be processed and the rate of change of a plurality of historically collected data; when the data to be processed is non-cumulative data, the similarity can be calculated based on the data to be processed Similarity is calculated between processed data and multiple historically collected data. Through the above method, since there are differences in the data characteristics of different data types, different data types are used for similarity calculation based on the different data types to be processed. Identify whether the data to be processed has high-value data characteristics. In addition, the cumulative/non-cumulative data is further divided into single-dimensional data and multi-dimensional data, and then different similarity algorithms are used to calculate the similarity according to the different dimensions of the data, thereby improving the applicability of the data collection method proposed in this application. Scalability.

请参阅图9，本申请提供的一种数据采集方法，所述方法包括：Please refer to FIG. 9, a data collection method provided by the present application, the method includes:

S310：响应于采集指令，获取在当前采集周期采集得到的采集数据作为待处理数据。S310: In response to the acquisition instruction, acquire the acquisition data acquired in the current acquisition cycle as the data to be processed.

S320：计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性。S320: Calculate the similarity between the data to be processed and the historical collection data collected in the historical collection period.

S330：若所述相似性满足指定阈值条件，对所述待处理数据进行存储。S330: If the similarity meets a specified threshold condition, store the data to be processed.

S340：若所述相似性不满足所述指定阈值条件且所述待处理数据的采集时刻与持久化存储周期匹配，对所述待处理数据进行存储。S340: If the similarity does not meet the specified threshold condition and the collection time of the data to be processed matches the persistent storage period, store the data to be processed.

其中，在本申请实施例中，出于对设备的后续维护和对设备安全性的考虑，可以设置持久化存储周期以对待处理数据进行定期地存储，从而可以存储更多的常规数据(设备正常运行时的数据)。Among them, in the embodiment of the present application, for the consideration of the subsequent maintenance of the device and the security of the device, a persistent storage period can be set to periodically store the data to be processed, so that more conventional data can be stored (the device is normal runtime data).

作为一种方式，持久化存储周期可以为采集周期的N倍(N>1，且N为整数)，当待处理数据对应的相似性不满足指定阈值条件时，可以判断待处理数据的采集时刻与持久化存储周期是否匹配，若待处理数据的采集时刻与持久化存储周期匹配，也就是说待处理数据的采集时刻刚好是持久化存储周期对应的时刻，则可以将待处理数据进行持久化存储。As a method, the persistent storage period can be N times the acquisition period (N>1, and N is an integer). When the similarity corresponding to the data to be processed does not meet the specified threshold condition, the acquisition time of the data to be processed can be judged Whether it matches the persistent storage period. If the collection time of the data to be processed matches the persistent storage period, that is to say, the collection time of the data to be processed happens to be the time corresponding to the persistent storage period, the data to be processed can be persisted. storage.

需要说明的是，在本申请实施例中，也可以先判断待处理数据的采集时刻与持久化存储周期是否匹配，若待处理数据的采集时刻与持久化存储周期匹配，则可以将待处理数据进行持久化存储；若待处理数据的采集时刻与持久化存储周期不匹配，则可以判断所述待处理数据对应的相似性是否满足指定阈值条件，若满足指定阈值条件，则可以将待处理数据进行持久化存储。It should be noted that, in this embodiment of the present application, it is also possible to first determine whether the collection time of the data to be processed matches the persistent storage period. If the collection time of the data to be processed matches the persistent storage period, the data to be processed can be stored. Perform persistent storage; if the collection time of the data to be processed does not match the persistent storage period, it can be judged whether the similarity corresponding to the data to be processed satisfies the specified threshold condition, and if the specified threshold condition is met, the data to be processed can be stored. for persistent storage.

可选的，为了便于管理，可以对所有的待处理数据设置相同的持久化存储周期，示例性的，可以将所有的待处理数据的持久化存储周期设置为1天。Optionally, in order to facilitate management, the same persistent storage period may be set for all data to be processed. Exemplarily, the persistent storage period of all data to be processed may be set to 1 day.

可选的，为了节约存储空间，可以基于待处理数据的类型，确定持久化存储周期，示例性的，当待处理数据为累积型数据时，可以将持久化存储周期设置为7天；当待处理数据为非累积型数据时，可以将持久化存储周期设置为1天。Optionally, in order to save storage space, the persistent storage period may be determined based on the type of data to be processed. Exemplarily, when the data to be processed is cumulative data, the persistent storage period may be set to 7 days; When the processing data is non-cumulative data, the persistent storage period can be set to 1 day.

本实施例提供的一种数据采集方法，通过上述方式使得，可以将计算得到的待处理数据与在历史采集周期所采集的历史采集数据的相似性与指定阈值条件进行比对，并且在相似性满足指定阈值条件的情况下，再对待处理数据进行存储，使得对于所获取得到的采集数据会根据指定阈值条件进行一定的筛选，以得到满足需求(指定阈值条件)的采集数据进行存储，进而不用直接对每次获取得到的采集数据都进行存储，节约了存储空间。并且，在本实施例中，待处理数据可以在采集时刻与持久化存储周期匹配或者相似性可以满足指定阈值条件时被存储，由于持久化存储周期到达时一定会存储待处理数据，而相似性能否满足指定阈值条件是具有随机性的，从而可以对待处理数据进行可变频率的持久化存储，使得具有较高价值的待处理数据和常规待处理数据都可以被持久化存储，在节约存储空间的同时提高了设备的安全性和稳定性。In a data collection method provided by this embodiment, the above-mentioned method makes it possible to compare the similarity between the calculated data to be processed and the historical collection data collected in the historical collection period with a specified threshold condition, and compare the similarity between the data to be processed and the historical collection data collected in the historical collection period. When the specified threshold conditions are met, the data to be processed will be stored, so that the acquired collected data will be screened according to the specified threshold conditions to obtain the collected data that meets the requirements (specified threshold conditions) for storage, and no need The collected data obtained each time is directly stored, which saves storage space. In addition, in this embodiment, the data to be processed can be stored when the collection time matches the persistent storage period or the similarity can meet the specified threshold condition, because the data to be processed will be stored when the persistent storage period arrives, and similar performance Whether the specified threshold condition is met is random, so that the data to be processed can be persistently stored at a variable frequency, so that both the data to be processed with high value and the conventional data to be processed can be persistently stored, saving storage space. At the same time, it improves the security and stability of the device.

请参阅图10，本申请提供的一种数据采集方法，所述方法包括：Please refer to FIG. 10, a data collection method provided by this application, the method includes:

S410：响应于采集指令，获取在当前采集周期采集得到的采集数据作为待处理数据。S410: In response to the acquisition instruction, acquire the acquisition data acquired in the current acquisition cycle as the data to be processed.

S420：计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性。S420: Calculate the similarity between the data to be processed and the historical collection data collected in the historical collection period.

S430：若所述相似性满足指定阈值条件，对所述待处理数据进行存储。S430: If the similarity meets a specified threshold condition, store the data to be processed.

S440：基于持久化存储周期采集并存储数据，其中，所述持久化存储周期大于所述采样周期。S440: Collect and store data based on a persistent storage period, wherein the persistent storage period is greater than the sampling period.

其中，作为一种方式，如图11所示，当持久化存储周期对应的时刻到达时，电子设备的的控制单元可以向数据采集单元发出采集指令，该采集指令可以包括有待采集数据标识，数据采集单元响应于该采集指令后，可以根据待采集数据标识执行数据采集操作，并将在当前持久化存储周期采集到的采集数据进行持久化存储。In one way, as shown in FIG. 11 , when the time corresponding to the persistent storage period arrives, the control unit of the electronic device may issue a collection instruction to the data collection unit, and the collection instruction may include the identifier of the data to be collected, the data After the collection unit responds to the collection instruction, it can perform a data collection operation according to the identifier of the data to be collected, and persistently store the collection data collected in the current persistent storage period.

作为另一种方式，如图11所示，当采样周期对应的时刻到达时，电子设备的控制单元可以向数据采集单元发出采集指令，该采集指令可以包括有待采集数据标识，数据采集单元响应于该采集指令后，可以根据待采集数据标识执行数据采集操作，以将在当前采集周期采集到的采集数据作为待处理数据；再将待处理数据对应的相似性与指定阈值条件相比较，若待处理数据对应的相似性满足指定阈值条件，可以将待处理数据进行持久化存储。As another way, as shown in FIG. 11 , when the time corresponding to the sampling period arrives, the control unit of the electronic device may issue a collection instruction to the data collection unit, and the collection instruction may include an identifier of the data to be collected, and the data collection unit responds to After the collection instruction, a data collection operation can be performed according to the identifier of the data to be collected, so that the collected data collected in the current collection cycle is regarded as the data to be processed; and then the similarity corresponding to the data to be processed is compared with the specified threshold condition. If the similarity corresponding to the processed data satisfies the specified threshold condition, the data to be processed can be persistently stored.

本实施例提供的一种数据采集方法，通过上述方式使得，可以将计算得到的待处理数据与在历史采集周期所采集的历史采集数据的相似性与指定阈值条件进行比对，并且在相似性满足指定阈值条件的情况下，再对待处理数据进行存储，使得对于所获取得到的采集数据会根据指定阈值条件进行一定的筛选，以得到满足需求(指定阈值条件)的采集数据进行存储，进而不用直接对每次获取得到的采集数据都进行存储，节约了存储空间。。并且，在本实施例中，待处理数据可以在持久化存储周期对应的时刻到达或者相似性可以满足指定阈值条件时被存储，由于持久化存储周期到达时一定会存储待处理数据，而相似性能否满足指定阈值条件是具有随机性的，从而可以对待处理数据进行可变频率的持久化存储，使得具有较高价值的待处理数据和常规待处理数据都可以被持久化存储，在节约存储空间的同时提高了设备的安全性和稳定性。In a data collection method provided by this embodiment, the above-mentioned method makes it possible to compare the similarity between the calculated data to be processed and the historical collection data collected in the historical collection period with a specified threshold condition, and compare the similarity between the data to be processed and the historical collection data collected in the historical collection period. When the specified threshold conditions are met, the data to be processed will be stored, so that the acquired collected data will be screened according to the specified threshold conditions to obtain the collected data that meets the requirements (specified threshold conditions) for storage, and no need The collected data obtained each time is directly stored, which saves storage space. . In addition, in this embodiment, the data to be processed can be stored at the time corresponding to the persistent storage period or when the similarity can meet the specified threshold condition. Since the data to be processed must be stored when the persistent storage period arrives, and the similar performance Whether the specified threshold condition is met is random, so that the data to be processed can be persistently stored at a variable frequency, so that both the data to be processed with high value and the conventional data to be processed can be persistently stored, saving storage space. At the same time, it improves the security and stability of the device.

请参阅图12，本申请提供的一种数据采集装置600，所述装置600包括：Please refer to FIG. 12, a data collection device 600 provided by the present application, the device 600 includes:

待处理数据获取单元610，用于响应于采集指令，获取在当前采集周期采集得到的采集数据作为待处理数据。The to-be-processed data acquisition unit 610 is configured to, in response to the acquisition instruction, acquire the acquisition data acquired in the current acquisition cycle as the to-be-processed data.

相似性计算单元620，用于计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性。The similarity calculation unit 620 is configured to calculate the similarity between the data to be processed and the historical collection data collected in the historical collection period.

存储单元630，用于若所述相似性满足指定阈值条件，对所述待处理数据进行存储。The storage unit 630 is configured to store the data to be processed if the similarity satisfies a specified threshold condition.

其中，作为一种方式，相似性计算单元620具体用于基于所述待处理数据的类型计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性。Wherein, as one way, the similarity calculating unit 620 is specifically configured to calculate the similarity between the data to be processed and the historical collection data collected in the historical collection period based on the type of the data to be processed.

作为另一种方式，相似性计算单元620具体用于若所述待处理数据为累积型数据，基于所述待处理数据以及在历史采集周期所采集的历史采集数据的变化率计算相似性；若所述待处理数据为非累积型数据，基于所述待处理数据以及在历史采集周期所采集的历史采集数据计算相似性。As another way, the similarity calculating unit 620 is specifically configured to calculate the similarity based on the data to be processed and the rate of change of the historical collection data collected in the historical collection period if the data to be processed is cumulative data; The data to be processed is non-cumulative data, and similarity is calculated based on the data to be processed and the historical collection data collected in the historical collection period.

其中，可选的，相似性计算单元620具体用于若所述待处理数据为单维数据，获取多个历史采集周期各自采集的历史采集数据，得到多个历史采集数据；从所述待处理数据和所述多个历史采集数据中获取多对数据，每对数据中的数据各自对应的采集周期相邻；获取每对数据中对应采集周期在后的数据与对应采集周期在前的数据的差作为参考差值，以得到每对数据对应的参考差值；将每对数据对应的参考差值与每对数据中对应采集周期在后的数据相比，得到每对数据对应的变化率；基于第一相似性算法以及每对数据对应的变化率计算相似性，所述第一相似性算法包括标准差、欧式距离、余弦距离、皮尔逊相关系数、修正余弦距离、汉明距离、曼哈顿距离中的任意一项。Wherein, optionally, the similarity calculation unit 620 is specifically configured to, if the to-be-processed data is single-dimensional data, obtain the historical collection data collected by each of multiple historical collection periods, and obtain a plurality of historical collection data; from the to-be-processed data Obtain multiple pairs of data from the data and the plurality of historical collection data, and the data in each pair of data have adjacent collection periods; obtain the difference between the data corresponding to the later collection period and the data corresponding to the previous collection period in each pair of data. The difference is used as the reference difference to obtain the reference difference corresponding to each pair of data; the reference difference corresponding to each pair of data is compared with the data corresponding to the collection period in each pair of data to obtain the corresponding change rate of each pair of data; The similarity is calculated based on the first similarity algorithm and the rate of change corresponding to each pair of data. The first similarity algorithm includes standard deviation, Euclidean distance, cosine distance, Pearson correlation coefficient, modified cosine distance, Hamming distance, and Manhattan distance. any of the .

可选的，相似性计算单元620具体用于若所述待处理数据为多维数据，获取多个历史采集周期各自采集的历史采集数据，得到多个历史采集数据；获取所述待处理数据和所述多个历史采集数据中每个维度位置的数据；基于对应的维度位置将每个维度位置的数据划分为多组，得到多组数据，其中，同一组数据所对应的维度位置相同；获取每组数据中的多对数据，每对数据中的数据各自对应的采集周期相邻；获取每对数据中对应采集周期在后的数据与对应采集周期在前的数据的差作为参考差值，以得到每组数据中的每对数据对应的参考差值；将每对数据对应的参考差值与每对数据中对应采集周期在前的数据相比，得到每对数据对应的变化率；基于所包括的数据的采样周期，将多对数据划分为多个集合，其中，同一个集合中的每队数据中所对应有的采集周期相同；基于每个集合中的每个数据对应的变化率生成对应的多维数据，得到多个多维数据，其中，每对数据对应的变化率在对应生成的多维数据的中维度位置与该对数据中的数据在所述待处理数据或者所述历史采集数据中的维度位置相同；基于第二相似性算法以及所述多个多维数据计算相似性，所述第二相似性算法包括欧式距离、余弦距离、皮尔逊相关系数、修正余弦距离、汉明距离、曼哈顿距离中的任意一项。Optionally, the similarity calculation unit 620 is specifically configured to, if the data to be processed is multi-dimensional data, obtain historical collection data collected by multiple historical collection periods, and obtain a plurality of historical collection data; obtain the data to be processed and all the data. The data of each dimension position in the plurality of historically collected data; the data of each dimension position is divided into multiple groups based on the corresponding dimension position, and multiple groups of data are obtained, wherein the dimension positions corresponding to the same group of data are the same; For multiple pairs of data in the group data, the corresponding acquisition periods of each pair of data are adjacent to each other; obtain the difference between the data corresponding to the later acquisition period and the data corresponding to the previous acquisition period in each pair of data as the reference difference, with Obtain the reference difference value corresponding to each pair of data in each group of data; compare the reference difference value corresponding to each pair of data with the data corresponding to the previous acquisition period in each pair of data to obtain the corresponding change rate of each pair of data; The sampling period of the included data, divides multiple pairs of data into multiple sets, wherein the data collection period corresponding to each team of data in the same set is the same; based on the rate of change corresponding to each data in each set Corresponding multi-dimensional data, obtain a plurality of multi-dimensional data, wherein, the rate of change corresponding to each pair of data is in the mid-dimensional position of the corresponding generated multi-dimensional data and the data in the pair of data is in the data to be processed or the historically collected data. The dimensional positions of the two are the same; the similarity is calculated based on the second similarity algorithm and the plurality of multidimensional data, the second similarity algorithm includes Euclidean distance, cosine distance, Pearson correlation coefficient, modified cosine distance, Hamming distance, Manhattan distance any of the distances.

可选的，相似性计算单元620具体用于若所述待处理数据为单维数据，获取多个历史采集周期各自采集的历史采集数据，得到多个历史采集数据；基于第一相似性算法、所述待处理数据以及所述多个历史采集数据计算相似性，所述第一相似性算法包括标准差、欧式距离、余弦距离、皮尔逊相关系数、修正余弦距离、汉明距离、曼哈顿距离中的任意一项。Optionally, the similarity calculation unit 620 is specifically configured to obtain the historical collection data collected by each of a plurality of historical collection periods if the data to be processed is single-dimensional data, and obtain a plurality of historical collection data; based on the first similarity algorithm, The similarity between the data to be processed and the plurality of historically collected data is calculated, and the first similarity algorithm includes standard deviation, Euclidean distance, cosine distance, Pearson correlation coefficient, modified cosine distance, Hamming distance, and Manhattan distance. any of the.

可选的，相似性计算单元620具体用于若所述待处理数据为多维数据，获取多个历史采集周期各自采集的历史采集数据，得到多个历史采集数据；基于第二相似性算法、所述待处理数据以及所述多个历史采集数据计算相似性，所述第二相似性算法包括欧式距离、余弦距离、皮尔逊相关系数、修正余弦距离、汉明距离、曼哈顿距离中的任意一项。Optionally, the similarity calculation unit 620 is specifically configured to, if the data to be processed is multi-dimensional data, obtain the historical collection data collected by each of multiple historical collection periods, and obtain a plurality of historical collection data; Calculate similarity between the data to be processed and the plurality of historically collected data, and the second similarity algorithm includes any one of Euclidean distance, cosine distance, Pearson correlation coefficient, modified cosine distance, Hamming distance, and Manhattan distance .

作为又一种方式，相似性计算单元620具体用于基于所述待处理数据的维度计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性。As yet another manner, the similarity calculating unit 620 is specifically configured to calculate the similarity between the data to be processed and the historically collected data collected in the historical collection period based on the dimension of the to-be-processed data.

作为一种方式，存储单元630具体用于若所述相似性不满足所述指定阈值条件且所述待处理数据的采集时刻与持久化存储周期匹配，对所述待处理数据进行存储。In one way, the storage unit 630 is specifically configured to store the data to be processed if the similarity does not meet the specified threshold condition and the collection time of the data to be processed matches the persistent storage period.

作为另一种方式，存储单元630具体用于基于持久化存储周期采集并存储数据，其中，所述持久化存储周期大于所述采样周期。In another manner, the storage unit 630 is specifically configured to collect and store data based on a persistent storage period, wherein the persistent storage period is greater than the sampling period.

其中，可选的，基于所述待处理数据的类型，确定持久化存储周期。Wherein, optionally, a persistent storage period is determined based on the type of the data to be processed.

可选的，所述指定阈值条件包括所述相似性小于第一相似阈值，或者所述相似性与上一个采集周期所对应的相似性的差值小于第二相似阈值。Optionally, the specified threshold condition includes that the similarity is less than a first similarity threshold, or a difference between the similarity and a similarity corresponding to a previous collection period is less than a second similarity threshold.

下面将结合图13对本申请提供的一种电子设备进行说明。An electronic device provided by the present application will be described below with reference to FIG. 13 .

请参阅图13，基于上述的数据采集方法、装置，本申请实施例还提供的另一种可以执行前述数据采集方法的电子设备1000。电子设备1000包括相互耦合的一个或多个(图中仅示出一个)处理器102、存储器104。其中，该存储器104中存储有可以执行前述实施例中内容的程序，而处理器102可以执行该存储器104中存储的程序。其中，处理器102可以包括一个或者多个处理核。处理器102利用各种接口和线路连接整个电子设备1000内的各个部分，通过运行或执行存储在存储器104内的指令、程序、代码集或指令集，以及调用存储在存储器104内的数据，执行电子设备1000的各种功能和处理数据。Referring to FIG. 13 , based on the foregoing data collection method and apparatus, another electronic device 1000 that can execute the foregoing data collection method is further provided by the embodiments of the present application. The electronic device 1000 includes one or more (only one is shown in the figure) a processor 102 and a memory 104 that are coupled to each other. Wherein, the memory 104 stores a program that can execute the content in the foregoing embodiments, and the processor 102 can execute the program stored in the memory 104 . The processor 102 may include one or more processing cores. The processor 102 uses various interfaces and lines to connect various parts of the entire electronic device 1000, and executes by running or executing the instructions, programs, code sets or instruction sets stored in the memory 104, and calling the data stored in the memory 104. Various functions of the electronic device 1000 and processing data.

可选地，处理器102可以采用数字信号处理(Digital Signal Processing，DSP)、现场可编程门阵列(Field－Programmable Gate Array，FPGA)、可编程逻辑阵列(Programmable Logic Array，PLA)中的至少一种硬件形式来实现。处理器102可集成中央处理器(Central Processing Unit，CPU)、图像处理器(Graphics Processing Unit，GPU)和调制解调器等中的一种或几种的组合。其中，CPU主要处理操作系统、用户界面和应用程序等；调制解调器用于处理无线通信。可以理解的是，上述调制解调器也可以不集成到处理器102中，单独通过一块通信芯片进行实现。Optionally, the processor 102 may adopt at least one of a digital signal processing (Digital Signal Processing, DSP), a Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), and a Programmable Logic Array (Programmable Logic Array, PLA). A hardware form is implemented. The processor 102 may integrate one or a combination of a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), a modem, and the like. Among them, the CPU mainly handles the operating system, user interface and application programs; the modem is used to handle wireless communication. It can be understood that, the above-mentioned modem may not be integrated into the processor 102, and is implemented by a communication chip alone.

存储器104可以包括随机存储器(Random Access Memory，RAM)，也可以包括只读存储器(Read-Only Memory)。存储器104可用于存储指令、程序、代码、代码集或指令集。存储器104可包括存储程序区和存储数据区，其中，存储程序区可存储用于实现操作系统的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现下述各个方法实施例的指令等。存储数据区还可以存储电子设备1000在使用中所创建的数据(比如电话本、音视频数据、聊天记录数据)等。The memory 104 may include random access memory (Random Access Memory, RAM), or may include read-only memory (Read-Only Memory). Memory 104 may be used to store instructions, programs, codes, sets of codes, or sets of instructions. The memory 104 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playback function, an image playback function, etc.) , instructions for implementing the following method embodiments, and the like. The storage data area may also store data created by the electronic device 1000 during use (such as a phone book, audio and video data, chat record data) and the like.

请参考图14，其示出了本申请实施例提供的一种计算机可读存储介质的结构框图。该计算机可读存储介质800中存储有程序代码，所述程序代码可被处理器调用执行上述方法实施例中所描述的方法。Please refer to FIG. 14 , which shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application. The computer-readable storage medium 800 stores program codes, and the program codes can be invoked by the processor to execute the methods described in the above method embodiments.

计算机可读存储介质800可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。可选地，计算机可读存储介质800包括非易失性计算机可读存储介质(non-transitory computer-readable storage medium)。计算机可读存储介质800具有执行上述方法中的任何方法步骤的程序代码810的存储空间。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。程序代码800可以例如以适当形式进行压缩。The computer readable storage medium 800 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. Optionally, the computer-readable storage medium 800 includes a non-transitory computer-readable storage medium. Computer readable storage medium 800 has storage space for program code 810 to perform any of the method steps in the above-described methods. These program codes can be read from or written to one or more computer program products. The program code 800 may, for example, be compressed in a suitable form.

综上所述，本申请提供的一种数据采集方法、装置、电子设备以及存储介质，在响应于采集指令，获取在当前采集周期采集得到的采集数据后，将当前采集周期采集得到的采集数据作为待处理数据，计算所述待处理数据与在历史采集周期所采集的历史采集数据的相似性，若所述相似性满足指定阈值条件，对所述待处理数据进行存储。从而通过上述方式使得，可以将计算得到的待处理数据与在历史采集周期所采集的历史采集数据的相似性与指定阈值条件进行比对，并且在相似性满足指定阈值条件的情况下，再对待处理数据进行存储，使得对于所获取得到的采集数据会根据指定阈值条件进行一定的筛选，以得到满足需求(指定阈值条件)的采集数据进行存储，进而不用直接对每次获取得到的采集数据都进行存储，节约了存储空间。To sum up, the data acquisition method, device, electronic device and storage medium provided by the present application, after acquiring the acquisition data acquired in the current acquisition period in response to the acquisition instruction, the acquisition data acquired in the current acquisition period is collected. As the data to be processed, the similarity between the data to be processed and the historical collection data collected in the historical collection period is calculated, and if the similarity meets a specified threshold condition, the data to be processed is stored. Thus, by the above method, the similarity between the calculated data to be processed and the historical collection data collected in the historical collection period can be compared with the specified threshold condition, and if the similarity satisfies the specified threshold condition, then treat it again. The processed data is stored, so that the acquired collected data will be screened according to the specified threshold conditions, so as to obtain the collected data that meets the requirements (specified threshold conditions) for storage, so that it is not necessary to directly store the collected data obtained each time. Save storage space.

最后应说明的是：以上实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不驱使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or some technical features thereof are equivalently replaced; and these modifications or replacements do not drive the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims

1. a data collection method, is characterized in that, described method comprises:

In response to the acquisition instruction, acquiring the acquisition data acquired in the current acquisition cycle as the data to be processed;

Calculate the similarity between the data to be processed and the historical collection data collected in the historical collection period;

If the similarity satisfies the specified threshold condition, the data to be processed is stored.

2. The method according to claim 1, wherein the calculating the similarity between the collection data and the historical collection data collected in the historical collection period comprises:

The similarity between the data to be processed and the historically collected data collected in the historical collection period is calculated based on the type of the data to be processed.

3. The method according to claim 2, wherein calculating the similarity between the data to be processed and the historical collection data collected in a historical collection period based on the type of the data to be processed comprises:

If the data to be processed is cumulative data, the similarity is calculated based on the data to be processed and the rate of change of the historical collection data collected in the historical collection period;

If the data to be processed is non-cumulative data, the similarity is calculated based on the data to be processed and the historical collection data collected in the historical collection period.

4. The method according to claim 3, wherein the calculating similarity based on the data to be processed and the rate of change of the historical collection data collected in the historical collection period comprises:

If the data to be processed is single-dimensional data, obtain the historical collection data collected by each of a plurality of historical collection periods, and obtain a plurality of historical collection data;

Obtain multiple pairs of data from the data to be processed and the plurality of historical collection data, and the respective collection periods corresponding to the data in each pair of data are adjacent;

Obtain the difference between the data corresponding to the later acquisition period and the data corresponding to the previous acquisition period in each pair of data as the reference difference value, so as to obtain the reference difference value corresponding to each pair of data;

Comparing the reference difference corresponding to each pair of data with the data corresponding to the previous acquisition period in each pair of data, the corresponding change rate of each pair of data is obtained;

The similarity is calculated based on the first similarity algorithm and the rate of change corresponding to each pair of data. The first similarity algorithm includes standard deviation, Euclidean distance, cosine distance, Pearson correlation coefficient, modified cosine distance, Hamming distance, and Manhattan distance. any of the .

5. The method according to claim 3, wherein the calculating similarity based on the data to be processed and the rate of change of the historical collection data collected in the historical collection period comprises:

If the data to be processed is multi-dimensional data, obtain the historical collection data collected by each of multiple historical collection periods, and obtain a plurality of historical collection data;

Acquiring the data to be processed and the data of each dimension position in the plurality of historically collected data;

Divide the data of each dimension position into multiple groups based on the corresponding dimension position to obtain multiple groups of data, wherein the dimension positions corresponding to the same group of data are the same;

Acquire multiple pairs of data in each group of data, and the data in each pair of data have adjacent acquisition periods;

Obtaining the difference between the data corresponding to the following acquisition period and the data corresponding to the previous acquisition period in each pair of data as a reference difference value, so as to obtain a reference difference value corresponding to each pair of data in each group of data;

Based on the sampling period of the included data, the multiple pairs of data are divided into multiple sets, wherein each pair of data in the same set has the same sampling period;

Corresponding multi-dimensional data is generated based on the rate of change corresponding to each data in each set, and multiple multi-dimensional data are obtained, wherein the rate of change corresponding to each pair of data is in the middle dimension position of the corresponding generated multi-dimensional data and the pair of data The dimension positions of the data in the data to be processed or the historically collected data are the same;

The similarity is calculated based on a second similarity algorithm and the plurality of multidimensional data, the second similarity algorithm includes any one of Euclidean distance, cosine distance, Pearson correlation coefficient, modified cosine distance, Hamming distance, and Manhattan distance item.

6. The method according to claim 3, wherein the calculating similarity based on the data to be processed and the historical collection data collected in the historical collection period, comprising:

The similarity is calculated based on a first similarity algorithm, the data to be processed, and the plurality of historically collected data, where the first similarity algorithm includes standard deviation, Euclidean distance, cosine distance, Pearson correlation coefficient, modified cosine distance, Any of Hamming distance and Manhattan distance.

7. The method according to claim 3, wherein the calculating similarity based on the data to be processed and the historical collection data collected in the historical collection period comprises:

The similarity is calculated based on a second similarity algorithm, the data to be processed, and the plurality of historically collected data, where the second similarity algorithm includes Euclidean distance, cosine distance, Pearson correlation coefficient, modified cosine distance, and Hamming distance , any of the Manhattan distances.

8. The method according to claim 1, wherein the calculating the similarity between the data to be processed and the historical collection data collected in the historical collection period comprises:

The similarity between the data to be processed and the historical collection data collected in the historical collection period is calculated based on the dimension of the data to be processed.

9. The method according to any one of claims 1-8, wherein the method further comprises:

If the similarity does not satisfy the specified threshold condition and the collection time of the data to be processed matches the persistent storage period, the data to be processed is stored.

10. The method according to claim 9, wherein the method further comprises:

Based on the type of the data to be processed, a persistent storage period is determined.

11. The method of claim 1, wherein the specified threshold condition comprises the similarity being less than a first similarity threshold, or

The absolute value of the difference between the similarity and the similarity corresponding to the previous collection period is smaller than the second similarity threshold.

12. The method according to any one of claims 1-8, wherein the method further comprises:

Data is collected and stored based on a persistent storage period, wherein the persistent storage period is greater than the sampling period.

13. A data acquisition device, characterized in that the device comprises:

a data-to-be-processed acquisition unit, configured to, in response to the acquisition instruction, acquire the acquisition data acquired in the current acquisition cycle as the data to be processed;

a similarity calculation unit, configured to calculate the similarity between the data to be processed and the historical collection data collected in the historical collection period;

A storage unit, configured to store the data to be processed if the similarity satisfies a specified threshold condition.

14. An electronic device, comprising one or more processors and a memory;

One or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs being configured to perform the method of any of claims 1-12.

15. A computer-readable storage medium, wherein a program code is stored in the computer-readable storage medium, wherein the method according to any one of claims 1-12 is executed when the program code is executed.