CN112819176A

CN112819176A - Data management method and data management device suitable for machine learning

Info

Publication number: CN112819176A
Application number: CN202110088033.6A
Authority: CN
Inventors: 赵家志
Original assignee: Fiberhome Telecommunication Technologies Co Ltd
Current assignee: Fiberhome Telecommunication Technologies Co Ltd
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2021-05-18
Anticipated expiration: 2041-01-22
Also published as: CN112819176B

Abstract

The invention discloses a data management method and a data management device suitable for machine learning, wherein the data management method comprises the following steps: the data management device receives a registration task of the intelligent application; acquiring a discrete value of target data corresponding to the data tag, and filling the discrete value into a corresponding time sequence of the global data storage area; and when the time sequence corresponding to the registration task meets the loopback condition, B multiplied by S numerical values are extracted from the corresponding time sequence according to the registration sequence of the data labels to form F vectors, the F vectors are jointly deformed to obtain a B multiplied by S multiplied by F dimension tensor, and the B multiplied by S multiplied by F dimension tensor is distributed to the corresponding intelligent application task. In the invention, under the condition of ensuring that the correlation among samples and the time characteristics of the samples are not lost, various heterogeneous sample data in a communication network are reasonably expressed into a tensor so as to meet the application basis of more machine learning models.

Description

Data management method and data management device suitable for machine learning

Technical Field

The present invention relates to the field of machine learning, and more particularly, to a data management method and a data management apparatus suitable for machine learning.

Background

Machine learning, particularly deep learning technology, has been rapidly developed in recent years, and communication networks increasingly apply the technology to aspects such as network networking planning and tuning, intelligent energy conservation, fault tracing, performance monitoring and the like. Machine learning is carried out aiming at multidimensional tensor, reasonable tensor representation is carried out aiming at various heterogeneous sample data in communication network equipment so as to be applied to machine learning, and the process is also called as feature engineering. Although the deep learning technology can realize the function of feature extraction and simplify the work of feature engineering, the processed data is still a multidimensional tensor. The tensor representation of the sample data needs to meet the requirement of model input, and meanwhile, the completeness of data information, such as correlation between data, data time characteristics and the like, needs to be guaranteed.

In the field of communication technology, with the application of machine learning technology, a machine learning model for a single class of data cannot meet the needs of a user, and more models need to learn the relevance of different structural data to mine potential information, for example, the performance, alarm, device state, user control and the like of certain services of a device are all relevant.

Therefore, a method is needed to reasonably represent heterogeneous data in a tensor; in addition, time information of data change is also an object of key diving of a machine learning algorithm, in the prior art, a network manager respectively collects and manages required equipment data, a plurality of data indexes cannot guarantee time alignment, and time uniformity of single data sampling cannot be guaranteed due to complexity of interaction between the network manager and the equipment and network transmission delay. In addition, artificial intelligence is applied to a communication network at the present stage and is also in a model, data and application tight coupling stage, one intelligent application only processes data aiming at a specific model, and the intelligent applications respectively process the data, so that a large amount of redundant and repeated data processing work is caused for data acquisition and storage.

Disclosure of Invention

In view of the above drawbacks and needs of the prior art, the present invention provides a data management method and a data management apparatus for machine learning, which are capable of reasonably representing various heterogeneous sample data in a communication network into a tensor under the condition of ensuring correlation between samples and loss of time characteristics of the samples, so as to satisfy the foundation of more machine learning model applications.

To achieve the above object, according to one aspect of the present invention, there is provided a data management method suitable for machine learning, the data management method being applied to a data management apparatus disposed at a specified level of a communication device, the data management method including:

the data management device receives a registration task of intelligent application, wherein the registration task comprises F data tags of data required by the intelligent application, a collection step length S and a return batch B, and F is more than or equal to 1;

acquiring a discrete value of target data corresponding to the data tag, and filling the discrete value into a corresponding time sequence of a global data storage area;

and when the time sequence corresponding to the registration task meets the loopback condition, B multiplied by S numerical values are extracted from the corresponding time sequence according to the registration sequence of the data labels to form F vectors, the F vectors are jointly deformed to obtain a B multiplied by S multiplied by F dimension tensor, and the B multiplied by S multiplied by F dimension tensor is distributed to the corresponding intelligent application task.

Preferably, the registration task further includes a sampling interval, the target data includes continuous data and discrete data, each time series has a series identifier, the series identifier of the time series corresponding to the continuous data is formed by a data tag corresponding to the continuous data and the sampling interval, and the series identifier of the time series corresponding to the discrete data is formed by a data tag corresponding to the discrete data;

the management method comprises the following steps:

when the target data is continuous data, judging whether an existing time sequence of a sequence identifier containing a data tag of the continuous data exists in the global data storage area, wherein the sampling interval of the continuous data is an integral multiple or a submultiple of the sampling interval of the existing time sequence;

if yes, sharing the existing time sequence, wherein the sharing principle is as follows: if the sampling interval of the continuous data is integral multiple of the sampling interval of the existing time sequence, the existing time sequence is directly shared; if the sampling interval of the continuous data is the submultiple of the sampling interval of the existing time sequence, the existing time sequence is shared after the sampling interval of the existing time sequence is adjusted;

if not, adding a time sequence, and using the data label and the sampling interval of the continuous data as the sequence identification of the time sequence.

Preferably, the management method includes:

when the target data is discrete data, judging whether an existing time sequence of a data tag containing the discrete data exists in the global data storage area or not;

if so, sharing the existing time sequence;

if not, adding a time sequence, and using the data label of the discrete data as the sequence identification of the time sequence.

Preferably, the obtaining a discrete value of the target data corresponding to the data tag, and filling the discrete value into a corresponding time sequence of the global data storage area includes:

determining whether the target data is discrete data or continuous data according to a data tag carried in a return result, wherein the return result also carries a corresponding discrete value;

when the target data is discrete data, determining a corresponding time sequence according to the data label, and filling the discrete value and the occurrence time of the discrete value into the time sequence;

when the target data is continuous data, determining a corresponding time sequence according to the data label and the sampling interval, and filling the time sequence with the discrete value and the time stamp of the discrete value.

Preferably, when the target data is continuous data, determining a corresponding time sequence according to the data tag and the sampling interval, and filling the discrete value and the time stamp of the discrete value into the time sequence includes:

when the target data is continuous data, determining a corresponding time sequence according to the data label and the sampling interval;

determining whether a missing value exists in the middle according to the sampling interval and the timestamp of the last discrete value, if the missing value exists, calculating the average value of the last discrete value and the current discrete value, and filling the average value serving as the missing value into a time sequence;

and filling the discrete value and the time stamp of the discrete value into the time sequence.

Preferably, the registration task further includes a sampling interval T; the loopback condition is as follows: for the time series of continuous data, after the registration task returns data for the last time, when the number of effective data newly added on the time series is greater than or equal to BxS, the returning condition is satisfied; for a time series of discrete data, the loopback condition is satisfied when a time lapse after the registration task last looped back data is greater than or equal to bxsxt.

Preferably, the registration task further includes a sampling interval T, and the data management method further includes:

for each time sequence, acquiring the BxSxT value of each registration task of the time sequence, and destroying the data which are returned in the time sequence after returning the data to the registration task with the maximum value of BxSxT;

when the target data is continuous data, setting a time span M equal to the maximum value of BxSxT of all the registered tasks of the time sequence, taking a loopback time point of the registered task with the BxSxT as the maximum value as an initial time point, subtracting the time span M from the initial time point to obtain a time point A, and destroying all data before the time point A;

when the target data is discrete data, setting the time span M equal to the maximum value of B multiplied by S multiplied by T of all the registered tasks of the time sequence, taking the return time point of the registered task with B multiplied by S multiplied by T as the initial time point, subtracting the time span M from the initial time point to obtain the time point B, reserving the change node closest to the time point B, and deleting all the data before the change node.

Preferably, the communication device includes a network manager, a network element located under the network manager, a service single disk located under the network element, and a service line located under the service single disk, and the network manager, the network element, the service single disk, the service line, and the target data all have unique numbers;

when the data management device is deployed in a network manager, the data label is formed by combining the number of the network element, the number of the service single disk, the number of the service line and the number of the target data;

when the data management device is deployed in a network element, the data label is formed by combining the serial number of the service single disc, the serial number of the service line and the serial number of the target data;

when the data management device is deployed on a service single disk, the data label is formed by combining the serial number of the service line and the serial number of the target data.

Preferably, the management method further comprises:

and when all the registration tasks sharing the same time sequence exit, deleting the time sequence and destroying the cache data of the time sequence.

According to another aspect of the present invention, there is provided a data management apparatus adapted for machine learning, comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions programmed to perform the data management method of the present invention.

Generally, compared with the prior art, the technical scheme of the invention has the following beneficial effects: the invention provides a data management method and a data management device suitable for machine learning, wherein the data management method is applied to the data management device, the data management device is deployed at a specified level of communication equipment, and the data management method comprises the following steps: the data management device receives a registration task of the intelligent application, wherein the registration task comprises F data tags of data required by the intelligent application, a collection step length S and a return batch B, and F is more than or equal to 1; acquiring a discrete value of target data corresponding to the data tag, and filling the discrete value into a corresponding time sequence of the global data storage area; and when the time sequence corresponding to the registration task meets the loopback condition, B multiplied by S numerical values are extracted from the corresponding time sequence according to the registration sequence of the data labels to form F vectors, the F vectors are jointly deformed to obtain a B multiplied by S multiplied by F dimension tensor, and the B multiplied by S multiplied by F dimension tensor is distributed to the corresponding intelligent application task.

In the invention, tensor expression can be uniformly carried out on various types of heterogeneous data, the application range of machine learning applied to communication equipment is expanded, the data is uniformly managed through the global data storage area, a plurality of registration tasks can share the same time sequence, redundant storage and transmission of the data are reduced, and repeated development of a uniform data acquisition interface is reduced. On the other hand, data required by a plurality of registration tasks can be strictly aligned in time, so that the correlation among samples and the time characteristic of the samples are not lost, and the application of an algorithm with strict requirements on the time correlation is ensured.

Drawings

Fig. 1 is a schematic diagram illustrating an interaction flow between an intelligent application and a data management device according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a data management method suitable for machine learning according to an embodiment of the present invention;

fig. 3 is a network hierarchy diagram of a communication device according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a time sequence in a global data storage area according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a time sequence in another global data storage area according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of data destruction over a time series of continuous data provided by an embodiment of the present invention;

fig. 7 is a schematic diagram of data destruction on a time series of discrete data provided by an embodiment of the present invention;

FIG. 8 is a schematic diagram of a data acquisition process provided by an embodiment of the present invention;

FIG. 9 is a schematic flow chart of setting a time sequence according to an embodiment of the present invention;

FIG. 10 is a flow chart of a fill time sequence provided by an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a data management apparatus suitable for machine learning according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example 1:

referring to fig. 1 and fig. 2, the present embodiment provides a data management method suitable for machine learning, where the data management method is applied to a data management apparatus, the data management apparatus is deployed at a specified level of a communication device, and the data management method includes the following steps:

step 11: the data management device receives a registration task of the intelligent application, wherein the registration task comprises F data labels, a collection step S and a return batch B of data required by the intelligent application, and F is larger than or equal to 1.

The data required by the intelligent application comprises continuous data and discrete data, wherein the continuous data comprises performance, and the discrete data comprises alarms, states and control instructions.

The data tags are used for locating the generation source of the data, and when the hierarchy of the data management device in the communication equipment is different, the representation mode of the data tags is different.

Specifically, the communication device comprises a network manager, a network element located below the network manager, a service single disc located below the network element, and a service line located below the service single disc, wherein the network manager, the network element, the service single disc, the service line, and the target data all have unique numbers; when the data management device is deployed in a network manager, the data label is formed by combining the number of the network element, the number of the service single disk, the number of the service line and the number of the target data; when the data management device is deployed in a network element, the data label is formed by combining the serial number of the service single disc, the serial number of the service line and the serial number of the target data; when the data management device is deployed on a service single disk, the data label is formed by combining the serial number of the service line and the serial number of the target data.

For example, in fig. 3, if the data management apparatus is deployed in the service disk 1, the ID of the alarm 1 is 0xAA, and the ID of the service line 1 where the data management apparatus is located is 0XBB, then 0XBBAA can be accurately located in the service disk as the data tag of the alarm 1; if the data management device is deployed in the network element 1, the ID-0X 2 of the service single disk 1 is added in front of the data tag of the alarm 1, and the complete data tag is 0X02BBAA, so that the alarm 1 can be located in the network element 1. The IDs of the network element, the service single disk, the service line and the alarm are represented by one byte for simplicity, and the actual situation may be represented by an array or a character string.

Acquiring S times to represent a sample, wherein if only the current state of data is concerned, S is 1; s >1 if the temporal characteristics of the data change are of interest; the return batch B indicates how many samples are returned at a time.

Step 12: if the time sequence which accords with the target data corresponding to the data label exists in the global data storage area, sharing the existing time sequence; and if the time sequence which accords with the target data corresponding to the data label does not exist in the global data storage area, adding a time sequence.

The data management device comprises a global data storage area, all data required by the intelligent application registered to the data management equipment are uniformly stored in the global data storage area, a plurality of time sequences are established in a global coarse storage area, data with the same data label can share the time sequences, on one hand, time continuity is guaranteed, on the other hand, data sharing is achieved, repeated data acquisition and redundant storage are avoided, and development workload is reduced.

In this embodiment, the registration task further includes a sampling interval T, the target data includes continuous data and discrete data, each of the time series has a sequence identifier, the sequence identifier of the time series corresponding to the continuous data is formed by a data tag corresponding to the continuous data and the sampling interval, and the sequence identifier of the time series corresponding to the discrete data is formed by a data tag corresponding to the discrete data.

Step 102 specifically includes: when the target data is continuous data, whether an existing time sequence of a data tag containing the continuous data in sequence identification exists in the global data storage area or not is judged, and the sampling interval of the continuous data is integral multiple or submultiple of the sampling interval of the existing time sequence. If yes, sharing the existing time sequence, wherein the sharing principle is as follows: if the sampling interval of the continuous data is integral multiple of the sampling interval of the existing time sequence, the existing time sequence is directly shared; if the sampling interval of the continuous data is the submultiple of the sampling interval of the existing time sequence, the existing time sequence and the corresponding sequence identification are shared after the sampling interval of the existing time sequence is adjusted; if not, adding a time sequence, and using the data label and the sampling interval of the continuous data as the sequence identification of the time sequence.

When the target data is discrete data, judging whether an existing time sequence of a data tag containing the discrete data exists in the global data storage area or not; if so, sharing the existing time sequence; if not, adding a time sequence, and using the data label of the discrete data as the sequence identification of the time sequence.

With reference to fig. 4, each time series is declared by a sequence identifier, and the sequence identifier of the time series corresponding to the discrete sampling of continuous data is composed of a "data tag + a sampling interval", for example, the sequence identifier of the time series of a certain performance is "line number + a performance code + a sampling interval" (as shown in fig. 4, line 1 performance 1 sampling 1, line 1 performance 1 sampling 2, line 2 performance 1 sampling 1), "performance code + a sampling interval" indicates global performance, and if the sampling intervals are different, the same performance is indicated by two sequence identifiers (as shown in fig. 4, line 1 performance 1 sampling 1, line 1 performance 1 sampling 2). The sequence of the time series corresponding to the discrete data is identified as "data tag", for example, the sequence of the time series corresponding to a certain alarm is identified as "line number + alarm code" (as shown in fig. 4, line 1 alarm 1, line 1 alarm 2).

There are two kinds of data in the network device: continuous data-Performance; discrete data-alarms, status, control commands. Two kinds of data storage will be described separately below.

Storage of continuous data: continuous data is subjected to discrete sampling to obtain a discrete time sequence, and in order to reduce data redundant storage and acquisition, a new data label and a sampling interval are added and processed in the following three conditions: for time series with the same data tag, if the newly added sampling interval is a multiple of some existing sampling interval, the existing time series may be shared, for example, in fig. 4, assuming line 1 performance 1 sample 1 is the existing time series, line 1 performance 1 sample 2 may directly share the time series; if the newly added sampling interval is a divisor of some existing sampling interval, adjusting the sampling interval of the time series, and then sharing the time series, for example, in fig. 4, assuming that the line 1 performance 1 sample 2 is an existing time series, after adjusting the sampling interval of the time series to 1, the line 1 performance 1 sample 2 shares the time series; if the newly added sampling interval is not a multiple or divisor of all existing intervals, then a new time series is created to indicate a new acquisition.

Storage of discrete data: the frequency of discrete data change is very small according to the characteristics of communication equipment, all changes of target data corresponding to one data label can be completely represented by one time sequence, each value change is actively reported and recorded according to the change result, a plurality of time sequences do not need to be represented according to sampling intervals, and all tasks needing the same data label share the same time sequence.

In an actual application scenario, when all the registration tasks shared in the same time sequence exit, the time sequence is deleted and the cache data of the time sequence is destroyed. Specifically, after the registered task exits, all time sequences in the global data storage area are searched according to the registered time sequences, if no task is used, the time sequences are deleted, and the time sequence data cache is destroyed.

After the time series is established in the foregoing manner, data is acquired as follows in step 103, and the data is filled in the corresponding time series.

Step 13: and acquiring a discrete value of the target data corresponding to the data label, and filling the discrete value into a corresponding time sequence of the global data storage area.

In this embodiment, whether the target data is discrete data or continuous data is determined according to a data tag carried in a returned result, where the returned result carries a discrete value; when the target data is discrete data, determining a corresponding time sequence according to the data label, and filling the discrete value and the occurrence time of the discrete value into the time sequence; when the target data is continuous data, determining a corresponding time sequence according to the data label and the sampling interval, and filling the time sequence with the discrete value and the time stamp of the discrete value. In a preferred embodiment, in order to avoid data loss, when the target data is continuous data, determining a corresponding time sequence according to the data label and the sampling interval; determining whether a missing value exists in the middle according to the sampling interval and the timestamp of the last discrete value, if the missing value exists, calculating the average value of the last discrete value and the current discrete value, and filling the average value serving as the missing value into a time sequence; and filling the discrete value and the time stamp of the discrete value into the time sequence.

Selecting a corresponding data acquisition protocol according to a deployment mode of the data management device, and acquiring data through a single-disk management application program if the data management device is deployed on a business single disk; if the data management device is deployed on the equipment network element and the data comes from the single-disk application, acquiring the data through a data communication protocol between the network element and the service single disk; if the data management device is deployed on the network manager and the data comes from each network element device, the data is acquired through a data communication protocol between the network manager and the network element device. After the data is acquired, the returned results are populated onto the time series in the global data store.

In an actual application scene, for continuous data, sending a data acquisition command at sampling interval points to obtain corresponding discrete values, and filling the discrete values into a time sequence; for discrete data, the time series is filled according to the actively reported discrete values.

Step 14: and when the time sequence corresponding to the registration task meets the loopback condition, B multiplied by S numerical values are extracted from the corresponding time sequence according to the registration sequence of the data labels to form F vectors, the F vectors are jointly deformed to obtain a B multiplied by S multiplied by F dimension tensor, and the B multiplied by S multiplied by F dimension tensor is distributed to the corresponding intelligent application task.

In this embodiment, when the time series corresponding to a certain registration task satisfies the loopback condition, B × S values are extracted from the corresponding time series according to the registration order of the data tags to form F vectors, the F vectors are jointly transformed to obtain a B × S × F tensor, and the B × S × F tensor is distributed to the corresponding intelligent application task.

Wherein the registration task further comprises a sampling interval T; the loopback condition is as follows: for the time series of continuous data, after the registration task returns data for the last time, when the number of effective data newly added on the time series is greater than or equal to BxS, the returning condition is satisfied; for a time series of discrete data, the loopback condition is satisfied when a time lapse after the registration task last looped back data is greater than or equal to bxsxt.

The method for judging whether the time sequence of the continuous data meets the loopback condition comprises the following steps: taking fig. 5 as an example, if the intelligent application registers the data requirements of B ═ 2, S ═ 2, and T ═ 2 on the performance a, if the last loopback data is at the 1 st time point, 4 numbers must be collected later to satisfy the loopback condition, the data corresponding to the 3 rd, 5 th, 7 th, and 9 th time points are required at the sampling interval, where the data at the 2 nd, 4 th, 6 th, and 8 th time points are not required, and the number of valid data newly added to the time series after the last loopback data of the registration task is greater than or equal to 4, that is, the time series of the performance a satisfies the loopback condition after the 9 th data is collected.

The method for judging whether the time sequence of the discrete value data meets the loopback condition comprises the following steps: taking fig. 5 as an example, if the time lapse after the task returns data last time is greater than or equal to bxs × T, if the intelligent application is in the data requirement of B ═ 2, S ═ 2, and T ═ 2 registered by the alarm a, the task records that the data returned last time is at the time point 1, the time lapse of the time series of the alarm a is greater than or equal to 8, that is, after the difference between the time stamp of the time point 1 and the current time stamp is greater than or equal to 8, it indicates that the time series meets the condition of returning data to the task.

In an actual application scenario, when a certain registration task meets a loopback condition, on a time sequence corresponding to a registered data label, a data tensor is obtained by cutting according to a sampling interval, a collection step length and a return batch, and the registration task is informed.

In connection with fig. 5, the smart application 1 registers 3 data tags: performance A, alarm A and state A; the sampling interval T is equal to 1, the acquisition step length S is equal to 3, and the batch B is returned to 2; first, data is sent back and data cutting and combining are carried out on the performance A, the alarm A and the state A at 1-6 time points respectively to obtain a 2 x 3 tensor as shown in the following table.

6.2	1	0
			10	1	0
3.1	1	1

4	1	-1
			7	0	-1
11	0	-1

For another example, smart application 2 registers 2 data tags: performance A, state A; the sampling interval T is 3, the acquisition step length S is 3, the batch B is returned to 1, and the first time of data is returned to the point of

time

1, 4, and 7, and the data is cut and combined to obtain a tensor of 1 × 3 × 2, as shown in the following table.

6	0
		4	-1
12	0

In a preferred embodiment, in order to save memory, the cache data that is no longer used is deleted, and for each time series, the B × S × T value of each registration task of the time series is obtained, and after returning data to the "registration task whose B × S × T is the maximum value", the data that has been returned on the time series is destroyed. Each time sequence corresponds to a plurality of different registration tasks, the bxSxT values corresponding to each registration task are completely different, and the cache data which is not used any more is deleted by taking the data loopback time point of the registration task with the bxSxT as the maximum value as the time base point for data destruction. The data destruction modes of the time series corresponding to the continuous data and the discrete data are different, and the destruction modes of the continuous data and the discrete data are as follows:

referring to fig. 6, when the target data is continuous data, the time span M is set to be equal to the maximum value of B × S × T of all the registration tasks in the time series, the loop-back time point of the "registration task with B × S × T as the maximum value" is set as the start time point, the time span M is subtracted from the start time point to obtain a time point a, and all the data before the time point a is destroyed. Data within M may need to be cut and combined to be distributed to the data task to be returned, and data before M may determine that no task is needed and may be destroyed.

Referring to fig. 7, when the target data is discrete data, the time span M is set to be equal to the maximum value of B × S × T of all the registered tasks in the time series, the loop-back time point of the "registered task with B × S × T as the maximum value" is set as the start time point, the time span M is subtracted from the start time point to obtain a time point B, the change node closest to the time point B is retained, and all data before the change node is deleted. Wherein, the data in M may need to be cut and combined to be distributed to the data task to be returned, because the time sequence is recorded when the value is changed, and because the values of all time points in M need to be known, the latest data change before M cannot be destroyed.

In the embodiment, tensor expression can be uniformly performed on various types of heterogeneous data, the application range of machine learning applied to communication equipment is expanded, the data are uniformly managed through the global data storage area, a plurality of registration tasks can share the same time sequence, data redundancy storage and transmission are reduced, and repeated development of a uniform data acquisition interface is reduced. On the other hand, data required by a plurality of registration tasks can be strictly aligned in time, so that the correlation among samples and the time characteristic of the samples are not lost, and the application of an algorithm with strict requirements on the time correlation is ensured.

Example 2:

with reference to fig. 1, the present embodiment provides a data management apparatus, including: the system comprises a management unit, a data acquisition unit, a data distribution unit and a global data storage area, wherein the management unit is responsible for global data storage, and the data acquisition unit and the data distribution unit are scheduled to run. Further, in the data management apparatus, a collection timing task, a data distribution task, and a reception task are created.

Specifically, an acquisition timing task with the operation period of unit 1 is created, the sampling interval of all time sequences is equal to the time unit or is an integral multiple of the time unit, all time sequences are traversed by the acquisition timing task at regular time, an acquisition command is initiated for a task with the arrival time of the sampling interval, and the result of the acquisition command is filled in the corresponding time sequence.

The data distribution task is used for circularly checking all the registration tasks and informing the data distribution unit of distributing data to the registration tasks meeting the distribution. And checking the time sequence of the distributed data to determine whether a data destruction condition is met, and if so, destroying the data.

The receiving task is used for receiving the registration task of the intelligent application and transmitting the registration task to the management unit, the management unit analyzes the registration task to obtain F data tags, sampling intervals T, acquisition step length S and returned batches B of data required by the intelligent application, a time sequence is established in the global data storage area according to the data tags and the sampling intervals, sequence identifications of the time sequences of the data required by all the registration tasks are recorded, and the corresponding time sequence of the global storage is destroyed when the task exits.

In an actual application scenario, the data management apparatus of the embodiment has multiple deployment modes: (1) in network management deployment, a data management device is used as a module of a model training or reasoning platform, the platform can initiate data acquisition on all network elements in the network, and the model training and reasoning data management device provides an Application Programming Interface (API) to the platform in a dynamic library mode for calling. The main API of the dynamic library is a registration data acquisition interface. And calling a callback after the data management device finishes data acquisition, and returning the data to the intelligent application by using parameters of the callback, wherein the data acquisition unit finishes data acquisition through a network management and equipment communication protocol. (2) And (3) deploying at the network element: usually, the network element deployment is only used for reasoning, and an intelligent computation single disk is introduced on the network element, wherein the computation single disk is a single disk specially used for model reasoning on the network element, and is used for collecting data of all other service disks in the network element and analyzing and reasoning. The data management device provides API for platform calling in a dynamic library mode, compared with the arrangement on a network manager, a data label can be positioned in equipment without a network element ID, the main API of the dynamic library is the same as the arrangement on the network manager, and a data acquisition unit finishes data acquisition through a single-disk data communication protocol. (3) In the deployment of a service single disk, the data management device is used as a module of a model reasoning platform on the service disk to realize the data acquisition and management in the service disk. Compared with the deployment on the network element, the data label can be positioned in a single disk without a single disk ID. The data collection in the disc is completed through a single disc management application data communication protocol, and the main API of the dynamic library is the same as that deployed on the network manager. (4) The intelligent application task is operated on the network management, and the registration and data loopback protocols of the network management and the data management device are defined. The deployment scheme is different from the deployment scheme (2) and the deployment scheme (3) in that the data registration and the loopback mode are different, the data management device of the implementation schemes (2) and (3) is used as a platform sub-module, the platform directly uses an API (application programming interface) to register data requirements for each intelligent application, and the data is looped back in the callback function mode of application registration; in the implementation scheme (4), the data requirements of the network management model training or reasoning platform are registered to the daemon process of the data management device in a network communication protocol mode, and the data loopback is also reported in a network protocol mode. The data requirement registration protocol comprises a data label, a sampling interval, a collection step length, a loopback batch and an intelligent application unique number of the data requirement on the network management; the data loopback protocol comprises a BxSxF array and also comprises a unique number of the intelligent application to which the data requirement belongs on the network manager, so that the network manager can attribute the loopback data to the intelligent application.

The time performance of the data of the deployment modes (1) to (3) is better and better, the three modes are mainly different in the level of collected data, in fig. 3, the 1 st mode corresponds to the network management level, the 2 nd mode corresponds to the network element level, and the 3 rd mode corresponds to the single disk level. The data acquisition protocols deployed at the bottom layers of different levels are different, general network elements and single-disk computing power are limited, only reasoning can be carried out, and a network management server is used for training. The deployment method (4) can utilize the real-time property of the equipment to obtain data with good time property, and then training is carried out on the network manager. The 4 th mode can be regarded as a modification for absorbing the advantages of the 1 st, 2 nd and 3 rd modes, and the modification is flexible according to requirements and is not limited to the above 4 embodiments.

The following explains the implementation process of the data management method by taking the example that the data management device is arranged in the network device according to the deployment mode of (3) as follows:

establishing a mechanism for acquiring data:

after the data management device receives the registration task of the intelligent application, the discrete value of the corresponding data is obtained according to the data tag and the sampling interval, referring to fig. 8, the specific implementation process is as follows:

s101, issuing a designated data tag through a data query protocol of a single-disk management process, reading a discrete value of data corresponding to the data tag by the single-disk management process, and returning the acquired discrete value and sampling interval.

The sampling interval is also taken as a parameter to be issued together, and in a return result, the sampling interval is filled in a protocol so as to jointly position a time sequence of returned data with a data label; the return of the query result proceeds to S401 processing.

S201, issuing a data label needing to be actively reported through a data label subscription protocol of a single-disk management process, and actively reporting the value change of the subscribed data label.

The data tags that need to be actively reported are data tags of discrete data, and the processing of S401 is performed after the actively reported values are received.

In this embodiment, the data acquisition unit establishes communication with a single-disk management process of a service single disk, and realizes inter-process communication between the data acquisition unit and the single-disk management process in the same operating system, and realizes query and report of all data (including performance, alarm, state, and the like) through the single-disk management process, and also realizes active report of discrete data such as alarm, state, control commands, and the like.

(II) setting a time sequence:

after receiving the registration task of the intelligent application, a time sequence is further set for data corresponding to each data tag in the global data storage area according to the data tag and the sampling interval, so as to perform global data storage and maintenance, and with reference to fig. 9, the specific process is as follows (corresponding to step 12 of embodiment 1):

s301, judging whether the data corresponding to the data label is discrete data or continuous data according to the data label; when the data is discrete data, setting a time sequence according to S302-S304; when the data is continuous data, a time series is set in accordance with S305 to S307.

S302, if the data is discrete data, searching time sequences of all the discrete data, and determining whether the time sequences corresponding to the data labels exist.

And S303, if the time sequence exists, sharing the time sequence, and recording a registration task sharing the time sequence.

And S304, if the time sequence of the discrete data does not exist, adding a new time sequence of the discrete data for the data label, acquiring an initial value of the discrete data corresponding to the data label through the S101 process, after the initial value is acquired, subscribing the data corresponding to the time sequence through the S201 process, actively reporting the data when the data change, and filling the acquired value in the corresponding time sequence.

S305, if the data is continuous data, searching an existing time sequence matched with the data label of the continuous data, if the sampling interval of the continuous data is a multiple or a divisor of the sampling interval of the existing time sequence, sharing the time sequence, and entering S306, and if no matched data label exists or the sampling interval does not meet the sharing, entering S307.

S306, if the sampling interval of the continuous data is a multiple or a submultiple of the sampling interval of the existing time sequence, sharing the time sequence, modifying the time sequence needing to change the sampling interval, and recording the registration task sharing the time sequence.

S307, if no matched data label exists or the sampling interval does not meet the sharing requirement, a time sequence of the continuous data is newly established, the identification of the time sequence is represented by the new data label plus the sampling interval, data is acquired through S101, and the acquired value is filled in the corresponding time sequence.

And (III) filling the acquired data into a corresponding time sequence:

after the time series is established, the corresponding discrete values are obtained through the foregoing steps S101 and S102, and then the discrete values are filled into the corresponding time series according to the following procedure, referring to fig. 10, the specific implementation process is as follows (corresponding to step 13 of embodiment 1):

s401, acquiring a data label in a return result of the data acquisition unit, and determining whether the data to which the return result belongs is discrete data or continuous data according to the data label, wherein the return result also carries a corresponding discrete value.

S402, if the data to which the returned result belongs is discrete data, filling the discrete value and the occurrence time of the discrete value into a corresponding time sequence according to the data label.

And S403, if the data to which the returned result belongs is continuous data, acquiring the sampling interval carried in the returned result, and determining a corresponding time sequence according to the data label and the sampling interval.

S404, determining whether a missing value exists in the middle according to the sampling interval and the timestamp of the last discrete value, if the missing value exists, calculating the average value of the last discrete value and the current discrete value, and filling the average value serving as the missing value into the time sequence. This ensures that the data remains aligned in time in the event of a loss of result.

And S405, filling the discrete value and the time stamp of the discrete value into the time sequence.

(IV) data distribution:

cutting the data label of each registration task from the global data storage area to obtain a B multiplied by S multiplied by F dimension tensor, which is specifically described as follows: and judging whether the time sequences corresponding to the task registration all meet the loopback conditions. And if the result meets the requirement, cutting the data tensor from the global data storage area according to the sequence of the registered data labels, wherein the time sequence of the continuous data is different from the time sequence cutting method of the discrete data, the former takes B multiplied by S values from the last returning position, the latter obtains B multiplied by S values on the time sequence from the last returning time stamp according to the acquisition interval in sequence, and the F B multiplied by S vectors are combined and deformed to obtain the B multiplied by S multiplied by F dimension tensor.

The foregoing mainly describes the main data processing flow of the data management apparatus of the present embodiment, and how the units cooperate with each other to complete the data management method is briefly described: firstly, a sampling timer task issues an acquisition command to a time sequence of all continuous data, the execution period of the timer is unit time, and all sampling intervals are integral multiples of the unit time. The specific process is as follows: and traversing all the time sequences of the continuous data, if the sampling time point of the continuous data is reached, issuing a sampling command through the S101 process, and actively reporting other numerical values of the time sequences of the discrete data except the initial value by adopting the S201 process.

And after receiving the return result based on the sampling command and the actively reported return result, entering S401 for data storage.

The data distribution of all intelligent application tasks is processed by distributing the timer task, the task can be merged with the sampling timer task, the execution period of the timer is unit time, and the execution flow of the timer task is as follows: and traversing all the registration tasks, and distributing data to all the tasks meeting the conditions.

Example 3:

referring to fig. 11, fig. 11 is a schematic structural diagram of a data management apparatus suitable for machine learning according to an embodiment of the present invention. The data management apparatus of the present embodiment includes one or more processors 41 and a memory 42. In fig. 11, one processor 41 is taken as an example.

The processor 41 and the memory 42 may be connected by a bus or other means, and fig. 11 illustrates the connection by a bus as an example.

The memory 42, which is a non-volatile computer-readable storage medium based on a data management method, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, the methods of the above embodiments, and corresponding program instructions. The processor 41 implements the methods of the foregoing embodiments by executing non-volatile software programs, instructions, and modules stored in the memory 42 to thereby execute various functional applications and data processing.

The memory 42 may include, among other things, high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 42 may optionally include memory located remotely from processor 41, which may be connected to processor 41 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

It should be noted that, for the information interaction, execution process and other contents between the modules and units in the apparatus and system, the specific contents may refer to the description in the embodiment of the method of the present invention because the same concept is used as the embodiment of the processing method of the present invention, and are not described herein again.

Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A data management method applicable to machine learning, the data management method being applied to a data management apparatus deployed at a specified level of a communication device, the data management method comprising:

2. The data management method according to claim 1, wherein the registration task further includes a sampling interval, the target data includes continuous data and discrete data, each of the time series has a series identifier, the series identifier of the time series corresponding to the continuous data is formed by a data tag corresponding to the continuous data and the sampling interval, and the series identifier of the time series corresponding to the discrete data is formed by a data tag corresponding to the discrete data;

the management method comprises the following steps:

3. The data management method according to claim 2, wherein the management method comprises:

if so, sharing the existing time sequence;

4. The data management method according to claim 3, wherein the obtaining of the discrete value of the target data corresponding to the data tag and the filling of the discrete value into the corresponding time sequence of the global data storage area comprises:

5. The data management method of claim 4, wherein when the target data is continuous data, determining a corresponding time sequence according to the data tag and the sampling interval, and filling the discrete value and the timestamp of the discrete value into the time sequence comprises:

6. The data management method of claim 1, wherein the registration task further comprises a sampling interval T; the loopback condition is as follows: for the time series of continuous data, after the registration task returns data for the last time, when the number of effective data newly added on the time series is greater than or equal to BxS, the returning condition is satisfied; for a time series of discrete data, the loopback condition is satisfied when a time lapse after the registration task last looped back data is greater than or equal to bxsxt.

7. The data management method of claim 1, wherein the registration task further comprises a sampling interval T, the data management method further comprising:

8. The data management method according to claim 1, wherein the communication device comprises a network manager, a network element located under the network manager, a service single disk located under the network element, and a service line located under the service single disk, and the network manager, the network element, the service single disk, the service line, and the target data all have unique numbers;

9. The data management method of claim 1, wherein the management method further comprises:

10. A data management apparatus adapted for machine learning, comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor and programmed to perform the data management method of any of claims 1 to 9.