WO2023275012A1

WO2023275012A1 - Computer-implemented method for the efficient generation of a large volume of configuration data

Info

Publication number: WO2023275012A1
Application number: PCT/EP2022/067667
Authority: WO
Inventors: Christoph Schneider
Original assignee: Christoph Schneider
Priority date: 2021-07-02
Filing date: 2022-06-28
Publication date: 2023-01-05
Also published as: EP4113303A1; US20240168662A1; EP4363975A1

Abstract

The invention relates to a method for processing large volumes of data, so-called big data, by means of different computing architectures and arithmetics. In said process, distributed and heterogeneous data sources can be used. The invention is also directed to a correspondingly set-up system arrangement. Also disclosed is a computer program product with control commands which implement the disclosed method and operate the disclosed device and arrangement.

Description

Computer-implemented method for efficiently generating extensive configuration data

The present invention relates to a method for processing large amounts of data, so-called big data, using different computing architectures and arithmetic. Distributed and heterogeneous data sources can be used here. The invention is also aimed at a correspondingly set up system arrangement. Furthermore, a computer program product is proposed with control commands that implement the proposed method or operate the proposed device and arrangement.

EP 3764618 A1 shows a method for the efficient optimization of memory allocations, which makes it possible for resources present in a computer network to be used efficiently and for this the required bandwidth can be optimized. In addition, it is possible according to the invention for the data to be anonymized during the outsourcing or implicitly encrypted by segmentation.

Various processing techniques are known from the prior art, which make it possible to recognize structures in extensive data sets, so-called big data, and to process the data accordingly. One challenge here is the different data types in the raw data, and this creates problems in making the data compatible with one another or processing it.

Proposed methods are very computationally intensive, which is a problem especially with extensive data sets. Although processors typically have several computing cores, this often results in complex computing processes and there are also processor architectures which cannot process certain types of matrices efficiently enough for some application scenarios with the instruction sets provided. The problem arises in particular if extensive matrices are provided which are also stored distributed in a network. This requires a large number of Buffer storage in order to be able to process the matrices efficiently in certain application scenarios. In summary, it can be concluded that certain computer architectures are not optimized for matrices.

In addition, in some types of processors, instruction sets are fixed, ie hard-coded, and consequently it is not possible with such processors to generate dynamic calculation steps in such a way that they can be used depending on the configuration data. However, it is again not possible to provide optimized calculation steps, rather an existing command set must be used, even if it is not optimized for specific input data.

In general, there is also the problem that in the case of extensive data sets, so-called big data, processing is often error-prone and, in addition, requires large hardware capacities. Errors can result from certain floating point arithmetic causing errors or from data types being incompatible with one another. It is thus possible, for example, for decimal places to be calculated using fewer bits in a first data type than in a second data type. This inevitably leads to an error, which may lead to further incorrect calculations.

It is therefore an object of the proposed invention to provide an improved method which is suitable for efficiently generating extensive configuration data and in doing so addresses the problems mentioned. Furthermore, it is an object of the present invention to propose a corresponding system arrangement which can be implemented or operated according to the method. Furthermore, a computer program product is to be provided which contains control commands which implement the method or operate the proposed device.

The object is solved by a method with the features according to patent claim 1 . Further advantageous configurations are specified in the dependent claims. Accordingly, a computer-implemented method for efficiently generating extensive configuration data based on heterogeneous data sources is proposed, comprising reading out a stored matrix with configuration data and serializing the configuration data read out according to a first serialization metric; reading out at least one further stored matrix with configuration data and serializing the configuration data read out according to a second serialization metric; calculating a relation between the serialized configuration data; and creating new configuration data depending on the calculated relation.

According to the invention, a method for efficiently generating extensive configuration data is thus proposed, with the configuration data generally being stored according to any data type. There is therefore no general compatibility and it may be necessary to introduce further processing steps for this purpose. According to the present invention, it is provided that data types are either converted or a relation between the serialized configuration data is calculated in such a way that only certain configuration data are related. In this context, specific configuration data means that pairs of configuration data are compared which are of the same or a compatible data type. If there are more than two matrices, a comparison is not made in pairs, but configuration data from a number of matrices may also be related.

It is generally also possible here not to create a relation between all configuration data, but first to check whether a relation can be created because compatible data types are present. For example, if the algorithm encounters configuration data from a first matrix that has no correspondence in the configuration data from the second matrix, then no relation is generated. Thus, the configuration data is selected for generating a relation.

The stored matrices can each be converted into a vector, with the vector storing the configuration data of the matrix in serialized form. Will If two vectors or at least two vectors are now compared or related to one another, the entry within the vector can be compared with a further entry of a further vector, the corresponding index being taken into account in each case. If, for example, the configuration data is written in a first column and the configuration data of another matrix is written in a second column, the data can also be compared line by line. If there is no match in the second line to the entry in the first line, no relation is calculated. How the vectors or columns are to be calculated is stored in the serialization metric.

The serialization metric can determine for each matrix how to serialize it. For example, a matrix can be read out line by line and thus the individual entries can be written into a vector. Thus it is generally possible to transform the matrix from a two-dimensional table into a one-dimensional column.

The present invention also accounts for multi-dimensional arrays and the serialization metric for this specifies how these data entries are written into a vector. In general, it is also possible to convert a matrix into not only one vector, but to create several vectors. If necessary, these can also have redundant data records.

The matrices are typically read out over a network together with the stored configuration data. For this purpose, the matrices are kept on a server, for example. In this respect, these are also typically heterogeneous data sources, with heterogeneous essentially referring to the data types and the underlying hardware. It is thus possible for the data sources to be provided according to different operating systems. The processing steps can also be distributed in the network in such a way that first the data is read out on a first processing unit and then the configuration data is transmitted to a second processing unit, where it is then serialized, ie stored in series. In addition, however, it is advantageous that the serialization takes place on the computing unit on which the matrix is saved. It is also possible to send a serial data stream over the network.

In certain application scenarios, it is advantageous to carry out a serial data transmission, since this can often be carried out more easily than the transmission of a matrix. All network components typically support serial data streams and corresponding protocols provide appropriate security mechanisms. A checksum can thus be calculated via the serial data stream and the serial data stream corresponds to the actual real-world data transmission in that an analog signal is typically transmitted, which is then digitized using threshold values which relate to an amplitude in the analog signal.

In general, according to the proposed method, any number of matrices can be compared or related to one another. However, this requires at least two matrices. In this respect, a first matrix and at least one further matrix are proposed.

Calculating a relation between the serialized configuration data involves comparing according to predetermined method steps. In general, it is possible to compare the configuration data in such a way that, for example, it is output which configuration data has the largest amount. This configuration data can then be output. New configuration data is thus created by copying configuration data that has already been read out and serialized. This can also be done using a reference to the stored configuration data. In general, it is possible according to the invention to consider the larger value from two vectors, for example, as new configuration data. However, the first vector can also be compared to the second vector or to the at least second vector, and then the vector that satisfies a specific relation can be output. In summary, it can be concluded that in the case where the configuration data are numerical, it can be determined which new configuration data are to be generated using a “greater than” relation. For this data from both vectors can be mixed and, for example, the larger value can be written into a new vector. However, it is also possible, for example, to count which configuration data is the largest and that vector which has the most larger entries is created as the new vector.

Correspondingly, if the configuration data is in the form of alphanumeric data, other rules can be created that indicate how a relation is to be created. For example, terms can be linked using a taxonomy and the relation then describes a relation of the configuration data within the underlying data structure. For example, the relation can describe a prioritization of many terms and then the relation can be created or calculated in such a way that configuration data with the highest priority is output and a vector with new configuration data is thus created, which has precisely this configuration data with the highest priority.

In another example, it is possible for the relation between two configuration data to be given by calculating a difference between the two data in the event that they are numeric. The relation is, for example, the amount of a difference. New configuration data can then be created in such a way that the difference itself forms the configuration data or that the configuration data are created as a function of the differences within a relation vector.

According to one aspect of the present invention, the matrices are each stored on different storage devices and transmitted using network technology. This has the advantage that there is a high level of failsafety when the matrices are provided, since they can be distributed over a network and, if necessary, stored redundantly. In addition, the proposed method is independent of the fact that the large and extensive data sets can be stored on a device. The proposed method scales due to the distribution in the network. According to a further aspect of the present invention, the matrices have data records of different data types and a relation is always calculated between data records of the same data types. This has the advantage that heterogeneous data types can generally be used, with the proposed method checking whether a relation can be generated at all. Consequently, only those data sets or configuration data that are compatible with one another are used. If a vector is created that describes the relations, corresponding entries of incompatible configuration data can remain empty.

According to a further aspect of the present invention, a conversion of data types is performed. This has the advantage that further relations can be calculated even if the data type is not the same. For example, it is possible to save a numeric value as text or to save numeric values as data types with a different number of bits. For example, a floating point number can be stored as 32 bits or 64 bits. According to this aspect, it is ensured that as many relations as possible can be calculated, and in this context it is particularly advantageous for the data types to be converted for this purpose in such a way that they match as far as possible.

According to another aspect of the present invention, the configuration data is serialized into configuration vectors. This has the advantage that existing implementations can be reused and, in particular, storing within a vector is particularly efficient.

According to a further aspect of the present invention, a serialization metric is provided for each matrix, which provides an indication of how data sets of the matrix are transformed. This has the advantage that for each matrix it is fixed at all times how it is read out and, for example, in which order the configuration data is written in series.

In accordance with another aspect of the present invention, all serialization metrics provided produce comparable vectors. This has the advantage that as many relations as possible can be calculated. So can the vectors be made comparable in that, given a different dimension or number of entries in two vectors, the same length is created in such a way that the shorter vector is filled with filler data.

According to a further aspect of the present invention, all serialized configuration data, all relations and/or all new configuration data are stored in the same database. This has the advantage that there is no delay over a network for the calculation-intensive processes, but rather this data is held locally and a shared buffer memory can be used.

According to a further aspect of the present invention, the relation is generated iteratively for a respective selection of configuration data. This has the advantage that the respective vectors or configuration data written in series are checked and, if a relation can be generated, a value from the configuration data from the first matrix and the second matrix is compared in pairs. If there are several matrices, the configuration data are compared according to the indexing. For example, in the case of three vectors, the first entry is compared with the other first entries. The second entry is compared with the further second entries of the other vectors. Figuratively speaking, a table can be compared line by line.

According to a further aspect of the present invention, a data memory is provided for providing calculation steps for calculating relations. This has the advantage that the relations in this data store can be predetermined and adjusted at any time. The calculation steps describe how a relation is to be generated. A distinction must be made between calculation steps that relate to numeric values or to alphanumeric values. In the case of numerical values, the calculation steps can describe an arithmetic, with alphanumeric configuration data being able to be prioritized. However, other calculation steps are also possible. According to a further aspect of the present invention, the creation of new configuration data includes applying the relation to configuration data, adopting existing configuration data and/or reading out further configuration data. This has the advantage that the new configuration data can either be selected from the existing configuration data or can be calculated from this configuration data. For example, a relation can provide information about which configuration data is to be used in the future.

According to a further aspect of the present invention, the configuration data are used to control a terminal device. This has the advantage that the results of the proposed method can be fed back into a terminal and thus operating parameters of the terminal can be influenced by the configuration data.

The object is also achieved by a system arrangement for efficiently generating extensive configuration data based on heterogeneous data sources, having a first interface unit set up for reading out a stored matrix with configuration data and a serialization unit set up for serializing the configuration data read out according to a first serialization metric; at least one second interface unit set up to read out at least one further stored matrix with configuration data and a further serialization unit set up to serialize the configuration data read out according to a second serialization metric; a computing unit set up to calculate a relation between the serialized configuration data; and an output unit set up to create new configuration data depending on the calculated relation.

The object is also achieved by a computer program product with control commands that implement the proposed method or operate the proposed device. According to the invention, it is particularly advantageous that the method can be used to operate the proposed devices and units. Furthermore, the proposed devices and facilities are suitable for carrying out the method according to the invention. Thus, in each case the device implements structural features which are suitable for carrying out the corresponding method. However, the structural features can also be designed as method steps. The proposed method also provides steps for implementing the function of the structural features. In addition, physical components can also be provided virtually or virtualized.

Further advantages, features and details of the invention result from the following description, in which aspects of the invention are described in detail with reference to the drawings. The features mentioned in the claims and in the description can each be essential to the invention individually or in any combination. Likewise, the features mentioned above and those further explained here can each be used individually or together in any combination. Parts or components that are functionally similar or identical are sometimes provided with the same reference symbols. The terms “left”, “right”, “above” and “below” used in the description of the exemplary embodiments refer to the drawings in an orientation with a normally legible figure designation or normally legible reference symbols. The embodiments shown and described are not to be understood as final, but have an exemplary character to explain the invention. The detailed description is provided for the convenience of those skilled in the art, and therefore well-known circuits, structures, and methods are not shown or discussed in detail in the description so as not to obscure the understanding of the present description. In the figures show:

FIG. 1 shows a schematic block diagram of the system arrangement for efficiently generating extensive configuration data based on heterogeneous data sources according to an aspect of the present invention; FIG. 2: a serialization of the matrices into vectors to generate a further vector according to an aspect of the present invention;

FIG. 3: a representation of the generated columns with serialized data records; and

Figure 4: a schematic flowchart of the computer-implemented

Procedure for efficiently generating extensive configuration data based on heterogeneous data sources.

FIG. 1 shows a block diagram of the proposed system arrangement. The processing of the first matrix is shown at the top of FIG. This is provided by the first device and has been drawn in here as a database DBO. The provided matrix is then serialized via the component connected on the right. In this component, the matrix is read out line by line, for example, and then converted into a vector, for example. This also follows for at least a second matrix as shown below. This second matrix is also provided by a database DB1 and transmitted to the component connected on the right, where all of the configuration data is serialized. In the present FIG. 1, at the bottom, it is shown that any number of matrices can be provided and serialized.

The configuration data written in series is then transmitted to a common component, which calculates a relation. This component is also connected to a database, as is shown here, since the database holds corresponding calculation steps. Based on the output of this device, a new set of configuration data is created, which is done in the rightmost component.

In the present case, a computer-implemented method is proposed, although this does not prevent individual steps from being carried out manually. The configuration data can also indicate how, for example, an output device such as a printer or a display is addressed. FIG. 2 schematically shows the data used. The raw data, which are currently stored as matrices, are shown on the left-hand side. As can be seen in the middle, these matrices are serialized and written into a vector. A relation is now created and this relation is in turn stored in a vector. This is drawn on the right. Typically, the vector on the right contains as many entries as the longest vector in the middle. However, it is also possible that only those relations that could also be calculated are entered in the vector on the right-hand side. If, after a conversion, the respective data type is not compatible with a data type to be compared, no relation can be created and either an error code is entered at the appropriate point in the right vector or the data record is simply omitted for this impossible relation. This means that the vector on the right-hand side can also be shorter than the vectors in the middle.

On the left-hand side, FIG. 3 shows a vector V0 which was generated from a first matrix and which is now plotted as a column. A column V1 is drawn in next to it, which has the configuration data of the further matrix. The third column V2 indicates a relation between the configuration data. In general, it is also possible to provide a fourth column V3, which has the new configuration data. Thus, in the present example, a column is provided for each vector generated.

FIG. 4 shows a computer-implemented method for efficiently generating extensive configuration data based on heterogeneous data sources, comprising reading out 100 a stored matrix with configuration data and serializing 101 the configuration data read out according to a first serialization metric; reading out 102 at least one further stored matrix with configuration data and serializing 103 the configuration data read out according to a second serialization metric; calculating 104 a relation between the serialized configuration data; and creating 105 new configuration data depending on the calculated relation. Not shown here is a data memory or a computer-readable medium with a computer program product having control commands that implement the proposed method or operate the proposed system arrangement.

Claims

patent claims

1. A computer-implemented method for efficiently generating large-scale configuration data based on heterogeneous data sources, comprising:

- Reading out (100) a stored two-dimensional matrix with configuration data and serializing (101) the read out configuration data according to a first serialization metric;

- reading out (102) at least one further stored two-dimensional matrix with configuration data and serializing (103) the configuration data read out according to a second serialization metric;

- calculating (104) a relation between the serialized configuration data; and

- creating (105) new configuration data as a function of the calculated relation, which includes reading out further configuration data.

2. The method according to claim 1, characterized in that the matrices are each stored on different storage devices and are transmitted via network technology.

3. The method according to claim 1 or 2, characterized in that the matrices have configuration data of different data types and a calculation (104) of a relation always takes place between configuration data of the same data types.

4. The method according to any one of the preceding claims, characterized in that a conversion of data types is carried out.

5. The method according to any one of the preceding claims, characterized in that the configuration data are serialized in configuration vectors.

6. The method according to any one of the preceding claims, characterized in that a serialization metric is provided for each matrix, which provides an indication of how data records of the matrix are read out and in which order the configuration data are written in series.

7. The method according to any one of the preceding claims, characterized in that all serialization metrics provided generate comparable vectors of the same length, such that the shorter vector is filled with padding data.

8. Method according to one of the preceding claims, characterized in that all serialized configuration data, all relations and/or all new configuration data are stored in the same database.

9. The method according to any one of the preceding claims, characterized in that the relation is generated iteratively for a selection of configuration data.

10. The method according to any one of the preceding claims, characterized in that a data memory with computing steps for calculating relations is provided.

11. The method according to any one of the preceding claims, characterized in that the creation (105) of new configuration data includes applying the relation to configuration data and/or adopting existing configuration data.

12. The method according to any one of the preceding claims, characterized in that the configuration data are used to control a terminal device.

13. System arrangement for the efficient generation of extensive configuration data based on heterogeneous data sources, comprising: - a first interface unit set up for reading (100) a stored two-dimensional matrix with configuration data and a Serialization unit set up for serializing (101) the read out configuration data according to a first serialization metric;

- at least one second interface unit set up for reading (102) at least one further stored two-dimensional matrix with configuration data and a further serialization unit set up for serializing (103) the configuration data read out according to a second serialization metric;

- A computing unit set up to calculate (104) a relation between the serialized configuration data; and

- An output unit set up for creating (105) new configuration data depending on the calculated relation, which includes reading out further configuration data

14. A computer program product comprising instructions which, when the program is executed by a computer, cause the latter to carry out the steps of the method according to any one of claims 1 to 12.

A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to perform the steps of the method of any one of claims 1 to 12.