CN113127510A

CN113127510A - Data archiving method and device, electronic equipment and storage medium

Info

Publication number: CN113127510A
Application number: CN201911425845.4A
Authority: CN
Inventors: 陈孟科; 谢友平
Original assignee: Shenzhen Intellifusion Technologies Co Ltd
Current assignee: Shenzhen Intellifusion Technologies Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2021-07-16
Anticipated expiration: 2039-12-31
Also published as: CN113127510B

Abstract

The application provides a data archiving method, a data archiving device, an electronic device and a storage medium, wherein the method comprises the following steps: calling a preset characteristic project through a distributed computing engine to read snapshot data to be archived and archive data in an archive library; determining one of the data to be archived and the archival data to be partitioned according to the data volume of the data to be archived and the archival data; dynamically acquiring a target IP address, and acquiring target GPU card information according to the target IP address; and sending the target GPU card information to a heterogeneous platform, and calling the target GPU card through the heterogeneous platform to archive the data to be archived in a target partition. The embodiment of the application is beneficial to improving the speed of portrait filing.

Description

Data archiving method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image archiving technologies, and in particular, to a method and an apparatus for data archiving, an electronic device, and a storage medium.

Background

With the improvement of public security consciousness of society, most places are provided with monitoring systems or snapshot systems, and the systems are provided with corresponding archives for storing face archives obtained after the snapshot face images are archived by the system. The archiving is to store a large amount of face images captured in a plurality of face files existing in an archive library, and as the coverage and access of a monitoring system or a capturing system increase, the face images to be archived every day also increase explosively, and the archiving speed is easily slow due to huge data volume.

Disclosure of Invention

In view of the above problems, the present application provides a method, an apparatus, an electronic device, and a storage medium for data archiving, which are beneficial to increase the human image archiving speed.

In order to achieve the above object, a first aspect of the embodiments of the present application provides a method for data archiving, where the method includes:

calling a preset characteristic project through a distributed computing engine to read snapshot data to be archived and archive data in an archive library;

determining one of the data to be archived and the archival data to be partitioned according to the data volume of the data to be archived and the archival data;

dynamically acquiring a target IP address, and acquiring target GPU card information according to the target IP address;

and sending the target GPU card information to a heterogeneous platform, and calling the target GPU card through the heterogeneous platform to archive the data to be archived in a target partition.

In a possible embodiment, the determining, according to the data amount of the data to be archived and the archival data, one of the data to be archived and the archival data for partitioning includes:

if the data volume of the data to be archived is larger than that of the archival data, clustering the data to be archived into n clusters by adopting a K-means clustering algorithm, and calling a first preset operator of a distributed computing engine to partition the n clusters;

if the data volume of the archive data is larger than that of the data to be archived, clustering the archive data into n clusters by adopting a K-means clustering algorithm, and calling a first preset operator of a distributed computing engine to partition the n clusters.

In one possible embodiment, the method further comprises:

under the condition that the data to be archived are partitioned, broadcasting the archive data;

and under the condition of partitioning the archive data, broadcasting the data to be archived.

In a possible embodiment, the invoking, by the heterogeneous platform, the target GPU card to archive the data to be archived in a target partition includes:

under the condition of partitioning the data to be archived, acquiring the archive data from a broadcast variable;

calling the target GPU card through the heterogeneous platform, and matching the data to be archived with the archive data acquired from the broadcast variables in a target partition for archiving;

under the condition of partitioning the archive data, acquiring the data to be archived from a broadcast variable;

and calling the target GPU card through the heterogeneous platform, and matching the data to be archived acquired from the broadcast variable with the archive data in a target partition for archiving.

In one possible implementation, before the calling, by the heterogeneous platform, the target GPU card archives the data to be archived in a target partition, the method further includes:

detecting whether the target GPU card is opened;

if so, executing the operation of calling the target GPU card through the heterogeneous platform to archive the data to be archived in a target partition;

if not, calling a CPU of the equipment to which the target IP address belongs through the heterogeneous platform to archive the data to be archived in the target partition.

A second aspect of the embodiments of the present application provides an apparatus for data archiving, where the apparatus includes:

the data reading module is used for calling a preset characteristic project through a distributed computing engine to read snapshot data to be archived and archive data in an archive library;

the data partitioning module is used for determining one of the data to be archived and the archival data to partition according to the data volume of the data to be archived and the archival data;

the GPU obtaining module is used for dynamically obtaining a target IP address and obtaining target GPU card information according to the target IP address;

and the archiving execution module is used for sending the target GPU card information to a heterogeneous platform, and calling the target GPU card through the heterogeneous platform to archive the data to be archived in a target partition.

A third aspect of embodiments of the present application provides an electronic device, including: a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above-mentioned data archiving method when executing the computer program.

A fourth aspect of the present embodiments provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the above-mentioned data archiving method.

The above scheme of the present application includes at least the following beneficial effects: calling a preset characteristic project through a distributed computing engine to read snapshot data to be archived and archive data in an archive library; determining one of the data to be archived and the archival data to be partitioned according to the data volume of the data to be archived and the archival data; dynamically acquiring a target IP address, and acquiring target GPU card information according to the target IP address; and sending the target GPU card information to a heterogeneous platform, and calling the target GPU card through the heterogeneous platform to archive the data to be archived in a target partition. After the snapshot data to be archived and the archive data in the archive library are read, the data with the larger data amount in the snapshot data to be archived and the archive data in the archive library are partitioned, then the target GPU card is dynamically called through the target IP address to archive the data to be archived in the target partition corresponding to the target IP address, GPU resources in the system are fully utilized, and therefore the speed of portrait archiving is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram of an application architecture provided by an embodiment of the present application;

fig. 2 is a schematic flow chart of a data archiving method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating another method for archiving data according to an embodiment of the present disclosure;

FIG. 4 is an exemplary diagram of inserting data to be archived into an archive according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an apparatus for data archiving according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of another data archiving apparatus according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of another data archiving apparatus according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of another data archiving apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "comprising" and "having," and any variations thereof, as appearing in the specification, claims and drawings of this application, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. Furthermore, the terms "first," "second," and "third," etc. are used to distinguish between different objects and are not used to describe a particular order.

First, a network system architecture to which the solution of the embodiments of the present application may be applied will be described by way of example with reference to the accompanying drawings. Referring to fig. 1, fig. 1 is an application architecture diagram provided in an embodiment of the present application, and the application architecture is based on fig. 1 and mainly includes an image acquisition device, a terminal device, a heterogeneous platform, and an archive library, where a heterogeneous platform server reads or loads data to be archived captured by the image acquisition device and archive data in the archive library through an archive model constructed by feature engineering of the archive library, and determines to partition one of the data and broadcast one of the data according to the data amount of the data, and the partition allocates one of the data to be partitioned to multiple terminal devices, and dynamically calls a CPU (central Processing Unit) or a GPU (Graphics Processing Unit) of the terminal device through an executor to perform archive Processing. The archive library has corresponding characteristic projects, and also comprises various archive tables or event tables such as a legacy table, a filing table, a gathering table and the like, the archive library can be updated after the data to be filed is filed, and the time limit of data storage in each table in the archive library can be set. It can be understood that the heterogeneous platform is an execution hub of the whole application architecture, and data to be archived, which is not successfully archived, can be subjected to archive processing. Based on the application architecture shown in fig. 1, the following describes in detail a method for archiving data provided in an embodiment of the present application with reference to other drawings.

Referring to fig. 2, fig. 2 is a schematic flow chart of a data archiving method according to an embodiment of the present application, and as shown in fig. 2, the method includes the steps of:

and S21, calling a preset feature project through the distributed computing engine to read the snapshot data to be archived and the archive data in the archive.

In the embodiment of the present application, the data to be archived, that is, the face image to be archived, and the archive data, that is, the face image in the existing portrait archive in the archive, are as follows: the portrait file of three photos has a plurality of identity certificates of three photos. The preset feature engineering is a feature engineering of an archive library, taking archive data as an example, and when the feature engineering reads the archive data, the archive data with the same identification mantissas are read in the same terminal equipment in a model solving mode, for example: the method comprises the steps that two repeated identifications 604715476525912098 and 604715476525912098 of archive data are provided, the archive data corresponding to the two identifications are required to be read on 10 fragments, the mantissas of the two identifications are 8, the mantissas are the same, the archive data corresponding to the two identifications are read on the same fragment, the network cannot be used by calling a distint operator of a distributed computing engine during duplicate removal in the later period, and the data to be archived are read in the same way.

S22, determining one of the data to be archived and the archival data to be partitioned according to the data volume of the data to be archived and the archival data.

In the embodiment of the present application, after the data to be archived and the archive data are read, the data volumes of the data to be archived and the archive data are judged, and if the data volume of the data to be archived is greater than the data volume of the archive data, the data to be archived is partitioned and allocated to different terminal devices, for example: distributing data 1 to be archived to terminal equipment 1, distributing data 2 to be archived and data 4 to be archived to terminal equipment 2, and the like, and broadcasting the archive data; and if the data volume of the archive data is larger than that of the data to be archived, partitioning the archive data, distributing the archive data to different terminal equipment, and broadcasting the data to be archived.

S23, dynamically acquiring the target IP address and acquiring the target GPU card information according to the target IP address.

In the embodiment of the present application, a target IP (Internet Protocol) address is an IP address of a terminal device that needs to be filed at this time, and the GPU card information may be understood as identification information of a GPU, where the target GPU card is a GPU card associated with the target IP. Because the system broadcasts the distributed computing engine tuple element composed of the IP address and the GPU card information in the configuration file, the distributed computing engine tuple element composed of the IP address and the GPU card information shows the association relationship between the IP address and the GPU card information, for example: the IP1 is associated with the GPU-1 and the GPU-2, and the IP2 is associated with the GPU-A, GPU-B, so that the GPU card information required to be called by the partition filing can be determined according to the distributed computing engine tuple elements acquired from the broadcast variables by acquiring the target IP. In addition, when broadcasting the tuple elements of the distributed computing engine and the data to be archived (or archive data), considering the problem that the memory overflow is caused by the overlarge data amount, the broadcast variables need to be optimized and adjusted to use the out-of-heap memory or the maximum variable memory.

And S24, sending the target GPU card information to a heterogeneous platform, and calling the target GPU card through the heterogeneous platform to archive the data to be archived in a target partition.

In the embodiment of the application, after the target partition, that is, the terminal device to which the target IP address belongs, acquires the target GPU card information, if the partition is to-be-archived data, the to-be-archived data is added to the heterogeneous platform in batches, and if the partition is archive data, the archive data is added to the heterogeneous platform in batches, so that the archive performance of the target GPU card is optimal, and 80 ten thousand can be added in each batch. And when data are added to the heterogeneous platform, the target GPU card information is sent to the heterogeneous platform, and the heterogeneous platform calls the target GPU card to be filed in the target partition according to the target GPU card information. Specifically, the existing data to be archived 1-5 and the archive data 1-5 are deduplicated after the data to be archived and the archive data are read in step S21, so that the data to be archived 1-5 and the archive data 1-5 are face images of different objects, if the data to be archived 1 and the data to be archived 2 exist in the target partition, it is indicated that the broadcast is archive data, a target GPU card is called to obtain the similarity between the data to be archived 1 and the data to be archived 2 and the archive data 1-5, respectively, and if the file data 1-5 with the largest similarity to the data to be archived 1 is archive data 3, the data to be archived 1 is archived to the archive where the archive data 3 is located; it will be appreciated that if archival data 1, archival data 2 are present in the first target partition, archival data 3, archival data 4, archival data 5 are present in the second target partition, the broadcast data is to-be-archived data, the target GPU card of the first target partition is called to obtain the similarity between the to-be-archived data 1-5 and the archival data 1 and 2 in the first target partition, the target GPU card of the second target partition is called to obtain the similarity between the to-be-archived data 1-5 and the archival data 3, 4 and 5 in the second target partition, the archival data with the maximum similarity to the to-be-archived data 1, 2, 3, 4 and 5 are respectively determined from the archival data 1-5, and the to-be-archived data is archived to the archive in which the archival data with the maximum similarity is located.

Compared with the prior art, the method has the advantages that the preset characteristic engineering is called through the distributed computing engine to read the snapshot data to be archived and the archive data in the archive library; determining one of the data to be archived and the archival data to be partitioned according to the data volume of the data to be archived and the archival data; dynamically acquiring a target IP address, and acquiring target GPU card information according to the target IP address; and sending the target GPU card information to a heterogeneous platform, and calling the target GPU card through the heterogeneous platform to archive the data to be archived in a target partition. After the snapshot data to be archived and the archive data in the archive library are read, the data with the larger data amount in the snapshot data to be archived and the archive data in the archive library are partitioned, then the target GPU card is dynamically called through the target IP address to archive the data to be archived in the target partition corresponding to the target IP address, GPU resources in the system are fully utilized, and therefore the speed of portrait archiving is improved.

Referring to fig. 3, fig. 3 is a schematic flow chart of another data archiving method according to an embodiment of the present application, as shown in fig. 3, including the steps of:

s31, calling a preset feature project through a distributed computing engine to read the snapshot data to be archived and the archive data in the archive library;

s32, if the data volume of the data to be archived is larger than that of the archival data, clustering the data to be archived into n clusters by adopting a K-means clustering algorithm, and calling a first preset operator of a distributed computing engine to partition the n clusters;

s33, if the data volume of the archival data is larger than that of the data to be archived, clustering the archival data into n clusters by adopting a K-means clustering algorithm, and calling a first preset operator of a distributed computing engine to partition the n clusters;

s34, dynamically acquiring a target IP address, and acquiring target GPU card information according to the target IP address;

and S35, sending the target GPU card information to a heterogeneous platform, and calling the target GPU card through the heterogeneous platform to archive the data to be archived in a target partition.

In the specific embodiment of the application, a first preset operator is a coalesece operator, when the data volume of the data to be archived is larger than that of the archive data, the data to be archived is clustered into n clusters according to the similarity of the characteristic values of the data to be archived by adopting a K-means clustering algorithm, and then the coalesece operator of a distributed computing engine is called to partition the n clusters; similarly, when the data volume of the archive data is larger than that of the data to be archived, clustering the archive data into n clusters according to the similarity of the characteristic values of the archive data by adopting a K-means clustering algorithm, and then calling a coalesece operator of a distributed computing engine to partition the n clusters, wherein the partition process is to avoid the phenomenon that the same face image is clustered into two different clusters to generate wide dependence.

For some steps in the embodiment shown in fig. 3, please refer to the related description in the embodiment shown in fig. 2, and details are not repeated here to avoid repetition.

In a possible embodiment, the method further comprises:

In the implementation mode, the data with larger data volume is partitioned and the data with smaller data volume is broadcasted aiming at the archive data and the data to be filed, so that the consumption and occupation of the larger data volume in broadcasting on the network are avoided, and the purpose of not influencing the performance of the system is achieved.

In the embodiment of the present application, if the partition is to-be-archived data, archive data is obtained from the broadcast variable, the target GPU card is called, the to-be-archived data is matched with the archive data obtained from the broadcast variable in the target partition, that is, the target GPU card is called, the similarity between the to-be-archived data and the archive data obtained from the broadcast variable is obtained in the target partition, and the to-be-archived data is archived to an archive where the archive data with the largest similarity is located. If the partition is the archive data, acquiring the data to be archived from the broadcast variables, calling the target GPU card, matching the data to be archived acquired from the broadcast variables with the archive data in the target partition, namely calling the target GPU card, acquiring the similarity between the broadcasted data to be archived and the archive data in the target partition, and determining the archive where the archive data with the maximum similarity among all the archive data to be archived is located as the archive to which the data to be archived should belong.

In the embodiment, the target GPU card is called by the heterogeneous platform to carry out archiving operation, so that the cooperative calculation between the CPU and the GPU of the terminal equipment is facilitated, the resources of the CPU and the GPU are fully utilized, and the archiving is accelerated.

In a possible implementation manner, the obtaining target GPU card information according to the target IP address includes:

acquiring distributed computing engine tuple elements from broadcast variables, wherein the tuple elements comprise the incidence relation between all IP addresses in the configuration file and GPU card information;

and obtaining the target GPU card information through the incidence relation among the target IP address, all the IP addresses and the GPU card information.

In the embodiment, the target GPU card is determined according to the target IP address, the incidence relation between all the IP addresses acquired in the broadcast and the GPU card information, and the flexibility of dynamic acquisition and dynamic calling is fully embodied.

In one possible implementation, after the target GPU card is called by the heterogeneous platform to archive the data to be archived in a target partition, the method further includes:

returning the identification of the archived file, the similarity of the archived data to be archived and the archive to which the archived data to be archived belongs and the identification of the characteristic value of the data in the archived file, forming a tuple by the identification of the archive, the similarity and the identification of the characteristic value, and determining the identification of the characteristic value as a key value;

the method further comprises the following steps:

aggregating the data in all the partitions according to the hash codes of the key values;

and calling a second preset operator of the distributed computing engine to upload a user-defined partition function, wherein the user-defined partition function is used for balancing data in the partition.

In this embodiment of the application, after the second preset operator, i.e. the GroupByKey operator, archives the data to be archived in the target partition, the system returns the archive identifier ID (identity document) after archiving, the similarity between the archived data and the archive to which the archived data belongs, and the feature value IDs of all face images in the archived archive, and sets the feature value IDs as key values, and whether the data to be archived is partitioned in the front or the archive data is partitioned, the data in all partitions can be aggregated according to the hash code of the key value, and then the GroupByKey operator of the distributed computing engine is called to transfer a custom partition function, where the custom partition function specifies the number of partitions, for example: and if the number of the previous partitions is small and the data amount in the partial partition is large, the partial data in the partition with the large data amount can be divided into other partitions or separate partitions to realize partition balancing.

In the embodiment, the data of the previous partition is aggregated again through the key values, and then the user-defined partition function is uploaded through the GroupByKey operator of the distributed computing engine to perform partition balancing, so that the condition of unbalanced resource occupation when the data in the partition is inserted into the archive is avoided to a certain extent.

In one possible embodiment, the method further comprises:

detecting whether the data to be archived are all successfully archived;

storing the data to be archived, which is successfully archived, in an archiving table;

storing the data to be archived, of the data to be archived, which has similarity greater than or equal to a similarity threshold value with the archived archive, in a legacy table, wherein the data to be archived is not successfully archived;

and storing the data to be archived, of the data to be archived, which has similarity smaller than a similarity threshold value with the archived archive, in an archive aggregation table.

In the embodiment of the present application, a policy crossing model is adopted to insert the data to be archived, which are successfully archived and are not successfully archived, into the relevant tables of the archive in a classified manner, and specifically, as shown in fig. 4, the data to be archived, which are successfully archived, are inserted into the archive table of the archive for storage; and aiming at the data to be archived which is not successfully archived, judging whether the similarity between the data to be archived and the archive obtained after archiving is greater than or equal to a similarity threshold value, filtering the data to be archived which is greater than or equal to the similarity threshold value, inserting the data into a document gathering table for storage, facilitating the later document gathering process, inserting the data to be archived which is less than the similarity threshold value into a legacy table for storage, and setting a time period reserved in the legacy table. In the implementation mode, the data to be archived which are captured in a snapshot mode are inserted into the relevant tables according to different types, so that the later-stage utilization of the data is facilitated, meanwhile, the data to be archived which is not successfully archived and has higher similarity with the archived archives in the data to be archived are filtered, and the interference of the data to the archived archives is reduced.

detecting whether the target GPU card is opened;

In the embodiment of the application, under the condition that the target GPU card switch is not opened, the CPU of the device to which the target IP address belongs is called through the heterogeneous platform for filing, so that the cooperative calculation of the CPU and the GPU is realized, and the filing speed is obviously improved.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a data archiving device according to an embodiment of the present application, and as shown in fig. 5, the data archiving device includes:

the data reading module 51 is used for calling a preset feature project through a distributed computing engine to read snapshot data to be archived and archive data in an archive library;

the data partitioning module 52 is configured to determine one of the data to be archived and the archival data to partition according to the data amount of the data to be archived and the archival data;

a GPU obtaining module 53, configured to dynamically obtain a target IP address, and obtain target GPU card information according to the target IP address;

and the archiving execution module 54 is configured to send the target GPU card information to a heterogeneous platform, and call the target GPU card through the heterogeneous platform to archive the data to be archived in a target partition.

The data archiving device calls the preset characteristic engineering through the distributed computing engine to read the snapshot data to be archived and the archive data in the archive library; determining one of the data to be archived and the archival data to be partitioned according to the data volume of the data to be archived and the archival data; dynamically acquiring a target IP address, and acquiring target GPU card information according to the target IP address; and sending the target GPU card information to a heterogeneous platform, and calling the target GPU card through the heterogeneous platform to archive the data to be archived in a target partition. After the snapshot data to be archived and the archive data in the archive library are read, the data with the larger data amount in the snapshot data to be archived and the archive data in the archive library are partitioned, then the target GPU card is dynamically called through the target IP address to archive the data to be archived in the target partition corresponding to the target IP address, GPU resources in the system are fully utilized, and therefore the speed of portrait archiving is improved.

In an example, in terms of determining, according to the data amount of the data to be archived and the archive data, one of the data to be archived and the archive data to perform partitioning, the data partitioning module 52 is specifically configured to:

In an example, as shown in fig. 6, the apparatus further includes a broadcasting module 55, where the broadcasting module 55 is specifically configured to:

In an example, in the aspect of invoking, by the heterogeneous platform, the target GPU card to archive the data to be archived in the target partition, the archive execution module 54 is specifically configured to:

In an example, in terms of acquiring the target GPU card information according to the target IP address, the GPU acquiring module 53 is specifically configured to:

In an example, as shown in fig. 7, the apparatus further includes a result returning module 56 and a partition balancing module 57, where the result returning module 56 is specifically configured to:

the partition equalizing module 57 is specifically configured to:

In one example, as shown in fig. 8, the apparatus further includes a data insertion module 58, where the data insertion module 58 is specifically configured to:

detecting whether the data to be archived are all successfully archived;

In an example, the archive execution module 54 is further specifically configured to:

detecting whether the target GPU card is opened;

It should be noted that, each step in the data archiving method shown in fig. 2 and fig. 3 may be executed by each unit module in the data archiving device provided in the embodiment of the present application, and the same or similar beneficial effects may be achieved, for example: steps S21 and S31 may be implemented by the data reading module 51 in the data archiving apparatus, for example: step S22 may be implemented with data partitioning module 52 in a device for data archiving, and so on. The data archiving device provided by the embodiment of the present application can be applied to various human face image archiving scenes, and specifically, the data archiving device can be applied to a server, a computer, and other devices capable of processing human face images.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 9, including: a memory 901 for storing one or more computer programs; a processor 902, configured to invoke a computer program stored in the memory 901 to execute the steps in the above-described method embodiments of data archiving; a communication interface 903 for input and output, wherein the communication interface 903 may be one or more; it will be appreciated that the various parts of the electronic device communicate via respective bus connections. The processor 902 is specifically configured to invoke a computer program to execute the following steps:

In a possible implementation manner, the determining, by the processor 902, one of the data to be archived and the archival data to be partitioned according to the data amount of the data to be archived and the archival data includes:

In a possible embodiment, the processor 902 is further configured to broadcast the archive data in case of partitioning the data to be archived; and under the condition of partitioning the archive data, broadcasting the data to be archived.

In one possible embodiment, the processor 902 executes the invoking, by the heterogeneous platform, the target GPU card to archive the data to be archived in the target partition, including:

In a possible implementation manner, the processor 902 executes the acquiring of the target GPU card information according to the target IP address, including:

In one possible implementation, after the target GPU card is called by the heterogeneous platform to archive the data to be archived in the target partition, the processor 902 is further configured to:

the processor 902 is further configured to: aggregating the data in all the partitions according to the hash codes of the key values; and calling a second preset operator of the distributed computing engine to upload a user-defined partition function, wherein the user-defined partition function is used for balancing data in the partition.

In one possible implementation, the processor 902 is further configured to: detecting whether the data to be archived are all successfully archived; storing the data to be archived, which is successfully archived, in an archiving table; storing the data to be archived, of the data to be archived, which has similarity greater than or equal to a similarity threshold value with the archived archive, in a legacy table, wherein the data to be archived is not successfully archived; and storing the data to be archived, of the data to be archived, which has similarity smaller than a similarity threshold value with the archived archive, in an archive aggregation table.

In one possible implementation, the processor 902 is further configured to: detecting whether the target GPU card is opened; if so, executing the operation of calling the target GPU card through the heterogeneous platform to archive the data to be archived in a target partition; if not, calling a CPU of the equipment to which the target IP address belongs through the heterogeneous platform to archive the data to be archived in the target partition.

For example, the electronic device may be a computer, a notebook computer, a tablet computer, a palm computer, a server, a cloud server, or the like. Electronic devices may include, but are not limited to, memory 901, processor 902, communication interface 903. It will be appreciated by those skilled in the art that the schematic diagrams are merely examples of an electronic device and are not limiting of an electronic device and may include more or fewer components than those shown, or some components in combination, or different components.

It should be noted that, since the steps in the data archiving method described above are implemented when the processor 902 of the electronic device executes the computer program, the embodiments of the data archiving method described above are all applicable to the electronic device, and all can achieve the same or similar beneficial effects.

The embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps in the data archiving method.

In particular, the computer program when executed by the processor implements the steps of:

Optionally, the computer program when executed by the processor further implements the steps of: if the data volume of the data to be archived is larger than that of the archival data, clustering the data to be archived into n clusters by adopting a K-means clustering algorithm, and calling a first preset operator of a distributed computing engine to partition the n clusters; if the data volume of the archive data is larger than that of the data to be archived, clustering the archive data into n clusters by adopting a K-means clustering algorithm, and calling a first preset operator of a distributed computing engine to partition the n clusters.

Optionally, the computer program when executed by the processor further implements the steps of: under the condition that the data to be archived are partitioned, broadcasting the archive data; and under the condition of partitioning the archive data, broadcasting the data to be archived.

Optionally, the computer program when executed by the processor further implements the steps of: under the condition of partitioning the data to be archived, acquiring the archive data from a broadcast variable; calling the target GPU card through the heterogeneous platform, and matching the data to be archived with the archive data acquired from the broadcast variables in a target partition for archiving; under the condition of partitioning the archive data, acquiring the data to be archived from a broadcast variable; and calling the target GPU card through the heterogeneous platform, and matching the data to be archived acquired from the broadcast variable with the archive data in a target partition for archiving.

Optionally, the computer program when executed by the processor further implements the steps of: acquiring distributed computing engine tuple elements from broadcast variables, wherein the tuple elements comprise the incidence relation between all IP addresses in the configuration file and GPU card information; and obtaining the target GPU card information through the incidence relation among the target IP address, all the IP addresses and the GPU card information.

Optionally, the computer program when executed by the processor further implements the steps of: returning the identification of the archived file, the similarity of the archived data to be archived and the archive to which the archived data to be archived belongs and the identification of the characteristic value of the data in the archived file, forming a tuple by the identification of the archive, the similarity and the identification of the characteristic value, and determining the identification of the characteristic value as a key value; aggregating the data in all the partitions according to the hash codes of the key values; and calling a second preset operator of the distributed computing engine to upload a user-defined partition function, wherein the user-defined partition function is used for balancing data in the partition.

Optionally, the computer program when executed by the processor further implements the steps of: detecting whether the data to be archived are all successfully archived; storing the data to be archived, which is successfully archived, in an archiving table; storing the data to be archived, of the data to be archived, which has similarity greater than or equal to a similarity threshold value with the archived archive, in a legacy table, wherein the data to be archived is not successfully archived; and storing the data to be archived, of the data to be archived, which has similarity smaller than a similarity threshold value with the archived archive, in an archive aggregation table.

Optionally, the computer program when executed by the processor further implements the steps of: detecting whether the target GPU card is opened; if so, executing the operation of calling the target GPU card through the heterogeneous platform to archive the data to be archived in a target partition; if not, calling a CPU of the equipment to which the target IP address belongs through the heterogeneous platform to archive the data to be archived in the target partition.

Illustratively, the computer program of the computer-readable storage medium comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.

It should be noted that, since the computer program of the computer readable storage medium is executed by the processor to implement the steps in the data archiving method, all the embodiments of the data archiving method are applicable to the computer readable storage medium, and the same or similar beneficial effects can be achieved.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of data archiving, the method comprising:

2. The method according to claim 1, wherein the determining one of the data to be archived and the archival data for partitioning according to the data amount of the data to be archived and the archival data comprises:

3. The method according to claim 1 or 2, characterized in that the method further comprises:

4. The method of claim 3, wherein the invoking, by the heterogeneous platform, the target GPU card to archive the data to be archived in a target partition comprises:

5. The method of claim 1, wherein the obtaining target GPU card information according to the target IP address comprises:

6. The method of claim 1, wherein after invoking, by the heterogeneous platform, the target GPU card to archive the data to be archived in a target partition, the method further comprises:

the method further comprises the following steps:

7. The method of claim 6, further comprising:

detecting whether the data to be archived are all successfully archived;

8. An apparatus for data archiving, the apparatus comprising:

9. An electronic device, comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the method of data archiving according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps in the method of data archiving according to one of the claims 1 to 7.