CN112529149A

CN112529149A - Data processing method and related device

Info

Publication number: CN112529149A
Application number: CN202011381498.2A
Authority: CN
Inventors: 陈汉亭; 王云鹤; 许春景
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-03-19
Anticipated expiration: 2040-11-30
Also published as: CN112529149B; WO2022111387A1

Abstract

The embodiment of the application discloses a data processing method, which is applied to the field of artificial intelligence and comprises the following steps: acquiring a network to be compressed and a plurality of data, wherein the network to be compressed is a classified network; inputting the plurality of data into the network to be compressed to obtain a plurality of first output results, wherein the plurality of first output results are in one-to-one correspondence with the plurality of data; determining a one-hot tag corresponding to each first output result in the plurality of first output results; respectively determining a first similarity between each first output result in the plurality of first output results and the one-hot label; determining at least one target data in the plurality of data according to the first similarity corresponding to each first output result in the plurality of first output results, wherein the at least one target data is used for compressing the network to be compressed. By the method, a large amount of data similar to the original training data of the network to be compressed can be obtained, so that the network can be effectively compressed.

Description

Data processing method and related device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method and a related apparatus.

Background

In recent years, deep neural networks have enjoyed great success in various applications in the field of computer vision, such as image classification, target detection, image segmentation, and the like. However, the model of the deep neural network often includes a large number of model parameters and is computationally expensive, so that it is difficult to apply the model to some devices with low computational power (such as terminal devices, embedded devices, integrated devices, etc.).

In the related technology, a compression algorithm of a deep neural network is provided, and a teacher network model with large storage space requirement and high calculation complexity can be compressed into a student network model with small storage space requirement and low calculation complexity, so that the student network can be applied to equipment with low power consumption and low calculation capacity. In the related art, in the process of compressing the deep neural network by using a compression algorithm, training data of an original neural network is required.

However, in some cases, training data of a network to be compressed cannot be acquired, thereby making it difficult to efficiently achieve compression of a neural network.

Disclosure of Invention

The application provides a data processing method, which comprises the steps of inputting the obtained label-free data into a network to be compressed, obtaining the unique hot label of the obtained output result, measuring the similarity between the output result and the unique hot label, and taking the label-free data corresponding to the output result with higher similarity as the data for compressing the network to be compressed. By the method, a large amount of data similar to the original training data of the network to be compressed can be obtained, so that the network can be effectively compressed.

A first aspect of the present application provides a data processing method, including: the data processing device acquires a network to be compressed and a plurality of data, wherein the network to be compressed is a classification network and is used for classifying input data to obtain an output classification result; the plurality of data may be image data, text data, video data, or voice data. The network to be compressed may be uploaded to a data processing apparatus, for example, for a user, and the plurality of data may be unlabeled data obtained by the data processing apparatus accessing a particular gallery. The data processing device sequentially inputs the plurality of data into the network to be compressed so as to obtain a first output result corresponding to each data in the plurality of data. The first output result may be an n-dimensional label, where n is the number of classification categories, and each label value in the n-dimensional label represents the probability of the category to which the data corresponding to the first output result belongs.

The data processing device determines a one-hot (one-hot) label corresponding to each first output result in the plurality of first output results; the one-hot tag is, for example, an n-dimensional tag, where the n-dimensional tag includes a tag value with a value of 1 and a tag value with a value of n-1, and n is an integer greater than 1. The data processing device respectively determines a first similarity between each first output result in the plurality of first output results and the one-hot label; the first similarity may be used to measure the similarity between the data obtained by the data processing apparatus and the original training data, so that the data processing apparatus may determine the target data in the plurality of data according to the first similarity corresponding to each of the obtained plurality of first output results. In short, the data obtained by the data processing device has corresponding first output results, and each first output result has corresponding first similarity, so the data obtained by the data processing device has corresponding first similarity. For data obtained by the data processing device, the higher the first similarity corresponding to the data is, the closer the data is to the original training data of the network to be compressed, so the data processing device may select the data with the higher first similarity as the target data to implement compression of the network to be compressed.

According to the scheme, the obtained label-free data is input into the network to be compressed, the one-hot label of the obtained output result is obtained, the similarity between the output result and the one-hot label is measured, and the label-free data corresponding to the output result with higher similarity is used as the data for compressing the network to be compressed. By the method, a large amount of data similar to the original training data of the network to be compressed can be obtained, so that the network can be effectively compressed.

Optionally, in a possible implementation manner, the determining, by the data processing apparatus, at least one target data in the plurality of data according to the first similarity corresponding to each of the plurality of first output results by the data processing apparatus includes: and determining N target data with the maximum first similarity in the plurality of data according to the first similarity corresponding to each first output result in the plurality of first output results, wherein N is a first preset threshold and is an integer greater than 1.

In the scheme, by determining the plurality of data with the maximum similarity as the target data, the data close to the original training data can be selected from a large amount of label-free data for training, so that the compression of the network can be effectively realized.

Optionally, in a possible implementation manner, the determining, by the data processing apparatus, at least one target data in the plurality of data according to the first similarity corresponding to each of the plurality of first output results by the data processing apparatus includes: and the data processing device determines M target data with the first similarity larger than a second preset threshold in the plurality of data according to the first similarity corresponding to each first output result in the plurality of first output results.

In the scheme, by determining a plurality of data with similarity greater than the threshold as the target data, data close to the original training data can be selected from a large amount of label-free data for training, so that the compression of the network can be effectively realized.

Optionally, in a possible implementation manner, the determining, by the data processing apparatus, a first similarity between each of the plurality of first output results and the one-hot tag respectively includes: determining the first similarity by calculating a relative entropy or distance metric between each of the plurality of first output results and the one-hot tag.

In the scheme, the first similarity is determined by calculating the relative entropy or distance measurement between the first output result and the one-hot label, so that the calculation of the similarity can be realized, and the realizability of the scheme is ensured.

Optionally, in one possible implementation, the distance metric comprises a mean square error MES distance or an L1 distance.

Optionally, in a possible implementation manner, the method further includes: and compressing the network to be compressed by a distillation method to obtain a target network.

Optionally, in a possible implementation manner, the compressing the network to be compressed by a distillation method to obtain a target network includes: the data processing device acquires a student network; the data processing device respectively inputs the at least one target data into the student network and the network to be compressed to obtain a second output result of the student network and a third output result of the network to be compressed; the data processing device determines a loss function according to the second output result and the third output result; and the data processing device trains the student network according to the loss function until the loss function is converged to obtain the target network.

Optionally, in a possible implementation manner, the determining, by the data processing apparatus, a loss function according to the second output result and the third output result includes: determining a second similarity between the second output result and the third output result; determining the loss function based at least on the second similarity.

Optionally, in a possible implementation manner, the determining a loss function according to the second output result and the third output result further includes: determining a fourth output result according to the second output result and the probability transition matrix; determining a one-hot label corresponding to the third output result; determining a third similarity between the one-hot labels corresponding to the fourth output result and the third output result; determining the loss function based at least on the second similarity, comprising: and determining the loss function according to the second similarity and the third similarity.

According to the scheme, the probability transition matrix is introduced to correct the prediction labels of the teacher network, so that the effect of network compression can be improved under the condition that the training data is label-free data, and the prediction accuracy of the compressed network is guaranteed.

Optionally, in one possible implementation, the plurality of data includes image data, text data, video data, or voice data.

A second aspect of the present application provides a data processing apparatus, including an acquisition unit and a processing unit; the acquisition unit is used for acquiring a network to be compressed and a plurality of data, wherein the network to be compressed is a classified network; the processing unit is configured to input the multiple data into the network to be compressed to obtain multiple first output results, where the multiple first output results are in one-to-one correspondence with the multiple data; the processing unit is further configured to determine a unique one-hot tag corresponding to each of the plurality of first output results; the processing unit is further configured to determine a first similarity between each of the plurality of first output results and the one-hot tag; the processing unit is further configured to determine at least one target data in the plurality of data according to the first similarity corresponding to each of the plurality of first output results, where the at least one target data is used to compress the network to be compressed.

Optionally, in a possible implementation manner, the one-hot tag is an n-dimensional tag, where the n-dimensional tag includes a tag value with a value of 1 and a tag value with a value of n-1 being 0, and n is an integer greater than 1.

Optionally, in a possible implementation manner, the processing unit is further configured to determine, according to the first similarity corresponding to each of the plurality of first output results, N target data with a largest first similarity among the plurality of data, where N is a first preset threshold and N is an integer greater than 1.

Optionally, in a possible implementation manner, the processing unit is further configured to determine, according to the first similarity corresponding to each of the plurality of first output results, M target data, of the plurality of data, where the first similarity is greater than a second preset threshold.

Optionally, in a possible implementation manner, the processing unit is further configured to determine the first similarity by calculating a relative entropy or a distance metric between each of the plurality of first output results and the one-hot tag.

Optionally, in a possible implementation manner, the processing unit is further configured to compress the network to be compressed by a distillation method to obtain the target network.

Optionally, in a possible implementation manner, the obtaining unit is further configured to obtain a student network; the processing unit is further configured to input the at least one piece of target data into the student network and the network to be compressed respectively to obtain a second output result of the student network and a third output result of the network to be compressed; the processing unit is further configured to determine a loss function according to the second output result and the third output result; and the processing unit is further used for training the student network according to the loss function until the loss function is converged to obtain the target network.

Optionally, in a possible implementation manner, the processing unit is further configured to determine a second similarity between the second output result and the third output result; the processing unit is further configured to determine the loss function at least according to the second similarity.

Optionally, in a possible implementation manner, the processing unit is further configured to: determining a fourth output result according to the second output result and the probability transition matrix; determining a one-hot label corresponding to the third output result; determining a third similarity between the one-hot labels corresponding to the fourth output result and the third output result; and determining the loss function according to the second similarity and the third similarity.

A third aspect of the present application provides a data processing apparatus, which may comprise a processor, a processor coupled to a memory, the memory storing program instructions, which when executed by the processor implement the method of the first aspect. For the processor to execute the steps in each possible implementation manner of the first aspect, reference may be made to the first aspect specifically, and details are not described here.

A fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the method of the first aspect described above.

A fifth aspect of the present application provides circuitry comprising processing circuitry configured to perform the method of the first aspect described above.

A sixth aspect of the present application provides a computer program which, when run on a computer, causes the computer to perform the method of the first aspect described above.

A seventh aspect of the present application provides a chip system, which includes a processor, configured to enable a server or a threshold value obtaining apparatus to implement the functions referred to in the foregoing aspects, for example, to send or process data and/or information referred to in the foregoing methods. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the server or the communication device. The chip system may be formed by a chip, or may include a chip and other discrete devices.

Drawings

FIG. 1 is a schematic structural diagram of an artificial intelligence body framework provided by an embodiment of the present application;

fig. 2a is an image processing system according to an embodiment of the present application;

FIG. 2b is a schematic diagram of another exemplary image processing system according to an embodiment of the present disclosure;

FIG. 2c is a schematic diagram of an apparatus related to image processing provided in an embodiment of the present application;

fig. 3a is a schematic diagram of a system 100 architecture according to an embodiment of the present application;

FIG. 3b is a schematic diagram of semantic segmentation of an image according to an embodiment of the present disclosure;

FIG. 4a is a schematic diagram of application of neural network compression in an actual scene;

fig. 4b is a schematic flowchart of a network compression method according to an embodiment of the present application;

fig. 5 is a flowchart illustrating a data processing method 500 according to an embodiment of the present application;

fig. 6 is a schematic flowchart of compressing a network to be compressed according to an embodiment of the present application;

fig. 7 is a schematic flowchart of network compression according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an execution device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.

Detailed Description

The embodiments of the present invention will be described below with reference to the drawings. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The general workflow of the artificial intelligence system will be described first, please refer to fig. 1, which shows a schematic structural diagram of an artificial intelligence body framework, and the artificial intelligence body framework is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, safe city etc..

Several application scenarios of the present application are presented next.

Fig. 2a is an image processing system provided in an embodiment of the present application, where the image processing system includes a user device and a data processing device. The user equipment comprises a mobile phone, a personal computer or an intelligent terminal such as an information processing center. The user equipment is an initiating end of the image processing, and as an initiator of the image enhancement request, the user usually initiates the request through the user equipment.

The data processing device may be a device or a server having a data processing function, such as a cloud server, a network server, an application server, and a management server. The data processing equipment receives an image enhancement request from the intelligent terminal through the interactive interface, and then performs image processing in the modes of machine learning, deep learning, searching, reasoning, decision making and the like through a memory for storing data and a processor link for data processing. The memory in the data processing device may be a generic term that includes a database that stores locally and stores historical data, either on the data processing device or on other network servers.

In the image processing system shown in fig. 2a, a user device may receive an instruction of a user, for example, the user device may acquire an image input/selected by the user device, and then initiate a request to the data processing device, so that the data processing device executes an image enhancement processing application (for example, image super-resolution reconstruction, image denoising, image defogging, image deblurring, image contrast enhancement, and the like) on the image acquired by the user device, thereby acquiring a corresponding processing result for the image. For example, the user equipment may obtain an image input by a user, and then initiate an image denoising request to the data processing equipment, so that the data processing equipment performs image denoising on the image, thereby obtaining a denoised image.

In fig. 2a, a data processing apparatus may perform the image processing method of the embodiment of the present application.

Fig. 2b is another image processing system according to an embodiment of the present application, in fig. 2b, a user device directly serves as a data processing device, and the user device can directly obtain an input from a user and directly perform processing by hardware of the user device itself, and a specific process is similar to that in fig. 2a, and reference may be made to the above description, which is not repeated herein.

In the image processing system shown in fig. 2b, the user device may receive an instruction from the user, for example, the user device may acquire an image selected by the user in the user device, and then perform an image processing application (for example, image super-resolution reconstruction, image denoising, image defogging, image deblurring, image contrast enhancement, and the like) on the image by the user device itself, so as to obtain a corresponding processing result for the image.

In fig. 2b, the user equipment itself can execute the image processing method according to the embodiment of the present application.

Fig. 2c is a schematic diagram of a related apparatus for image processing provided in an embodiment of the present application.

The user device in fig. 2a and fig. 2b may specifically be the local device 301 or the local device 302 in fig. 2c, and the data processing device in fig. 2a may specifically be the execution device 210 in fig. 2c, where the data storage system 250 may store data to be processed of the execution device 210, and the data storage system 250 may be integrated on the execution device 210, or may be disposed on a cloud or other network server.

The processor in fig. 2a and 2b may perform data training/machine learning/deep learning through a neural network model or other models (e.g., models based on a support vector machine), and perform image processing application on the image using the model finally trained or learned by the data, so as to obtain a corresponding processing result.

Fig. 3a is a schematic diagram of an architecture of a system 100 according to an embodiment of the present application, in fig. 3a, an execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through a client device 140, where the input data may include: each task to be scheduled, the resources that can be invoked, and other parameters.

During the process that the execution device 110 preprocesses the input data or during the process that the calculation module 111 of the execution device 110 performs the calculation (for example, performs the function implementation of the neural network in the present application), the execution device 110 may call the data, the code, and the like in the data storage system 150 for corresponding processing, and may store the data, the instruction, and the like obtained by corresponding processing into the data storage system 150.

Finally, the I/O interface 112 returns the processing results to the client device 140 for presentation to the user.

It should be noted that the training device 120 may generate corresponding target models/rules based on different training data for different targets or different tasks, and the corresponding target models/rules may be used to achieve the targets or complete the tasks, so as to provide the user with the required results. Wherein the training data may be stored in the database 130 and derived from training samples collected by the data collection device 160.

In the case shown in fig. 3a, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 3a is only a schematic diagram of a system architecture provided in this embodiment of the present application, and the position relationship between the devices, modules, etc. shown in the diagram does not constitute any limitation, for example, in fig. 3a, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110. As shown in fig. 3a, a neural network may be trained from the training device 120.

The embodiment of the application also provides a chip, which comprises the NPU. The chip may be provided in an execution device 110 as shown in fig. 3a to perform the calculation work of the calculation module 111. The chip may also be disposed in the training apparatus 120 as shown in fig. 3a to complete the training work of the training apparatus 120 and output the target model/rule.

The neural network processor NPU, NPU is mounted as a coprocessor on a main Central Processing Unit (CPU) (host CPU), and tasks are distributed by the main CPU. The core portion of the NPU is an arithmetic circuit, and the controller controls the arithmetic circuit to extract data in a memory (weight memory or input memory) and perform an operation.

In some implementations, the arithmetic circuitry includes a plurality of processing units (PEs) therein. In some implementations, the operational circuit is a two-dimensional systolic array. The arithmetic circuit may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes the matrix A data from the input memory and carries out matrix operation with the matrix B, and partial results or final results of the obtained matrix are stored in an accumulator (accumulator).

The vector calculation unit may further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector computation unit may be used for network computation of the non-convolution/non-FC layer in a neural network, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector calculation unit can store the processed output vector to a unified buffer. For example, the vector calculation unit may apply a non-linear function to the output of the arithmetic circuit, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to arithmetic circuitry, e.g., for use in subsequent layers in a neural network.

The unified memory is used for storing input data and output data.

The weight data directly passes through a memory cell access controller (DMAC) to carry input data in the external memory to the input memory and/or the unified memory, store the weight data in the external memory in the weight memory, and store data in the unified memory in the external memory.

And the Bus Interface Unit (BIU) is used for realizing interaction among the main CPU, the DMAC and the instruction fetch memory through a bus.

An instruction fetch buffer (instruction fetch buffer) connected to the controller for storing instructions used by the controller;

and the controller is used for calling the instruction cached in the finger memory and realizing the control of the working process of the operation accelerator.

Generally, the unified memory, the input memory, the weight memory, and the instruction fetch memory are On-Chip (On-Chip) memories, the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, the neural units may refer to operation units with xs and intercept 1 as inputs, and the output of the operation units may be:

where s is 1, 2, … … n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

The operation of each layer in the neural network can be expressed mathematically

To describe: from the work of each layer in the physical layer neural network, it can be understood that the transformation of the input space into the output space (i.e. the row space to the column space of the matrix) is accomplished by five operations on the input space (set of input vectors), which include: 1. ascending/descending dimensions; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". Wherein 1, 2, 3 are operated by

The operation of 4 is completed by + b, and the operation of 5 is realized by a (). The expression "space" is used herein because the object being classified is not a single thing, but a class of things, and space refers to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value for a neuron in the layer of neural network. The vector W determines the above spatial transformation of input space to output space, i.e. the weight W of each layer controls how the space is transformed. The purpose of training the neural network, i.e. ultimately obtaining all layers of the trained neural networkA weight matrix (a weight matrix formed by vectors W of many layers). Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.

Because it is desirable that the output of the neural network is as close as possible to the value actually desired to be predicted, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the value actually desired to be predicted, and then updating the weight vector according to the difference between the predicted value and the value actually desired (of course, there is usually an initialization process before the first update, that is, the parameters are configured in advance for each layer of the neural network). Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the neural network becomes a process of reducing the loss as much as possible.

(2) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the neural network model, such as a weight matrix.

(3) Image enhancement

Image enhancement refers to processing the brightness, color, contrast, saturation, dynamic range, etc. of an image to meet certain specific criteria. In brief, in the process of image processing, by purposefully emphasizing the overall or local characteristics of an image, an original unclear image is made clear or certain interesting characteristics are emphasized, the difference between different object characteristics in the image is enlarged, and the uninteresting characteristics are inhibited, so that the effects of improving the image quality and enriching the image information quantity are achieved, the image interpretation and identification effects can be enhanced, and the requirements of certain special analysis are met. Exemplary image enhancements may include, but are not limited to, image super-resolution reconstruction, image denoising, image defogging, image deblurring, and image contrast enhancement.

(4) Image semantic segmentation

Image semantic segmentation refers to subdividing an image into different categories according to some rule (such as illumination, category). In brief, the semantic segmentation of the image aims to label each pixel point in the image with a label, that is, to label the object class to which each pixel in the image belongs, and the labels may include people, animals, cars, flowers, furniture, and the like. Referring to fig. 3b, fig. 3b is a schematic diagram of semantic segmentation of an image according to an embodiment of the present disclosure. As shown in fig. 3b, the image may be divided into different sub-regions, such as building, sky, plant, etc., according to categories at a pixel level by semantic segmentation.

The method provided by the present application is described below from the training side of the neural network and the application side of the neural network.

The training method of the neural network provided by the embodiment of the application relates to image processing, and particularly can be applied to data processing methods such as data training, machine learning and deep learning, and the training data (such as the image in the application) is subjected to symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like, and a trained image processing model is finally obtained; in addition, the image processing method provided in the embodiment of the present application may use the trained image processing model to input data (e.g., an image to be processed in the present application) into the trained image processing model, so as to obtain output data (e.g., a target image in the present application). It should be noted that the training method of the image processing model and the image processing method provided in the embodiment of the present application are inventions based on the same concept, and can also be understood as two parts in a system or two stages of an overall process: such as a model training phase and a model application phase.

With the development of deep learning techniques, neural networks have been successfully applied to many practical tasks (such as image classification, object detection, text classification, and speech recognition). Generally, neural networks require significant computational resources to be able to function properly. However, on some terminal devices (such as a mobile phone, a camera or a vehicle-mounted terminal), the computing resources of the terminal devices are usually insufficient to support the operation of a neural network with a complex structure.

Therefore, in order to apply the neural network to these computationally limited terminal devices, a compression algorithm of the neural network has been proposed. By adopting the compression algorithm to compress the neural network with higher computational complexity and larger storage space requirement, the compressed neural network with lower computational complexity and smaller storage space requirement can be obtained, so that the compressed neural network can be operated on the terminal equipment with limited computational power.

In the existing compression algorithm, a neural network to be compressed is generally used as a teacher network, a neural network with low computational complexity is used as a student network, original training data is respectively input into the teacher network and the student network, and the teacher network provides effective supervision information for the student network to realize the training of the student network, so that the compressed neural network is obtained. Therefore, the existing compression algorithms usually require the original training data of the network to be compressed for network compression.

However, in most scenarios, the original training data of the network to be compressed is often difficult to obtain.

Referring to fig. 4a, fig. 4a is a schematic diagram illustrating an application of neural network compression in an actual scenario. As shown in fig. 4a, taking an application scenario of a public cloud as an example, a user trains to obtain a neural network based on own local data, and transmits the trained neural network to the public cloud, and the neural network is required to be compressed so as to be applied to mobile devices such as a mobile phone. For the raw training data used to train the neural network, the raw training data is not available due to the personal privacy protection of the user, or is too large to transmit onto the cloud. That is, raw training data to train the neural network is typically not available on the public cloud.

In another possible scenario, a user purchases a trained neural network from a particular organization or company and compresses this network on the public cloud for application to a mobile device such as a cell phone. Since the raw training data of the neural network is a commercial secret, the raw training data of the neural network is generally not available to the user.

In these scenarios, the compression algorithm of the existing neural network is usually difficult to implement due to the lack of original training data.

In view of this, an embodiment of the present application provides a data processing method, where acquired non-tag data is input into a network to be compressed, an one-hot tag of an obtained output result is obtained, and a similarity between the output result and the one-hot tag is measured, so that non-tag data corresponding to an output result with a higher similarity is used as data for compressing the network to be compressed. By the method, a large amount of data similar to the original training data of the network to be compressed can be obtained, so that the network can be effectively compressed.

Referring to fig. 4b, fig. 4b is a schematic flowchart of a network compression method according to an embodiment of the present disclosure. As shown in fig. 4b, the user uploads the network to be compressed to the public cloud, the public cloud compresses the network to be compressed, and the network obtained after compression is deployed on the mobile device. Specifically, the process of compressing the network by the public cloud includes: the public cloud inputs the non-tag data into the network to be compressed, and based on the unique hot tag of the obtained output result, the similarity between the output result and the unique hot tag is measured, so that the non-tag data corresponding to the output result with higher similarity is used as target data. And then, compressing the network to be compressed by the public cloud based on the target data and the network to be compressed by adopting a distillation algorithm to obtain a compressed network.

Referring to fig. 5, fig. 5 is a flowchart illustrating a data processing method 500 according to an embodiment of the present disclosure. As shown in fig. 5, the data processing method 500 includes the following steps.

Step 501, a data processing apparatus obtains a network to be compressed and a plurality of data, where the network to be compressed is a classified network.

In this embodiment, the data processing device may be a device for compressing a neural network, or may be a device dedicated to acquiring training data required for compressing a neural network. Illustratively, the data processing apparatus may be a server deployed on the cloud for acquiring training data required for compressing the neural network and compressing the neural network based on the acquired training data.

The network to be compressed is a classification network and is used for classifying input data to obtain an output classification result. Exemplarily, assuming that the network to be compressed is T, the input data is x, and the output result of the network to be compressed is y_TY of the_TIs an n-dimensional label, where n is the number of classification categories. Output result y_TThe dimension with the largest median is the category of the data x judged by the network. For example, suppose that the output result y_TThe 3-dimensional label is a 3-dimensional label, wherein the first dimension of the 3-dimensional label represents that the classification category is cats, the second dimension represents that the classification category is dogs, and the third dimension represents that the classification category is pigs; then, when the input data x is an image, the result y is output_TIn the case of {0, 1, 0}, the classification category of the image is a dog, that is, the animal in the image is a dog.

The data processing apparatus may acquire the network to be compressed by acquiring data transmitted by other terminal devices. For example, when the data processing apparatus is a server deployed on a cloud, the data processing apparatus may obtain a network to be compressed uploaded by a user by receiving data sent by a terminal device such as a personal computer or a notebook computer.

The plurality of data are data of the same type, and for example, the plurality of data may be image data, text data, video data, or voice data. For example, when the network to be compressed is an image classification network, the plurality of data are image data, and the network to be compressed is used for classifying images, such as dog, cat, fish, and the like according to the animal displayed on the images. For another example, when the network to be compressed is a text classification network, the data are text data used for classifying texts, such as classifying texts into types of positive emotion texts or negative emotion texts.

The data processing apparatus may acquire the plurality of data in various ways. The manner in which the data processing apparatus acquires the plurality of data will be described below taking the plurality of data as image data as an example.

In one possible approach, in the case where the data processing device is a server, since a large amount of image data is usually stored on the server, the image data may be image data uploaded by a large number of users or image data entered by a dedicated image capturing person. That is, the data processing apparatus can acquire the plurality of data from the storage space in which the image data is stored.

In another possible approach, on a network, there is typically a gallery created and opened by a particular person, which gallery includes a large amount of image data for access and use by developers. That is, the data processing apparatus can access a specific web page to access a corresponding gallery on the web page, thereby acquiring a large amount of data in the gallery as the plurality of data. In addition, the data processing device can also capture image data on the network based on a web crawler to obtain the plurality of data.

It will be appreciated that the data acquired by the data processing apparatus is typically not labeled with a classification, i.e. the data is not specifically classified and labeled with a corresponding label. Since the data processing device can acquire a large amount of unlabeled data, and a part of the unlabeled data is similar to the original training data of the network to be compressed, the data processing device screens out the part of data similar to the original training data of the network to be compressed by the method of the embodiment, and then the part of data can be used for compressing the network to be compressed.

In some cases, the data acquired by the data processing device may be simply classified, for example, the data processing device may acquire an animal image, a household electrical image, a plant image, and the like in different galleries. Then, in the case where the data processing apparatus can acquire simply classified data, if the data processing apparatus can acquire the classification category of the network to be compressed, the data processing apparatus may preliminarily screen the acquired data to screen out data that is not possible as training data of the network to be compressed. For example, when the data processing device obtains that the classification category of the network to be compressed is animal, that is, the network to be compressed classifies the image of the animal, the data processing device may filter out the image that is not animal in advance, for example, filter out the image of the household appliance or the image of the plant, so as to save the amount of calculation.

Step 502, a data processing apparatus inputs a plurality of data into a network to be compressed to obtain a plurality of first output results, and the plurality of first output results are in one-to-one correspondence with the plurality of data.

After acquiring the plurality of data, the data processing apparatus may sequentially input the plurality of data into the network to be compressed to obtain a first output result corresponding to each data in the plurality of data. The first output result may be an n-dimensional label, where n is the number of classification categories, and each label value in the n-dimensional label represents the probability of the category to which the data corresponding to the first output result belongs. For example, the first dimension of the first output result corresponding to data 1 represents that the classification category is cat, the second dimension represents that the classification category is dog, and the third dimension represents that the classification category is pig; then, in the case where the first output result is {0.3,0.6,0.1}, it means that the probability that data 1 belongs to the cat category is 0.3, the probability that data 1 belongs to the dog category is 0.6, and the probability that data 1 belongs to the pig category is 0.1.

In step 503, the data processing apparatus determines a one-hot (one-hot) label corresponding to each of the plurality of first output results.

In one possible embodiment, the one-hot tag is an n-dimensional tag, the n-dimensional tag including 1 tag value having a value of 1 and n-1 tag values having a value of 0, n being an integer greater than 1. Since the first output result is also an n-dimensional tag, the data processing apparatus may determine the one-hot tag corresponding to the first output result in the following manner: and determining the dimension with the maximum label value in the first output result, and generating a one-hot label corresponding to the first output result based on the dimension with the label value of 1 and the other dimensions with the label values of 0. For example, in the case where the first output result is {0.3,0.6,0.1}, the data processing apparatus may determine that the dimension with the largest tag value in the first output result is the second dimension (i.e., the dimension with the tag value of 0.6), and thus the data processing apparatus may generate a one-hot tag corresponding to the first output result, where the one-hot tag is {0, 1, 0 }.

It will be appreciated that in the above description, a one-hot tag is a tag that includes a tag value of 1 having a value of 1 and a tag value of n-1 having a value of 0. In practice, a one-hot tag may also refer to a tag that includes a tag value of 1 with a value close to 1 and a tag value of n-1 with a value close to 0, for example the one-hot tag may be {0.001, 0.997, 0.002 }. The embodiment does not specifically limit the one-hot tag.

In step 504, the data processing apparatus determines a first similarity between each of the plurality of first output results and the one-hot tag.

After obtaining the one-hot tag corresponding to each first output result, the data processing apparatus may calculate a first similarity between each first output result and its corresponding one-hot tag.

It will be appreciated that in the training of the classification network, the goal is to make the output of the classification network and the true label of the training data as identical as possible. While the true labels of the training data may be generally represented by one-hot labels. Therefore, for a trained classification network, the output result of the original training data of the classification network in the classification network is very close to the one-hot label, i.e. the similarity between the output result and the one-hot label is very high. For other data other than the original training data, because the classification network does not necessarily accurately recognize the data, the output result of the data in the classification network is not very close to the one-hot label, that is, the similarity between the output result and the one-hot label is not high.

Illustratively, in the case where the network to be compressed is obtained based on image training related to dogs, cats and pigs, that is, the original training data of the network to be compressed is images related to dogs, cats and pigs, the image data obtained by the data processing device includes an image 1 and an image 2, the image 1 is an animal image related to dogs, and fig. 2 is a household image related to refrigerators. Inputting the image 1 and the image 2 into a network to be compressed respectively, wherein the image 1 is similar to the original training data of the network to be compressed, so that the output result corresponding to the image 1 can be {0.08, 0.91, 0.01 }; because the difference between the image 2 and the original training data of the network to be compressed is large, the network to be compressed is difficult to effectively identify the image 2, and the output result corresponding to the image 2 may be {0.3, 0.3, 0.4 }. It follows that the closer the output result of an image is to one-hot labels, the closer the image is to the original training data.

Based on this, in this embodiment, whether the data corresponding to the first output result is close to the original training data may be determined by determining the first similarity between the first output result and the one-hot tag. If the first similarity is higher, the data corresponding to the first similarity can be considered to be closer to the original training data; if the first similarity is low, the data corresponding to the first similarity can be considered to be greatly different from the original training data.

In one possible embodiment, the determining, by the data processing apparatus, a first similarity between each of the plurality of first output results and the one-hot tag may specifically include: the data processing device determines a first similarity between the first output result and the one-hot label by calculating a relative entropy or a distance metric between each of the plurality of first output results and the one-hot label corresponding to the first output result.

The relative entropy is also called Kullback-Leibler divergence (KL divergence) or information divergence (information divergence), and is an asymmetry measure of a difference between two probability distributions (probability distributions), and specifically may be a difference between information entropies (Shannon entrypes) of the two probability distributions. Illustratively, assume that the first output result is y_TIf the one-hot tag corresponding to the first output result is t, the KL divergence between the first output result and the one-hot tag can be shown in formula 1:

D_KL(y_T，t)＝-[y_Tlogt+(1-y_T)log(1-t)]equation 1

Wherein D is_KL(y_TT) represents the KL divergence between the first output result and the one-hot label, and log () represents the logarithm. The smaller the KL divergence, the first output y_TThe closer to its corresponding one-hot tag, the greater the similarity between the two.

The distance measure may also be referred to as metric similarity, and by calculating a distance measure between two multidimensional data, the similarity between the two multidimensional data may be determined. Generally, the smaller the distance metric between two multidimensional data, the higher the similarity between the two multidimensional data; conversely, the greater the distance metric between two multidimensional data, the less similarity between the two multidimensional data. Illustratively, the distance metric may include Mean Squared Error (MES) distance or L1 distance equidistance. It should be understood that the data processing device may determine the first similarity based on other distance measures besides the MES distance and the L1 distance, and the embodiment does not specifically limit the way of the distance measures.

The MES distance is an expected value of the square of the difference between the estimated value of the parameter and the true value of the parameter, and can be used for evaluating the change degree of data, and the smaller the value of the MSE distance is, the smaller the difference between the estimated value of the parameter and the true value of the parameter is. Similarly, assume that the first output result is y_TIf the one-hot label corresponding to the first output result is t, the MES distance between the first output result and the one-hot label can be as shown in formula 2:

MSE(y_T，t)＝(y_T-t)²equation 2

Wherein MSE (y)_TAnd t) represents the MES distance between the first output result and the one-hot label.

The L1 distance is also referred to as the manhattan distance and represents the sum of the absolute wheelbases of two points in a standard coordinate system. Similarly, assume that the first output result is y_TIf the one-hot tag corresponding to the first output result is t, the L1 distance between the first output result and the one-hot tag may be as shown in formula 3:

L1(y_T，t)＝|y_T-t | equation 3

Wherein, L1 (y)_TAnd t) represents the L1 distance between the first output result and the one-hot tag.

Step 505, the data processing apparatus determines at least one target data in the plurality of data according to the first similarity corresponding to each first output result in the plurality of first output results, where the at least one target data is used for compressing the network to be compressed.

The first similarity can be used for measuring the similarity between the data obtained by the data processing device and the original training data, so that the data processing device can determine the target data in the multiple data according to the first similarities corresponding to the multiple obtained data. In short, the data obtained by the data processing device has corresponding first output results, and each first output result has corresponding first similarity, so the data obtained by the data processing device has corresponding first similarity. For data obtained by the data processing device, the higher the first similarity corresponding to the data is, the closer the data is to the original training data of the network to be compressed, so the data processing device may select the data with the higher first similarity as the target data to implement compression of the network to be compressed.

In a possible embodiment, the determining, by the data processing apparatus, at least one target data in the plurality of data according to the first similarity corresponding to each of the plurality of first output results may specifically include: the data processing device determines N target data with the maximum first similarity in the plurality of data according to the first similarity corresponding to each first output result in the plurality of first output results, wherein N is a first preset threshold and is an integer greater than 1. In short, the data processing apparatus may obtain a first preset threshold N in advance, where the first preset threshold N may be preset in the data processing apparatus, or the data processing apparatus receives the first preset threshold N from another network device in advance; then, the data processing apparatus selects N target data from the plurality of data in order of the first similarity from large to small according to the first preset threshold N, where the target data are the N data with the largest first similarity among the plurality of data. In the case where the first similarity is represented by a KL divergence or a distance metric, the data processing apparatus may actually select N target data from the plurality of data in order of the KL divergence or the distance metric from small to large according to the first preset threshold N, the target data being N data in which the KL divergence or the distance metric is smallest among the plurality of data.

The value of N may be determined according to the actual computation capability of the data processing apparatus and the compression accuracy of the network to be compressed, for example, the value range of N may be in an interval of several tens of thousands to several hundreds of thousands. For example, when the data processing apparatus acquires 100 million data and the value of N is 10 ten thousand, the data processing apparatus may determine, as the target data, 10 ten thousand data having the largest first similarity from among the 100 million data.

In another possible embodiment, the determining, by the data processing apparatus, at least one target data in the plurality of data according to the first similarity corresponding to each of the plurality of first output results may specifically include: the data processing device determines M target data with the first similarity larger than a second preset threshold in the plurality of data according to the first similarity corresponding to each first output result in the plurality of first output results. The second preset threshold may also be obtained by the data processing apparatus in advance, for example, the second preset threshold may be preset in the data processing apparatus, or the data processing apparatus receives the second preset threshold from another network device in advance. For each of the plurality of data, the data processing apparatus may compare whether the first similarity corresponding to the data is greater than a second preset threshold, and if the first similarity corresponding to the data is greater than the second preset threshold, may determine that the data is the target data. The value of the second preset threshold may also be determined according to the actual computing capability of the data processing apparatus and the compression precision of the network to be compressed, and this embodiment is not particularly limited.

In the case where the data processing apparatus employs KL divergence or a distance metric to represent the first similarity, the data processing apparatus may determine the corresponding first similarity by taking the inverse of the KL divergence or the inverse of the distance metric, and determine M pieces of target data among the plurality of data according to the first similarity of each data.

It should be noted that, in the last possible embodiment, the value of N is fixed, that is, the value of N is the first preset threshold, and in this embodiment, the value of M is not fixed, but is determined based on the first similarity corresponding to each of the plurality of data. If the more data of the plurality of data with the first similarity larger than the second preset threshold value, the larger M is; if the data with the first similarity larger than the second preset threshold value in the plurality of data is less, the smaller M is.

In this embodiment, the obtained non-tag data is input into the network to be compressed, the one-hot tag of the obtained output result is obtained, and the similarity between the output result and the one-hot tag is measured, so that the non-tag data corresponding to the output result with higher similarity is used as the data for compressing the network to be compressed. By the method, a large amount of data similar to the original training data of the network to be compressed can be obtained, so that the network can be effectively compressed.

In a possible embodiment, after the data processing device obtains the at least one target data, the data processing device may further compress the network to be compressed by distillation to obtain the target network.

In distillation processes, two networks are typically involved: one is a teacher network which is pre-trained, has strong performance, but has high computational complexity and requires a large storage space; the other is a student network to be trained, but with low computational complexity and small memory requirements. The distillation method aims to extract useful information and knowledge from a teacher network to be used as guidance in the student network training process so as to realize the training of the student network. The training and learning are carried out under the guidance of the teacher network, and the student network can obtain better performance than the independent training. That is, a high performance, low computational complexity and low storage consumption student network can be achieved by distillation.

In this embodiment, the data processing apparatus may compress the network to be compressed by a distillation method, specifically, the network to be compressed is obtained in advance as a student network with low computational complexity, the network to be compressed is used as a teacher network, the student network is trained based on the obtained target data, useful information is extracted from the teacher network to guide training of the student network, and finally the target network is obtained by training.

For example, referring to fig. 6, fig. 6 is a schematic flowchart of compressing a network to be compressed according to an embodiment of the present application. As shown in fig. 6, the process of the data processing apparatus compressing the network to be compressed by the distillation method may include the following steps.

Step 601, the data processing device acquires a student network.

In this embodiment, the student network may be a constructed neural network, and can be used to implement data classification, such as a deep neural network. The data processing means may obtain the student network in a variety of ways.

In one possible implementation, one or more pre-constructed student networks may be pre-configured in the data processing apparatus, and the one or more student networks may be pre-configured and pre-configured in the data processing apparatus by a specific person. Different student networks may have different computational complexity and different requirements for storage space. The data processing device can determine the student network capable of meeting the compression requirement according to the compression requirement of the network to be compressed, such as the size of the storage space occupied by the compressed network, the calculation complexity of the compressed network and other indexes.

In another possible implementation manner, the user may upload the network to be compressed and the student network to the data processing device at the same time, and the data processing device may obtain the student network by acquiring the data uploaded by the user.

In another possible implementation manner, after the compression requirement of the user is obtained, the data processing device may automatically construct a student network capable of meeting the compression requirement according to the compression requirement of the user. For example, when the compression requirement of the user is less than 1 Gigabyte (GB) of storage space occupied by the compressed network, the data processing apparatus may construct a student network with a storage space requirement of less than 1GB based on the compression requirement.

Step 602, the data processing device inputs at least one target data into the student network and the network to be compressed respectively, and obtains a second output result of the student network and a third output result of the network to be compressed.

After obtaining the student network, the data processing device may train the student network based on the at least one target data. Specifically, the data processing apparatus may input one target data of the at least one target data into the student network and the network to be compressed, respectively, to obtain a second output result corresponding to the target data in the student network and a third output result corresponding to the target data in the network to be compressed.

Step 603, the data processing device determines a loss function according to the second output result and the third output result.

In the training process of the student network, the training of the student network needs to be guided based on knowledge extracted from the network to be compressed (i.e., the teacher network). Therefore, the loss function of the student network for training may be determined based on the second output result corresponding to the student network and the third output result corresponding to the network to be compressed.

In general, in the distillation method, the loss function of the student network may be composed of two expressions, one is the similarity between the output result of the student network and the real label of the input data, and the other is the similarity between the output result of the student network and the output result of the teacher network. Illustratively, in the case of similarity expressed in KL divergence, the loss function of the student network can be as shown in equation 4:

loss＝D_KL(y_S，y)+D_KL(y_S，y_T) Equation 4

Where loss represents the loss function, y_SRepresenting the output of the student network, y representing the true label of the input data, D_KL(y_SY) represents the KL divergence between the output of the student network and the true label of the input data, y_TRepresenting the output of the teacher network, D_KL(y_S，y_T) Indicating a KL divergence between the output of the teacher network and the output of the student network. Because the output result of the teacher network is not accurate, the output result of the teacher network can be corrected effectively based on the similarity between the output result of the student network and the real label of the input data.

In the present embodiment, since the target data obtained by the data processing apparatus does not have a real tag, the data processing apparatus cannot obtain the similarity between the output result of the student network and the real tag of the input data. That is, the data processing apparatus may determine the loss function in such a manner that: the data processing device determines a second similarity between the second output result and the third output result and determines a loss function based on at least the second similarity. For example, a loss function provided in the present embodiment can be shown in equation 5:

loss＝D_KL(y_S，y_T) Equation 5

Where loss represents the loss function, y_SRepresenting the output of the student network, y_TRepresenting the output of the teacher network, D_KL(y_S，y_T) Indicating a KL divergence between the output of the teacher network and the output of the student network.

And step 604, training the student network by the data processing device according to the loss function until the loss function is converged to obtain a target network.

After obtaining the loss function, the data processing apparatus may train the student network based on the loss function until the loss function converges to obtain a trained student network, that is, a compressed target network corresponding to the network to be compressed. Briefly, the process of the data processing apparatus training the student network based on the loss function may be: the data processing device inputs one of the plurality of target data into the student network and the network to be compressed, calculates a loss function based on output results of the two networks, then adjusts parameters of the student network according to the loss function, and repeatedly executes a process of inputting the next target data of the plurality of target data into the student network and the network to be compressed and adjusting the parameters of the student network based on the newly calculated loss function until the calculated loss function is smaller than a preset threshold value.

In the embodiment of the present application, since the target data obtained by the data processing apparatus does not have the real tag, the data processing apparatus cannot obtain the similarity between the output result of the student network and the real tag of the input data, that is, the data processing apparatus cannot correct the erroneous output result in the teacher network based on the similarity between the output result of the student network and the real tag of the input data. Based on this, in the present embodiment, a probability transition matrix is introduced by adjusting the loss function, so as to correct an erroneous output result in the teacher network.

In a possible embodiment, the data processing apparatus determines the loss function according to the second output result and the third output result, and may further include: the data processing device determines a fourth output result according to the second output result and the probability transition matrix, namely the data processing device multiplies the second output result by the probability transition matrix to correct the second output result to obtain the fourth output result; and the data processing device determines a one-hot label corresponding to the third output result (namely the output result of the teacher network), wherein the one-hot label is the label predicted by the teacher network. And the data processing device determines a third similarity between the one-hot labels corresponding to the fourth output result and the third output result, and determines a loss function according to the second similarity and the third similarity.

Illustratively, the loss function may be as shown in equation 6:

loss＝D_KL(Q(y_S)，t)+D_KL(y_S，y_T) Equation 6

Where loss is the loss function, Q is the probability transition matrix, y_SFor the second output result (i.e. the output result of the student network), Q (y)_S) For the fourth output result (i.e. the result of multiplying the output result of the student network by the probability transition matrix), y_TIs the third output result (i.e. the output result of the teacher's network), t is the one-hot label corresponding to the third output result, i.e. the label predicted by the teacher's network, D_KL() Indicating the divergence.

It can be seen that the loss function shown in equation 6 introduces a new KL divergence, relative to the loss function shown in equation 5, between the result of multiplying the output of the student network by the probability transition matrix and the label predicted by the teacher network. Since the teacher predicted tag has an error, the probability transition matrix Q is introduced in this embodiment to correct the teacher network predicted tag, so that the output result of the student network is a correct tag. Namely, after a correct output result of the student network passes through a noise transfer matrix, an error label t predicted by the teacher network is obtained. And, in the training process of the student network, the probability transition matrix Q can be trained together with the student network.

That is, in the process of training the student network by the data processing device, the data processing device may train the probability transition matrix together based on the loss function. That is, the probability transition matrix is not fixed, and the data processing device may also adjust the probability transition matrix during the process of training the student network by the data processing device. By introducing the probability transition matrix to correct the prediction labels of the teacher network, the effect of network compression can be improved under the condition that the training data is label-free data, and the prediction accuracy of the compressed network is ensured.

For example, assuming that the second output result is an n-dimensional label, the probability transition matrix may be a matrix of n × n, and the sum of each row element in the probability transition matrix is 1. Specifically, one possible probability transition matrix may be as shown in equation 7:

wherein A represents a probability transition matrix, a₁₁、a_1n、a_n1、a_nnIs an element in a probability transition matrix, and a₁₁+a₁₂+…+a_1n＝1。

For example, assume that the second output result y_sIs (0.6, 0.4), the probability transfer matrix Q is

The second output result y_sResult Q (y) multiplied by probability transition matrix Q_s) Is (0.48, 0.52). As another example, assume that the second output result y_sIs (0.95, 0.05), the probability transition matrix Q is not changed, and the second output result y_sResult Q (y) multiplied by probability transition matrix Q_s) Is (0.585, 0.415).

For the convenience of understanding, the network compression method provided by the embodiment of the present application will be described below with reference to specific examples.

The present embodiment takes a Deep Neural Network (DNN) for image classification as an example, and describes a process of compressing the Deep Neural network. Suppose that a user trains a deep neural network with some pictures taken or created by the user and uploads the pictures to a server, and the deep neural network is required to be compressed. At this time, the server may know the specific structure of the deep neural network, but since the training data of the deep neural network is some pictures taken or created by the user himself, the user is not willing to upload the training data to the server because the training data is too large or contains personal privacy, that is, the server cannot acquire the original training data of the deep neural network.

Specifically, the present embodiment takes a CIFAR data set as an example, and shows a compression effect of the Network compression method provided in the present embodiment on a Residual Network (ResNet). Wherein, the CIFAR data set is a data set which is collected and sorted by a developer and comprises 0 ten thousand miniature images.

For CIFAR data sets, a ResNet-34 network can be used as a network to be compressed uploaded by a user, an ImageNet data set is used as an unmarked data set on a server, and a ResNet-18 network is used as a student network to be trained. The ImageNet project is a large visual database for visual object recognition software research, and the ImageNet data set can be part or all of image data in the database.

Specifically, the process of network compression may include the steps of:

and S1, training the ResNet-34 network structure by the server based on the CIFAR-10 data set as training data to obtain a trained network. Through step S1, the process of network training performed by the user on the self-based training set can be simulated.

S2, after obtaining the trained ResNet-34 network, the server may use the ImageNet dataset on the cloud as the unlabeled dataset, and screen the ImageNet dataset by the method 500 described above to obtain the target dataset. Specifically, the server may input the ImageNet dataset into the trained ResNet-34 network, calculate the KL divergence between the output result of each image and the one-hot label of the output result thereof, and select 500,000 images with the smallest KL divergence as the training set.

S3, after obtaining the selected training set, the server may initialize the noise transfer matrix Q and the student network, and compress the ResNet-34 network based on the method 600 and the training set to obtain a compressed network.

For example, in the present embodiment, experiments are performed based on different algorithms and different training data, and specific experimental results may be shown in table 1:

TABLE 1

As can be seen from Table 1, the network to be compressed is an uncompressed pre-trained model with an accuracy of 94.85%. After the network to be compressed was compressed using a conventional distillation algorithm and the original training data, the accuracy of the resulting network was 94.34%. Under the condition that original training data do not exist, after the data processing method provided by the scheme is adopted to select data, the accuracy rate of the network obtained by adopting the traditional distillation algorithm compression is 93.55%. In addition, after the data processing method provided by the scheme is adopted to select data, the accuracy of the network obtained by adopting the improved distillation algorithm compression in the scheme is 94.02%. Therefore, the method provided by the scheme can solve the problem that the network cannot be compressed under the condition of lacking original training data in the related technology, and can ensure that the accuracy of the compressed network keeps a higher level.

Specifically, referring to fig. 7, fig. 7 is a schematic flowchart of network compression according to an embodiment of the present disclosure. As shown in fig. 7, the user has acquired a network to be compressed, i.e., a teacher network, by training with raw training data, which is not available. The server selects and obtains the label-free data which can be used for compressing the teacher network according to the data processing method of the scheme. The server inputs the label-free data into the teacher network and the student network, and training of the student network is achieved based on the distillation algorithm. The error result output in the teacher network can be corrected through the distillation algorithm in the scheme, so that the student network can output a correct prediction result. Specifically, in the teacher network, "pandas" are classified as "spaceships" and "foxes" are classified as "dogs", and these erroneous results are corrected in the student network.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure. As shown in fig. 8, a data processing apparatus provided in an embodiment of the present application includes: an acquisition unit 801 and a processing unit 802; the acquiring unit 801 is configured to acquire a network to be compressed and a plurality of data, where the network to be compressed is a classified network; the processing unit 802 is configured to input the multiple data into the network to be compressed to obtain multiple first output results, where the multiple first output results are in one-to-one correspondence with the multiple data; the processing unit 802 is further configured to determine a unique one-hot tag corresponding to each first output result in the plurality of first output results; the processing unit 802 is further configured to determine a first similarity between each of the plurality of first output results and the one-hot tag respectively; the processing unit 802 is further configured to determine, according to the first similarity corresponding to each of the plurality of first output results, at least one target data in the plurality of data, where the at least one target data is used to compress the network to be compressed.

Optionally, in a possible implementation manner, the processing unit 802 is further configured to determine, according to the first similarity corresponding to each of the multiple first output results, N target data with a largest first similarity among the multiple data, where N is a first preset threshold and N is an integer greater than 1.

Optionally, in a possible implementation manner, the processing unit 802 is further configured to determine, according to the first similarity corresponding to each of the multiple first output results, M target data, of the multiple data, where the first similarity is greater than a second preset threshold.

Optionally, in a possible implementation manner, the processing unit 802 is further configured to determine the first similarity by calculating a relative entropy or a distance metric between each of the plurality of first output results and the one-hot tag.

Optionally, in a possible implementation manner, the processing unit 802 is further configured to compress the network to be compressed by a distillation method, so as to obtain a target network.

Optionally, in a possible implementation manner, the obtaining unit 801 is further configured to obtain a student network; the processing unit 802 is further configured to input the at least one target data into the student network and the network to be compressed, respectively, to obtain a second output result of the student network and a third output result of the network to be compressed; the processing unit 802 is further configured to determine a loss function according to the second output result and the third output result; the processing unit 802 is further configured to train the student network according to the loss function until the loss function converges to obtain the target network.

Optionally, in a possible implementation manner, the processing unit 802 is further configured to determine a second similarity between the second output result and the third output result; the processing unit 802 is further configured to determine the loss function according to at least the second similarity.

Optionally, in a possible implementation manner, the processing unit 802 is further configured to: determining a fourth output result according to the second output result and the probability transition matrix; determining a one-hot label corresponding to the third output result; determining a third similarity between the one-hot labels corresponding to the fourth output result and the third output result; and determining the loss function according to the second similarity and the third similarity.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an execution device provided in the embodiment of the present application, and the execution device 900 may be embodied as a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a server, and the like, which is not limited herein. The execution device 900 may be disposed with the data processing apparatus described in the embodiment corresponding to fig. 9, and is configured to implement the function of data processing in the embodiment corresponding to fig. 9. Specifically, the execution apparatus 900 includes: a receiver 901, a transmitter 902, a processor 903 and a memory 904 (where the number of processors 903 in the execution device 900 may be one or more, for example, one processor in fig. 9), where the processor 903 may include an application processor 9031 and a communication processor 9032. In some embodiments of the present application, the receiver 901, the transmitter 902, the processor 903, and the memory 904 may be connected by a bus or other means.

The memory 904 may include both read-only memory and random-access memory, and provides instructions and data to the processor 903. A portion of memory 904 may also include non-volatile random access memory (NVRAM). The memory 904 stores the processor and the operating instructions, executable modules or data structures, or a subset thereof, or an expanded set thereof, wherein the operating instructions may include various operating instructions for performing various operations.

The processor 903 controls the operation of the execution apparatus. In a particular application, the various components of the execution device are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.

The method disclosed in the embodiments of the present application may be applied to the processor 903, or implemented by the processor 903. The processor 903 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 903. The processor 903 may be a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 903 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 904, and the processor 903 reads information in the memory 904 and performs the steps of the above method in combination with hardware thereof.

The receiver 901 may be used to receive input numeric or character information and generate signal inputs related to performing relevant settings and function control of the device. The transmitter 902 may be configured to output numeric or character information through the first interface; the transmitter 902 is also operable to send instructions to the disk group via the first interface to modify data in the disk group; the transmitter 902 may also include a display device such as a display screen.

In this embodiment of the present application, in one case, the processor 903 is configured to execute the data processing method executed by the execution device in the embodiment corresponding to fig. 5.

Embodiments of the present application also provide a computer program product, which when executed on a computer causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.

Also provided in an embodiment of the present application is a computer-readable storage medium, in which a program for signal processing is stored, and when the program is run on a computer, the program causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.

The execution device, the training device, or the terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored by the storage unit to cause the chip in the execution device to execute the data processing method described in the above embodiment, or to cause the chip in the training device to execute the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, referring to fig. 10, fig. 10 is a schematic structural diagram of a chip provided in the embodiment of the present application, where the chip may be represented as a neural network processor NPU 1000, and the NPU 1000 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 1003, and the controller 1004 controls the arithmetic circuit 1003 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 1003 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuit 1003 is a two-dimensional systolic array. The arithmetic circuit 1003 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1003 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1002 and buffers it in each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 1001 and performs matrix operation with the matrix B, and a partial result or a final result of the obtained matrix is stored in an accumulator (accumulator) 1008.

The unified memory 1006 is used for storing input data and output data. The weight data directly passes through a Memory Access Controller (DMAC) 1005, and the DMAC is transferred to the weight Memory 1002. The input data is also carried into the unified memory 1006 by the DMAC.

The BIU is a Bus Interface Unit 1013 for the interaction of the AXI Bus with the DMAC and an Instruction Fetch memory (IFB) 1009.

The Bus Interface Unit 1013(Bus Interface Unit, BIU for short) is configured to obtain the instruction from the external memory by the instruction fetch memory 1009, and is further configured to obtain the original data of the input matrix a or the weight matrix B from the external memory by the storage Unit access controller 1005.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1006 or to transfer weight data into the weight memory 1002 or to transfer input data into the input memory 1001.

The vector calculation unit 1007 includes a plurality of operation processing units, and further processes the output of the operation circuit 1003 such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 1007 can store the processed output vector to the unified memory 1006. For example, the vector calculation unit 1007 may calculate a linear function; alternatively, a non-linear function is applied to the output of the arithmetic circuit 1003, such as performing linear interpolation on the feature planes extracted from the convolutional layers, and then accumulating the vectors of values to generate the activation values. In some implementations, the vector calculation unit 1007 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuit 1003, for example, for use in subsequent layers in a neural network.

An instruction fetch buffer 1009 connected to the controller 1004, for storing instructions used by the controller 1004;

the unified memory 1006, the input memory 1001, the weight memory 1002, and the instruction fetch memory 1009 are On-Chip memories. The external memory is private to the NPU hardware architecture.

The processor mentioned in any of the above may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a training device, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Claims

1. A data processing method, comprising:

acquiring a network to be compressed and a plurality of data, wherein the network to be compressed is a classified network;

inputting the plurality of data into the network to be compressed to obtain a plurality of first output results, wherein the plurality of first output results are in one-to-one correspondence with the plurality of data;

determining a one-hot tag corresponding to each first output result in the plurality of first output results;

respectively determining a first similarity between each first output result in the plurality of first output results and the one-hot label;

determining at least one target data in the plurality of data according to the first similarity corresponding to each first output result in the plurality of first output results, wherein the at least one target data is used for compressing the network to be compressed.

2. The data processing method according to claim 1, wherein the one-hot tag is an n-dimensional tag, the n-dimensional tag comprises 1 tag value with a value of 1 and n-1 tag value with a value of 0, and n is an integer greater than 1.

3. The data processing method according to claim 1 or 2, wherein the determining at least one target data in the plurality of data according to the first similarity corresponding to each of the plurality of first output results comprises:

and determining N target data with the maximum first similarity in the plurality of data according to the first similarity corresponding to each first output result in the plurality of first output results, wherein N is a first preset threshold and is an integer greater than 1.

4. The data processing method according to claim 1 or 2, wherein the determining at least one target data in the plurality of data according to the first similarity corresponding to each of the plurality of first output results comprises:

and determining M target data with the first similarity larger than a second preset threshold in the plurality of data according to the first similarity corresponding to each first output result in the plurality of first output results.

5. The data processing method according to any one of claims 1 to 4, wherein the separately determining a first similarity between each of the plurality of first output results and the one-hot tag comprises:

determining the first similarity by calculating a relative entropy or distance metric between each of the plurality of first output results and the one-hot tag.

6. The data processing method of claim 5, wherein the distance metric comprises a mean square error (MES) distance or an L1 distance.

7. The data processing method according to any one of claims 1 to 6, characterized in that the method further comprises:

and compressing the network to be compressed by a distillation method to obtain a target network.

8. The data processing method of claim 7, wherein compressing the network to be compressed by distillation to obtain a target network comprises:

acquiring a student network;

inputting the at least one target data into the student network and the network to be compressed respectively to obtain a second output result of the student network and a third output result of the network to be compressed;

determining a loss function according to the second output result and the third output result;

and training the student network according to the loss function until the loss function is converged to obtain the target network.

9. The data processing method of claim 8, wherein determining a loss function from the second output result and the third output result comprises:

determining a second similarity between the second output result and the third output result;

determining the loss function based at least on the second similarity.

10. The data processing method of claim 9, wherein determining a loss function based on the second output result and the third output result further comprises:

determining a fourth output result according to the second output result and the probability transition matrix;

determining a one-hot label corresponding to the third output result;

determining a third similarity between the one-hot labels corresponding to the fourth output result and the third output result;

determining the loss function based at least on the second similarity, comprising:

and determining the loss function according to the second similarity and the third similarity.

11. The data processing method according to any one of claims 1 to 10, wherein the plurality of data includes image data, text data, video data, or voice data.

12. A data processing apparatus includes an acquisition unit and a processing unit;

the acquisition unit is used for acquiring a network to be compressed and a plurality of data, wherein the network to be compressed is a classified network;

the processing unit is configured to input the multiple data into the network to be compressed to obtain multiple first output results, where the multiple first output results are in one-to-one correspondence with the multiple data;

the processing unit is further configured to determine a unique one-hot tag corresponding to each of the plurality of first output results;

the processing unit is further configured to determine a first similarity between each of the plurality of first output results and the one-hot tag;

the processing unit is further configured to determine at least one target data in the plurality of data according to the first similarity corresponding to each of the plurality of first output results, where the at least one target data is used to compress the network to be compressed.

13. The data processing apparatus according to claim 12, wherein the one-hot tag is an n-dimensional tag, the n-dimensional tag comprises 1 tag value with a value of 1 and n-1 tag values with a value of 0, and n is an integer greater than 1.

14. The data processing apparatus according to claim 12 or 13, wherein the processing unit is further configured to determine, according to the first similarity corresponding to each of the plurality of first output results, N target data with the largest first similarity among the plurality of data, where N is a first preset threshold and N is an integer greater than 1.

15. The data processing apparatus according to claim 12 or 13, wherein the processing unit is further configured to determine, according to the first similarity corresponding to each of the plurality of first output results, M target data with a first similarity greater than a second preset threshold from among the plurality of data.

16. The data processing apparatus according to any of claims 12 to 15, wherein the processing unit is further configured to determine the first similarity by calculating a relative entropy or a distance metric between each of the plurality of first output results and the one-hot tag.

17. The data processing apparatus of claim 16, wherein the distance metric comprises a mean square error (MES) distance or an L1 distance.

18. The data processing apparatus of any of claims 12 to 17, wherein the processing unit is further configured to compress the network to be compressed by distillation to obtain the target network.

19. The data processing apparatus of claim 18,

the acquisition unit is also used for acquiring a student network;

the processing unit is further configured to input the at least one piece of target data into the student network and the network to be compressed respectively to obtain a second output result of the student network and a third output result of the network to be compressed;

the processing unit is further configured to determine a loss function according to the second output result and the third output result;

and the processing unit is further used for training the student network according to the loss function until the loss function is converged to obtain the target network.

20. The data processing apparatus of claim 19, wherein the processing unit is further configured to determine a second similarity between the second output result and the third output result;

the processing unit is further configured to determine the loss function at least according to the second similarity.

21. The data processing apparatus of claim 20, wherein the processing unit is further configured to:

determining a one-hot label corresponding to the third output result;

22. A data processing apparatus according to any of claims 12 to 21, wherein the plurality of data comprises image data, text data, video data or voice data.

23. A data processing apparatus comprising a memory and a processor; the memory stores code, the processor is configured to execute the code, and when executed, the data processing apparatus performs the method of any of claims 1 to 11.

24. A computer storage medium storing instructions that, when executed by a computer, cause the computer to perform the method of any one of claims 1 to 11.

25. A computer program product having stored thereon instructions which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to 11.