WO2022111387A1 - 一种数据处理方法及相关装置 - Google Patents

一种数据处理方法及相关装置 Download PDF

Info

Publication number
WO2022111387A1
WO2022111387A1 PCT/CN2021/131686 CN2021131686W WO2022111387A1 WO 2022111387 A1 WO2022111387 A1 WO 2022111387A1 CN 2021131686 W CN2021131686 W CN 2021131686W WO 2022111387 A1 WO2022111387 A1 WO 2022111387A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
network
output result
similarity
data processing
Prior art date
Application number
PCT/CN2021/131686
Other languages
English (en)
French (fr)
Inventor
陈汉亭
王云鹤
许春景
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022111387A1 publication Critical patent/WO2022111387A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present application relates to the field of computer technology, and in particular, to a data processing method and related apparatus.
  • a compression algorithm of deep neural network which can compress a teacher network model with large storage space requirements and high computational complexity into a student network model with small storage space requirements and low computational complexity, so that the student network can be Applied to devices with low power consumption and low computing power.
  • the training data of the original neural network needs to be used.
  • the training data of the network to be compressed cannot be obtained, which makes it difficult to effectively compress the neural network.
  • the present application provides a data processing method, by inputting the acquired unlabeled data into the network to be compressed, obtaining the one-hot label of the obtained output result, and measuring the similarity between the output result and the one-hot label, The unlabeled data corresponding to the output result with higher similarity is used as the data for compressing the to-be-compressed network.
  • This method a large amount of data similar to the original training data of the network to be compressed can be obtained, thereby ensuring that the network can be compressed effectively.
  • a first aspect of the present application provides a data processing method, comprising: a data processing device acquiring a to-be-compressed network and a plurality of data, where the to-be-compressed network is a classification network for classifying input data to obtain an output classification result;
  • the plurality of data may be image data, text data, video data, or voice data.
  • the network to be compressed may be uploaded by the user to the data processing apparatus, and the plurality of data may be unlabeled data obtained by the data processing apparatus accessing a specific gallery.
  • the data processing device sequentially inputs the plurality of data into the to-be-compressed network to obtain a first output result corresponding to each of the plurality of data.
  • the first output result may be an n-dimensional label, where n is the number of classification categories, and each label value in the n-dimensional label represents the probability of the category to which the data corresponding to the first output result belongs.
  • the data processing device determines a one-hot label corresponding to each first output result in the plurality of first output results; the one-hot label is, for example, an n-dimensional label, and the n-dimensional label includes 1 tag values with a value of 1, and n-1 tag values with a value of 0, where n is an integer greater than 1.
  • the data processing device respectively determines the first similarity between each of the first output results and the one-hot label; since the first similarity can be used to measure the data obtained by the data processing device. The similarity between the data and the original training data, therefore, the data processing apparatus can determine the target data from the plurality of data according to the obtained first similarity corresponding to each of the first output results.
  • the data obtained by the data processing device has a corresponding first output result, and each first output result has a corresponding first similarity, the data obtained by the data processing device has a corresponding first similarity.
  • the higher the first similarity corresponding to the data the closer the data is to the original training data of the network to be compressed. Therefore, the data processing device can select a higher first similarity.
  • the data is used as the target data to realize the compression of the to-be-compressed network.
  • the one-hot label of the obtained output result is obtained, and the similarity between the output result and the one-hot label is measured, so that the similarity is higher
  • the unlabeled data corresponding to the output result is used as the data for compressing the to-be-compressed network.
  • the data processing apparatus according to the first similarity corresponding to each first output result in the plurality of first output results, in the plurality of data Determining at least one target data includes: according to the first similarity corresponding to each first output result in the plurality of first output results, determining N targets with the largest first similarity among the plurality of data data, the N is a first preset threshold and the N is an integer greater than 1.
  • the data processing apparatus according to the first similarity corresponding to each first output result in the plurality of first output results, in the plurality of data Determining at least one target data includes: the data processing device determining, in the plurality of data, that the first similarity is greater than the first similarity in the plurality of data according to the first similarity corresponding to each of the plurality of first output results. 2. M pieces of target data with preset thresholds.
  • the data processing device respectively determines the first similarity between each first output result in the plurality of first output results and the one-hot label
  • the method includes: determining the first similarity by calculating a relative entropy or a distance metric between each of the plurality of first output results and the one-hot label.
  • the first similarity is determined by calculating the relative entropy or distance metric between the first output result and the one-hot label, which can realize the calculation of the similarity and ensure the feasibility of the solution.
  • the distance metric includes mean square error MES distance or L1 distance.
  • the method further includes: compressing the to-be-compressed network by a distillation method to obtain a target network.
  • compressing the to-be-compressed network by distillation to obtain the target network includes: the data processing device obtains the student network; the data processing device separately inputs the at least one target data The student network and the to-be-compressed network obtain the second output result of the student network and the third output result of the to-be-compressed network; the data processing device obtains the second output result and the third output result according to the A loss function is determined; the data processing device trains the student network according to the loss function until the loss function converges to obtain the target network.
  • the data processing apparatus determines a loss function according to the second output result and the third output result, including: determining the second output result and the third output result The second similarity between them; the loss function is determined at least according to the second similarity.
  • the determining the loss function according to the second output result and the third output result further includes: determining the first output result and the probability transition matrix according to the second output result and the probability transition matrix.
  • determining the loss function includes: determining the loss function according to the second similarity and the third similarity.
  • the prediction label of the teacher network is corrected by introducing a probability transition matrix, which can improve the effect of network compression and ensure the prediction accuracy of the compressed network when the training data is unlabeled data.
  • the plurality of data includes image data, text data, video data or voice data.
  • a second aspect of the present application provides a data processing device, including an acquisition unit and a processing unit; the acquisition unit is used to acquire a network to be compressed and a plurality of data, and the network to be compressed is a classification network; the processing unit is configured with inputting the plurality of data into the network to be compressed to obtain a plurality of first output results, and the plurality of first output results and the plurality of data are in one-to-one correspondence; the processing unit is further configured to Determine the one-hot one-hot label corresponding to each first output result in the plurality of first output results; the processing unit is further configured to respectively determine each first output in the plurality of first output results the first similarity between the result and the one-hot label; the processing unit is further configured to, according to the first similarity corresponding to each of the first output results in the plurality of first output results, in At least one target data is determined from the plurality of data, and the at least one target data is used to compress the to-be-compressed network.
  • the one-hot label is an n-dimensional label
  • the n-dimensional label includes 1 label value with a value of 1, and n-1 label values with a value of 0.
  • the n is an integer greater than 1.
  • the processing unit is further configured to, according to the first similarity corresponding to each first output result in the plurality of first output results, perform a Among the pieces of data, N pieces of target data with the largest first similarity are determined, where N is a first preset threshold and N is an integer greater than 1.
  • the processing unit is further configured to, according to the first similarity corresponding to each first output result in the plurality of first output results, perform a Among the pieces of data, M pieces of target data whose first similarity is greater than the second preset threshold are determined.
  • the processing unit is further configured to calculate the relative entropy between each first output result in the plurality of first output results and the one-hot label. or a distance metric to determine the first similarity.
  • the distance metric includes mean square error MES distance or L1 distance.
  • the processing unit is further configured to compress the to-be-compressed network by a distillation method to obtain a target network.
  • the obtaining unit is further configured to obtain a student network; the processing unit is further configured to input the at least one target data into the student network and the to-be-to-be-used network respectively. compressing the network to obtain the second output result of the student network and the third output result of the to-be-compressed network; the processing unit is further configured to determine a loss function according to the second output result and the third output result ; the processing unit is further configured to train the student network according to the loss function until the loss function converges to obtain the target network.
  • the processing unit is further configured to determine a second similarity between the second output result and the third output result; the processing unit is further configured to use and determining the loss function according to at least the second similarity.
  • the processing unit is further configured to: determine a fourth output result according to the second output result and the probability transition matrix; determine the one corresponding to the third output result -hot label; determine the third similarity between the one-hot label corresponding to the fourth output result and the third output result; according to the second similarity and the third similarity, determine the loss function.
  • the plurality of data includes image data, text data, video data or voice data.
  • a third aspect of the present application provides a data processing apparatus, which may include a processor, the processor is coupled with a memory, the memory stores program instructions, and the method described in the first aspect is implemented when the program instructions stored in the memory are executed by the processor .
  • a data processing apparatus which may include a processor, the processor is coupled with a memory, the memory stores program instructions, and the method described in the first aspect is implemented when the program instructions stored in the memory are executed by the processor .
  • a fourth aspect of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it runs on a computer, causes the computer to execute the method described in the first aspect.
  • a fifth aspect of the present application provides a circuit system, the circuit system comprising a processing circuit configured to perform the method of the above-mentioned first aspect.
  • a sixth aspect of the present application provides a computer program that, when run on a computer, causes the computer to execute the method described in the first aspect.
  • a seventh aspect of the present application provides a chip system, where the chip system includes a processor for supporting a server or a threshold value obtaining device to implement the functions involved in the above aspects, for example, sending or processing the data involved in the above methods and/or information.
  • the chip system further includes a memory for storing necessary program instructions and data of the server or the communication device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • FIG. 1 is a schematic structural diagram of an artificial intelligence main frame provided by an embodiment of the present application.
  • FIG. 2a is an image processing system provided by an embodiment of the present application.
  • FIG. 2b is another image processing system provided by an embodiment of the present application.
  • FIG. 2c is a schematic diagram of a related device for image processing provided by an embodiment of the present application.
  • FIG. 3a is a schematic diagram of the architecture of a system 100 provided by an embodiment of the present application.
  • 3b is a schematic diagram of an image semantic segmentation provided by an embodiment of the present application.
  • Figure 4a is a schematic diagram of the application of neural network compression in an actual scene
  • FIG. 4b is a schematic flowchart of a network compression method provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a data processing method 500 provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of compressing a network to be compressed according to an embodiment of the present application
  • FIG. 7 is a schematic flowchart of a network compression provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • Figure 1 shows a schematic structural diagram of the main frame of artificial intelligence.
  • the above-mentioned artificial intelligence theme framework is explained in two dimensions (vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communication with the outside world through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
  • the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall solution of artificial intelligence, and the productization of intelligent information decision-making to achieve landing applications. Its application areas mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart city, etc.
  • FIG. 2a is an image processing system provided by an embodiment of the present application, where the image processing system includes a user equipment and a data processing device.
  • the user equipment includes smart terminals such as mobile phones, personal computers, or information processing centers.
  • the user equipment is the initiator of image processing. As the initiator of the image enhancement request, the user usually initiates the request through the user equipment.
  • the above-mentioned data processing device may be a device or server with data processing functions, such as a cloud server, a network server, an application server, and a management server.
  • the data processing device receives the image enhancement request from the intelligent terminal through the interactive interface, and then performs image processing in the form of machine learning, deep learning, search, reasoning, and decision-making through the memory for storing data and the processor for data processing.
  • the memory in the data processing device may be a general term, including local storage and a database for storing historical data.
  • the database may be on the data processing device or on other network servers.
  • the user equipment can receive instructions from the user, for example, the user equipment can acquire an image input/selected by the user, and then initiate a request to the data processing equipment, so that the data processing equipment can target the data obtained by the user equipment.
  • the image is subjected to image enhancement processing applications (such as image super-resolution reconstruction, image denoising, image dehazing, image deblurring, and image contrast enhancement, etc.), thereby obtaining corresponding processing results for the image.
  • image enhancement processing applications such as image super-resolution reconstruction, image denoising, image dehazing, image deblurring, and image contrast enhancement, etc.
  • the user equipment may acquire an image input by the user, and then initiate an image denoising request to the data processing device, so that the data processing device performs image denoising on the image, thereby obtaining a denoised image.
  • the data processing device may execute the image processing method of the embodiment of the present application.
  • Fig. 2b is another image processing system provided by the embodiment of the application.
  • the user equipment is directly used as a data processing device, and the user equipment can directly obtain the input from the user and directly process it by the hardware of the user equipment itself, The specific process is similar to that of FIG. 2a, and the above description can be referred to, and details are not repeated here.
  • the user equipment can receive instructions from the user, for example, the user equipment can acquire an image selected by the user in the user equipment, and then the user equipment can execute an image processing application (for example, image super-resolution reconstruction, image denoising, image dehazing, image deblurring, and image contrast enhancement, etc.), so as to obtain corresponding processing results for the image.
  • an image processing application For example, image super-resolution reconstruction, image denoising, image dehazing, image deblurring, and image contrast enhancement, etc.
  • the user equipment itself can execute the image processing method of the embodiment of the present application.
  • FIG. 2c is a schematic diagram of a related device for image processing provided by an embodiment of the present application.
  • the user equipment in the above-mentioned FIGS. 2a and 2b may specifically be the local device 301 or the local device 302 in FIG. 2c, and the data processing device in FIG. 2a may specifically be the execution device 210 in FIG. 2c, wherein the data storage system 250 may be To store the data to be processed by the execution device 210, the data storage system 250 may be integrated on the execution device 210, or may be set on the cloud or other network servers.
  • the processors in Figures 2a and 2b may perform data training/machine learning/deep learning through a neural network model or other model (eg, a support vector machine-based model), and use the data to finally train or learn the model to execute on the image Image processing application, so as to obtain the corresponding processing results.
  • a neural network model or other model eg, a support vector machine-based model
  • FIG. 3a is a schematic diagram of the architecture of a system 100 provided by an embodiment of the present application.
  • the execution device 110 is configured with an input/output (I/O) interface 112 for performing data interaction with external devices,
  • I/O input/output
  • the user may input data to the I/O interface 112 through the client device 140, and the input data may include: various tasks to be scheduled, callable resources, and other parameters in this embodiment of the present application.
  • the execution device 110 may call the data storage system 150
  • the data, codes, etc. in the corresponding processing can also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing results to the client device 140 for provision to the user.
  • the training device 120 can generate corresponding target models/rules based on different training data for different goals or tasks, and the corresponding target models/rules can be used to achieve the above-mentioned goals or complete the above-mentioned tasks. , which provides the user with the desired result.
  • the training data may be stored in the database 130 and come from training samples collected by the data collection device 160 .
  • the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .
  • FIG. 3a is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • the neural network can be obtained by training according to the training device 120.
  • An embodiment of the present application also provides a chip, where the chip includes a neural network processor NPU.
  • the chip can be set in the execution device 110 as shown in FIG. 3 a to complete the calculation work of the calculation module 111 .
  • the chip can also be set in the training device 120 as shown in FIG. 3a to complete the training work of the training device 120 and output the target model/rule.
  • the neural network processor NPU is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and the tasks are allocated by the main CPU.
  • the core part of the NPU is the arithmetic circuit, and the controller controls the arithmetic circuit to extract the data in the memory (weight memory or input memory) and perform operations.
  • the arithmetic circuit includes multiple processing units (process engines, PEs).
  • the arithmetic circuit is a two-dimensional systolic array.
  • the arithmetic circuit may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • the arithmetic circuit is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory, and buffers it on each PE in the operation circuit.
  • the arithmetic circuit fetches the data of matrix A from the input memory and performs matrix operation on matrix B, and stores the partial result or final result of the matrix in an accumulator.
  • the vector calculation unit can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector computing unit can be used for network computation of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc.
  • the vector computation unit can store the processed output vector to a unified buffer.
  • the vector computing unit may apply a nonlinear function to the output of the arithmetic circuit, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as activation input to an operational circuit, such as for use in subsequent layers in a neural network.
  • Unified memory is used to store input data as well as output data.
  • the weight data directly transfers the input data in the external memory to the input memory and/or the unified memory through the direct memory access controller (DMAC), stores the weight data in the external memory into the weight memory, and transfers the unified memory store the data in the external memory.
  • DMAC direct memory access controller
  • the bus interface unit (BIU) is used to realize the interaction between the main CPU, the DMAC and the instruction fetch memory through the bus.
  • the instruction fetch buffer connected to the controller is used to store the instructions used by the controller
  • the controller is used for invoking the instructions cached in the memory to realize and control the working process of the operation accelerator.
  • the unified memory, input memory, weight memory and instruction fetch memory are all on-chip memories
  • the external memory is the memory outside the NPU
  • the external memory can be double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (HBM), or other readable and writable memory.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes xs and intercept 1 as inputs, and the output of the operation unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • the work of each layer in a neural network can be expressed mathematically To describe: From the physical level, the work of each layer in the neural network can be understood as the transformation from the input space to the output space (that is, the row space of the matrix to the column space) through five operations on the input space (set of input vectors). ), the five operations include: 1. Dimension raising/lowering; 2. Enlarging/reducing; 3. Rotation; 4. Translation; 5. "Bending”. Among them, the operations of 1, 2, and 3 are determined by Complete, the operation of 4 is completed by +b, and the operation of 5 is realized by a().
  • W is the weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer.
  • the vector W determines the space transformation from the input space to the output space above, that is, the weight W of each layer controls how to transform the space.
  • the purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vectors W of many layers). Therefore, the training process of the neural network is essentially learning the way to control the spatial transformation, and more specifically, learning the weight matrix.
  • the neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller.
  • BP error back propagation
  • the input signal is passed forward until the output will generate error loss, and the parameters in the initial neural network model are updated by back-propagating the error loss information, so that the error loss converges.
  • the back-propagation algorithm is a back-propagation motion dominated by error loss, which aims to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • Image enhancement refers to processing the brightness, color, contrast, saturation, dynamic range, etc. of an image to meet certain specific indicators. Simply put, in the process of image processing, by purposefully emphasizing the overall or local characteristics of the image, the original unclear image becomes clear or some interesting features are emphasized, and the difference between the features of different objects in the image is enlarged. It can improve the image quality and enrich the amount of image information, and can strengthen the image interpretation and recognition effect to meet the needs of some special analysis.
  • image enhancement may include, but is not limited to, image super-resolution reconstruction, image denoising, image dehazing, image deblurring, and image contrast enhancement.
  • Image semantic segmentation refers to subdividing the image into different categories according to certain rules (such as illumination, category). To put it simply, the goal of image semantic segmentation is to label each pixel in the image with a label, that is, to label the object category to which each pixel in the image belongs. These labels can include people, animals, cars, flowers, furniture, etc. .
  • FIG. 3b is a schematic diagram of an image semantic segmentation provided by an embodiment of the present application. As shown in Figure 3b, through image semantic segmentation, the image can be divided into different sub-regions according to categories at the pixel level, such as sub-regions such as buildings, sky, and plants.
  • the neural network training method provided in the embodiment of the present application involves the processing of images, and can be specifically applied to data processing methods such as data training, machine learning, deep learning, etc., to symbolize and form the training data (such as the images in the present application).
  • the image processing method provided in the embodiment of the present application can use the above-mentioned trained image processing model to convert input data (such as The image to be processed in this application) is input into the trained image processing model to obtain output data (such as the target image in this application).
  • training method and the image processing method of the image processing model provided in the embodiments of this application are inventions based on the same concept, and can also be understood as two parts in a system, or two stages of an overall process : such as model training phase and model application phase.
  • neural networks have been successfully applied to many practical tasks (such as image classification, object detection, text classification, and speech recognition tasks).
  • Generally, neural networks require huge computing resources to function properly.
  • the computing resources of these terminal devices are usually insufficient to support the operation of neural networks with complex structures.
  • a compression algorithm of the neural network is proposed.
  • the compression algorithm to compress the neural network with high computational complexity and large storage space requirements, a compressed neural network with low computational complexity and small storage space requirements can be obtained, so that the compressed neural network can run in the computing power. on restricted end devices.
  • Existing compression algorithms usually use the neural network to be compressed as the teacher network, use a neural network with less computational complexity as the student network, and input the original training data into the teacher network and the student network respectively.
  • the student network provides effective supervision information to realize the training of the student network, so as to obtain the compressed neural network. Therefore, existing compression algorithms usually require the original training data of the network to be compressed to compress the network.
  • FIG. 4a is a schematic diagram of the application of neural network compression in an actual scene.
  • the user trains a neural network based on his own local data, and transmits the trained neural network to the public cloud, requiring the neural network to be compressed for application to mobile devices such as mobile phones.
  • these original training data for training the neural network are not available due to the protection of user's personal privacy, or because these original training data are too large to be transmitted to the cloud. That is to say, the original training data for training the neural network is usually not available on the public cloud.
  • the user purchases a trained neural network from a specific institution or company, and compresses the network on the public cloud for application to mobile devices such as mobile phones. Since the original training data of the neural network is a commercial secret, users usually cannot obtain the original training data of the neural network.
  • the embodiment of the present application provides a data processing method, by inputting the obtained unlabeled data into the network to be compressed, obtaining the one-hot label of the obtained output result, and measuring the output result and the one-hot label.
  • the similarity between them is to use the unlabeled data corresponding to the output results with higher similarity as the data for compressing the to-be-compressed network.
  • FIG. 4b is a schematic flowchart of a network compression method provided by an embodiment of the present application.
  • the user uploads the network to be compressed to the public cloud
  • the public cloud compresses the network to be compressed, and deploys the compressed network to the mobile device.
  • the process of compressing the network in the public cloud includes: the public cloud inputs unlabeled data into the network to be compressed, obtains the one-hot label of the obtained output result, and measures the similarity between the output result and the one-hot label degree, and the unlabeled data corresponding to the output result with higher similarity is used as the target data.
  • the public cloud uses a distillation algorithm to compress the network to be compressed to obtain a compressed network.
  • FIG. 5 is a schematic flowchart of a data processing method 500 provided by an embodiment of the present application. As shown in FIG. 5 , the data processing method 500 includes the following steps.
  • step 501 the data processing apparatus acquires a network to be compressed and a plurality of data, and the network to be compressed is a classification network.
  • the data processing device may be a device for compressing a neural network, or may be a device dedicated to acquiring training data required for compressing a neural network.
  • the data processing apparatus may be a server deployed on the cloud for acquiring training data required for compressing the neural network and compressing the neural network based on the acquired training data.
  • the network to be compressed is a classification network, which is used to classify the input data to obtain an output classification result.
  • the network to be compressed is T
  • the input data is x
  • the output result of the network to be compressed is y T
  • y T is an n-dimensional label, where n is the number of classification categories.
  • the dimension with the largest value in the output result y T is the category of the data x judged by the network.
  • the output result y T is a 3-dimensional label
  • the first dimension of the 3-dimensional label indicates that the classification category is cat
  • the second dimension indicates that the classification category is dog
  • the third dimension indicates that the classification category is pig
  • the data processing apparatus can acquire the network to be compressed by acquiring data sent by other terminal devices.
  • the data processing device when the data processing device is a server deployed on the cloud, the data processing device can obtain the network to be compressed uploaded by the user by receiving data sent by the user based on terminal equipment such as a personal computer or a notebook computer.
  • the plurality of data are data of the same type, for example, the plurality of data may be image data, text data, video data or voice data.
  • the network to be compressed is an image classification network
  • the multiple pieces of data are image data
  • the network to be compressed is used to classify images, such as classifying images into dogs, cats, fish, etc. according to the animals displayed on the images type.
  • the network to be compressed is a text classification network
  • the multiple pieces of data are text data and are used to classify the text, such as classifying the text into positive sentiment text or negative sentiment text.
  • the data processing apparatus may acquire the plurality of data in various ways. The following will take the plurality of data as image data as an example to introduce the manner in which the data processing apparatus acquires the plurality of data.
  • the data processing device when the data processing device is a server, since a large amount of image data is usually stored on the server, these image data may be image data uploaded by a large number of users, or may be uploaded by a large number of users.
  • the data processing apparatus can access a corresponding gallery on the web page by accessing a specific web page, thereby acquiring a large amount of data in the gallery as the above-mentioned multiple data.
  • the data processing apparatus may also capture image data on the network based on a web crawler to obtain the above-mentioned multiple data.
  • the data obtained by the data processing apparatus usually do not have classification labels, that is, these data are not classified and marked with corresponding labels. Since the data processing device can obtain a large amount of unlabeled data, and some of these unlabeled data are similar to the original training data of the network to be compressed, the data processing device uses the method of this embodiment to compare this part with the data to be compressed. After the data similar to the original training data of the compressed network is filtered out, this part of the data can be used to compress the to-be-compressed network.
  • the data acquired by the data processing apparatus may also be simply classified, for example, the data processing apparatus may acquire animal images, home appliance images, plant images, and the like in different galleries. Then, in the case that the data processing device can obtain the data of simple classification, if the data processing device can obtain the classification category of the network to be compressed, the data processing device can preliminarily screen the obtained data to filter out impossible data.
  • the data that is the training data of the network to be compressed For example, when the data processing device obtains that the classification category of the network to be compressed is animal, that is, the network to be compressed classifies images of animals, the data processing device can filter out images that are not animals in advance, such as household appliances. class images or plant class images, etc., to save computation.
  • Step 502 The data processing apparatus inputs a plurality of data into the network to be compressed, and obtains a plurality of first output results, and there is a one-to-one correspondence between the plurality of first output results and the plurality of data.
  • the data processing apparatus may sequentially input the multiple pieces of data into the network to be compressed to obtain a first output result corresponding to each of the multiple pieces of data.
  • the first output result may be an n-dimensional label, where n is the number of classification categories, and each label value in the n-dimensional label represents the probability of the category to which the data corresponding to the first output result belongs.
  • the first dimension of the first output result corresponding to data 1 indicates that the classification category is cat
  • the second dimension indicates that the classification category is dog
  • the third dimension indicates that the classification category is pig
  • the first output result is ⁇ 0.3, 0.6 ,0.1 ⁇
  • the probability that data 1 belongs to the dog category is 0.6
  • the probability that data 1 belongs to the pig category is 0.1.
  • Step 503 The data processing apparatus determines a one-hot label corresponding to each of the multiple first output results.
  • the one-hot label is an n-dimensional label
  • the n-dimensional label includes 1 label value with a value of 1, and n-1 label values with a value of 0, and n is an integer greater than 1.
  • the method for the data processing device to determine the one-hot label corresponding to the first output result may be: determining the dimension with the largest label value in the first output result, and based on the label value of the dimension is 1, the label value of other dimensions is 0, and the one-hot label corresponding to the first output result is generated.
  • the data processing apparatus may determine that the dimension with the largest label value in the first output result is the second dimension (ie, the dimension with the label value 0.6), so the data The processing device may generate a one-hot label corresponding to the first output result, where the one-hot label is ⁇ 0,1,0 ⁇ .
  • a one-hot tag is a tag that includes 1 tag value with a value of 1, and n-1 tag values with a value of 0.
  • a one-hot label can also refer to a label that includes 1 label value with a value close to 1, and n-1 label values with a value close to 0.
  • the one-hot label can be ⁇ 0.001, 0.997, 0.002 ⁇ . This embodiment does not specifically limit the one-hot tag.
  • Step 504 the data processing apparatus respectively determines the first similarity between each of the first output results and the one-hot label among the plurality of first output results.
  • the data processing apparatus may calculate the first similarity between each first output result and its corresponding one-hot label.
  • the goal is to make the output of the classification network as identical as possible to the true labels of the training data.
  • the true labels of training data can usually be represented by one-hot labels. Therefore, for a trained classification network, the output of the original training data of the classification network in the classification network will be very close to the one-hot label, that is, the output result is very similar to the one-hot label. For other data that are not original training data, since the classification network may not be able to accurately identify the data, the output results of these data in the classification network will not be very close to the one-hot label, that is, the output results are the same as the one-hot label.
  • the similarity of hot tags is not high.
  • the network to be compressed is obtained by training based on images related to dogs, cats and pigs
  • the original training data of the network to be compressed are images related to dogs, cats and pigs
  • the images obtained by the data processing device The data includes image 1 and image 2
  • image 1 is an animal image related to dogs
  • image 2 is an image of household appliances related to refrigerators. Input the image 1 and image 2 into the network to be compressed respectively.
  • image 1 is similar to the original training data of the network to be compressed, the output result corresponding to image 1 can be ⁇ 0.08, 0.91, 0.01 ⁇ ;
  • the original training data of the network is quite different, and it is difficult for the network to be compressed to effectively identify the image 2, and the output result corresponding to the image 2 can be ⁇ 0.3, 0.3, 0.4 ⁇ . It can be seen that the closer the output of the image is to the one-hot label, the closer the image is to the original training data.
  • the data processing apparatus respectively determines the first similarity between each of the multiple first output results and the one-hot label, which may specifically include: the data processing apparatus calculates a plurality of The relative entropy or distance metric between each first output result and the one-hot label corresponding to the first output result in the first output result, to determine the first similarity between the first output result and the one-hot label .
  • Relative entropy also known as Kullback-Leibler divergence (KL divergence) or information divergence (information divergence)
  • KL divergence Kullback-Leibler divergence
  • information divergence information divergence
  • D KL (y T , t) represents the KL divergence between the first output result and the one-hot label
  • log( ) represents the logarithm. The smaller the KL divergence, the closer the first output result y T is to its corresponding one-hot label, that is, the greater the similarity between the two.
  • the distance measure can also be called measure similarity, and by calculating the distance measure between two multidimensional data, the similarity between two multidimensional data can be determined.
  • the distance metric may include a mean squared error (Mean Squared Error, MES) distance or an L1 distance equidistant.
  • MES mean squared Error
  • L1 distance equidistant L1 distance equidistant
  • the MES distance refers to the expected value of the square of the difference between the estimated parameter value and the true value of the parameter, which can be used to evaluate the degree of change in the data.
  • the MES distance between the first output result and the one-hot label can be shown in formula 2:
  • MSE(y T , t) represents the MES distance between the first output result and the one-hot label.
  • the L1 distance also known as the Manhattan distance, represents the sum of the absolute wheelbases of two points on a standard coordinate system.
  • the first output result is y T
  • the one-hot label corresponding to the first output result is t
  • the L1 distance between the first output result and the one-hot label can be shown in formula 3:
  • L1(y T , t) represents the L1 distance between the first output result and the one-hot label.
  • Step 505 the data processing apparatus determines at least one target data in the plurality of data according to the first similarity corresponding to each of the plurality of first output results, and the at least one target data is used to compress the network to be compressed.
  • the data processing apparatus can Identify target data.
  • the data obtained by the data processing device has a corresponding first output result, and each first output result has a corresponding first similarity, the data obtained by the data processing device has a corresponding first similarity.
  • the data processing device can select a higher first similarity. The data is used as the target data to realize the compression of the to-be-compressed network.
  • the data processing apparatus determines at least one target data in the plurality of data according to the first similarity corresponding to each of the plurality of first output results, which may specifically include: a data processing apparatus According to the first similarity corresponding to each first output result in the plurality of first output results, N target data with the largest first similarity among the plurality of data are determined, where N is the first preset threshold and N is greater than 1 the integer.
  • the data processing apparatus may obtain the first preset threshold N in advance, and the first preset threshold N may be preset in the data processing apparatus, or received by the data processing apparatus from other network devices in advance; then , the data processing device selects N target data from the plurality of data according to the first preset threshold N and in descending order of the first similarity, and these target data are the first similarity among the plurality of data N data with the largest degree.
  • the data processing apparatus may actually, according to the first preset threshold N, in the order of the KL divergence or the distance metric from the smallest to the largest, from the Select N target data from multiple data, and these target data are N data with the smallest KL divergence or distance metric among the multiple data.
  • the value of N may be determined according to the actual computing capability of the data processing device and the compression precision of the network to be compressed.
  • the value range of N may be in the range of tens of thousands to hundreds of thousands.
  • the data processing apparatus may determine 100,000 pieces of data with the largest first similarity among the 1,000,000 pieces of data as the target data.
  • the data processing apparatus determines at least one target data in the plurality of data according to the first similarity corresponding to each first output result in the plurality of first output results, which may specifically include: data processing The device determines, from the plurality of data, M pieces of target data whose first similarity is greater than a second preset threshold according to the first similarity corresponding to each of the first output results in the plurality of first output results.
  • the second preset threshold may also be pre-acquired by the data processing apparatus, for example, the second preset threshold may be preset in the data processing apparatus, or received in advance by the data processing apparatus from other network devices.
  • the data processing apparatus can compare whether the first similarity corresponding to the data is greater than the second preset threshold, and if the first similarity corresponding to the data is greater than the second preset threshold, then This data can be determined as target data.
  • the value of the second preset threshold may also be determined according to the actual computing capability of the data processing device and the compression precision of the network to be compressed, which is not specifically limited in this embodiment.
  • the data processing device may determine the corresponding first similarity by obtaining the reciprocal of the KL divergence or the inverse of the distance metric, and M pieces of target data are determined from the plurality of data according to the first similarity of each data.
  • the value of N is fixed, that is, the value of N is the first preset threshold
  • the value of M is not fixed, Instead, it is determined based on the first similarity corresponding to each of the plurality of data. If there are more data in the plurality of data with the first similarity greater than the second preset threshold, the M is larger; if there are fewer data in the plurality of data with the first similarity greater than the second preset threshold, the M is larger Small.
  • the one-hot label of the obtained output result is obtained, and the similarity between the output result and the one-hot label is measured, so as to compare the similarity with the one-hot label.
  • the unlabeled data corresponding to the high output result is used as the data for compressing the to-be-compressed network.
  • the data processing apparatus may further compress the network to be compressed by the distillation method to obtain the target network.
  • the distillation method aims to extract useful information and knowledge from the teacher network as a guide in the training process of the student network, so as to realize the training of the student network.
  • the student network can obtain better performance than training alone. That is to say, a student network with high performance, low computational complexity and low storage consumption can be obtained by distillation.
  • the data processing device compresses the network to be compressed by the distillation method, specifically, obtains a student network with low computational complexity in advance, uses the network to be compressed as the teacher network, and then trains the student network based on the obtained target data, And extract useful information from the teacher network to guide the training of the student network, and finally train the target network.
  • FIG. 6 is a schematic flowchart of compressing a to-be-compressed network according to an embodiment of the present application.
  • the process of compressing the network to be compressed by the data processing apparatus by distillation may include the following steps.
  • Step 601 the data processing apparatus acquires the student network.
  • the student network may be a constructed neural network, which can be used to implement data classification, such as a deep neural network.
  • the data processing device may be obtained from the student network in a number of ways.
  • one or more pre-built student networks may be preset in the data processing device, and the one or more student networks may be constructed by specific personnel and pre-installed in the data processing device of.
  • Different student networks can have different computational complexity and storage space requirements.
  • the data processing device can determine the student network that can meet the compression requirements according to the compression requirements of the network to be compressed, such as the size of the storage space occupied by compression, the computational complexity after compression and other indicators.
  • the user may simultaneously upload the network to be compressed and the student network to the data processing apparatus, and the data processing apparatus may obtain the student network by acquiring the data uploaded by the user.
  • the data processing apparatus may also automatically construct a student network that can meet the compression requirement according to the compression requirement of the user after acquiring the compression requirement of the user. For example, when the compression requirement of the user is that the storage space occupied by the compressed network is less than 1 gigabyte (Gigabyte, GB), the data processing apparatus may construct a student network with the storage space requirement lower than 1 GB based on the compression requirement.
  • the compression requirement of the user is that the storage space occupied by the compressed network is less than 1 gigabyte (Gigabyte, GB)
  • the data processing apparatus may construct a student network with the storage space requirement lower than 1 GB based on the compression requirement.
  • Step 602 The data processing apparatus inputs at least one target data into the student network and the network to be compressed respectively, and obtains the second output result of the student network and the third output result of the network to be compressed.
  • the data processing apparatus may train the student network based on the at least one target data. Specifically, the data processing apparatus may input one target data of the at least one target data into the student network and the network to be compressed respectively, and obtain the second output result corresponding to the target data in the student network and the target data in the to-be-compressed network. The corresponding third output result in the network.
  • Step 603 the data processing apparatus determines a loss function according to the second output result and the third output result.
  • the loss function of the student network used for training may be determined based on the second output result corresponding to the student network and the third output result corresponding to the network to be compressed.
  • the loss function of the student network can be composed of two expressions, one is the similarity between the output of the student network and the real label of the input data, and the other is the output of the student network and the teacher.
  • the similarity between the outputs of the network Exemplarily, when the similarity is represented by KL divergence, the loss function of the student network can be shown in Equation 4:
  • loss represents the loss function
  • y S represents the output of the student network
  • y represents the true label of the input data
  • D KL (y S , y) represents the KL divergence between the output of the student network and the true label of the input data
  • y T represents the output of the teacher network
  • D KL (y S , y T ) represents the KL divergence between the output of the teacher network and the output of the student network.
  • the data processing apparatus may determine the loss function by: the data processing apparatus determines the second similarity between the second output result and the third output result, and determines the loss function at least according to the second similarity.
  • a loss function provided in this embodiment may be shown in formula 5:
  • loss represents the loss function
  • y S represents the output result of the student network
  • y T represents the output result of the teacher network
  • D KL (y S , y T ) represents the KL between the output result of the teacher network and the output result of the student network Divergence.
  • Step 604 the data processing device trains the student network according to the loss function until the loss function converges, and the target network is obtained.
  • the data processing apparatus may train the student network based on the loss function until the loss function converges, so as to obtain a trained student network, that is, the compressed target network corresponding to the network to be compressed.
  • the process that the data processing device trains the student network based on the loss function may be as follows: the data processing device inputs one target data among multiple target data into the student network and the network to be compressed, and calculates the loss based on the output results of the two networks.
  • the data processing device since the target data obtained by the data processing device does not have a real label, the data processing device cannot obtain the similarity between the output result of the student network and the real label of the input data, that is, the data processing device cannot be based on the actual label of the student network.
  • the similarity of the output results to the true labels of the input data is used to correct the erroneous output results in the teacher network. Based on this, in this embodiment, by adjusting the loss function, a probability transition matrix is introduced to correct the wrong output result in the teacher network.
  • the data processing apparatus determines the loss function according to the second output result and the third output result, and may further include: the data processing apparatus determines the fourth output result, that is, the data, according to the second output result and the probability transition matrix The processing device multiplies the second output result by the probability transition matrix, and corrects the second output result to obtain the fourth output result; the data processing device determines the one-hot corresponding to the third output result (ie, the output result of the teacher network) label, the one-hot label is the label predicted by the teacher network. The data processing device determines a third similarity between the one-hot labels corresponding to the fourth output result and the third output result, and determines a loss function according to the second similarity and the third similarity.
  • Equation 6 the loss function can be shown in Equation 6:
  • loss is the loss function
  • Q is the probability transition matrix
  • y S is the second output result (that is, the output result of the student network)
  • Q(y S ) is the fourth output result (that is, the output result of the student network and the probability transition matrix.
  • the result of multiplication) y T is the third output result (that is, the output result of the teacher network)
  • t is the one-hot label corresponding to the third output result, that is, the label predicted by the teacher network
  • D KL () means to obtain Divergence.
  • the loss function shown in Equation 6 introduces a new KL divergence, which is the result of multiplying the output of the student network by the probability transition matrix and KL divergence between labels predicted by the teacher network. Since the label predicted by the teacher is wrong, in this embodiment, the probability transition matrix Q is introduced to correct the label predicted by the teacher network, so that the output result of the student network is the correct label. That is, after the correct output result of the student network passes through a noise transition matrix, the error label t predicted by the teacher network is obtained. And, during the training process of the student network, the probability transition matrix Q can be trained together with the student network.
  • the data processing device may simultaneously train the probability transition matrix based on the loss function. That is, the probability transition matrix is not fixed.
  • the data processing device can also adjust the probability transition matrix. By introducing the probability transition matrix to correct the predicted labels of the teacher network, it can improve the effect of network compression and ensure the prediction accuracy of the compressed network when the training data is unlabeled data.
  • the probability transition matrix may be an n*n matrix, and the sum of the elements of each row in the probability transition matrix is 1.
  • Equation 7 a possible probability transition matrix can be shown in Equation 7:
  • A represents a probability transition matrix
  • a 11 , a 1n , a n1 , and a nn are elements in the probability transition matrix
  • a 11 +a 12 +...+a 1n 1.
  • the probability transition matrix Q is (0.48, 0.52).
  • the result Q(y S ) of multiplying the second output result y S and the probability transition matrix Q is (0.585, 0.415 ).
  • This embodiment takes a deep neural network (Deep Neural Networks, DNN) used for image classification as an example to introduce the process of compressing the deep neural network.
  • DNN Deep Neural Networks
  • the user has trained a deep neural network using some pictures taken or created by himself, and uploaded it to the server, requesting to compress the deep neural network.
  • the server can learn the specific structure of the deep neural network, but since the training data of the deep neural network are some pictures taken or created by the user, the user is not willing to Uploaded to the server, that is, the server cannot obtain the original training data of the deep neural network.
  • this embodiment takes the CIFAR data set as an example to show the compression effect of the network compression method proposed in this embodiment on a Residual Network (Residual Network, ResNet).
  • the CIFAR dataset is a dataset of 00,000 miniature images collected and organized by developers.
  • the ResNet-34 network can be used as the user-uploaded network to be compressed, the ImageNet dataset as the unlabeled dataset on the server, and the ResNet-18 network as the student network to be trained.
  • the ImageNet project is a large-scale visualization database for visual object recognition software research, and the ImageNet dataset can be part or all of the image data in the database.
  • the process of network compression can include the following steps:
  • the server trains the ResNet-34 network structure based on the CIFAR-10 dataset as training data to obtain a trained network.
  • step S1 the process of the user performing network training on the training set based on the user can be simulated.
  • the server can use the ImageNet dataset on the cloud as the unlabeled dataset, and use the above method 500 to filter the ImageNet dataset to obtain the target dataset. Specifically, the server can input the ImageNet dataset into the trained ResNet-34 network, and calculate the KL divergence between the output result of each image and the one-hot label of the output result, and select the smallest KL divergence 500,000 images are used as training set.
  • the server can initialize the noise transition matrix Q and the student network, and compress the ResNet-34 network based on the above method 600 and the training set to obtain a compressed network.
  • the network to be compressed is an uncompressed pre-training model, and its accuracy rate is 94.85%.
  • the accuracy of the obtained network is 94.34%.
  • the accuracy of the network compressed by the traditional distillation algorithm is 93.55%.
  • the accuracy rate is 94.02%. It can be seen that the method provided by this solution can not only solve the problem that the network cannot be compressed in the absence of original training data in the related art, but also can ensure that the accuracy of the compressed network remains at a high level.
  • FIG. 7 is a schematic flowchart of a network compression provided by an embodiment of the present application.
  • the user obtains the network to be compressed, that is, the teacher network, through the training of the original training data, and the original training data is not available.
  • the server selects and obtains unlabeled data that can be used to compress the teacher network according to the data processing method of this scheme.
  • the server implements the training of the student network by inputting unlabeled data into the teacher network and the student network, and based on the distillation algorithm.
  • the wrong results output by the teacher network can be corrected by the distillation algorithm in this scheme, so that the student network can output correct prediction results.
  • "panda" is classified as "spaceship” and "fox” is classified as "dog”, and these wrong results are corrected in the student network.
  • FIG. 8 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.
  • a data processing apparatus provided by an embodiment of the present application includes: an obtaining unit 801 and a processing unit 802; the obtaining unit 801 is configured to obtain a network to be compressed and a plurality of data, the network to be compressed is a classification network; the processing unit 802 is configured to input the plurality of data into the network to be compressed to obtain a plurality of first output results, and there is a relationship between the plurality of first output results and the plurality of data One-to-one correspondence; the processing unit 802 is further configured to determine the one-hot label corresponding to each first output result in the plurality of first output results; the processing unit 802 is further configured to determine the corresponding one-hot label respectively; the first similarity between each of the plurality of first output results and the one-hot label; the processing unit 802 is further configured to For the first similarity corresponding to the first output result,
  • the one-hot label is an n-dimensional label
  • the n-dimensional label includes 1 label value with a value of 1, and n-1 label values with a value of 0.
  • the n is an integer greater than 1.
  • the processing unit 802 is further configured to, according to the first similarity corresponding to each first output result in the plurality of first output results, in the Among the multiple pieces of data, N pieces of target data with the largest first similarity are determined, where N is a first preset threshold and N is an integer greater than 1.
  • the processing unit 802 is further configured to, according to the first similarity corresponding to each first output result in the plurality of first output results, in the Among the multiple pieces of data, M pieces of target data whose first similarity is greater than the second preset threshold are determined.
  • the processing unit 802 is further configured to calculate the relative relationship between each first output result in the plurality of first output results and the one-hot label. Entropy or distance metric to determine the first similarity.
  • the distance metric includes mean square error MES distance or L1 distance.
  • the processing unit 802 is further configured to compress the to-be-compressed network by distillation to obtain the target network.
  • the obtaining unit 801 is further configured to obtain a student network; the processing unit 802 is further configured to input the at least one target data into the student network and the student network respectively.
  • the network to be compressed obtains the second output result of the student network and the third output result of the network to be compressed; the processing unit 802 is further configured to obtain the second output result and the third output result according to the second output result and the third output result Determine a loss function; the processing unit 802 is further configured to train the student network according to the loss function until the loss function converges to obtain the target network.
  • the processing unit 802 is further configured to determine a second similarity between the second output result and the third output result; the processing unit 802, is also used for determining the loss function at least according to the second similarity.
  • the processing unit 802 is further configured to: determine a fourth output result according to the second output result and the probability transition matrix; determine the corresponding output result of the third output result one-hot label; determine the third similarity between the one-hot labels corresponding to the fourth output result and the third output result; determine the third similarity according to the second similarity and the third similarity The described loss function.
  • the plurality of data includes image data, text data, video data or voice data.
  • FIG. 9 is a schematic structural diagram of the execution device provided by an embodiment of the present application. Smart wearable devices, servers, etc., are not limited here.
  • the data processing apparatus described in the embodiment corresponding to FIG. 9 may be deployed on the execution device 900 to implement the data processing function in the embodiment corresponding to FIG. 9 .
  • the execution device 900 includes: a receiver 901, a transmitter 902, a processor 903 and a memory 904 (wherein the number of processors 903 in the execution device 900 may be one or more, and one processor is taken as an example in FIG. 9 ) , wherein the processor 903 may include an application processor 9031 and a communication processor 9032 .
  • the receiver 901, the transmitter 902, the processor 903, and the memory 904 may be connected by a bus or otherwise.
  • Memory 904 which may include read-only memory and random access memory, provides instructions and data to processor 903 .
  • a portion of memory 904 may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 904 stores processors and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.
  • the processor 903 controls the operation of the execution device.
  • the various components of the execution device are coupled together through a bus system, where the bus system may include a power bus, a control bus, and a status signal bus in addition to a data bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the above embodiments of the present application may be applied to the processor 903 or implemented by the processor 903 .
  • the processor 903 may be an integrated circuit chip, which has signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 903 or an instruction in the form of software.
  • the above-mentioned processor 903 may be a general-purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), a field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application specific integrated circuit
  • FPGA field programmable Field-programmable gate array
  • the processor 903 may implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of this application.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 904, and the processor 903 reads the information in the memory 904, and completes the steps of the above method in combination with its hardware.
  • the receiver 901 can be used to receive input numerical or character information, and to generate signal input related to the relevant setting and function control of the execution device.
  • the transmitter 902 can be used to output digital or character information through the first interface; the transmitter 902 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 902 can also include a display device such as a display screen .
  • the processor 903 is configured to execute the data processing method executed by the execution device in the embodiment corresponding to FIG. 5 .
  • Embodiments of the present application also provide a computer program product that, when running on a computer, causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.
  • Embodiments of the present application further provide a computer-readable storage medium, where a program for performing signal processing is stored in the computer-readable storage medium, and when it runs on a computer, the computer executes the steps performed by the aforementioned execution device. , or, causing the computer to perform the steps as performed by the aforementioned training device.
  • the execution device, training device, or terminal device provided in this embodiment of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, pins or circuits, etc.
  • the processing unit can execute the computer executable instructions stored in the storage unit, so that the chip in the execution device executes the data processing method described in the above embodiments, or the chip in the training device executes the data processing method described in the above embodiment.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • FIG. 10 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip may be represented as a neural network processor NPU 1000, and the NPU 1000 is mounted as a co-processor to the main CPU (Host CPU), tasks are allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 1003, which is controlled by the controller 1004 to extract the matrix data in the memory and perform multiplication operations.
  • the arithmetic circuit 1003 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 1003 is a two-dimensional systolic array. The arithmetic circuit 1003 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1003 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1002 and buffers it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 1001 to perform matrix operation, and stores the partial result or final result of the matrix in an accumulator 1008 .
  • Unified memory 1006 is used to store input data and output data.
  • the weight data is directly passed through the storage unit access controller (Direct Memory Access Controller, DMAC) 1005, and the DMAC is transferred to the weight memory 1002.
  • Input data is also transferred to unified memory 1006 via the DMAC.
  • DMAC Direct Memory Access Controller
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 1013, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 1009.
  • IFB Instruction Fetch Buffer
  • the bus interface unit 1013 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1009 to obtain instructions from the external memory, and also for the storage unit access controller 1005 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1006 , the weight data to the weight memory 1002 , or the input data to the input memory 1001 .
  • the vector calculation unit 1007 includes a plurality of operation processing units, and further processes the output of the operation circuit 1003 if necessary, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. It is mainly used for non-convolutional/fully connected layer network computation in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.
  • the vector computation unit 1007 can store the vector of processed outputs to the unified memory 1006 .
  • the vector calculation unit 1007 may apply a linear function; or a nonlinear function to the output of the operation circuit 1003, such as performing linear interpolation on the feature plane extracted by the convolution layer, and for example, a vector of accumulated values, to generate activation values.
  • the vector computation unit 1007 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 1003, eg, for use in subsequent layers in a neural network.
  • the instruction fetch memory (instruction fetch buffer) 1009 connected to the controller 1004 is used to store the instructions used by the controller 1004;
  • the unified memory 1006, the input memory 1001, the weight memory 1002 and the instruction fetch memory 1009 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above program.
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be retrieved from a website, computer, training device, or data Transmission from the center to another website site, computer, training facility or data center via wired (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) means.
  • wired eg coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device, a data center, or the like that includes an integration of one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.

Abstract

一种数据处理方法,应用于人工智能领域,包括:数据处理装置获取待压缩网络和多个数据,该待压缩网络为分类网络(501);数据处理装置将多个数据输入待压缩网络,得到多个第一输出结果,所述多个第一输出结果与所述多个数据之间一一对应(502);数据处理装置确定多个第一输出结果中每个第一输出结果所对应的独热标签(503);数据处理装置分别确定多个第一输出结果中每个第一输出结果与独热标签之间的第一相似度(504);数据处理装置根据多个第一输出结果中每个第一输出结果对应的第一相似度,在多个数据中确定至少一个目标数据,至少一个目标数据用于压缩所述待压缩网络(505)。通过该方法,能够获得大量与待压缩网络的原训练数据相近的数据,从而保证能够有效地实现网络的压缩。

Description

一种数据处理方法及相关装置
本申请要求于2020年11月30日提交中国专利局、申请号为202011381498.2、发明名称为“一种数据处理方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种数据处理方法及相关装置。
背景技术
近几年来,深度神经网络在计算机视觉领域的各类应用中取得了巨大的成功,如图像分类、目标检测、图像分割等。但深度神经网络的模型往往包含大量的模型参数,计算量大,导致其难以应用于一些计算能力较低的设备(如终端设备、嵌入式设备、集成设备等)上。
相关技术中,提出了深度神经网络的压缩算法,能够将存储空间要求大、计算复杂度高的教师网络模型,压缩为存储空间要求小且计算复杂度低的学生网络模型,以使得学生网络可应用到低功耗、低计算能力的设备中。在相关技术中,采用压缩算法对深度神经网络进行压缩的过程中,需要用到原神经网络的训练数据。
然而,在一些情况下,无法获取到待压缩网络的训练数据,从而导致难以有效地实现神经网络的压缩。
发明内容
本申请提供了一种数据处理方法,通过将获取到的无标签数据输入待压缩网络中,求取所得到的输出结果的独热标签,并衡量输出结果与独热标签之间的相似度,以将相似度较高的输出结果对应的无标签数据作为用于压缩该待压缩网络的数据。通过该方法,能够获得大量与待压缩网络的原训练数据相近的数据,从而保证能够有效地实现网络的压缩。
本申请第一方面提供一种数据处理方法,包括:数据处理装置获取待压缩网络和多个数据,所述待压缩网络为分类网络用于对输入的数据进行分类,以得到输出的分类结果;该多个数据可以为图像数据、文本数据、视频数据或语音数据。该待压缩网络例如可以为用户上传给数据处理装置的,该多个数据可以是数据处理装置访问特定的图库所获得的无标签数据。数据处理装置依次将该多个数据输入该待压缩网络中,以得到该多个数据中每个数据对应的第一输出结果。该第一输出结果可以为一个n维标签,其中n为分类类别的数目,n维标签中的每一个标签值都表示该第一输出结果对应的数据所属类别的概率。
数据处理装置确定所述多个第一输出结果中每个第一输出结果所对应的独热(one-hot)标签;所述one-hot标签例如为n维标签,所述n维标签包括1个值为1的标签值,以及n-1个值为0的标签值,所述n为大于1的整数。数据处理装置分别确定所述多个第一输出结果中每个第一输出结果与所述one-hot标签之间的第一相似度;由于第一相似度可以用于衡量数据处理装置所获得的数据与原训练数据之间的相似性,因此,数据处理装置可以根据获取到的多个第一输出结果中每个第一输出结果对应的第一相似度,在多个数据中确 定目标数据。简单来说,由于数据处理装置获得的数据均有对应的第一输出结果,且每个第一输出结果有对应的第一相似度,因此数据处理装置获得的数据均有对应的第一相似度。对于数据处理装置所获得的数据来说,该数据对应的第一相似度越高,则代表该数据与该待压缩网络的原始训练数据越接近,因此数据处理装置可以选择第一相似度较高的数据作为目标数据,以实现该待压缩网络的压缩。
本方案中,通过将获取到的无标签数据输入待压缩网络中,求取所得到的输出结果的独热标签,并衡量输出结果与独热标签之间的相似度,以将相似度较高的输出结果对应的无标签数据作为用于压缩该待压缩网络的数据。通过该方法,能够获得大量与待压缩网络的原训练数据相近的数据,从而保证能够有效地实现网络的压缩。
可选地,在一种可能的实现方式中,所述数据处理装置根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定至少一个目标数据,包括:根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定第一相似度最大的N个目标数据,所述N为第一预设阈值且所述N为大于1的整数。
本方案中,通过确定相似度最大的多个数据为目标数据,可以能够在大量的无标签数据中选择到与原始训练数据接近的数据用于训练,从而保证能够有效地实现网络的压缩。
可选地,在一种可能的实现方式中,所述数据处理装置根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定至少一个目标数据,包括:数据处理装置根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定第一相似度大于第二预设阈值的M个目标数据。
本方案中,通过确定相似度大于阈值的多个数据为目标数据,可以能够在大量的无标签数据中选择到与原始训练数据接近的数据用于训练,从而保证能够有效地实现网络的压缩。
可选地,在一种可能的实现方式中,所述数据处理装置分别确定所述多个第一输出结果中每个第一输出结果与所述one-hot标签之间的第一相似度,包括:通过计算所述多个第一输出结果中每个第一输出结果与所述one-hot标签之间的相对熵或距离度量,来确定所述第一相似度。
本方案中,通过计算第一输出结果与所述one-hot标签之间的相对熵或距离度量,来确定所述第一相似度,能够实现相似度的计算,保证方案的可实现性。
可选地,在一种可能的实现方式中,所述距离度量包括均方误差MES距离或L1距离。
可选地,在一种可能的实现方式中,所述方法还包括:通过蒸馏法压缩所述待压缩网络,得到目标网络。
可选地,在一种可能的实现方式中,所述通过蒸馏法压缩所述待压缩网络,得到目标网络,包括:数据处理装置获取学生网络;数据处理装置将所述至少一个目标数据分别输入所述学生网络和所述待压缩网络,得到所述学生网络的第二输出结果和所述待压缩网络的第三输出结果;数据处理装置根据所述第二输出结果和所述第三输出结果确定损失函 数;数据处理装置根据所述损失函数,训练所述学生网络,直至所述损失函数收敛,得到所述目标网络。
可选地,在一种可能的实现方式中,数据处理装置根据所述第二输出结果和所述第三输出结果确定损失函数,包括:确定所述第二输出结果和所述第三输出结果之间的第二相似度;至少根据所述第二相似度,确定所述损失函数。
可选地,在一种可能的实现方式中,所述根据所述第二输出结果和所述第三输出结果确定损失函数,还包括:根据所述第二输出结果和概率转移矩阵,确定第四输出结果;确定所述第三输出结果对应的one-hot标签;确定所述第四输出结果和所述第三输出结果对应的one-hot标签之间的第三相似度;所述至少根据所述第二相似度,确定所述损失函数,包括:根据所述第二相似度和所述第三相似度,确定所述损失函数。
本方案中,通过引入概率转移矩阵来对教师网络的预测标签进行纠正,可以在训练数据为无标签数据的情况下,提高网络压缩的效果,保证压缩后的网络的预测准确性。
可选地,在一种可能的实现方式中,所述多个数据包括图像数据、文本数据、视频数据或语音数据。
本申请第二方面提供一种数据处理装置,包括获取单元和处理单元;所述获取单元,用于获取待压缩网络和多个数据,所述待压缩网络为分类网络;所述处理单元,用于将所述多个数据输入所述待压缩网络,得到多个第一输出结果,所述多个第一输出结果与所述多个数据之间一一对应;所述处理单元,还用于确定所述多个第一输出结果中每个第一输出结果所对应的独热one-hot标签;所述处理单元,还用于分别确定所述多个第一输出结果中每个第一输出结果与所述one-hot标签之间的第一相似度;所述处理单元,还用于根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定至少一个目标数据,所述至少一个目标数据用于压缩所述待压缩网络。
可选地,在一种可能的实现方式中,所述one-hot标签为n维标签,所述n维标签包括1个值为1的标签值,以及n-1个值为0的标签值,所述n为大于1的整数。
可选地,在一种可能的实现方式中,所述处理单元,还用于根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定第一相似度最大的N个目标数据,所述N为第一预设阈值且所述N为大于1的整数。
可选地,在一种可能的实现方式中,所述处理单元,还用于根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定第一相似度大于第二预设阈值的M个目标数据。
可选地,在一种可能的实现方式中,所述处理单元,还用于通过计算所述多个第一输出结果中每个第一输出结果与所述one-hot标签之间的相对熵或距离度量,来确定所述第一相似度。
可选地,在一种可能的实现方式中,所述距离度量包括均方误差MES距离或L1距离。
可选地,在一种可能的实现方式中,所述处理单元,还用于通过蒸馏法压缩所述待压缩网络,得到目标网络。
可选地,在一种可能的实现方式中,所述获取单元,还用于获取学生网络;所述处理单元,还用于将所述至少一个目标数据分别输入所述学生网络和所述待压缩网络,得到所述学生网络的第二输出结果和所述待压缩网络的第三输出结果;所述处理单元,还用于根据所述第二输出结果和所述第三输出结果确定损失函数;所述处理单元,还用于根据所述损失函数,训练所述学生网络,直至所述损失函数收敛,得到所述目标网络。
可选地,在一种可能的实现方式中,所述处理单元,还用于确定所述第二输出结果和所述第三输出结果之间的第二相似度;所述处理单元,还用于至少根据所述第二相似度,确定所述损失函数。
可选地,在一种可能的实现方式中,所述处理单元,还用于:根据所述第二输出结果和概率转移矩阵,确定第四输出结果;确定所述第三输出结果对应的one-hot标签;确定所述第四输出结果和所述第三输出结果对应的one-hot标签之间的第三相似度;根据所述第二相似度和所述第三相似度,确定所述损失函数。
可选地,在一种可能的实现方式中,所述多个数据包括图像数据、文本数据、视频数据或语音数据。
本申请第三方面提供了一种数据处理装置,可以包括处理器,处理器和存储器耦合,存储器存储有程序指令,当存储器存储的程序指令被处理器执行时实现上述第一方面所述的方法。对于处理器执行第一方面的各个可能实现方式中的步骤,具体均可以参阅第一方面,此处不再赘述。
本申请第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。
本申请第五方面提供了一种电路系统,所述电路系统包括处理电路,所述处理电路配置为执行上述第一方面所述的方法。
本申请第六方面提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。
本申请第七方面提供了一种芯片系统,该芯片系统包括处理器,用于支持服务器或门限值获取装置实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存服务器或通信设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
附图说明
图1为本申请实施例提供的人工智能主体框架的一种结构示意图;
图2a为本申请实施例提供的一种图像处理系统;
图2b为本申请实施例提供的另一种图像处理系统;
图2c为本申请实施例提供的图像处理的相关设备的示意图;
图3a为本申请实施例提供的一种系统100架构的示意图;
图3b为本申请实施例提供的一种图像语义分割的示意图;
图4a为神经网络压缩在实际场景中的应用示意图;
图4b为本申请实施例提供的一种网络压缩方法的流程示意图;
图5为本申请实施例提供的一种数据处理方法500的流程示意图;
图6为本申请实施例提供的一种压缩待压缩网络的流程示意图;
图7为本申请实施例提供的一种网络压缩的流程示意图;
图8为本申请实施例提供的一种数据处理装置的结构示意图;
图9为本申请实施例提供的执行设备的一种结构示意图;
图10为本申请实施例提供的芯片的一种结构示意图。
具体实施方式
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、智慧城市等。
接下来介绍几种本申请的应用场景。
图2a为本申请实施例提供的一种图像处理系统,该图像处理系统包括用户设备以及数据处理设备。其中,用户设备包括手机、个人电脑或者信息处理中心等智能终端。用户设备为图像处理的发起端,作为图像增强请求的发起方,通常由用户通过用户设备发起请求。
上述数据处理设备可以是云服务器、网络服务器、应用服务器以及管理服务器等具有数据处理功能的设备或服务器。数据处理设备通过交互接口接收来自智能终端的图像增强请求,再通过存储数据的存储器以及数据处理的处理器环节进行机器学习,深度学习,搜索,推理,决策等方式的图像处理。数据处理设备中的存储器可以是一个统称,包括本地存储以及存储历史数据的数据库,数据库可以在数据处理设备上,也可以在其它网络服务器上。
在图2a所示的图像处理系统中,用户设备可以接收用户的指令,例如用户设备可以获取用户输入/选择的一张图像,然后向数据处理设备发起请求,使得数据处理设备针对用户设备得到的该图像执行图像增强处理应用(例如图像超分辨率重构、图像去噪、图像去雾、图像去模糊以及图像对比度增强等),从而得到针对该图像的对应的处理结果。示例性的,用户设备可以获取用户输入的一张图像,然后向数据处理设备发起图像去噪请求,使得数据处理设备对该图像进行图像去噪,从而得到去噪后的图像。
在图2a中,数据处理设备可以执行本申请实施例的图像处理方法。
图2b为本申请实施例提供的另一种图像处理系统,在图2b中,用户设备直接作为数据处理设备,该用户设备能够直接获取来自用户的输入并直接由用户设备本身的硬件进行处理,具体过程与图2a相似,可参考上面的描述,在此不再赘述。
在图2b所示的图像处理系统中,用户设备可以接收用户的指令,例如用户设备可以获取用户在用户设备中所选择的一张图像,然后再由用户设备自身针对该图像执行图像处理 应用(例如图像超分辨率重构、图像去噪、图像去雾、图像去模糊以及图像对比度增强等),从而得到针对该图像的对应的处理结果。
在图2b中,用户设备自身就可以执行本申请实施例的图像处理方法。
图2c是本申请实施例提供的图像处理的相关设备的示意图。
上述图2a和图2b中的用户设备具体可以是图2c中的本地设备301或者本地设备302,图2a中的数据处理设备具体可以是图2c中的执行设备210,其中,数据存储系统250可以存储执行设备210的待处理数据,数据存储系统250可以集成在执行设备210上,也可以设置在云上或其它网络服务器上。
图2a和图2b中的处理器可以通过神经网络模型或者其它模型(例如,基于支持向量机的模型)进行数据训练/机器学习/深度学习,并利用数据最终训练或者学习得到的模型针对图像执行图像处理应用,从而得到相应的处理结果。
图3a是本申请实施例提供的一种系统100架构的示意图,在图3a中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:各个待调度任务、可调用资源以及其他参数。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理(比如进行本申请中神经网络的功能实现)过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则,该相应的目标模型/规则即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。其中,训练数据可以存储在数据库130中,且来自于数据采集设备160采集的训练样本。
在图3a中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图3a仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图3a中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110 中。如图3a所示,可以根据训练设备120训练得到神经网络。
本申请实施例还提供的一种芯片,该芯片包括神经网络处理器NPU。该芯片可以被设置在如图3a所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图3a所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则。
神经网络处理器NPU,NPU作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路,控制器控制运算电路提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路是二维脉动阵列。运算电路还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)中。
向量计算单元可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现中,向量计算单元能将经处理的输出的向量存储到统一缓存器。例如,向量计算单元可以将非线性函数应用到运算电路的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器和/或统一存储器、将外部存储器中的权重数据存入权重存储器,以及将统一存储器中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU),用于通过总线实现主CPU、DMAC和取指存储器之间进行交互。
与控制器连接的取指存储器(instruction fetch buffer),用于存储控制器使用的指令;
控制器,用于调用指存储器中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器,输入存储器,权重存储器以及取指存储器均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以xs和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2021131686-appb-000001
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
神经网络中的每一层的工作可以用数学表达式
Figure PCTCN2021131686-appb-000002
来描述:从物理层面神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由
Figure PCTCN2021131686-appb-000003
完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
因为希望神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么神经网络的训练就变成了尽可能缩小这个loss的过程。
(2)反向传播算法
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动, 旨在得到最优的神经网络模型的参数,例如权重矩阵。
(3)图像增强
图像增强指的是对图像的亮度、颜色、对比度、饱和度、动态范围等进行处理,以满足某种特定指标。简单来说,通过在图像处理过程中,通过有目的地强调图像的整体或局部特性,将原来不清晰的图像变得清晰或强调某些感兴趣的特征,扩大图像中不同物体特征之间的差别,抑制不感兴趣的特征,从而起到改善图像质量、丰富图像信息量的作用,能够加强图像判读和识别效果,满足某些特殊分析的需要。示例性的,图像增强可以包括但不限于图像超分辨率重构、图像去噪、图像去雾、图像去模糊以及图像对比度增强。
(4)图像语义分割
图像语义分割是指将图像按照某种规则(如光照、类别)将像素细分成不同的类别。简单来说,图像语义分割的目标是给图像中的每一个像素点都标注一个标签,即标注出图像中每个像素所属的对象类别,这些标签可以包括人、动物、汽车、鲜花、家具等。可以参阅图3b,图3b为本申请实施例提供的一种图像语义分割的示意图。如图3b所示,通过图像语义分割可以将图像在像素级别按照类别划分成不同的子区域,如建筑物、天空、植物等子区域。
下面从神经网络的训练侧和神经网络的应用侧对本申请提供的方法进行描述。
本申请实施例提供的神经网络的训练方法,涉及图像的处理,具体可以应用于数据训练、机器学习、深度学习等数据处理方法,对训练数据(如本申请中的图像)进行符号化和形式化的智能信息建模、抽取、预处理、训练等,最终得到训练好的图像处理模型;并且,本申请实施例提供的图像处理方法可以运用上述训练好的图像处理模型,将输入数据(如本申请中的待处理图像)输入到训练好的图像处理模型中,得到输出数据(如本申请中目标图像)。需要说明的是,本申请实施例提供的图像处理模型的训练方法和图像处理方法是基于同一个构思产生的发明,也可以理解为一个系统中的两个部分,或一个整体流程的两个阶段:如模型训练阶段和模型应用阶段。
随着深度学习技术的发展,神经网络已经被成功的应用于许多实际任务中(例如图片分类、物体检测、文本分类以及语音识别等任务)。一般地,神经网络需要巨大的计算资源才能够正常运行。然而,在一些终端设备上(例如手机、摄像头或车载终端等终端设备),这些终端设备的计算资源通常不足以支撑其运行具有复杂结构的神经网络。
因此,为了将神经网络能够应用到这些算力受限的终端设备上,人们提出了神经网络的压缩算法。通过采用压缩算法对计算复杂度较高且存储空间要求较大的神经网络进行压缩,能够得到计算复杂度较低且存储空间要求较小的压缩神经网络,以便于压缩神经网络能够运行于算力受限的终端设备上。
现有的压缩算法通常是将待压缩的神经网络作为教师网络,将计算复杂度较小的一个神经网络作为学生网络,并且将原始训练数据分别输入教师网络与学生网络中,通过由教师网络为学生网络提供有效的监督信息来实现学生网络的训练,从而得到压缩后的神经网络。因此,现有的压缩算法通常都需要待压缩网络的原始训练数据来进行网络的压缩。
然而,在大部分场景下,待压缩网络的原始训练数据往往是难以获取得到的。
可以参阅图4a,图4a为神经网络压缩在实际场景中的应用示意图。如图4a所示,以公有云的应用场景为例,用户基于自己的本地数据训练得到一个神经网络,并将该训练好的神经网络传输到公有云,要求对这个神经网络进行压缩,以便应用到手机等移动设备上。对于用于训练该神经网络的这些原始训练数据而言,由于用户的个人隐私保护,这些原始训练数据是不可获得的,或由于这些原始训练数据太大而导致其难以传输到云上。也就是说,公有云上通常无法获取到训练该神经网络的原始训练数据。
在另一个可能的场景下,用户从特定的机构或公司购买训练好的神经网络,并且将这个网络在公有云上进行压缩,以便应用到手机等移动设备上。由于该神经网络的原始训练数据属于商业机密,用户通常也无法获得该神经网络的原始训练数据。
在这些场景下,由于缺乏原始训练数据,现有的神经网络的压缩算法通常都难以实现神经网络的压缩。
有鉴于此,本申请实施例提供了一种数据处理方法,通过将获取到的无标签数据输入待压缩网络中,求取所得到的输出结果的独热标签,并衡量输出结果与独热标签之间的相似度,以将相似度较高的输出结果对应的无标签数据作为用于压缩该待压缩网络的数据。通过该方法,能够获得大量与待压缩网络的原训练数据相近的数据,从而保证能够有效地实现网络的压缩。
可以参阅图4b,图4b为本申请实施例提供的一种网络压缩方法的流程示意图。如图4b所示,用户将待压缩网络上传到公有云上,由公有云对待压缩网络进行压缩,并将压缩后得到的网络部署到移动设备上。具体地,公有云压缩网络的过程包括:公有云将无标签数据输入至待压缩网络中,并基于求取所得到的输出结果的独热标签,并衡量输出结果与独热标签之间的相似度,以将相似度较高的输出结果对应的无标签数据作为目标数据。然后,公有云基于目标数据和待压缩网络,采用蒸馏算法对待压缩网络进行压缩,得到压缩后的网络。
可以参阅图5,图5为本申请实施例提供的一种数据处理方法500的流程示意图。如图5所示,该数据处理方法500包括以下的步骤。
步骤501,数据处理装置获取待压缩网络和多个数据,该待压缩网络为分类网络。
本实施例中,该数据处理装置可以是用于压缩神经网络的装置,也可以是专门用于获取压缩神经网络所需的训练数据的装置。示例性地,数据处理装置可以是部署于云上的服务器,用于获取压缩神经网络所需的训练数据以及基于获取到的训练数据压缩神经网络。
该待压缩网络为分类网络,用于对输入的数据进行分类,以得到输出的分类结果。示例性地,假设待压缩网络为T,输入数据为x,待压缩网络的输出结果为y T,该y T为n维标签,其中n为分类类别的数目。输出结果y T中值最大的维度即为网络所判断的数据x的类别。例如,假设输出结果y T为3维标签,该3维标签的第一维表示分类类别为猫,第二维表示分类类别为狗,第三维表示分类类别为猪;那么,在输入数据x为图像,输出结果y T为{0,1,0}的情况下,表示该图像的分类类别为狗,即该图像中的动物为狗。
该数据处理装置可以通过获取其他的终端设备所发送的数据来获取该待压缩网络。例 如,在该数据处理装置为部署于云上的服务器时,该数据处理装置可以通过接收用户基于个人电脑或笔记本电脑等终端设备所发送的数据,来获得用户上传的待压缩网络。
该多个数据为类型相同的数据,例如该多个数据可以为图像数据、文本数据、视频数据或语音数据。例如,在该待压缩网络为图像分类网络时,该多个数据为图像数据,该待压缩网络用于对图像进行分类,比如根据图像上所显示的动物将图像分类为狗、猫、鱼等类型。又例如,在该待压缩网络为文本分类网络时,该多个数据为文本数据,用于对文本进行分类,比如将文本分类为正面情感文本或负面情感文本等类型。
其中,该数据处理装置获取该多个数据的方式可以有多种。以下将以该多个数据为图像数据为例,介绍数据处理装置获取该多个数据的方式。
在一种可能的方式中,在数据处理装置为服务器的情况下,由于服务器上通常保存有大量的图像数据,这些图像数据可能是由大量的用户所上传的图像数据,也可能是由专门的图像采集人员所录入的图像数据。即数据处理装置可以从保存图像数据的存储空间中获取到上述的多个数据。
在另一种可能的方式中,在网络上,通常存在有特定人员所创建并开放的图库,该图库中包括有大量的图像数据,以供开发人员访问并使用。也就是说,数据处理装置可以通过访问特定的网页,来访问该网页上对应的图库,从而获取到该图库中的大量数据作为上述的多个数据。此外,数据处理装置还可以是基于网络爬虫在网络上抓取图像数据,以获得上述的多个数据。
可以理解的是,数据处理装置获取到的数据通常是不具有分类标签的,即这些数据是没有经过特定分类且标记有相应标签的。由于数据处理装置能够获取到大量的未标记数据,而这些未标记数据中会存在有一部分与待压缩网络的原始训练数据相近的数据,因此数据处理装置通过本实施例的方法将这部分与待压缩网络的原始训练数据相近的数据筛选出来之后,这部分数据即可用于压缩该待压缩网络。
在一些情况下,数据处理装置获取到的数据也可能是经过简单分类的,例如数据处理装置可以在不同的图库中获取到动物类图像、家电类图像、植物类图像等等。那么,在数据处理装置能够获取到简单分类的数据的情况下,如果数据处理装置能够获取到待压缩网络的分类类别,则数据处理装置可以对获取到的数据进行初步筛选,以筛选掉不可能作为待压缩网络训练数据的数据。例如,在数据处理装置获取待压缩网络的分类类别为动物类时,即待压缩网络用语对动物类的图像进行分类,则数据处理装置可以提前将不是动物类的图像筛选掉,例如筛选掉家电类的图像或者植物类的图像等等,以节省计算量。
步骤502,数据处理装置将多个数据输入待压缩网络,得到多个第一输出结果,所述多个第一输出结果与所述多个数据之间一一对应。
在获取到多个数据之后,数据处理装置可以依次将该多个数据输入该待压缩网络中,以得到该多个数据中每个数据对应的第一输出结果。该第一输出结果可以为一个n维标签,其中n为分类类别的数目,n维标签中的每一个标签值都表示该第一输出结果对应的数据所属类别的概率。例如,在数据1对应的第一输出结果的第一维表示分类类别为猫,第二维表示分类类别为狗,第三维表示分类类别为猪;那么,在第一输出结果为{0.3,0.6,0.1}的 情况下,表示数据1属于猫类别的概率为0.3,数据1属于狗类别的概率为0.6,数据1属于猪类别的概率为0.1。
步骤503,数据处理装置确定多个第一输出结果中每个第一输出结果所对应的独热(one-hot)标签。
在一个可能的实施例中,one-hot标签为n维标签,n维标签包括1个值为1的标签值,以及n-1个值为0的标签值,n为大于1的整数。由于第一输出结果同样为一个n维标签,因此数据处理装置确定第一输出结果对应的one-hot标签的方式可以是:确定第一输出结果中标签值最大的维度,基于该维度的标签值为1,其他维度的标签值为0,生成第一输出结果对应的one-hot标签。例如,在第一输出结果为{0.3,0.6,0.1}的情况下,数据处理装置可以确定第一输出结果中标签值最大的维度为第二维(即标签值为0.6的维度),因此数据处理装置可以生成第一输出结果对应的one-hot标签,该one-hot标签为{0,1,0}。
可以理解的是,在上文的介绍中,one-hot标签是一个包括1个值为1的标签值,以及n-1个值为0的标签值的标签。实际上,one-hot标签也可以是指包括1个值接近1的标签值,以及n-1个值接近0的标签值的标签,例如该one-hot标签可以为{0.001,0.997,0.002}。本实施例并不对one-hot标签做具体限定。
步骤504,数据处理装置分别确定多个第一输出结果中每个第一输出结果与one-hot标签之间的第一相似度。
在得到每个第一输出结果对应的one-hot标签之后,数据处理装置可以计算每个第一输出结果与其对应的one-hot标签之间的第一相似度。
可以理解的是,在分类网络的训练中,其目标是让分类网络的输出结果和训练数据的真实标签尽可能相同。而训练数据的真实标签通常可以由one-hot标签来表示。所以,对于一个训练好的分类网络,该分类网络的原始训练数据在该分类网络中的输出结果会非常接近one-hot标签,即输出结果与one-hot标签的相似度很高。而对于非原始训练数据的其他数据,由于该分类网络并不一定能够准确识别到该数据,因此这些数据在该分类网络中的输出结果并非会很接近one-hot标签,即输出结果与one-hot标签的相似度不高。
示例性地,在待压缩网络是基于与狗、猫以及猪相关的图像训练得到的情况下,即待压缩网络的原始训练数据是与狗、猫以及猪相关的图像,数据处理装置获得的图像数据中包括图像1和图像2,图像1为与狗相关的动物类图像,图像2为与冰箱相关的家电类图像。将该图像1和图像2分别输入待压缩网络中,由于图像1与待压缩网络的原始训练数据类似,因此图像1对应的输出结果可以为{0.08,0.91,0.01};由于图像2与待压缩网络的原始训练数据相差较大,待压缩网络难以有效地识别图像2,该图像2对应的输出结果可以为{0.3,0.3,0.4}。由此可见,在图像的输出结果越接近one-hot标签的情况下,该图像与原始训练数据越接近。
基于此,本实施例中可以通过确定第一输出结果与one-hot标签之间的第一相似度来判断该第一输出结果对应的数据与原始训练数据是否接近。如果第一相似度较高,则可以认为该第一相似度对应的数据与原始训练数据较为接近;如果第一相似度较低,则可以认为该第一相似度对应的数据与原始训练数据相差较大。
在一个可能的实施例中,数据处理装置分别确定多个第一输出结果中每个第一输出结果与one-hot标签之间的第一相似度,具体可以包括:数据处理装置通过计算多个第一输出结果中每个第一输出结果与该第一输出结果对应的one-hot标签之间的相对熵或距离度量,来确定第一输出结果与one-hot标签之间的第一相似度。
相对熵又被称为Kullback-Leibler散度(KL散度)或信息散度(information divergence),是两个概率分布(probability distribution)间差异的非对称性度量,具体可以为两个概率分布的信息熵(Shannon entropy)的差值。示例性地,假设第一输出结果为y T,第一输出结果对应的one-hot标签为t,则第一输出结果与one-hot标签之间的KL散度可以如公式1所示:
D KL(y T,t)=-[y Tlogt+(1-y T)log(1-t)]     公式1
其中,D KL(y T,t)表示第一输出结果与one-hot标签之间的KL散度,log()表示对数。KL散度越小,代表第一输出结果y T越接近其对应的one-hot标签,即两者的相似度越大。
距离度量也可以称为度量相似性,通过计算两个多维数据之间的距离度量,可以确定两个多维数据之间的相似度。一般地,两个多维数据之间的距离度量越小,两个多维数据之间的相似度就越高;相反,两个多维数据之间的距离度量越大,两个多维数据之间的相似度就越小。示例性地,该距离度量可以包括均方误差(Mean Squared Error,MES)距离或L1距离等距离。应理解,除了MES距离和L1距离之外,数据处理装置还可以是基于其他的距离度量来确定第一相似度,本实施例并不对距离度量的方式做具体限定。
MES距离是指参数估计值与参数真值之差平方的期望值,可以用于评价数据的变化程度,MSE距离的值越小,说明参数估计值与参数真值之间的差距越小。类似地,假设第一输出结果为y T,第一输出结果对应的one-hot标签为t,则第一输出结果与one-hot标签之间的MES距离可以如公式2所示:
MSE(y T,t)=(y T-t) 2       公式2
其中,MSE(y T,t)表示第一输出结果与one-hot标签之间的MES距离。
L1距离也称为曼哈顿距离,表示两个点在标准坐标系上的绝对轴距之和。类似地,假设第一输出结果为y T,第一输出结果对应的one-hot标签为t,则第一输出结果与one-hot标签之间的L1距离可以如公式3所示:
L1(y T,t)=|y T-t|     公式3
其中,L1(y T,t)表示第一输出结果与one-hot标签之间的L1距离。
步骤505,数据处理装置根据多个第一输出结果中每个第一输出结果对应的第一相似度,在多个数据中确定至少一个目标数据,至少一个目标数据用于压缩待压缩网络。
由于第一相似度可以用于衡量数据处理装置所获得的数据与原训练数据之间的相似性,因此,数据处理装置可以根据获取到的多个数据对应的第一相似度,在多个数据中确定目标数据。简单来说,由于数据处理装置获得的数据均有对应的第一输出结果,且每个第一输出结果有对应的第一相似度,因此数据处理装置获得的数据均有对应的第一相似度。对于数据处理装置所获得的数据来说,该数据对应的第一相似度越高,则代表该数据与该 待压缩网络的原始训练数据越接近,因此数据处理装置可以选择第一相似度较高的数据作为目标数据,以实现该待压缩网络的压缩。
在一个可能的实施例中,数据处理装置根据多个第一输出结果中每个第一输出结果对应的第一相似度,在多个数据中确定至少一个目标数据,具体可以包括:数据处理装置根据多个第一输出结果中每个第一输出结果对应的第一相似度,在多个数据中确定第一相似度最大的N个目标数据,N为第一预设阈值且N为大于1的整数。简单来说,数据处理装置可以预先获取到第一预设阈值N,该第一预设阈值N可以是预置于数据处理装置中的,或者数据处理装置提前从其他网络设备接收到的;然后,数据处理装置根据该第一预设阈值N,按照第一相似度从大到小的顺序,从该多个数据中选择N个目标数据,这些目标数据则是该多个数据中第一相似度最大的N个数据。在第一相似度是通过KL散度或距离度量来表示的情况下,数据处理装置实际上可以是根据该第一预设阈值N,按照KL散度或距离度量从小到大的顺序,从该多个数据中选择N个目标数据,这些目标数据则是该多个数据中KL散度或距离度量最小的N个数据。
其中,N的取值可以根据数据处理装置的实际计算能力以及待压缩网络的压缩精度来决定,例如N的取值范围可以为几万到几十万的区间。比如,在数据处理装置获取到100万个数据,且N的取值为10万的情况下,数据处理装置可以在该100百万个数据中确定第一相似度最大的10万个数据为目标数据。
在另一个可能的实施例中,数据处理装置根据多个第一输出结果中每个第一输出结果对应的第一相似度,在多个数据中确定至少一个目标数据,具体可以包括:数据处理装置根据多个第一输出结果中每个第一输出结果对应的第一相似度,在多个数据中确定第一相似度大于第二预设阈值的M个目标数据。其中,第二预设阈值也可以是数据处理装置预先获取到的,例如该第二预设阈值可以是预置于数据处理装置中的,或者数据处理装置提前从其他网络设备接收到的。对于该多个数据中的每个数据,数据处理装置都可以比对数据对应的第一相似度是否大于第二预设阈值,如果该数据对应的第一相似度大于第二预设阈值,则可以确定该数据为目标数据。其中,第二预设阈值的取值也可以是根据数据处理装置的实际计算能力以及待压缩网络的压缩精度来决定,本实施例并不做具体限定。
在数据处理装置采用KL散度或距离度量来表示第一相似度的情况下,数据处理装置可以通过求取KL散度的倒数或者距离度量的倒数的方式来确定对应的第一相似度,并根据每个数据的第一相似度在多个数据中确定M个目标数据。
需要注意的是,在上一个可能的实施例中,N的取值是固定的,即N的取值为第一预设阈值,而在本实施例中,M的取值并非是固定的,而是基于多个数据中每个数据对应的第一相似度确定的。如果该多个数据中第一相似度大于第二预设阈值的数据越多,则M越大;如果该多个数据中第一相似度大于第二预设阈值的数据越少,则M越小。
本实施例中,通过将获取到的无标签数据输入待压缩网络中,求取所得到的输出结果的独热标签,并衡量输出结果与独热标签之间的相似度,以将相似度较高的输出结果对应的无标签数据作为用于压缩该待压缩网络的数据。通过该方法,能够获得大量与待压缩网络的原训练数据相近的数据,从而保证能够有效地实现网络的压缩。
在一个可能的实施例中,在数据处理装置获得上述的至少一个目标数据之后,数据处理装置还可以通过蒸馏法压缩待压缩网络,得到目标网络。
在蒸馏法中,通常包含有两个网络:一个为预训练好的,具有较强性能,但是计算复杂度高且要求较大存储空间的教师网络;另一个为待训练的,但是具有较低的计算复杂度以及存储空间要求小的学生网络。蒸馏法旨在从教师网络中提取出有用的信息和知识来作为学生网络训练过程中的指导,以实现学生网络的训练。在教师网络的指导下进行训练学习,学生网络可以获得比单独训练更加优良的性能。也就是说,通过蒸馏法可以得到高性能、低计算复杂度以及低存储消耗的学生网络。
本实施例中,数据处理装置通过蒸馏法压缩待压缩网络具体可以是预先获取到计算复杂度较低的学生网络,并将待压缩网络作为教师网络,然后基于获得的目标数据训练该学生网络,并且从该教师网络中提取处有用的信息来指导学生网络的训练,最终训练得到目标网络。
示例性地,可以参阅图6,图6为本申请实施例提供的一种压缩待压缩网络的流程示意图。如图6所示,数据处理装置通过蒸馏法压缩待压缩网络的过程可以包括以下的步骤。
步骤601,数据处理装置获取学生网络。
本实施例中,该学生网络可以为一个构建好的神经网络,能够用于实现数据的分类,例如深度神经网络。数据处理装置可以是通过多种方式来获得学生网络。
在一种可能的实现方式中,在数据处理装置中可以预置有一个或多个预先构建好的学生网络,该一个或多个学生网络可以是特定人员构建好且预置于数据处理装置中的。不同的学生网络可以具有不同的计算复杂度,以及对存储空间的要求也不相同。数据处理装置可以根据待压缩网络的压缩要求,例如压缩后所占的存储空间的大小、压缩后的计算复杂度等指标,确定能够满足压缩要求的学生网络。
在另一种可能的实现方式中,用户可以是同时向数据处理装置上传待压缩网络以及学生网络,数据处理装置可以通过获取用户所上传的数据来获得学生网络。
在另一种可能的实现方式中,数据处理装置还可以是在获取到用户的压缩要求之后,根据用户的压缩要求,自动构建一个能够满足压缩要求的学生网络。例如,在用户的压缩要求为压缩后的网络所占据的存储空间低于1千兆(Gigabyte,GB)时,数据处理装置可以基于该压缩要求构建一个存储空间要求低于1GB的学生网络。
步骤602,数据处理装置将至少一个目标数据分别输入学生网络和待压缩网络,得到学生网络的第二输出结果和待压缩网络的第三输出结果。
在获得学生网络之后,数据处理装置可以基于该至少一个目标数据对学生网络进行训练。具体地,数据处理装置可以将该至少一个目标数据中的一个目标数据分别输入至学生网络和待压缩网络中,得到该目标数据在学生网络中对应的第二输出结果以及该目标数据在待压缩网络中对应的第三输出结果。
步骤603,数据处理装置根据第二输出结果和第三输出结果确定损失函数。
在学生网络的训练过程中,需要基于从待压缩网络(即教师网络)中所提取出来的知 识来指导学生网络的训练。因此,用于训练的学生网络的损失函数可以是基于学生网络对应的第二输出结果以及待压缩网络对应的第三输出结果确定的。
一般地,在蒸馏法中,学生网络的损失函数可以由两项式子构成,一项是学生网络的输出结果与输入数据的真实标签的相似度,另一项是学生网络的输出结果与教师网络的输出结果之间的相似度。示例性地,在以KL散度表示相似度的情况下,学生网络的损失函数可以如公式4所示:
loss=D KL(y S,y)+D KL(y S,y T)     公式4
其中,loss表示损失函数,y S表示学生网络的输出结果,y表示输入数据的真实标签,D KL(y S,y)表示学生网络的输出结果与输入数据的真实标签之间的KL散度,y T表示教师网络的输出结果,D KL(y S,y T)表示教师网络的输出结果与学生网络的输出结果之间的KL散度。由于教师网络的输出结果也并非是一定准确的,因此基于学生网络的输出结果与输入数据的真实标签的相似度,可以有效地纠正教师网络中错误的输出结果。
在本实施例中,由于数据处理装置所获得的目标数据是不具有真实标签的,因此数据处理装置无法获得学生网络的输出结果与输入数据的真实标签的相似度。也就是说,数据处理装置确定损失函数的方式可以是:数据处理装置确定第二输出结果和第三输出结果之间的第二相似度,并至少根据第二相似度,确定损失函数。示例性地,本实施例中所提供的一种损失函数可以如公式5所示:
loss=D KL(y S,y T)      公式5
其中,loss表示损失函数,y S表示学生网络的输出结果,y T表示教师网络的输出结果,D KL(y S,y T)表示教师网络的输出结果与学生网络的输出结果之间的KL散度。
步骤604,数据处理装置根据损失函数,训练学生网络,直至损失函数收敛,得到目标网络。
在得到损失函数之后,数据处理装置可以基于该损失函数对学生网络进行训练,直至损失函数收敛,以得到训练好的学生网络,即该待压缩网络对应的压缩后的目标网络。简单来说,数据处理装置基于损失函数训练学生网络的过程可以是:数据处理装置将多个目标数据中的一个目标数据输入学生网络和待压缩网络中,基于两个网络的输出结果计算得到损失函数,然后根据损失函数调整学生网络的参数,并且重复执行将多个目标数据中的下一个目标数据输入学生网络和待压缩网络以及基于新计算到的损失函数调整学生网络的参数的过程,直至计算得到的损失函数小于预设的阈值。
在本申请实施例中,由于数据处理装置所获得的目标数据不具有真实标签,数据处理装置无法获得学生网络的输出结果与输入数据的真实标签的相似度,即数据处理装置无法基于学生网络的输出结果与输入数据的真实标签的相似度来纠正教师网络中错误的输出结果。基于此,在本实施例中,通过对损失函数进行调整,引入概率转移矩阵,以纠正教师网络中错误的输出结果。
在一个可能的实施例中,数据处理装置根据第二输出结果和第三输出结果确定损失函 数,还可以包括:数据处理装置根据第二输出结果和概率转移矩阵,确定第四输出结果,即数据处理装置通过将第二输出结果与概率转移矩阵相乘,以第二输出结果进行修正,得到第四输出结果;数据处理装置确定第三输出结果(即教师网络的输出结果)对应的one-hot标签,该one-hot标签即为教师网络所预测的标签。数据处理装置确定第四输出结果和第三输出结果对应的one-hot标签之间的第三相似度,并根据第二相似度和第三相似度,确定损失函数。
示例性地,该损失函数可以如公式6所示:
loss=D KL(Q(y S),t)+D KL(y S,y T)             公式6
其中,loss为损失函数,Q为概率转移矩阵,y S为第二输出结果(即学生网络的输出结果),Q(y S)为第四输出结果(即学生网络的输出结果与概率转移矩阵相乘的结果),y T为第三输出结果(即教师网络的输出结果),t为第三输出结果对应的one-hot标签,即教师网络所预测的标签,D KL()表示求取散度。
可以看出,相对于公式5所示的损失函数,公式6所示的损失函数引入了新的KL散度,该新的KL散度为学生网络的输出结果与概率转移矩阵相乘的结果与教师网络所预测的标签之间的KL散度。由于教师预测的标签有误,因此本实施例中通过引入概率转移矩阵Q来纠正教师网络预测的标签,以使得学生网络的输出结果是正确的标签。即学生网络正确的输出结果经过了一个噪声转移矩阵之后,得到了教师网络预测的错误标签t。并且,在学生网络的训练过程中,概率转移矩阵Q可以与学生网络一起进行训练。
也就是说,在数据处理装置训练学生网络的过程中,数据处理装置可以是基于损失函数一并对概率转移矩阵进行训练。即概率转移矩阵并非是固定的,在数据处理装置训练学生网络的过程中,数据处理装置同样可以对概率转移矩阵进行调整。通过引入概率转移矩阵来对教师网络的预测标签进行纠正,可以在训练数据为无标签数据的情况下,提高网络压缩的效果,保证压缩后的网络的预测准确性。
示例性地,假设第二输出结果为n维的标签,概率转移矩阵可以为n*n的矩阵,且概率转移矩阵中的每一行元素的和为1。具体地,一种可能的概率转移矩阵可以如公式7所示:
Figure PCTCN2021131686-appb-000004
其中,A表示概率转移矩阵,a 11、a 1n、a n1、a nn为概率转移矩阵中的元素,且a 11+a 12+…+a 1n=1。
例如,假设第二输出结果y S为(0.6,0.4),概率转移矩阵Q为
Figure PCTCN2021131686-appb-000005
则第二输出结果y S与概率转移矩阵Q相乘的结果Q(y S)为(0.48,0.52)。又例如,假设第二输出结 果y S为(0.95,0.05),概率转移矩阵Q不变,则第二输出结果y S与概率转移矩阵Q相乘的结果Q(y S)为(0.585,0.415)。
为便于理解,以下将结合具体的例子对本申请实施例提供的网络压缩方法进行介绍。
本实施例以一个用于图像分类的深度神经网络(Deep Neural Networks,DNN)为例,对压缩该深度神经网络的过程进行介绍。假设,用户使用自己拍摄或者创造的一些图片训练好了一个深度神经网络,并上传到服务器上,要求对该深度神经网络进行压缩。此时,服务器可以获知该深度神经网络的具体结构,但是由于该深度神经网络的训练数据是用户自己拍摄或创造的一些图片,出于数据过大或是包含个人隐私的考虑,用户并不愿意上传到服务器上,即服务器无法获取到该深度神经网络的原始训练数据。
具体地,本实施例以CIFAR数据集为例,展示本实施例提出的网络压缩方法,在残差网络(Residual Network,ResNet)上的压缩效果。其中,CIFAR数据集是由开发人员收集整理的一个包括0万张微型图像的数据集。
对于CIFAR数据集,可以使用ResNet-34网络作为用户上传的待压缩网络,并将ImageNet数据集作为服务器上的未标记数据集,ResNet-18网络作为待训练的学生网络。其中,ImageNet项目是一个用于视觉对象识别软件研究的大型可视化数据库,ImageNet数据集可以是该数据库中的部分或全部图像数据。
具体来说,网络压缩的过程可以包括以下步骤:
S1、服务器基于CIFAR-10数据集为训练数据,对ResNet-34网络结构进行训练,得到训练好的网络。通过步骤S1,可以模拟用户在基于自身的训练集上进行网络训练的过程。
S2、在得到训练好的ResNet-34网络后,服务器可以使用云上的ImageNet数据集作为无标签数据集,并且采用上述的方法500对ImageNet数据集进行筛选,得到目标数据集。具体地,服务器可以将ImageNet数据集输入到该训练好的ResNet-34网络中,并计算每张图像的输出结果及其输出结果的one-hot标签之间的KL散度,选择KL散度最小的500,000张图像作为训练集。
S3、在得到选择好的训练集之后,服务器可以初始化噪声转移矩阵Q和学生网络,并基于上述的方法600以及该训练集对该ResNet-34网络进行压缩,得到压缩好的网络。
示例性地,本实施例中基于不同的算法以及不同的训练数据,进行了实验,具体的实验结果可以如表1所示:
表1
Figure PCTCN2021131686-appb-000006
Figure PCTCN2021131686-appb-000007
由表1可以看出,待压缩网络为未压缩的预训练模型,其准确率为94.85%。使用传统蒸馏算法以及原始训练数据对该待压缩网络进行压缩后,得到的网络的准确率为94.34%。在没有原始训练数据的情况下,采用本方案所提供的数据处理方法选择数据之后,采用传统蒸馏算法压缩得到的网络的准确率为93.55%。此外,在采用本方案所提供的数据处理方法选择数据之后,再采用本方案中改进后的蒸馏算法压缩得到的网络的准确率为94.02%。由此可见,采用本方案所提供的方法不仅能够解决相关技术中在缺乏原始训练数据的情况下无法压缩网络的问题,还能够保证压缩后的网络的准确率保持较高的水平。
具体地,可以参阅图7,图7为本申请实施例提供的一种网络压缩的流程示意图。如图7所示,用户通过原始训练数据训练获得了待压缩网络,即教师网络,且该原始训练数据是不可获得的。服务器根据本方案的数据处理方法选择得到了可以用于压缩教师网络的无标签数据。服务器通过将无标签数据输入至教师网络以及学生网络中,并基于蒸馏算法来实现学生网络的训练。对于教师网络中所输出的错误结果,可以通过本方案中的蒸馏算法进行纠正,从而使得学生网络能够输出正确的预测结果。具体地,在教师网络中,将“熊猫”分类为“太空船”,以及将“狐狸”分类为“狗”,而这些错误结果均在学生网络中纠正了过来。
可以参阅图8,图8为本申请实施例提供的一种数据处理装置的结构示意图。如图8所示,本申请实施例提供的一种数据处理装置,包括:获取单元801和处理单元802;所述获取单元801,用于获取待压缩网络和多个数据,所述待压缩网络为分类网络;所述处理单元802,用于将所述多个数据输入所述待压缩网络,得到多个第一输出结果,所述多个第一输出结果与所述多个数据之间一一对应;所述处理单元802,还用于确定所述多个第一输出结果中每个第一输出结果所对应的独热one-hot标签;所述处理单元802,还用于分别确定所述多个第一输出结果中每个第一输出结果与所述one-hot标签之间的第一相似度;所述处理单元802,还用于根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定至少一个目标数据,所述至少一个目标数据用于压缩所述待压缩网络。
可选地,在一种可能的实现方式中,所述one-hot标签为n维标签,所述n维标签包括1个值为1的标签值,以及n-1个值为0的标签值,所述n为大于1的整数。
可选地,在一种可能的实现方式中,所述处理单元802,还用于根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定第一相似度最大的N个目标数据,所述N为第一预设阈值且所述N为大于1的整数。
可选地,在一种可能的实现方式中,所述处理单元802,还用于根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定第一相似度大于第二预设阈值的M个目标数据。
可选地,在一种可能的实现方式中,所述处理单元802,还用于通过计算所述多个第 一输出结果中每个第一输出结果与所述one-hot标签之间的相对熵或距离度量,来确定所述第一相似度。
可选地,在一种可能的实现方式中,所述距离度量包括均方误差MES距离或L1距离。
可选地,在一种可能的实现方式中,所述处理单元802,还用于通过蒸馏法压缩所述待压缩网络,得到目标网络。
可选地,在一种可能的实现方式中,所述获取单元801,还用于获取学生网络;所述处理单元802,还用于将所述至少一个目标数据分别输入所述学生网络和所述待压缩网络,得到所述学生网络的第二输出结果和所述待压缩网络的第三输出结果;所述处理单元802,还用于根据所述第二输出结果和所述第三输出结果确定损失函数;所述处理单元802,还用于根据所述损失函数,训练所述学生网络,直至所述损失函数收敛,得到所述目标网络。
可选地,在一种可能的实现方式中,所述处理单元802,还用于确定所述第二输出结果和所述第三输出结果之间的第二相似度;所述处理单元802,还用于至少根据所述第二相似度,确定所述损失函数。
可选地,在一种可能的实现方式中,所述处理单元802,还用于:根据所述第二输出结果和概率转移矩阵,确定第四输出结果;确定所述第三输出结果对应的one-hot标签;确定所述第四输出结果和所述第三输出结果对应的one-hot标签之间的第三相似度;根据所述第二相似度和所述第三相似度,确定所述损失函数。
可选地,在一种可能的实现方式中,所述多个数据包括图像数据、文本数据、视频数据或语音数据。
接下来介绍本申请实施例提供的一种执行设备,请参阅图9,图9为本申请实施例提供的执行设备的一种结构示意图,执行设备900具体可以表现为手机、平板、笔记本电脑、智能穿戴设备、服务器等,此处不做限定。其中,执行设备900上可以部署有图9对应实施例中所描述的数据处理装置,用于实现图9对应实施例中数据处理的功能。具体的,执行设备900包括:接收器901、发射器902、处理器903和存储器904(其中执行设备900中的处理器903的数量可以一个或多个,图9中以一个处理器为例),其中,处理器903可以包括应用处理器9031和通信处理器9032。在本申请的一些实施例中,接收器901、发射器902、处理器903和存储器904可通过总线或其它方式连接。
存储器904可以包括只读存储器和随机存取存储器,并向处理器903提供指令和数据。存储器904的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器904存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器903控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态 信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器903中,或者由处理器903实现。处理器903可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器903中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器903可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器903可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器904,处理器903读取存储器904中的信息,结合其硬件完成上述方法的步骤。
接收器901可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器902可用于通过第一接口输出数字或字符信息;发射器902还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器902还可以包括显示屏等显示设备。
本申请实施例中,在一种情况下,处理器903,用于执行图5对应实施例中的执行设备执行的数据处理方法。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例提供的执行设备、训练设备或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的数据处理方法,或者,以使训练设备内的芯片执行上述实施例描述的数据处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图10,图10为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 1000,NPU 1000作为协处理器挂载到主CPU(Host  CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1003,通过控制器1004控制运算电路1003提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路1003内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1003是二维脉动阵列。运算电路1003还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1003是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1002中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1001中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1008中。
统一存储器1006用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1005,DMAC被搬运到权重存储器1002中。输入数据也通过DMAC被搬运到统一存储器1006中。
BIU为Bus Interface Unit即,总线接口单元1013,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)1009的交互。
总线接口单元1013(Bus Interface Unit,简称BIU),用于取指存储器1009从外部存储器获取指令,还用于存储单元访问控制器1005从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1006或将权重数据搬运到权重存储器1002中或将输入数据数据搬运到输入存储器1001中。
向量计算单元1007包括多个运算处理单元,在需要的情况下,对运算电路1003的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元1007能将经处理的输出的向量存储到统一存储器1006。例如,向量计算单元1007可以将线性函数;或,非线性函数应用到运算电路1003的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1007生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1003的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1004连接的取指存储器(instruction fetch buffer)1009,用于存储控制器1004使用的指令;
统一存储器1006,输入存储器1001,权重存储器1002以及取指存储器1009均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件 说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (25)

  1. 一种数据处理方法,其特征在于,包括:
    获取待压缩网络和多个数据,所述待压缩网络为分类网络;
    将所述多个数据输入所述待压缩网络,得到多个第一输出结果,所述多个第一输出结果与所述多个数据之间一一对应;
    确定所述多个第一输出结果中每个第一输出结果所对应的独热one-hot标签;
    分别确定所述多个第一输出结果中每个第一输出结果与所述one-hot标签之间的第一相似度;
    根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定至少一个目标数据,所述至少一个目标数据用于压缩所述待压缩网络。
  2. 根据权利要求1所述的数据处理方法,其特征在于,所述one-hot标签为n维标签,所述n维标签包括1个值为1的标签值,以及n-1个值为0的标签值,所述n为大于1的整数。
  3. 根据权利要求1或2所述的数据处理方法,其特征在于,所述根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定至少一个目标数据,包括:
    根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定第一相似度最大的N个目标数据,所述N为第一预设阈值且所述N为大于1的整数。
  4. 根据权利要求1或2所述的数据处理方法,其特征在于,所述根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定至少一个目标数据,包括:
    根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定第一相似度大于第二预设阈值的M个目标数据。
  5. 根据权利要求1-4任意一项所述的数据处理方法,其特征在于,所述分别确定所述多个第一输出结果中每个第一输出结果与所述one-hot标签之间的第一相似度,包括:
    通过计算所述多个第一输出结果中每个第一输出结果与所述one-hot标签之间的相对熵或距离度量,来确定所述第一相似度。
  6. 根据权利要求5所述的数据处理方法,其特征在于,所述距离度量包括均方误差MES距离或L1距离。
  7. 根据权利要求1-6任意一项所述的数据处理方法,其特征在于,所述方法还包括:
    通过蒸馏法压缩所述待压缩网络,得到目标网络。
  8. 根据权利要求7所述的数据处理方法,其特征在于,所述通过蒸馏法压缩所述待压缩网络,得到目标网络,包括:
    获取学生网络;
    将所述至少一个目标数据分别输入所述学生网络和所述待压缩网络,得到所述学生网络的第二输出结果和所述待压缩网络的第三输出结果;
    根据所述第二输出结果和所述第三输出结果确定损失函数;
    根据所述损失函数,训练所述学生网络,直至所述损失函数收敛,得到所述目标网络。
  9. 根据权利要求8所述的数据处理方法,其特征在于,根据所述第二输出结果和所述第三输出结果确定损失函数,包括:
    确定所述第二输出结果和所述第三输出结果之间的第二相似度;
    至少根据所述第二相似度,确定所述损失函数。
  10. 根据权利要求9所述的数据处理方法,其特征在于,所述根据所述第二输出结果和所述第三输出结果确定损失函数,还包括:
    根据所述第二输出结果和概率转移矩阵,确定第四输出结果;
    确定所述第三输出结果对应的one-hot标签;
    确定所述第四输出结果和所述第三输出结果对应的one-hot标签之间的第三相似度;
    所述至少根据所述第二相似度,确定所述损失函数,包括:
    根据所述第二相似度和所述第三相似度,确定所述损失函数。
  11. 根据权利要求1-10任意一项所述的数据处理方法,其特征在于,所述多个数据包括图像数据、文本数据、视频数据或语音数据。
  12. 一种数据处理装置,其特征在于,包括获取单元和处理单元;
    所述获取单元,用于获取待压缩网络和多个数据,所述待压缩网络为分类网络;
    所述处理单元,用于将所述多个数据输入所述待压缩网络,得到多个第一输出结果,所述多个第一输出结果与所述多个数据之间一一对应;
    所述处理单元,还用于确定所述多个第一输出结果中每个第一输出结果所对应的独热one-hot标签;
    所述处理单元,还用于分别确定所述多个第一输出结果中每个第一输出结果与所述one-hot标签之间的第一相似度;
    所述处理单元,还用于根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定至少一个目标数据,所述至少一个目标数据用于压缩所述待压缩网络。
  13. 根据权利要求12所述的数据处理装置,其特征在于,所述one-hot标签为n维标签,所述n维标签包括1个值为1的标签值,以及n-1个值为0的标签值,所述n为大于1的整数。
  14. 根据权利要求12或13所述的数据处理装置,其特征在于,所述处理单元,还用于根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定第一相似度最大的N个目标数据,所述N为第一预设阈值且所述N为大于1的整数。
  15. 根据权利要求12或13所述的数据处理装置,其特征在于,所述处理单元,还用于根据所述多个第一输出结果中每个第一输出结果对应的所述第一相似度,在所述多个数据中确定第一相似度大于第二预设阈值的M个目标数据。
  16. 根据权利要求12-15任意一项所述的数据处理装置,其特征在于,所述处理单元,还用于通过计算所述多个第一输出结果中每个第一输出结果与所述one-hot标签之间的相对熵或距离度量,来确定所述第一相似度。
  17. 根据权利要求16所述的数据处理装置,其特征在于,所述距离度量包括均方误差MES距离或L1距离。
  18. 根据权利要求12-17任意一项所述的数据处理装置,其特征在于,所述处理单元,还用于通过蒸馏法压缩所述待压缩网络,得到目标网络。
  19. 根据权利要求18所述的数据处理装置,其特征在于,
    所述获取单元,还用于获取学生网络;
    所述处理单元,还用于将所述至少一个目标数据分别输入所述学生网络和所述待压缩网络,得到所述学生网络的第二输出结果和所述待压缩网络的第三输出结果;
    所述处理单元,还用于根据所述第二输出结果和所述第三输出结果确定损失函数;
    所述处理单元,还用于根据所述损失函数,训练所述学生网络,直至所述损失函数收敛,得到所述目标网络。
  20. 根据权利要求19所述的数据处理装置,其特征在于,所述处理单元,还用于确定所述第二输出结果和所述第三输出结果之间的第二相似度;
    所述处理单元,还用于至少根据所述第二相似度,确定所述损失函数。
  21. 根据权利要求20所述的数据处理装置,其特征在于,所述处理单元,还用于:
    根据所述第二输出结果和概率转移矩阵,确定第四输出结果;
    确定所述第三输出结果对应的one-hot标签;
    确定所述第四输出结果和所述第三输出结果对应的one-hot标签之间的第三相似度;
    根据所述第二相似度和所述第三相似度,确定所述损失函数。
  22. 根据权利要求12-21任意一项所述的数据处理装置,其特征在于,所述多个数据包括图像数据、文本数据、视频数据或语音数据。
  23. 一种数据处理装置,其特征在于,包括存储器和处理器;所述存储器存储有代码,所述处理器被配置为执行所述代码,当所述代码被执行时,所述数据处理装置执行如权利要求1至11任意一项所述的方法。
  24. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有指令,所述指令在由计算机执行时使得所述计算机实施权利要求1至11任意一项所述的方法。
  25. 一种计算机程序产品,其特征在于,所述计算机程序产品存储有指令,所述指令在由计算机执行时使得所述计算机实施权利要求1至11任意一项所述的方法。
PCT/CN2021/131686 2020-11-30 2021-11-19 一种数据处理方法及相关装置 WO2022111387A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011381498.2A CN112529149A (zh) 2020-11-30 2020-11-30 一种数据处理方法及相关装置
CN202011381498.2 2020-11-30

Publications (1)

Publication Number Publication Date
WO2022111387A1 true WO2022111387A1 (zh) 2022-06-02

Family

ID=74995643

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/131686 WO2022111387A1 (zh) 2020-11-30 2021-11-19 一种数据处理方法及相关装置

Country Status (2)

Country Link
CN (1) CN112529149A (zh)
WO (1) WO2022111387A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112529149A (zh) * 2020-11-30 2021-03-19 华为技术有限公司 一种数据处理方法及相关装置
CN113746870B (zh) * 2021-11-05 2022-02-08 山东万网智能科技有限公司 一种物联网设备数据智能传输方法及系统
CN114115282B (zh) * 2021-11-30 2024-01-19 中国矿业大学 一种矿山辅助运输机器人无人驾驶装置及其使用方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030103667A1 (en) * 2001-12-05 2003-06-05 New Mexico Technical Research Foundation Neural network model for compressing/decompressing image/acoustic data files
CN108846445A (zh) * 2018-06-26 2018-11-20 清华大学 一种基于相似性学习的卷积神经网络滤波器剪枝技术
CN110880036A (zh) * 2019-11-20 2020-03-13 腾讯科技(深圳)有限公司 神经网络压缩方法、装置、计算机设备及存储介质
CN111291860A (zh) * 2020-01-13 2020-06-16 哈尔滨工程大学 一种基于卷积神经网络特征压缩的异常检测方法
CN112529149A (zh) * 2020-11-30 2021-03-19 华为技术有限公司 一种数据处理方法及相关装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3355547B1 (en) * 2017-01-27 2020-04-15 Vectra AI, Inc. Method and system for learning representations of network flow traffic
CN111091177B (zh) * 2019-11-12 2022-03-08 腾讯科技(深圳)有限公司 一种模型压缩方法、装置、电子设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030103667A1 (en) * 2001-12-05 2003-06-05 New Mexico Technical Research Foundation Neural network model for compressing/decompressing image/acoustic data files
CN108846445A (zh) * 2018-06-26 2018-11-20 清华大学 一种基于相似性学习的卷积神经网络滤波器剪枝技术
CN110880036A (zh) * 2019-11-20 2020-03-13 腾讯科技(深圳)有限公司 神经网络压缩方法、装置、计算机设备及存储介质
CN111291860A (zh) * 2020-01-13 2020-06-16 哈尔滨工程大学 一种基于卷积神经网络特征压缩的异常检测方法
CN112529149A (zh) * 2020-11-30 2021-03-19 华为技术有限公司 一种数据处理方法及相关装置

Also Published As

Publication number Publication date
CN112529149A (zh) 2021-03-19

Similar Documents

Publication Publication Date Title
WO2022042002A1 (zh) 一种半监督学习模型的训练方法、图像处理方法及设备
US20210012198A1 (en) Method for training deep neural network and apparatus
CN111797893B (zh) 一种神经网络的训练方法、图像分类系统及相关设备
WO2022083536A1 (zh) 一种神经网络构建方法以及装置
WO2021043168A1 (zh) 行人再识别网络的训练方法、行人再识别方法和装置
WO2022022274A1 (zh) 一种模型训练方法及装置
WO2022111387A1 (zh) 一种数据处理方法及相关装置
WO2021155792A1 (zh) 一种处理装置、方法及存储介质
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
WO2022134971A1 (zh) 一种降噪模型的训练方法及相关装置
CN110222718B (zh) 图像处理的方法及装置
WO2024041479A1 (zh) 一种数据处理方法及其装置
WO2021175278A1 (zh) 一种模型更新方法以及相关装置
WO2022179586A1 (zh) 一种模型训练方法及其相关联设备
WO2021129668A1 (zh) 训练神经网络的方法和装置
WO2024001806A1 (zh) 一种基于联邦学习的数据价值评估方法及其相关设备
WO2023231954A1 (zh) 一种数据的去噪方法以及相关设备
CN115222896B (zh) 三维重建方法、装置、电子设备及计算机可读存储介质
CN111738403A (zh) 一种神经网络的优化方法及相关设备
CN111950700A (zh) 一种神经网络的优化方法及相关设备
CN114359289A (zh) 一种图像处理方法及相关装置
WO2024067884A1 (zh) 一种数据处理方法及相关装置
CN113536970A (zh) 一种视频分类模型的训练方法及相关装置
CN115131604A (zh) 一种多标签图像分类方法、装置、电子设备及存储介质
CN113627163A (zh) 一种注意力模型、特征提取方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896888

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21896888

Country of ref document: EP

Kind code of ref document: A1