CN114037074A

CN114037074A - Model pruning method and device, electronic equipment and storage medium

Info

Publication number: CN114037074A
Application number: CN202111322412.3A
Authority: CN
Inventors: 李建伟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2022-02-11

Abstract

The disclosure provides a model pruning method, a model pruning device, an electronic device, a readable storage medium and a computer program product, and relates to the field of computer vision and deep learning. The specific implementation scheme is as follows: determining a pruning mask corresponding to a multi-head self-attention module in a target sub-layer according to an image feature vector matrix output by the target sub-layer, wherein the target sub-layer is a sub-layer of a visual converter model, and the pruning mask is used for identifying a self-attention head to be pruned in the multi-head self-attention module; and pruning the multi-head self-attention module by using the pruning mask. The scheme can determine a pruning mask for identifying the self-attention head to be pruned in the multi-head self-attention module, and prune the self-attention head of the multi-head self-attention module of the target sub-layer of the visual converter model by using the pruning mask. Therefore, resources consumed by deployment and operation of the vision converter model can be reduced.

Description

Model pruning method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of artificial intelligence, in particular to computer vision and deep learning technology, and can be used in computer vision, deep learning and other scenes.

Background

A Transformer (Transformer) model is a deep neural network model based mainly on the self-attention mechanism, and is originally applied to Natural Language Processing (NLP). In addition, because the Transformer model has strong characterization capability, the Transformer model is greatly developed in the field of computer vision in recent years.

Compared with the CNN-based model commonly used in the field of computer Vision, deployment and operation of a Vision Transformer (Vision Transformer) model generally requires huge resource consumption. Therefore, how to reduce the resources consumed by the deployment and calculation of the Vision Transformer model becomes an urgent problem to be solved when the Vision Transformer model is applied.

Disclosure of Invention

The present disclosure provides a model pruning method, apparatus, electronic device, readable storage medium, and computer program product to reduce resources consumed for visual transformer model deployment and computation.

According to an aspect of the present disclosure, there is provided a model pruning method, which may include the steps of:

determining a pruning mask corresponding to a multi-head self-attention module in a target sub-layer according to an image feature vector matrix output by the target sub-layer, wherein the target sub-layer is a sub-layer of a visual converter model, and the pruning mask is used for identifying a self-attention head to be pruned in the multi-head self-attention module;

and pruning the multi-head self-attention module by using a pruning mask.

According to a second aspect of the present disclosure, there is provided a model pruning device, which may include:

a pruning mask determining unit, configured to determine a pruning mask corresponding to the multi-head self-attention module in the target sub-layer according to an image feature vector matrix output by the target sub-layer, where the target sub-layer is a sub-layer of the visual converter model, and the pruning mask is used to identify a self-attention head to be pruned in the multi-head self-attention module;

and the module pruning unit is used for pruning the multi-head self-attention module by utilizing the pruning mask.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method in any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the method in any of the embodiments of the present disclosure.

The technology disclosed by the invention can determine the pruning mask of the self-attention head to be pruned in the multi-head self-attention module, and prune the self-attention head of the multi-head self-attention module of the target sub-layer of the visual converter model by utilizing the pruning mask. Therefore, resources consumed by deployment and operation of the vision converter model can be reduced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a model pruning method provided in an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a Vision Transformer model provided in an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a self-attention head pruning process provided in an embodiment of the present disclosure;

fig. 4 is a flowchart of a pruning mask determination method provided in an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a model pruning device according to an embodiment of the present disclosure;

fig. 6 is a schematic view of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

An embodiment of the present disclosure provides a model pruning method, and specifically, refer to fig. 1, which is a flowchart of a model pruning method provided in an embodiment of the present disclosure. The method may comprise the steps of:

step S101: and determining a pruning mask corresponding to the multi-head self-attention module in the target sub-layer according to the image feature vector matrix output by the target sub-layer, wherein the target sub-layer is a sub-layer of the visual converter model, and the pruning mask is used for identifying the self-attention head to be pruned in the multi-head self-attention module.

Step S102: and pruning the multi-head self-attention module by using a pruning mask.

The model pruning method provided by the embodiment of the disclosure can determine the pruning mask for identifying the self-attention head to be pruned in the multi-head self-attention module, and prune the self-attention head of the multi-head self-attention module of the target sub-layer of the visual converter model by using the pruning mask. Therefore, resources consumed by deployment and operation of the vision converter model can be reduced.

For the Vision Transformer model, the model generally includes an image vector conversion Layer, an encoding Layer and a decoding Layer (collectively, a Transformer Layer), which respectively include a plurality of encoding and decoding sublayers. Specifically, the coding sublayer is a sublayer in a coding layer of the visual transformer model, and the decoding sublayer is a sublayer in a decoding layer of the visual transformer model.

The model pruning method provided by the embodiment of the present disclosure is directed to any one or more of the coding sub-layer and the decoding sub-layer in the Vision transform model. That is, at least one sublayer can be arbitrarily selected as a target sublayer among a plurality of encoding sublayers and decoding sublayers. Specifically, the target sub-layer can be selectively determined at a plurality of encoding sub-layers and decoding sub-layers according to preset resource consumption and the accuracy of model output required by the visual converter model.

The principle of pruning the self-attention head in the multi-head self-attention module to reduce the resources consumed by the deployment and operation of the vision converter model will be described in detail below with reference to fig. 2. Fig. 2 is a schematic structural diagram of a Vision Transformer model provided in an embodiment of the present disclosure. It should be noted that fig. 2 only shows the image vector conversion layer and the plurality of coding sublayers in the Vision transform model.

The image vector conversion layer is mainly used for carrying out linear transformation and/or pixel flattening arrangement on an input image, remolding the input image into a vector and inputting the vector into the coding layer; each coding sublayer in the coding layer corresponds to an encoder, which in turn consists of a standard module, a Multi-Head Attention module (Multi-Head Attention), and a Multi-layer perceptual module (MLP, also called Multi-layer perceptron, which generally has two layers).

In a practical use scenario, the image is typically divided into a plurality of patches (patch), each having equal resolution. And each small block corresponds to one input position of the model, and after passing through the image vector conversion layer, the feature vectors with the number equal to that of the small blocks are generated. Then the feature vectors are sequentially input into a plurality of coding sublayers, the feature vectors output by the last coding layer are input into a classifier, and then a classification result is obtained. The classification result may be a probability value such as a probability of 90% for recognizing that the input image is a dog and a probability of 10% for recognizing that the input image is a cat.

For a certain coding sub-layer, the calculation amount of processing multiple patch simultaneously can be derived from equations (1) - (4), specifically, equations (1) - (3) respectively estimate the calculation amount of three main steps in the coding sub-layer calculation process, and equation (4) represents the calculation amount of the whole coding sub-layer. Where N represents the number of input patches or the number of input feature vectors, D is the size of an embedding dimension (embedding size/embedding dim), which is the product of the number of heads (also called self-attention heads and single self-attention computing heads) in the feature vectors and the dimension (dim, also called the length of the feature vectors) of each feature vector in the training process, [ N, D ] represents a matrix with dimensions (N, D), [ D, D ] represents a matrix with dimensions (D, D), [ N, D ], [ N, N ] are similar, and details are not repeated here.

4×([N，D]×[D，D])＝>4ND²——(1)

[N，D]×[D，N]+[N，N]×[N，D]＝>2N² D——(2)

[N，D]×[D，4D]+[N，4D]×[4D，D]＝>8ND²——(3)

12ND²+2N² D——(4)

As can be seen from equations (1) to (4), the amount of computation of the coding sublayer can be reduced by reducing the number of self-attention heads in the multi-head self-attention module. And there is a difference in importance for a multi-head self-attention module, for example, an image region of self-attention that is less important is not really useful for final output. Therefore, the less important attention head is pruned, the calculation amount of the coding sub-layer can be reduced on the premise of ensuring the accuracy of output, and the resources consumed by deployment and operation of the visual converter model can be further reduced.

In addition, as can be seen from the equations (1) to (4), the resource consumed for the deployment and calculation of the visual converter model can be reduced for the compressed embedding dim (also referred to as feature dim). However, for a multi-head self-attention module, each self-attention head is physically meaningful, and if only the embedding (vector) dimension is reduced, it is likely that self-attention heads with rich information are pruned. Therefore, the multi-head self-attention module of the target sub-layer of the vision converter model is subjected to self-attention pruning, and the model output accuracy can be ensured under the condition of reducing resources consumed by deployment and operation of the vision converter model.

In order to accurately determine the self-attention head to be pruned, in the embodiment of the present disclosure, the pruning mask is obtained in the following manner, please refer to fig. 3 specifically, and fig. 3 is a schematic diagram of a self-attention head pruning process provided in the embodiment of the present disclosure.

Assuming that an image is divided into 9 patches, after linear mapping or flattening operation (linear projection or flat) is performed on the 9 patches by an image vector conversion layer, 9 tokens are generated, assuming that the size (scale) of a token is 384, and the matrix dimension formed by the tokens is: [9, 384]. And inputting the token into a Transformer Layer, wherein the matrix dimension of the token is not changed after the token passes through the Transformer Layer until the final output Layer is reached. That is, the transform Layer outputs a corresponding image feature vector matrix. After the transform Layer i (i-th sub-Layer) outputs a corresponding image feature vector matrix, a pruning mask can be determined for the image feature vector matrix, and the specific process is as follows:

firstly, a first vector matrix is obtained after the image feature matrix is subjected to separation operation in the dimension of a color channel. That is, the first vector matrix is a vector matrix obtained by performing a separation operation on the image feature matrix in the color channel dimension. Specifically, the matrix dimension of the image feature vector matrix is [9, 384], the image feature vector matrix is transformed into a vector matrix with the matrix dimension of [9, 6, 64] through reshape (matrix transformation), and a first vector matrix with two matrix dimensions of [9, 6, 32] is obtained after split (separation) operation. Specifically, N is 9, H is 6, and C is 32 in the image feature vector matrix [ N, H, C ]. Wherein, N represents the number of input patches or the number of input feature vectors, H represents the number of multi-attention heads in the multi-head self-attention module, and C represents the color channel of the image.

Secondly, averaging the first vector matrix in the dimension of the color channel to obtain a second vector matrix and a third vector matrix. Specifically, an averaging operation is performed on the two eigenvector matrices to obtain two matrices with matrix dimensions [6, 32], which are respectively a second vector matrix and a third vector matrix.

Thirdly, averaging the second vector matrix in the multi-head dimensionality corresponding to the multi-head self-attention module to obtain a fourth vector matrix. That is, a further averaging operation is performed on the second vector matrix in the multi-head dimension to obtain a fourth vector matrix with a matrix dimension of [1, 32], which is globally a vector.

Fourthly, expanding the fourth vector matrix to obtain a fifth vector matrix. Specifically, an expanded operation is used to obtain a fifth vector matrix with a matrix dimension of [6, 32 ]. The specific implementation manner of the expansion operation may be to copy a fourth vector matrix with a matrix dimension of [1, 32] to obtain a fifth vector matrix with a matrix dimension of [6, 32 ].

Fifthly, the third vector matrix and the fifth vector matrix with the same dimensionality are spliced to obtain a sixth vector matrix. Specifically, the fifth vector matrix with the matrix dimension of [6, 32] is spliced with the third vector matrix with the other matrix dimension of [6, 32] to obtain the sixth vector matrix with the matrix dimension of [6, 64 ]. The matrix splicing is to merge the matrixes having the same number of rows into one row (right and left), and merge the matrixes having the same number of columns into one row (up and down). For example, for a matrix a ═ np.array ([1, 2]), and a matrix b ═ np.array ([2, 4]), left and right are combined a ═ np.hstack ((a, b)), and an output print ═ 1, 2, 2, 4 is obtained.

Sixthly, linearly changing the sixth vector matrix in the multi-head dimension to obtain a seventh vector matrix with a second matrix dimension. Specifically, the sixth vector matrix is input to a linear mapping layer (linear projection) to obtain a seventh vector matrix with a matrix dimension of [6, 2 ].

Seventh, based on the seventh vector matrix, a pruning mask is determined.

Referring to fig. 4, a specific implementation manner of determining a pruning mask based on a seventh vector matrix in the embodiment of the present disclosure is shown, and fig. 4 is a flowchart of a pruning mask determining method provided in the embodiment of the present disclosure.

Step S401: and carrying out normalization processing on the seventh vector matrix to obtain the distribution probability of the seventh vector matrix in different dimensions.

Step S402: and according to the distribution probability, using the vector of the specified dimension as a pruning mask.

In the embodiment of the disclosure, the vector of the specified dimension is used as the pruning mask according to the distribution probability, so that the mask matrix can be ensured to accurately identify the self-attention head to be pruned in the multi-head self-attention module.

Referring to fig. 3 again, in the embodiment of the disclosure, the seventh vector matrix with matrix dimension [6, 2] may be subjected to a normalization sublayer (log-softmax layer) to obtain 2-dimensional distribution probabilities. The multi-head dimension has a dimension value of 6(H ═ 6), and the other dimension has a dimension value of 2. After the distribution probability is obtained, a vector of a specified dimension (e.g., subscript 1 in fig. 3) can be obtained as a pruning mask by using argsort operation with respect to the probability distribution of the second dimension.

In order to prune the low-importance self-attention head to reduce the calculation amount of the target sub-layer while ensuring the accuracy of the output, in the embodiment of the present disclosure, after determining the pruning mask, the self-attention head may be pruned as follows: first, a correspondence between a mask value in a pruning mask and a self-attention head in a multi-head self-attention module is determined. Then, when a mask whose mask value is zero is included in the pruning mask, the self-attention head corresponding to the mask whose mask value is zero is pruned based on the correspondence relationship. Wherein, the mask value corresponds to each self-attention head in the multi-head self-attention module one by one.

Specifically, in the case where the pruning mask is [1, 0, 1, 1, 1, 1], the second one of the multi-head self-attention modules is pruned from the attention head (the pruning mask is multiplied by the image feature vector matrix). If the pruning mask is [0, 1, 0, 1, 1, 1], the first and three self-attention heads in the multi-head self-attention module are pruned, while the remaining self-attention heads are retained.

In order to conveniently and accurately determine the pruning mask, the embodiment of the disclosure may input the image feature vector matrix into the mask derivation module to obtain the pruning mask, and the mask derivation module is a module for determining the pruning mask, which is trained simultaneously in the training process of the visual converter model.

The mask derivation module is a module that is trained synchronously while training the visual transformer model. Specifically, the training process of the visual converter model is as follows: and inputting the sample image and the corresponding label into the visual converter model to be trained, and obtaining a result output by the visual converter model to be trained. And continuously adjusting parameters in the visual converter model to be trained according to the loss function values corresponding to the result and the label until the training times meet the preset times, thereby obtaining the visual converter model.

In the training phase of the visual converter model to be trained, the mask derivation module also executes corresponding steps, and the parameters of the mask derivation module are also continuously adjusted. However, in the model training phase, in order to ensure the consistency of data, the multi-head self-attention module is not pruned, but the output of the self-attention head corresponding to the mask with the mask value of zero is set to zero, for example, the pruning mask is [0, 1, 0, 1, 1, 1], that is, the outputs of the first and three self-attention heads in the multi-head self-attention module are set to zero.

In addition, in the model training stage, after the distribution probability is obtained, a vector with a subscript of 1 is obtained as a pruning mask by using argsort operation, not for the probability distribution of the second dimension. The probability distribution of the second dimension is firstly input into a Gumbel-softmax layer, the probability distribution is changed into one-hot coding, and then the vector on the channel 1 is obtained as a pruning mask.

As shown in fig. 5, an embodiment of the present disclosure provides a model pruning device, including:

a pruning mask determining unit 501, configured to determine a pruning mask corresponding to a multi-head self-attention module in a target sub-layer according to an image feature vector matrix output by the target sub-layer, where the target sub-layer is a sub-layer of a visual converter model, and the pruning mask is used to identify a self-attention head to be pruned in the multi-head self-attention module;

the module pruning unit 502 is configured to prune the multi-head self-attention module by using a pruning mask.

In one embodiment, the pruning mask determining unit 501 may include:

the first vector matrix processing subunit is used for averaging a first vector matrix corresponding to the image feature vector matrix in the color channel dimension to obtain a second vector matrix and a third vector matrix, wherein the first vector matrix is obtained after the image feature matrix is separated in the color channel dimension;

the second vector matrix processing subunit is used for averaging the second vector matrix in multi-head dimensions corresponding to the multi-head self-attention module to obtain a fourth vector matrix;

the third vector matrix processing subunit is configured to splice the third vector matrix with a fifth vector matrix with the same dimensionality to obtain a sixth vector matrix, and the fifth vector matrix is obtained by performing expansion operation on the fourth vector matrix;

a fourth vector matrix processing subunit, configured to perform linear change on the sixth vector matrix in the multi-head dimension to obtain a seventh vector matrix with a second matrix dimension;

and the pruning mask determining subunit is used for determining the pruning mask based on the seventh vector matrix.

In one embodiment, the pruning mask determination subunit may include:

the matrix normalization subunit is used for performing normalization processing on the seventh vector matrix to obtain the distribution probabilities of the seventh vector matrix in different dimensions;

and the pruning mask obtaining subunit is used for taking the vector of the specified dimensionality as the pruning mask according to the distribution probability.

In one embodiment, the module pruning unit 502 may include:

a correspondence determining subunit, configured to determine a correspondence between a mask value in the pruning mask and a self-attention head in the multi-head self-attention module;

and the module pruning subunit is used for pruning the self-attention head corresponding to the mask with the mask value of zero based on the corresponding relation under the condition that the pruning mask comprises the mask with the mask value of zero.

In one embodiment, the pruning mask determining unit 501 may include:

and the mask derivation module subunit is used for inputting the image feature vector matrix into the mask derivation module to obtain the pruning mask, and the mask derivation module is a module which is trained simultaneously in the training process of the visual converter model and is used for determining the pruning mask.

In one embodiment, the target sub-layer includes at least one of an encoding sub-layer and a decoding sub-layer, the encoding sub-layer being a sub-layer in an encoding layer of the visual transformer model, and the decoding sub-layer being a sub-layer in a decoding layer of the visual transformer model.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

According to an embodiment of the present disclosure, the present disclosure also provides an electronic device and a readable storage medium.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as the model pruning method. For example, in some embodiments, the model pruning method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the model pruning method described above may be performed. Alternatively, in other embodiments, the calculation unit 601 may be configured to perform the model pruning method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable model pruning device such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A model pruning method, comprising:

and pruning the multi-head self-attention module by using the pruning mask.

2. The method according to claim 1, wherein the determining a pruning mask corresponding to a multi-head self-attention module in the target sub-layer according to the image feature vector matrix output by the target sub-layer comprises:

averaging a first vector matrix corresponding to the image feature vector matrix in a color channel dimension to obtain a second vector matrix and a third vector matrix, wherein the first vector matrix is obtained by performing separation operation on the image feature matrix in the color channel dimension;

averaging the second vector matrix in the multi-head dimensionality corresponding to the multi-head self-attention module to obtain a fourth vector matrix;

splicing the third vector matrix with a fifth vector matrix with the same dimensionality to obtain a sixth vector matrix, wherein the fifth vector matrix is obtained by expanding the fourth vector matrix;

linearly changing the sixth vector matrix in the multi-head dimension to obtain a seventh vector matrix with a second matrix dimension;

determining the pruning mask based on the seventh vector matrix.

3. The method of claim 2, wherein the determining the pruning mask based on the seventh vector matrix comprises:

carrying out normalization processing on the seventh vector matrix to obtain the distribution probability of the seventh vector matrix in different dimensions;

and according to the distribution probability, using the vector of the specified dimensionality as the pruning mask.

4. The method of any of claims 1 to 3, wherein said pruning the multi-headed self-attention module with the pruning mask comprises:

determining a correspondence between a mask value in the pruning mask and a self-attention head in the multi-head self-attention module;

and under the condition that the pruning mask comprises a mask with a mask value of zero, pruning the self-attention head corresponding to the mask with the mask value of zero based on the corresponding relation.

5. The method according to claim 1, wherein the determining a pruning mask corresponding to a multi-head self-attention module in the target sub-layer according to the image feature vector matrix output by the target sub-layer comprises:

and inputting the image feature vector matrix into a mask derivation module to obtain the pruning mask, wherein the mask derivation module is a module which is trained simultaneously in the training process of the visual converter model and is used for determining the pruning mask.

6. The method of claim 1, wherein the target sub-layer comprises at least one of an encoding sub-layer that is a sub-layer in an encoding layer of the visual transformer model and a decoding sub-layer that is a sub-layer in a decoding layer of the visual transformer model.

7. A model pruning device, comprising:

a pruning mask determining unit, configured to determine a pruning mask corresponding to a multi-head self-attention module in a target sub-layer according to an image feature vector matrix output by the target sub-layer, where the target sub-layer is a sub-layer of a visual converter model, and the pruning mask is used to identify a self-attention head to be pruned in the multi-head self-attention module;

8. The apparatus of claim 7, wherein the pruning mask determination unit comprises:

a first vector matrix processing subunit, configured to obtain a second vector matrix and a third vector matrix by averaging, in a color channel dimension, a first vector matrix corresponding to the image feature vector matrix, where the first vector matrix is obtained by performing a separation operation on the image feature matrix in the color channel dimension;

the second vector matrix processing subunit is configured to average the second vector matrix over multiple-head dimensions corresponding to the multiple-head self-attention module to obtain a fourth vector matrix;

a third vector matrix processing subunit, configured to splice the third vector matrix with a fifth vector matrix having the same dimensionality to obtain a sixth vector matrix, where the fifth vector matrix is obtained by performing expansion operation on the fourth vector matrix;

a pruning mask determining subunit, configured to determine the pruning mask based on the seventh vector matrix.

9. The apparatus of claim 8, wherein the pruning mask determination subunit comprises:

10. The apparatus of any one of claims 7 to 9, wherein the modular pruning unit comprises:

and the module pruning subunit is configured to, when the pruning mask includes a mask whose mask value is zero, prune the self-attention head corresponding to the mask whose mask value is zero based on the correspondence.

11. The apparatus of claim 8, wherein the pruning mask determination unit comprises:

a mask derivation module subunit, configured to input the image feature vector matrix into a mask derivation module to obtain the pruning mask, where the mask derivation module is a module that is trained simultaneously in a training process of the visual converter model and is used to determine the pruning mask.

12. The apparatus of claim 7, wherein the target sub-layer comprises at least one of an encoding sub-layer that is a sub-layer in an encoding layer of the visual transformer model and a decoding sub-layer that is a sub-layer in a decoding layer of the visual transformer model.

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1 to 6.

15. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the steps of the method of claims 1 to 6.