CN115063673A

CN115063673A - Model compression method, image processing method and device and cloud equipment

Info

Publication number: CN115063673A
Application number: CN202210902200.0A
Authority: CN
Inventors: 汪振宇; 罗浩; 王帆; 李�昊
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-09-16
Anticipated expiration: 2042-07-29
Also published as: CN115063673B

Abstract

The application provides a model compression method, an image processing device and cloud equipment, wherein the model compression method comprises the following steps: receiving a first image sent by a user terminal, wherein the first image is an image of a preset application scene; determining a compression mode of the trained first visual network model according to a preset application scene, wherein the compression mode comprises channel compression and/or feature vector compression; determining frequency domain information of a first image; and compressing the first visual network model by adopting a compression mode based on the frequency domain information to obtain a second visual network model. The compressed visual network model obtained by the method occupies a smaller memory and has higher calculation efficiency.

Description

Model compression method, image processing method and device and cloud equipment

Technical Field

The application relates to the technical field of computers, in particular to a model compression method, an image processing device and cloud equipment.

Background

With the introduction of a transform (a deep learning Network), the visual Network model (visual transform, ViT) including the transform is better than the Convolutional Neural Network (CNN) in accuracy in many tasks, such as image classification, object detection, and semantic segmentation, so that the currently used CNN has a rolling position in the field of computer Vision.

However, ViT has the problems of greater memory usage and slower computational efficiency during operation compared to CNN.

Disclosure of Invention

Aspects of the present application provide a model compression method, an image processing apparatus, and a cloud device, so as to solve the problems of ViT that the memory usage amount is larger and the calculation efficiency is slower in the operation process.

In a first aspect, an embodiment of the present invention provides a model compression method, which is applied to a server, where the model compression method includes: receiving a first image sent by a user terminal, wherein the first image is an image of a preset application scene; determining a compression mode of the trained first visual network model according to a preset application scene, wherein the compression mode comprises channel compression and/or feature vector compression; determining frequency domain information of a first image; and compressing the first visual network model by adopting a compression mode based on the frequency domain information to obtain a second visual network model.

A second aspect of the embodiments of the present application provides a model compression method, applied to a server, including: receiving a remote sensing image sent by a user terminal; determining low-frequency information in the remote sensing image; and based on the low-frequency information, compressing the trained first remote sensing model by adopting a channel compression mode to obtain a second remote sensing model.

A third aspect of the embodiments of the present application provides an image processing method, which is applied to a terminal, and the image processing method includes: acquiring an image to be processed; sending the image to be processed to a server for the server to identify the image to be processed by adopting a visual network model to obtain a processing result, wherein the visual network model is obtained according to the model compression method of the first aspect or the second aspect; and receiving the processing result sent by the server.

A fourth aspect of the embodiments of the present application provides a model compression apparatus, including:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for receiving a first image sent by a user terminal, and the first image is an image of a preset application scene;

the first determining module is used for determining a compression mode of the trained first visual network model according to a preset application scene, wherein the compression mode comprises channel compression and/or feature vector compression;

a second determining module, configured to determine frequency domain information of the first image;

and the compression module is used for compressing the first visual network model by adopting a compression mode based on the frequency domain information to obtain a second visual network model.

A fifth aspect of the embodiments of the present application provides a model compression apparatus, including:

the receiving module is used for receiving the remote sensing image sent by the user terminal;

the determining module is used for determining low-frequency information in the remote sensing image;

and the compression module is used for compressing the trained first remote sensing model by adopting a channel compression mode based on the low-frequency information to obtain a second remote sensing model.

A sixth aspect of the embodiments of the present application provides an image processing apparatus, comprising:

the acquisition module is used for acquiring an image to be processed;

the sending module is used for sending the image to be processed to the server so that the server can identify the image to be processed by adopting a visual network model to obtain a processing result, wherein the visual network model is obtained according to the first aspect or the second aspect of the model compression method;

and the receiving module is used for receiving the processing result sent by the server.

A seventh aspect of the present embodiment provides a cloud device, including: a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the model compression method of the first aspect or the second aspect or the image processing method of the third aspect when executing the computer program.

An eighth aspect of embodiments of the present application provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-executable instructions are used to implement the model compression method of the first aspect or the second aspect or the image processing method of the third aspect.

A ninth aspect of embodiments of the present application provides a computer program product, including: a computer program, stored in a readable storage medium, from which at least one processor of the electronic device can read the computer program, the execution of which by the at least one processor causes the electronic device to perform the model compression method of the first or second aspect or the image processing method of the third aspect.

The method and the device are applied to an image recognition scene, and the first image sent by the user terminal is received, wherein the first image is an image of a preset application scene; determining a compression mode of the trained first visual network model according to a preset application scene, wherein the compression mode comprises channel compression and/or feature vector compression; determining frequency domain information of a first image; and compressing the first visual network model by adopting a compression mode based on the frequency domain information to obtain a second visual network model. According to the embodiment of the application, the compressed visual network model can be obtained, and the compressed visual network model occupies a smaller memory and has higher calculation efficiency under the condition that the identification precision is not influenced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a block diagram of a visual network model provided in an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of a model compression method provided in an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of another model compression method provided in an exemplary embodiment of the present application;

fig. 4 is a block diagram of a conversion module according to an exemplary embodiment of the present disclosure;

FIG. 5 is a block diagram of a feed-forward network layer provided in an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of a process for selecting layers according to an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of yet another model compression method provided by an exemplary embodiment of the present application;

FIG. 8 is a flowchart illustrating steps of an image processing method according to an exemplary embodiment of the present application;

FIG. 9 is a block diagram of a model compression apparatus according to an exemplary embodiment of the present disclosure;

FIG. 10 is a block diagram of another model compression apparatus provided in an exemplary embodiment of the present application;

fig. 11 is a block diagram of an image processing apparatus according to an exemplary embodiment of the present application;

fig. 12 is a schematic structural diagram of a cloud device according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The ViT model is compressed by pruning, but the ViT model in the related art still follows the experience obtained in CNN compression, for example, the weighting parameter with a larger norm value is relatively important, and the weighting parameter with a smaller norm value is deleted, and the characteristics of ViT itself are not considered. Among them, the characteristic of ViT itself is that ViT captures low frequency information of an image more effectively than CNN, which means that ViT has different sensitivity to frequency domain signals than CNN. Therefore, for compression of ViT, it is more important to consider low frequency information than high frequency parts.

Based on the foregoing background, the model compression method provided in the embodiment of the present application includes: receiving a first image sent by a user terminal, wherein the first image is an image of a preset application scene; determining a compression mode of the trained first visual network model according to a preset application scene, wherein the compression mode comprises channel compression and/or feature vector compression; determining frequency domain information of a first image; and compressing the first visual network model by adopting a compression mode based on the frequency domain information to obtain a second visual network model. According to the embodiment of the application, the visual network model (ViT) is compressed by considering different sensitivities of the visual network model to different frequency domain information of the image, for example, the sensitivity to a low-frequency part is higher, so that the reduction of the identification precision of the visual network model after compression can be reduced, and the compressed ViT model can occupy a small memory and has high calculation efficiency.

In this embodiment, the overall model compression method may be implemented by means of a cloud computing system. In addition, the server of the model compression method may be a cloud server in order to run various neural network models by virtue of resources on the cloud; as opposed to the cloud, the model compression method may also be applied to a server device such as a conventional server or a server array, and is not limited herein.

In addition, the model compression method provided by the embodiment of the application is applied to various compression scenes of the visual network model. As in the image recognition scenario, wherein the image recognition scenario includes: image classification, target detection, semantic segmentation, and the like. Referring to fig. 1, for a target task (image classification, target detection, semantic segmentation, or the like), a trained visual network model includes an embedding module (token embedding), where the embedding module is configured to segment an input image into a plurality of sub-images, then encode each sub-image to obtain a coding vector (token) of the sub-image, the embedding module outputs a plurality of coding vectors to a conversion module (transform) a1, and a result output by the conversion module a1 is input to the conversion module a2 until a result output by a last conversion module an is an image recognition result of the visual network model. The embodiment of the application realizes that the trained visual network model is compressed, the compressed visual network model can still accurately identify images, the memory occupation of the visual network model can be reduced, and the calculation efficiency is improved.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 2 is a flowchart illustrating steps of a model compression method according to an exemplary embodiment of the present disclosure. As shown in fig. 2, the model compression method specifically includes the following steps:

s201, receiving a first image sent by a user terminal.

The first image is an image of a preset application scene. Specifically, the first image is a captured natural image, such as a remote sensing image or a color image captured by a camera. The preset application scene comprises the following steps: one of a detection scenario, a segmentation task scenario, a classification scenario, and a retrieval task scenario.

Further, the first image may be sent by the user terminal to the server, when the user terminal sends the first image to the server, the user terminal sends the preset application scene of the first image at the same time, and the user terminal instructs the server to compress the first visual network model.

S202, determining a compression mode of the trained first visual network model according to a preset application scene.

The compression mode comprises channel compression and/or feature vector compression.

Specifically, under a detection scene and a segmentation task scene, the compression mode comprises channel compression, and under a classification scene and a retrieval task scene, the compression mode comprises channel compression and compression of feature vectors.

In the embodiment of the application, the detection scene and the segmentation task scene require details in an image, so that channel compression is not performed.

In addition, the structure of the first visual network model is as shown in fig. 1, the first visual network model is trained in advance, and for different preset application scenes, the first visual network model may perform corresponding tasks, and for a detection scene, the first visual network model may perform a detection task, for example, for an image input to the first visual network model, the first visual network model may detect a target object in the image.

The preset application scenario may also be other application scenarios as needed, and is not limited herein.

Further, after determining a compression mode of the trained first visual network model according to a preset application scenario, the method further includes: sending a compression mode to a user terminal; if negative feedback information of the user terminal on the compression modes is received, sending a plurality of compression modes to the user terminal; and receiving at least one compression mode sent by the user terminal, wherein the at least one compression mode is used for compressing the first visual network model.

If the compression mode determined in S202 is channel compression, the channel compression is sent to the user terminal, and if the feedback of the user terminal to the channel compression is negative feedback information, it is determined that the user does not agree to use the channel compression mode, all compression modes are sent to the user terminal, such as channel compression and feature vector compression, for the user to select, and the user may select a combination of channel compression and feature vector compression, or may select feature vector compression.

In the embodiment of the present application, the compression method of the first visual network model may also include other types of compression besides channel compression and feature vector compression, where a user may select one of the compression methods or a combination of the compression methods according to needs. The subsequent server may perform compression of the first visual network model according to a compression mode selected by the user.

S203, determining frequency domain information of the first image.

In practice, the low frequency information of the first image forms the basic grey scale of the first image. The high frequency information of the first image forms the edges and details of the image.

In the embodiment of the present application, since the first visual network model has a higher sensitivity to low-frequency information in the image, the frequency-domain information includes the low-frequency information of the first image.

In an optional embodiment, if the application scene of the first visual network model is that high-frequency information in the image needs to be identified, the frequency domain information may be the high-frequency information, and the high-frequency information is subsequently used to compress the first visual network model, so that the compressed first visual network model can still accurately identify the high-frequency information in the image.

And S204, compressing the first visual network model by adopting a compression mode based on the frequency domain information to obtain a second visual network model.

In an optional embodiment, based on the low-frequency information, the first visual network model is compressed by adopting a channel compression and/or feature vector compression mode to obtain the second visual network model.

In another optional embodiment, based on the high-frequency information, the first visual network model is compressed in a channel compression and/or feature vector compression mode to obtain the second visual network model.

In the embodiment of the application, because the first visual network model can capture the low-frequency information of the first image more effectively, the first visual network model is compressed based on the low-frequency information, and the second visual network model is obtained as an example to explain the application.

In the embodiment of the application, the visual network model (ViT) is compressed by considering different sensitivities of the visual network model to different frequency domain information of the image, for example, the sensitivity to a low frequency part is higher, so that the reduction of the identification precision of the visual network model after compression can be reduced, and the compressed ViT model can achieve the high calculation efficiency by occupying a small memory.

Fig. 3 is a flowchart illustrating steps of a model compression method according to an exemplary embodiment of the present application. As shown in fig. 3, the model compression method specifically includes the following steps:

s301, receiving a first image sent by a user terminal.

The specific implementation process of this step refers to S201, and is not described herein again.

S302, determining a compression mode of the trained first visual network model according to a preset application scene.

The specific implementation process of this step refers to S202, and is not described herein again.

S303, converting the first image into a frequency domain image.

Wherein the first image is converted from a spatial domain image to a frequency domain image using a fourier transform.

Specifically, the first image is X _x,y Converting the first image X into a frequency domain image as shown in formula (1)

Formula (1)

In the formula (1), the first and second groups,

representing the second of the first imagexGo to the firstyFourier transform is carried out on the pixels of the columns to obtain a frequency domain image

. Wherein, the first and the second end of the pipe are connected with each other,

which is indicative of the height of the first image,

representing the width of the first image.

Representing the average pixel value of the row of pixels,

representing the average pixel value of the column pixels.

S304, weakening processing is carried out on the high-frequency information of the frequency domain image to obtain a filtering frequency spectrum image.

The first visual network model can capture the low-frequency information of the first image more effectively, so that the high-frequency information in the frequency domain image is weakened, and the obtained filtering frequency spectrum image can better represent the low-frequency information of the visual network model in the frequency domain.

In particular, the de-emphasis process employs a filter function

Cut-off ratio of frequency

Performing attenuation processing on high-frequency information of the frequency domain image to obtain a filtering spectrum image as shown in formula (2):

formula (2)

In the formula (2), the first and second groups,

is a filtered spectral image. Wherein, when the first image is converted into the filtered spectrum image, the binary filter which can make the failure of partial frequency components can cause ringing effect phenomenon, therefore, the phenomenon can be avoided. Filter function

Following a gaussian distribution. Wherein the filter function

As in equation (3):

formula (3)

S305, inverse transformation is carried out on the filtered frequency spectrum image to obtain a second image.

Wherein the spectral image is filtered

And converting the image back to a space domain through inverse Fourier transform to obtain a second image. The second image includes low frequency information. Specifically, the second image is an image in which high-frequency information of the first image is attenuated and low-frequency information of the first image is enhanced.

S306, inputting the first image into the first visual network model for recognition processing to obtain a first output result.

The first output result is an output result obtained after the first visual network model identifies the first image.

Specifically, the first visual network model includes: a weight W representing all weight matrices of the visual network model, all weight matrices W comprising a plurality of weight matrices

Each weight matrix

Comprising a plurality of channels

(weight matrix)

A row or a column of weight parameters), each channel comprising a plurality of weight parameters

。

In the present implementation, the first image X is input into a first output result obtained by the trained first visual network model, which is denoted as Ϻ (X, W).

S307, deleting the channel aiming at each channel of the first visual network model to obtain the current visual network model.

Wherein all weight matrices

Wherein n is an integer greater than 1,

one of the weight matrices, k, representing the visual network model takes one of 1 to n. Wherein the content of the first and second substances,

wherein x and y are each an integer greater than 1,

representing a weight matrix

Set of weight parameters for a row, e.g.

，

Representing the kth weight matrix

The ith row and the jth column of (2).

Wherein the weight matrix is deleted first

The first channel of (2), then a weight matrix is obtained

Then all the weight matrixes of the current visual network model are obtained

. Then delete the weight moment

A second channel of (2) is obtainedWeight matrix

Then all the weight matrixes of the current visual network model are obtained

When comparing the weight matrix

And after each channel is deleted in sequence, m1 current visual network models are obtained. In the same way for

Deleting each channel in turn to obtain m2 current visual network models until QUOTE is matched

And deleting each channel in sequence to obtain mn current visual network models. And the total number of the obtained current visual network models is m = m1+ m2+ … + mn.

And S308, inputting the second image into the current visual network model for recognition processing to obtain a second output result.

And respectively inputting the second images into each current visual network model for recognition processing to obtain corresponding second output results.

Illustratively, m current visual network models are included, and m second output results are obtained correspondingly.

S309, channel compression is carried out on the first visual network model based on the first output result and the second output result, and a second visual network model is obtained.

In an optional embodiment, performing channel compression on the first visual network model based on the first output result and the second output result to obtain a second visual network model includes: determining a first loss value of the second output result relative to the first output result, wherein the magnitude of the first loss value represents the influence degree of the channel on the first visual network model; and deleting at least one channel of the first visual network model according to the first loss value to obtain a second visual network model, wherein the influence degree of the deleted channel on the first visual network model is smaller than the influence degree threshold value.

Specifically, the magnitude of the first loss value represents the influence degree of the channel on the first visual network model. The greater the first loss value, the greater the influence of the corresponding channel on the first visual network model is determined to be.

And the influence degree of the deleted channel on the first visual network model is smaller than the influence degree threshold value. In the embodiment of the application, the deleted channel does not influence the identification precision of the first visual network model.

Illustratively, if the weight matrix is deleted

After the first channel, the corresponding first loss value is less than the preset loss value threshold. Deleting weight matrix

After the second channel, the corresponding first loss value is smaller than the preset loss value threshold. But deleting the weight matrix

After the third channel, if the corresponding first loss value is greater than the preset loss value threshold, the weight matrix may be deleted after the second visual network model is obtained

The first and second channels, and the third channel.

Optionally, deleting at least one channel of the first visual network model according to the first loss value to obtain a second visual network model, including: deleting at least one channel of the first visual network model according to the sequence of the first loss value from small to large to obtain an intermediate visual network model; inputting the first image into the mesopic vision network model for recognition processing to obtain a third output result; determining a second loss value of the third output result relative to the first output result; if the second loss value is smaller than the loss value threshold value, increasing the number of the channels to be deleted, and deleting at least one channel of the first visual network model according to the sequence of the first target loss value from small to large; and if the second loss value is greater than or equal to the loss value threshold value, determining the intermediate vision network model as the second vision network model.

And obtaining a final second visual network model by adopting an incremental deleting mode for the deleted channel. For example, two channels of the first visual network model are deleted in the order from small to large according to the first loss value to obtain an intermediate visual network model, if the second loss value is smaller than the loss value threshold value at the moment, four channels of the first visual network model are deleted in the order from small to large according to the first loss value, if the second loss value is still smaller than the loss value threshold value at the moment, six channels of the first visual network model are deleted in the order from small to large according to the first loss value, and if the second loss value is larger than the loss value threshold value at the moment, the first visual network model from which the six channels are deleted is taken as the second visual network model.

In the embodiment of the application, the output result obtained by inputting the first image into the first visual network model is the same as the output result obtained by inputting the first image into the second visual network model, so that the first visual network model does not reduce the recognition accuracy after being compressed, the occupation of a memory can be reduced, and the calculation speed is increased.

In the specific compression mode of channel compression, based on frequency domain information, compressing the first visual network model by adopting a compression mode to obtain a second visual network model specifically comprises the following steps:

receiving a true value result of a first image sent by a user terminal.

For example, the first visual network model is used to identify an object in the first image, and if the first image includes a target object, the true result is used to indicate the target object.

And secondly, determining a third loss value of the fourth output result relative to the true value result by adopting a second preset loss function according to the weight parameter.

And thirdly, determining a fourth loss value of the first output result relative to the true value result by adopting a second preset loss function.

And fourthly, determining the square of the difference value of the third loss value and the fourth loss value as the first target loss value.

In the embodiment of the present application, the first output result obtained by inputting the first image X into the trained visual network model is Ϻ (X, W). Setting one of the weight parameters in all the weight matrixes W to 0, then all the weight matrixes W are converted into

. The second image

Inputting a first visual network model with one of the weight parameters set to be 0 to obtain a fourth output result of

。

Wherein the first target loss value is expressed as a function of equation (4):

formula (4)

In equation (4), the second image

Representing a filtered spectral image

Performing inverse Fourier transform, Y representing true result,

in the form of a fourth output result,

a third loss value representing a fourth output result determined using a second predetermined loss function relative to the true value result, Ϻ (X, W) being the first output result,

indicating that a fourth penalty value of the first output result relative to the true result is determined using a second predetermined penalty function.

Representing a first target loss value.

In an alternative embodiment, the first target loss value may be formulated as

To obtain a mixture of, among others,

representing a first target loss value, the formula representing, in a first visual network model, a weight parameter

When the value is set to 0, the first image is input into the first visual network model, and the obtained output result

Loss value relative to true result

And inputting the first image into the original first visual network model to obtain a loss value of the output result Ϻ (X, W) relative to the true value result

The square of the difference of (a).

Determining the target influence value of the weight parameter on the weight matrix according to the first target loss value.

Wherein the first target loss value is positively correlated with the target impact value. The visual network model comprises a plurality of weight matrixes comprising a plurality of weight parameters

Further, determining a target influence value of the weight parameter on the associated weight matrix according to the first target loss value includes: inputting the first image into a first visual network model aiming at the weight matrix to obtain a first sub-output result corresponding to the weight matrix; aiming at the weight parameter of the weight matrix, under the condition that the weight parameter is set to be 0, inputting the second image into the first visual network model to obtain a second sub-output result corresponding to the weight matrix; determining a second target loss value of the first sub-output result and the second sub-output result by adopting a first preset loss function; and performing weighted calculation on the first target loss value and the second target loss value by adopting a preset hyper-parameter to obtain a target influence value of the weight parameter.

Referring to fig. 4 and 5, the structure of a feedforward network layer is shown, wherein the feedforward network layer includes: second normalization layer, weight matrix

And a weight matrix

. Wherein the second normalization layer adopts the normalization technique of layerorm to the embedded matrix

And (6) carrying out normalization processing.

In an embodiment of the present application, referring to fig. 1, 4 and 5, the visual network model includes a plurality of conversion modules, each of which includes: a plurality of weight matrices, e.g.

、

、

、

、

And

. Wherein each weight matrix comprises a plurality of weight parameters, and each weight matrix has a corresponding sub-output result when the visual network model performs the identification process on the image, for example, referring to fig. 4, the weight matrix

The corresponding sub-output result is

The weight matrix

The corresponding sub-output result is

。

Wherein, the calculation mode of the target influence value refers to the formula (5):

formula (5)

In the formula (5), the first and second groups,Tin order to output the result for the first sub-output,

in order to output the result for the second sub-output,KLrepresenting a first pre-set loss function,

is a hyper-parameter.

Illustratively, for the weight matrix

Inputting the first image into the trained first visual network model (non-compressed visual network model), and obtaining a first sub-output result T corresponding to the weight matrix as

(ii) a For the weight matrix

In the weighting parameter, in the weighting parameter

Under the condition of setting to 0, inputting the second image into the visual network model to obtain a weight matrix

Corresponding second sub-output result

Is composed of

(ii) a Then substituting equation (5) to obtain the weight parameter

Target influence value of

。

In the embodiment of the present application, considering that the number of the weight parameters in the visual network model is large, if the target influence value of each weight parameter is calculated by using the formula (5), the compression efficiency of the visual network model is low. And then adopting a first-order Taylor unfolding myopia expression formula (5) to further obtain a myopia value of the target influence value, wherein the myopia value can be used for expressing the target influence value, and the specific calculation mode adopts a formula (6):

formula (6)

Wherein the content of the first and second substances,

for representing weight parameters

The value of the target influence is,

the representation is used for the purpose of partial derivation,

representing the kth weight matrixw ^k The ith row and the jth column of (2).

Sixthly, determining the sum of the target influence values of each channel of the weight matrix aiming at the weight matrix.

Wherein each channel comprises a row or a column of weight parameters in a weight matrix, and the specific calculation formula refers to formula (7):

formula (7)

In the formula (7), the first and second groups,

representing the sum of the target impact values for one channel.

And removing at least one channel according to the sequence of the sum of the target influence values from small to large to obtain a second visual network model.

Wherein the visual network model comprises: the method comprises the following steps that a plurality of conversion modules are connected in a cascade mode, the output of a current conversion model is used as the input of a previous conversion module, at least one channel is deleted according to the sequence from small to large of the sum of target influence values, and a compressed visual network model is obtained, and the method comprises the following steps: and deleting at least one channel according to the cascade sequence of the plurality of conversion modules from bottom to top and the sequence of the sum of the target influence values from small to large aiming at the weight matrix of the current conversion module.

Further, the conversion module includes: a multi-head attention layer and a feedforward network layer; the multi-head attention layer comprises: a plurality of head processing units, wherein the number of deleted channels is the same for weight matrices of different head processing units.

In the embodiment of the application, in the same multi-head attention layer, different head processing units have the same weight matrix, and the number of deleted channels is the same.

Illustratively, referring to FIG. 4, if the conversion module is a first level conversion module, the weight matrix is

、

To the weight matrix

The number of deleted channels is the same, if all 1 channel is deleted, the weight matrix

、

To the weight matrix

The number of deleted channels is the same, for example, 2 channels are deleted. If the conversion module is a second-stage conversion module, the weight matrix

、

To the weight matrix

The number of deleted channels is the same, for example, 2 channels are deleted.

In addition, deleting at least one channel of the first visual network model to obtain a second visual network model, comprising: the deleted channels are the same in number aiming at the weight matrixes with the same types of different head processing units, and the parallelism of the different head processing units can be ensured. In addition, the identification precision of the visual network model is not influenced by deleting at least one channel according to the sequence from small to large of the sum of the target influence values. Wherein the sum of the target impact values is positively correlated with the first loss value.

In the embodiment of the present application, each weight parameter can be obtained according to the above manner

Corresponding to the first target loss value, in order to ensure the availability of the matrix, an alternative way is to delete one channel (the weight parameter of one row or one column) of the weight matrix.

For example: QUOTE

Each line is a channel, and the first target loss value corresponding to each weight parameter is

If the preset loss value threshold is 0.7, calculating to obtain that the sum of the first target loss values of the first row is a first loss value corresponding to the channel, and if the first loss value is smaller than the preset loss value threshold, calculating to obtain that the sum of the first target loss values of the first row is a first loss value corresponding to the channelAfter compression

. Wherein, the method is adopted to each weight matrix

Compressing to obtain a compressed weight matrix

And then, the compression of the first visual network model is realized, the compressed first visual network model has fewer weight parameters, the occupied memory amount is smaller, the calculation speed is higher, and the visual network model is compressed based on the low-frequency information, so that the identification precision of the visual network model cannot be reduced.

Wherein, S310 to S313 are compression of the feature vector.

And S310, inputting the second image into the first visual network model for recognition processing to obtain a plurality of intermediate feature vectors corresponding to the second image.

An optional mode is to input the second image containing the low-frequency information into the first visual network model for recognition processing, to obtain a plurality of intermediate feature vectors corresponding to the second image, where the intermediate feature vectors are representative of the low-frequency information in the first image.

Another optional mode is that the first image is input into the first visual network model for recognition processing to obtain a plurality of feature vectors corresponding to the first image, then each feature vector of the first image is subjected to low-frequency information extraction to obtain an intermediate feature vector, and then feature vector compression is performed on the intermediate feature vector.

In an embodiment of the application, the intermediate feature vectors are output to a multi-head attention layer of the first visual network model.

Specifically, referring to fig. 1, the visual network model includes: a plurality of conversion modules, such as conversion module a1 through conversion module an in fig. 1). A plurality of conversion modules are connected in cascade, with the output of the current conversion model being the input of the conversion module at the previous stage, e.g., conversion module a1 is the first stage, which is the next stage of conversion module a 2. Conversion module a2 is the second stage, which is the stage above conversion module a 1. The conversion module an is the nth stage.

Referring to fig. 4, each conversion module includes: a Multi-Head Self attachment (MHSA), a selection layer, and a Feed-Forward Network layer (FFN). Wherein the first image or the second image is input into the embedding module to obtain the embedding matrixX ^l,1 The input of the multi-head attention layer is an embedded matrixX ^l,1 Embedded matrixX ^l,1 Comprising a plurality of input eigenvectors, the output of the multi-headed attention layer being an embedded matrixX ^l,2 The embedded matrixX ^l,2 Is composed of a plurality of intermediate feature vectors. The input to the selection layer is an embedded matrixX ^l,2 The output of the selection layer being an embedded matrixX ^l,3 The embedded matrixX ^l,3 Is composed of multiple eigenvectors, and the input of feedforward network layer is embedded matrixX ^l,3 The output of the feedforward network layer is an embedded matrixX ^l ^+l,1 The embedded matrixX ^l+l,1 Is composed of a plurality of feature vectors, whereinlIs shown aslAnd a conversion module. Embedded matrix of target output of current conversion moduleX ^l+l,1 As input to the next conversion module.

Illustratively, a first image X is input to an embedding module, which outputs an embedded matrix of the first image XX ^1,1 The target output after embedding the matrix input conversion module a1 is an embedded matrixX ^2,1 Embedded matrixX ^2,1 The target output tail embedded matrix after the input conversion module a2 isX ^3,1 Until the target output of the last conversion module an is embedded in the matrixX ^n+1,1 Is the output result of the visual network model.

Further, the multi-head attention layer comprises a plurality of head processing units, each head processing unit correspondingly comprises a plurality of weight matrixes, each weight matrix corresponds to one output, and the output corresponding to the weight matrix is the intermediate output corresponding to the multi-head attention layer.

Referring to fig. 4, the multi-head attention layer includes a head processing unit b1, a head processing unit b2 through a head processing unit bH, where H is an integer greater than 1. Each head processing unit includes a weight matrix

、

And, and

wherein q, k and v represent the class of the weight matrix,han integer of 1 to H is taken to correspond to the head processing unit. Further, the first normalization layer may employ a normalization technique of layerorm.

Exemplarily, the embedded matrixX ^l,1 After the first normalization layer of the multi-head attention layer is input, the first normalization layer is respectively input into each head processing unit for processing. For each head processing unit, the resulting intermediate output comprises: passing weight matrix

Processed matrixQ _l,h Passing through the weight matrix

Processed matrixk _l,h、 Passing weight matrix

Processed matrixV _l,h 。

In the embodiment of the present application, each head processing unit is obtainedQ _l,h Andk _l,h、 the cross product result is then summedV _l,h Performing cross multiplication to obtain output c of the head processing unithFusing the outputs C1, C2 to cH of each head processing unit to obtain the total output C of the head processing units, wherein the total output C passes through a weight matrix

To obtain a target outputX ^l,2 。

In the embodiment of the application, the middle output of the multi-head attention layer is a matrixQ _l,h ，Matrix arrayk _l,h Sum matrixV _l,h . Target output isX ^l,2 。

S311, according to the intermediate feature vector, the attention score and the low-frequency correction quantity of the first visual network model are determined.

Wherein the attention score of the multi-head attention layer is determined according to the intermediate output, wherein the intermediate output isQ _l,h 、k _l,h AndV _l,h 。

specifically, referring to fig. 4, the attention score of each head processing unit is calculated using the following formula (8):

formula (8)

Wherein, in the formula (8),

is shown as followslSecond of the conversion modules of a stagehThe attention score of the individual head processing unit,

is the head processing unit output matrix chIs output dimension.

In the embodiment of the present application, the attention score of each head processing unit may be used to measure the information amount of the intermediate feature vector (token), and the attention score determines the influence degree of one token on other tokens. It can be understood that when the feature of the first image is combined with the self-attention mechanism, the token corresponding to the larger attention score can provide more information, and in the embodiment of the present application, the average attention score of the H head processing units in the multi-head attention layer is determined by specifically using formula (9):

formula (9)

Wherein the target output of the multi-head attention layerX ^l,2 Includes a plurality of intermediate feature vectors (tokens), wherein the plurality of intermediate feature vectors has a classification token that represents information of all tokens and is therefore more important than other tokens.N ^l Represents the total number of tokens, where j takes 1 toN ^l H represents the number of head processing units,

the attention scores of the j tokens are represented, the greater the attention score, the more important the class token is. Based on this, rewriting the formula (9) yields the following formula (10):

equation 10)

Wherein, in the formula (10)

For the attention score of a multi-head attention tier,

，

. Wherein the content of the first and second substances,

an attention score representing the classification token,

indicating the attention scores of the other tokens.

In addition, the low-frequency filtering processing is carried out on the intermediate characteristic vector aiming at the intermediate characteristic vector output by the target, and a low-frequency correction vector of the intermediate characteristic vector is obtained.

Here, the token containing important low-frequency information cannot be clearly identified only by using the above attention score, and therefore, in the embodiment of the present application, such missing information is introduced from the frequency domain. Specifically, since the token is transformed according to the change of the image input to the visual network model, the embodiment of the present application proposes low frequency energy correction (LEC) to delete the token by using the characteristics of the visual network model in the frequency domain.

Here, it is required to determine how much low-frequency information each token contains, and the specific implementation process is to convert the token into frequency-domain information by using fourier transform, determine the total energy of the token containing the low-frequency component in the frequency-domain information, and apply a cut-off ratio of the frequency-domain information to the frequency-domain information

Low pass filter of

The frequency domain information is processed to highlight more low frequency information in the frequency domain information.

Further, if the first visual network model is input as the first image in the compression process of the feature vector, the finally obtained low-frequency correction vector refers to the formula (11):

formula (11)

Wherein: in the formula (11), the reaction mixture,

a low-frequency correction amount is indicated,

is used to calculate the value of the modulus,

representing a target output

The ith token of (a) is,

representing a token

And performing Fourier transform.

In an alternative embodiment, if the first visual network model is the second image during the compression of the feature vector, the low-frequency correction is obtained as

。

And S312, integrating the attention score and the low-frequency correction to obtain the importance score of the intermediate feature vector.

Wherein the importance score is determined using equation (12):

formula (12)

In the present embodiment, importance scores

The larger the corresponding intermediate feature vector is, the higher the importance of the corresponding intermediate feature vector is, the more information of the first image is contained.

S313, based on the importance scores, a score threshold value is configured in the first visual network model, and a second visual network model is obtained.

Wherein the score threshold is used for instructing the second visual network model to delete the feature vectors with the importance scores lower than the score threshold.

Further, based on the importance score, configuring a score threshold in the first visual network model, and obtaining a second visual network model, including: deleting at least one intermediate feature vector of the first visual network model according to the importance score, wherein the influence degree of the deleted intermediate feature vector on the first visual network model is smaller than an influence degree threshold value; and taking the highest importance score in the deleted intermediate feature vectors as a score threshold, and configuring the score threshold in the first visual network model to obtain a second visual network model.

The influence degree of the deleted intermediate characteristic vector on the first visual network model is smaller than an influence degree threshold value;

specifically, deleting at least one intermediate feature vector in the order of the importance scores from small to large; inputting the residual intermediate characteristic vectors into a subsequent structure of the first visual network model for identification processing to obtain a fifth output result output by the visual network model; determining a fifth loss value of the fifth output result relative to the true result; and if the fifth loss value is greater than or equal to the loss value threshold, determining the highest importance score in the deleted intermediate feature vectors as the score threshold of the selection layer. And if the fifth loss value is smaller than the loss value threshold, increasing the number of the deleted intermediate feature vectors, and continuing to delete at least one intermediate feature vector according to the sequence from small to large of the importance scores.

In the embodiment of the present application, the intermediate feature vectors may be deleted in N incremental ways until the loss requirement is reached. Wherein N is an integer greater than 1.

S310 to S313 are performed for the selection layer of the conversion module. Referring to FIG. 6, target output of a multi-headed attention layerX ^l,2 A plurality of intermediate feature vectors (token 1-token 6) are included, wherein token1 is a classification token and tokens 2-token 6 are other tokens. Outputting the targetX ^l,2 Inputting a selection layer, calculating an importance score of each token in the selection layer, and calculating an importance score according to the importanceThe tokens are ranked by sex score, as in fig. 6, and the importance scores are in order from small to large, token2, token4, token3, token6, token5, token 1. The selection layer deletes at least one intermediate feature vector according to the order of the importance scores from small to large, for example, in fig. 6, deletes one token, for example, token2, and then obtains the target output of the selection layerX ^l,3 。

Wherein the subsequent structure comprises a feedforward network layer of the current conversion module, a conversion module at the upper stage of the current conversion module, and the output of the last conversion moduleX ^n+1,1 Is a fifth output result of the visual network model. The selection layer is used for determining the importance scores of the vectors in the output result of the multi-head attention layer and deleting the vectors with the importance scores lower than the score threshold;

illustratively, if the third loss value is smaller than the loss value threshold, it is determined that the currently deleted intermediate feature vector has less influence on the recognition accuracy of the visual network model, and then the intermediate feature vector is continuously deleted. For example, refer to FIG. 6X ^l,3 Deleting token4 on the basis that the resulting embedded matrix includes: and the token1, the token3, the token5 and the token6 input the embedded matrix into a subsequent visual network model to obtain a fifth output result, and if the fifth loss value is greater than or equal to the loss value threshold value, the importance score corresponding to the token4 is used as the score threshold value of the current conversion module.

In this embodiment of the application, the scoring threshold of the conversion module and the subsequent compression step may be sequentially determined according to the sequence from the lower level to the upper level of the conversion module, after one conversion module completes the determination and compression of the scoring threshold, the determination and compression of the scoring threshold of the upper level conversion module are performed, and finally, the compression of the entire visual network model is completed.

Specifically, if the fifth loss value is greater than or equal to the loss value threshold, it may be determined that, of the deleted intermediate feature vectors, the intermediate feature vector with the highest importance score may affect the recognition accuracy of the visual network model after deletion, and therefore, the importance score is used as the score threshold of the current visual network model, and when the visual network model is compressed and used online, the intermediate feature vector with the importance score smaller than the score threshold may be deleted, so that subsequent calculation amount may be reduced, and the recognition efficiency of the visual network model may be accelerated.

In this embodiment of the application, the first visual network model includes a plurality of conversion modules, the plurality of conversion modules are connected in cascade, and in the compression process of the first visual network model, compression is performed on each conversion module in sequence according to the sequence from a lower level to an upper level of the conversion module.

Further, in the process of compressing the first visual network model by using a compression method based on the frequency domain information to obtain the second visual network model, the method further includes: sending the size of the current compressed model of the first visual network model to a user terminal; and if receiving a compression stop instruction sent by the user terminal, determining that the currently compressed first visual network model is the second visual network model.

In the embodiment of the present application, a cascaded compression manner is adopted, referring to fig. 1, compression of the conversion module a2 is performed after compression of the conversion module a1 is completed, and therefore, after compression of each conversion module is completed, a model size of a currently compressed first visual network model is sent to a user terminal, where the model size refers to a memory size occupied by the currently compressed first visual network model, if a user determines that the current model size meets a requirement, a compression stop instruction is sent to a server through the user terminal, and if the user determines that the current model size does not meet the requirement, a compression continuation instruction may also be sent to the server through the terminal, so that the server continues to compress a next conversion module.

In the embodiment of the present application, each conversion module is compressed in a manner from a lower level to an upper level, and in each conversion module, the channel compression may be performed through steps S301 to S309, and then the scoring threshold of the current conversion module is determined through steps S310 to S313. Or determining a scoring threshold value first, and then performing channel compression until all the conversion modules determine the scoring threshold value and are compressed, thereby obtaining a compressed visual network model. In addition, the channels and tokens are deleted according to the preference of the visual network model to the low-frequency information, so that the calculation efficiency of the compressed visual network model is greatly improved in the use process. And occupies smaller memory, and meanwhile, the identification precision of the visual network model cannot be influenced.

Fig. 7 is a flowchart illustrating steps of another model compression method according to an exemplary embodiment of the present application. As shown in fig. 7, the method specifically includes the following steps:

and S701, receiving the remote sensing image sent by the user terminal.

The remote sensing image is a film or a photo for recording electromagnetic waves of various ground objects, and is divided into an aerial image and a satellite image. The remote sensing image has higher requirements on image details.

S702, determining low-frequency information in the remote sensing image.

The determination of the low frequency information refers to the above embodiments, and is not described herein again.

And S703, compressing the trained first remote sensing model by adopting a channel compression mode based on the low-frequency information to obtain a second remote sensing model.

In the embodiment of the present application, the first remote sensing model is also a visual network model, and a channel compression manner is adopted for the remote sensing image, and specific compression manners refer to S301 to S309 in the above embodiment, which is not described herein again.

In addition, after the second remote sensing model is obtained, training data are obtained, and the training data are remote sensing images; and optimizing and training the second remote sensing model according to the training data to obtain the target remote sensing model. And the obtained target remote sensing model can identify and process the remote sensing image.

In the embodiment of the application, the target remote sensing model can be obtained to accurately identify the remote sensing image, and the compressed target remote sensing model occupies a small memory and has a high calculation speed.

Fig. 8 is a flowchart illustrating steps of an image processing method according to an exemplary embodiment of the present application. As shown in fig. 8, the method specifically includes the following steps:

and S801, acquiring an image to be processed.

The image to be processed is a natural image to be identified by the visual network model. The image to be processed and the first image can be a kind of image, or the image to be processed is a remote sensing image.

S802, sending the image to be processed to the server, so that the server can identify the image to be processed by adopting the visual network model to obtain a processing result.

The visual network model is obtained according to the model compression method and can be a second visual network model or a target remote sensing model. Furthermore, the importance score of each token of the input embedded matrix is calculated by the selection layer of each conversion module of the visual network model, and then the tokens with the importance scores smaller than the score threshold are deleted according to the corresponding score threshold, so that the calculation rate of the visual network model can be reduced.

Illustratively, the visual network model includes: the score threshold of the first conversion module is 0.3, the score threshold of the second conversion module is 0.2, and the score threshold of the third conversion module is 0.1. Deleting the token with the importance score smaller than 0.3 when the first conversion module performs calculation, deleting the token with the importance score smaller than 0.2 when the second conversion module performs calculation, and deleting the token with the importance score smaller than 0.1 when the third conversion module performs calculation.

S803, the processing result transmitted by the server is received.

Aiming at the detection scene, the corresponding processing result is a detection result; for a segmentation task scene, the corresponding processing result is a segmentation result; for the classification scene, the corresponding processing result is a classification result; and for the retrieval task scene, the corresponding processing result is a retrieval result.

The visual network model in the embodiment of the application occupies a small memory, and the image to be processed is identified quickly and accurately.

In the embodiment of the present application, referring to fig. 9, in addition to providing a model compression method, there is provided a model compression apparatus 90, the model compression apparatus 90 including:

the acquiring module 91 is configured to receive a first image sent by a user terminal, where the first image is an image of a preset application scene;

a first determining module 92, configured to determine, according to a preset application scenario, a compression mode of the trained first visual network model, where the compression mode includes channel compression and/or compression of a feature vector;

a second determining module 93, configured to determine frequency domain information of the first image.

And a compression module 94, configured to compress the first visual network model in a compression manner based on the frequency domain information to obtain a second visual network model.

In an optional embodiment, the frequency domain information includes low frequency information, and the second determining module 8 is specifically configured to: converting the first image into a frequency domain image; weakening the high-frequency information of the frequency domain image to obtain a filtering frequency spectrum image; and performing inverse transformation on the filtered spectrum image to obtain a second image, wherein the second image comprises low-frequency information.

In an alternative embodiment, the compression mode is channel compression, and the compression module 84 is specifically configured to: inputting the first image into a first visual network model for identification processing to obtain a first output result; deleting channels aiming at each channel of the first visual network model to obtain a current visual network model; inputting the second image into the current visual network model for recognition processing to obtain a second output result; and performing channel compression on the first visual network model based on the first output result and the second output result to obtain a second visual network model.

In an optional embodiment, when the compression module 94 performs channel compression on the first visual network model based on the first output result and the second output result to obtain the second visual network model, it is specifically configured to: determining a first loss value of the second output result relative to the first output result, wherein the magnitude of the first loss value represents the influence degree of the channel on the first visual network model; and deleting at least one channel of the first visual network model according to the first loss value to obtain a second visual network model, wherein the influence degree of the deleted channel on the first visual network model is smaller than the influence degree threshold value.

In an alternative embodiment, the compression manner is compression of the feature vector, and the compression module 94 is specifically configured to: inputting the second image into the first visual network model for identification processing to obtain a plurality of intermediate characteristic vectors corresponding to the second image; according to the intermediate feature vector, determining the attention score and the low-frequency correction quantity of the first visual network model; integrating the attention score and the low-frequency correction to obtain an importance score of the intermediate feature vector; and configuring a score threshold in the first visual network model based on the importance score to obtain a second visual network model, wherein the score threshold is used for indicating the second visual network model to delete the feature vectors with the importance scores lower than the score threshold.

In an optional embodiment, the compression module 94, when configuring a score threshold in the first visual network model based on the importance score to obtain the second visual network model, is specifically configured to: deleting at least one intermediate feature vector of the first visual network model according to the importance score, wherein the influence degree of the deleted intermediate feature vector on the first visual network model is smaller than an influence degree threshold value; and taking the highest importance score in the deleted intermediate feature vectors as a score threshold, and configuring the score threshold in the first visual network model to obtain a second visual network model.

In an optional embodiment, the first visual network model includes a plurality of conversion modules, the plurality of conversion modules are connected in cascade, and in the compression process of the first visual network model, compression is performed on each conversion module in sequence according to the sequence from a lower stage to an upper stage of the conversion module.

In an optional embodiment, the preset application scenario includes: the method comprises the steps of detecting one of a scene, a task segmentation scene, a classification scene and a task retrieval scene, wherein the compression mode comprises channel compression under the detection scene and the task segmentation scene, and the compression mode comprises channel compression and compression of characteristic vectors under the classification scene and the task retrieval scene.

In an optional embodiment, after the first determining module 92 determines the compression mode of the trained first visual network model according to a preset application scenario, the first determining module is further configured to: sending a compression mode to a user terminal; if negative feedback information of the user terminal on the compression modes is received, sending a plurality of compression modes to the user terminal; and receiving at least one compression mode sent by the user terminal, wherein the at least one compression mode is used for compressing the first visual network model.

In an optional embodiment, the compressing module 94 is further configured to, in the process of compressing the first visual network model by using a compression method based on the frequency domain information to obtain the second visual network model: sending the size of the current compressed model of the first visual network model to a user terminal; and if receiving a compression stop instruction sent by the user terminal, determining that the currently compressed first visual network model is the second visual network model.

The model compression device provided by the embodiment of the application can achieve the purpose of obtaining the compressed visual network model, and the compressed visual network model occupies a smaller memory and has higher calculation efficiency under the condition of not influencing the identification precision.

In the embodiment of the present application, referring to fig. 10, there is also provided another pattern compression apparatus 10, the pattern compression apparatus 10 including:

the receiving module 11 is used for receiving the remote sensing image sent by the user terminal;

the determining module 12 is used for determining low-frequency information in the remote sensing image;

and the compression module 13 is used for compressing the trained first remote sensing model by adopting a channel compression mode based on the low-frequency information to obtain a second remote sensing model.

In an alternative embodiment, the model compressing apparatus 10 further includes a training module (not shown) for acquiring training data, wherein the training data is a remote sensing image; and optimizing and training the second remote sensing model according to the training data to obtain a target remote sensing model.

In the embodiment of the application, the target remote sensing model which can accurately identify the remote sensing image, occupies small memory and has high calculation speed can be obtained.

In the embodiment of the present application, referring to fig. 11, there is also provided an image processing apparatus 110, where the image processing apparatus 110 includes:

an obtaining module 111, configured to obtain an image to be processed;

and a sending module 112, configured to send the image to be processed to the server, so that the server performs identification processing on the image to be processed by using a visual network model, and obtains a processing result, where the visual network model is obtained according to the model compression method.

A receiving module 113, configured to receive a processing result sent by the server.

The image processing device provided by the embodiment of the application can realize rapid and accurate identification of the image to be processed.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of order or in parallel as they appear in the present document, and only for distinguishing between the various operations, and the sequence number itself does not represent any execution order. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

Fig. 12 is a schematic structural diagram of a cloud device 120 according to an exemplary embodiment of the present application. The cloud device 120 is configured to run the above-described model compression method or image processing method. As shown in fig. 12, the cloud device includes: a memory 124 and a processor 125.

A memory 124 for storing computer programs and may be configured to store other various information to support operations on the cloud device. The Storage 124 may be an Object Storage Service (OSS).

The memory 124 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 125, coupled to the memory 124, for executing the computer program in the memory 124 to: receiving a first image sent by a user terminal, wherein the first image is an image of a preset application scene; determining a compression mode of the trained first visual network model according to a preset application scene, wherein the compression mode comprises channel compression and/or feature vector compression; determining frequency domain information of a first image; and compressing the first visual network model by adopting a compression mode based on the frequency domain information to obtain a second visual network model.

Further optionally, the frequency domain information includes low frequency information, and the processor 125 is specifically configured to, when determining the frequency domain information of the first image: converting the first image into a frequency domain image; weakening the high-frequency information of the frequency domain image to obtain a filtering frequency spectrum image; and performing inverse transformation on the filtered spectrum image to obtain a second image, wherein the second image comprises low-frequency information.

Further optionally, the compression method is channel compression, and the processor 125 is specifically configured to, when compressing the visual network model in a compression method based on the frequency domain information to obtain a compressed visual network model: inputting the first image into a first visual network model for identification processing to obtain a first output result; deleting channels aiming at each channel of the first visual network model to obtain a current visual network model; inputting the second image into the current visual network model for recognition processing to obtain a second output result; and performing channel compression on the first visual network model based on the first output result and the second output result to obtain a second visual network model.

Further optionally, the compression mode is compression of the feature vector, and when the processor 125 performs channel compression on the first visual network model based on the first output result and the second output result to obtain the second visual network model, the processor is specifically configured to: determining a first loss value of the second output result relative to the first output result, wherein the magnitude of the first loss value represents the influence degree of the channel on the first visual network model; and deleting at least one channel of the first visual network model according to the first loss value to obtain a second visual network model, wherein the influence degree of the deleted channel on the first visual network model is smaller than the influence degree threshold value.

Further optionally, the compression mode is compression of the feature vector, and the processor 125 is specifically configured to, when compressing the first visual network model in the compression mode based on the frequency domain information to obtain the second visual network model: inputting the second image into the first visual network model for identification processing to obtain a plurality of intermediate characteristic vectors corresponding to the second image; according to the intermediate feature vector, determining the attention score and the low-frequency correction quantity of the first visual network model; integrating the attention score and the low-frequency correction to obtain an importance score of the intermediate feature vector; and configuring a score threshold in the first visual network model based on the importance score to obtain a second visual network model, wherein the score threshold is used for indicating the second visual network model to delete the feature vectors with the importance scores lower than the score threshold.

Further optionally, the processor 125, when configuring a score threshold in the first visual network model based on the importance score to obtain the second visual network model, is specifically configured to: deleting at least one intermediate feature vector of the first visual network model according to the importance score, wherein the influence degree of the deleted intermediate feature vector on the first visual network model is smaller than an influence degree threshold value; and taking the highest importance score in the deleted intermediate feature vectors as a score threshold, and configuring the score threshold in the first visual network model to obtain a second visual network model.

Further optionally, the first visual network model includes a plurality of conversion modules, the plurality of conversion modules are connected in cascade, and in the compression process of the first visual network model, the processor 125 is further configured to sequentially compress for each conversion module according to an order of the conversion module from a lower level to an upper level.

In an optional embodiment, the processor 125, after determining the compression mode of the trained first visual network model according to the preset application scenario, is further configured to: sending a compression mode to a user terminal; if negative feedback information of the user terminal on the compression modes is received, sending a plurality of compression modes to the user terminal; and receiving at least one compression mode sent by the user terminal, wherein the at least one compression mode is used for compressing the first visual network model.

In an optional embodiment, the processor 125, in the process of compressing the first visual network model by using a compression method based on the frequency domain information to obtain the second visual network model, is further configured to: sending the size of the current compressed model of the first visual network model to a user terminal; and if receiving a compression stop instruction sent by the user terminal, determining that the currently compressed first visual network model is the second visual network model.

In an alternative embodiment, the processor 125, coupled to the memory 124, is configured to execute the computer program in the memory 124 to: receiving a remote sensing image sent by a user terminal; determining low-frequency information in the remote sensing image; and based on the low-frequency information, compressing the trained first remote sensing model by adopting a channel compression mode to obtain a second remote sensing model.

In an optional embodiment, the processor 125, after compressing the trained first remote sensing model by using a channel compression method based on the low-frequency information to obtain a second remote sensing model, is further configured to: acquiring training data, wherein the training data are remote sensing images; and optimizing and training the second remote sensing model according to the training data to obtain the target remote sensing model.

In an alternative embodiment, the processor 125, coupled to the memory 124, is configured to execute the computer program in the memory 124 to: acquiring an image to be processed; sending the image to be processed to a server for the server to identify the image to be processed by adopting a visual network model to obtain a processing result, wherein the visual network model is obtained according to the model compression method; and receiving the processing result sent by the server.

Further, as shown in fig. 12, the cloud device further includes: firewall 121, load balancer 122, communications component 126, power component 123, and other components. Only some of the components are schematically shown in fig. 12, and the cloud device is not meant to include only the components shown in fig. 12.

The cloud equipment provided by the embodiment of the application can obtain the compressed visual network model, and the compressed visual network model occupies a smaller memory and has higher calculation efficiency under the condition of not influencing the identification precision.

Accordingly, the present application also provides a computer readable storage medium storing a computer program, which when executed by a processor causes the processor to implement the steps of the above-mentioned method.

Accordingly, embodiments of the present application also provide a computer program product, which includes computer programs/instructions, when executed by a processor, cause the processor to implement the steps in the method shown above.

The communications component of fig. 12 described above is configured to facilitate communications between the device in which the communications component is located and other devices in a wired or wireless manner. The device where the communication component is located can access a wireless network based on a communication standard, such as WiFi, a mobile communication network such as 2G, 3G, 4G/LTE, 5G, or the like, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast associated text from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared information association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The power supply module of fig. 9 provides power to various components of the device in which the power supply module is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to complete all or part of the above described functions. For the specific working process of the system described above, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A model compression method is applied to a server, and comprises the following steps:

receiving a first image sent by a user terminal, wherein the first image is an image of a preset application scene;

determining a compression mode of the trained first visual network model according to the preset application scene, wherein the compression mode comprises channel compression and/or feature vector compression;

determining frequency domain information of the first image;

and compressing the first visual network model by adopting the compression mode based on the frequency domain information to obtain a second visual network model.

2. The model compression method of claim 1, wherein the frequency domain information includes low frequency information, and wherein the determining the frequency domain information for the first image comprises:

converting the first image into a frequency domain image;

weakening the high-frequency information of the frequency domain image to obtain a filtering frequency spectrum image;

and performing inverse transformation on the filtered spectrum image to obtain a second image, wherein the second image comprises the low-frequency information.

3. The model compression method according to claim 2, wherein the compression method is the channel compression, and the compressing the visual network model by using the compression method based on the frequency domain information to obtain a compressed visual network model comprises:

inputting the first image into the first visual network model for identification processing to obtain a first output result;

deleting each channel of the first visual network model to obtain a current visual network model;

inputting the second image into the current visual network model for recognition processing to obtain a second output result;

and performing channel compression on the first visual network model based on the first output result and the second output result to obtain the second visual network model.

4. The model compression method according to claim 3, wherein the channel compressing the first visual network model based on the first output result and the second output result to obtain the second visual network model comprises:

determining a first loss value of the second output result relative to the first output result, the magnitude of the first loss value representing the degree of influence of the channel on the first visual network model;

and deleting at least one channel of the first visual network model according to the first loss value to obtain the second visual network model, wherein the influence degree of the deleted channel on the first visual network model is smaller than an influence degree threshold value.

5. The model compression method according to claim 2, wherein the compression method is compression of the feature vector, and the compressing the first visual network model by using the compression method based on the frequency domain information to obtain a second visual network model comprises:

inputting the second image into the first visual network model for identification processing to obtain a plurality of intermediate feature vectors corresponding to the second image;

according to the intermediate feature vector, determining an attention score and a low-frequency correction quantity of the first visual network model;

fusing the attention score and the low-frequency correction to obtain an importance score of the intermediate feature vector;

and configuring a scoring threshold in the first visual network model based on the importance scores to obtain the second visual network model, wherein the scoring threshold is used for indicating the second visual network model to delete the feature vectors with the importance scores lower than the scoring threshold.

6. The model compression method of claim 5, wherein the configuring a score threshold in the first visual network model based on the importance score to obtain the second visual network model comprises:

deleting at least one intermediate feature vector of the first visual network model according to the importance score, wherein the influence degree of the deleted intermediate feature vector on the first visual network model is smaller than an influence degree threshold value;

and taking the highest importance score in the deleted intermediate feature vectors as a score threshold, and configuring the score threshold in the first visual network model to obtain the second visual network model.

7. The model compression method according to any one of claims 1 to 6, wherein the first visual network model includes a plurality of conversion modules, the plurality of conversion modules are connected in cascade, and compression is performed for each conversion module in order of the conversion module from a lower stage to an upper stage in the compression process of the first visual network model.

8. The model compression method according to any one of claims 1 to 6, wherein the preset application scenario comprises: the method comprises the steps of detecting one of a scene, a task segmentation scene, a classification scene and a task retrieval scene, wherein the compression mode comprises channel compression under the detection scene and the task segmentation scene, and the compression mode comprises channel compression and compression of feature vectors under the classification scene and the task retrieval scene.

9. The model compression method according to any one of claims 1 to 6, wherein after determining the compression mode of the trained first visual network model according to the preset application scenario, the method further comprises:

sending the compression mode to the user terminal;

if negative feedback information of the user terminal to the compression mode is received, sending a plurality of compression modes to the user terminal;

and receiving at least one compression mode sent by the user terminal, wherein the at least one compression mode is used for compressing the first visual network model.

10. The model compression method according to any one of claims 1 to 6, wherein in the process of compressing the first visual network model by the compression method based on the frequency domain information to obtain a second visual network model, the method further comprises:

sending the size of the current compressed model of the first visual network model to the user terminal;

and if receiving a compression stop instruction sent by the user terminal, determining that the currently compressed first visual network model is the second visual network model.

11. A model compression method is applied to a server, and comprises the following steps:

receiving a remote sensing image sent by a user terminal;

determining low-frequency information in the remote sensing image;

and based on the low-frequency information, compressing the trained first remote sensing model by adopting a channel compression mode to obtain a second remote sensing model.

12. The model compression method according to claim 11, wherein, after compressing the trained first remote sensing model by using a channel compression method based on the low-frequency information to obtain a second remote sensing model, the method further comprises:

acquiring training data, wherein the training data are remote sensing images;

and optimally training the second remote sensing model according to the training data to obtain a target remote sensing model.

13. An image processing method applied to a terminal, the image processing method comprising:

acquiring an image to be processed;

sending the image to be processed to a server, so that the server can identify the image to be processed by adopting a visual network model to obtain a processing result, wherein the visual network model is obtained according to the model compression method of any one of claims 1 to 12;

and receiving the processing result sent by the server.

14. A cloud device, comprising: a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the model compression method according to any one of claims 1 to 12 or the image processing method according to claim 13 when executing the computer program.