CN115761437B

CN115761437B - Image processing method, training method and electronic equipment based on vision converter

Info

Publication number: CN115761437B
Application number: CN202211400729.9A
Authority: CN
Inventors: 龙思凡; 皮积敏; 王井东
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-11-09
Filing date: 2022-11-09
Publication date: 2024-02-06
Anticipated expiration: 2042-11-09
Also published as: CN115761437A

Abstract

The disclosure provides an image processing method, a training method and electronic equipment based on a vision converter, relates to the technical field of artificial intelligence, in particular to the technical fields of deep learning, image processing and computer vision, and can be applied to scenes such as OCR. The method comprises the following steps: determining a plurality of original image marks of an image to be processed; splitting the plurality of primary image markers into at least two important image markers and at least two secondary image markers by a visual transducer model; aggregating the at least two important image marks through the vision converter model to obtain new important image marks, and aggregating the at least two secondary image marks to obtain new secondary image marks; and performing image processing according to the new important image mark and the new secondary image mark through the visual transducer model to obtain an image processing result. By the aid of the technical scheme, accuracy of image processing can be improved.

Description

Image processing method, training method and electronic equipment based on vision converter

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to the field of deep learning, image processing, and computer vision, and may be applied to scenes such as OCR (Optical Character Recognition, word recognition). In particular to an image processing method, a training method and electronic equipment based on a vision converter.

Background

Transducers (transducers) have become the most popular architecture in the fields of natural language processing and computer vision. The vision transformer (Vision Transformer, viT) achieves excellent performance in a variety of vision tasks and is superior to standard convolutional neural networks (Convolutional Neural Networks, CNN). The most significant advantage of the converter is that it is able to effectively capture long-range dependencies between patches in an input image through a self-attention mechanism.

However, the secondary interactions between tokens (token) significantly reduce computational efficiency. Therefore, how to improve the image processing efficiency is important.

Disclosure of Invention

The disclosure provides an image processing method, a training method and electronic equipment based on a vision converter.

According to an aspect of the present disclosure, there is provided an image processing method based on a vision converter, including:

determining a plurality of original image marks of an image to be processed;

splitting the plurality of primary image markers into at least two important image markers and at least two secondary image markers by a visual transducer model;

aggregating the at least two important image marks through the vision converter model to obtain new important image marks, and aggregating the at least two secondary image marks to obtain new secondary image marks;

And performing image processing according to the new important image mark and the new secondary image mark through the visual transducer model to obtain an image processing result.

According to another aspect of the present disclosure, there is provided a training method of a vision transducer model, including:

determining a plurality of original image marks of an image to be trained;

splitting the plurality of original image markers into at least two important image markers and at least two secondary image markers through an initial model to be trained;

the at least two important image marks are polymerized to obtain new important image marks through the initial model, and the at least two secondary image marks are polymerized to obtain new secondary image marks;

performing image processing according to the new important image mark and the new secondary image mark through the initial model to obtain a prediction result of the image to be trained;

training the initial model according to the prediction result and the label result of the image to be trained to obtain a visual transformation model; the visual transformation model is used for carrying out image processing on the image to be processed to obtain an image processing result.

According to still another aspect of the present disclosure, there is provided an image processing apparatus based on a vision converter, including:

The original image marking module is used for determining a plurality of original image marks of the image to be processed;

an image mark splitting module for splitting the plurality of original image marks into at least two important image marks and at least two secondary image marks through a visual transducer model;

the image mark aggregation module is used for aggregating the at least two important image marks through the visual transducer model to obtain new important image marks, and aggregating the at least two secondary image marks to obtain new secondary image marks;

and the image processing module is used for carrying out image processing according to the new important image mark and the new secondary image mark through the vision converter model to obtain an image processing result.

According to yet another aspect of the present disclosure, there is provided a training apparatus of a vision transducer model, including:

the original image marking module is used for determining a plurality of original image marks of the image to be trained;

the image mark splitting module is used for splitting the plurality of original image marks into at least two important image marks and at least two secondary image marks through an initial model to be trained;

the image mark aggregation module is used for aggregating the at least two important image marks through the initial model to obtain new important image marks, and aggregating the at least two secondary image marks to obtain new secondary image marks;

The prediction module is used for performing image processing according to the new important image mark and the new secondary image mark through the initial model to obtain a prediction result of the image to be trained;

the model training module is used for training the initial model according to the prediction result and the label result of the image to be trained to obtain a visual conversion model; the visual transformation model is used for carrying out image processing on the image to be processed to obtain an image processing result.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods provided by any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method provided by any of the embodiments of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1a is a flow chart of a vision transducer-based image processing method provided in accordance with an embodiment of the present disclosure;

fig. 1b is an effect schematic diagram of two pruning methods provided according to an embodiment of the present disclosure;

FIG. 2a is a flow chart of another vision transducer-based image processing method provided in accordance with an embodiment of the present disclosure;

FIG. 2b is a schematic diagram of an encoder in a visual transducer model provided by an embodiment of the present disclosure;

FIG. 3a is a flow chart of yet another vision transducer-based image processing method provided in accordance with an embodiment of the present disclosure;

FIG. 3b is a schematic diagram of an encoder in a visual transducer model provided by an embodiment of the present disclosure;

FIG. 3c is a schematic illustration of the effect of three different pruning methods provided in accordance with embodiments of the present disclosure;

FIG. 4 is a flow chart of a method of training a vision transducer model provided in accordance with an embodiment of the present disclosure;

Fig. 5 is a schematic structural view of an image processing apparatus based on a vision converter according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a vision transducer model-based training device provided in accordance with an embodiment of the present disclosure;

fig. 7 is a block diagram of an electronic device used to implement a vision transducer-based image processing method or a training method for a vision transducer model in accordance with an embodiment of the present disclosure.

Detailed Description

Fig. 1a is a flowchart of an image processing method based on a vision converter according to an embodiment of the present disclosure. The method is suitable for the condition of image processing by using a visual transducer. The method may be performed by a vision-converter-based image processing device, which may be implemented in software and/or hardware and may be integrated in an electronic device. As shown in fig. 1a, the image processing method based on the vision converter of the present embodiment may include:

s101, determining a plurality of original image marks of an image to be processed;

s102, splitting the plurality of original image marks into at least two important image marks and at least two secondary image marks through a visual transducer model;

s103, aggregating the at least two important image marks through the vision converter model to obtain new important image marks, and aggregating the at least two secondary image marks to obtain new secondary image marks;

S104, performing image processing according to the new important image mark and the new secondary image mark through the visual transducer model to obtain an image processing result.

The visual transducer model is a model trained based on the visual transducer and is used for processing an image to be processed to obtain an image processing result.

Specifically, the Image to be processed may be segmented by an Image segmentation layer in the visual transducer model to obtain a plurality of local Image blocks, and the local Image blocks are linearly encoded by a linear encoding layer in the visual transducer model to obtain an original Image Token (Image Token) of each local Image block. The local image blocks are in one-to-one correspondence with the original image marks, i.e. each local image block has a respective original image mark. A learnable Class Token (CLS) and each original image Token constitute a Token sequence, the Class Token being located at the head of the Token sequence. The visual transducer model also comprises a plurality of encoders which are connected in sequence, and a prediction layer which is connected with the tail encoder, and the image processing result is obtained by prediction through the prediction layer.

In the disclosed embodiment, each original image mark can be split into an important group and a minor group, which can also be called a non-important group, according to the attention score of each original image mark by an encoder. The important group includes at least two important image markers, the minor group includes at least two minor image markers, and the attention score of any one important image marker is greater than the attention scores of all minor image markers.

And the encoder further aggregates the at least two important image markers to obtain new important image markers and aggregates the at least two secondary image markers to obtain new secondary image markers. The learnable classification marks and the new important image marks and the new minor image marks constitute a new mark sequence as input to the next encoder. Because the number of the new important image marks is less than the number of the important image marks before aggregation and the number of the new secondary image marks is also less than the number of the secondary image marks before aggregation, the number of marks in the new mark sequence is less than the number of marks in the mark sequence before aggregation, pruning of the visual converter is realized, and the computational complexity of the encoder can be greatly reduced.

The related art proposes a cutting method of image markers (which may be simply referred to as a first pruning method), retaining only important image markers, and directly discarding all the secondary image markers to achieve a satisfactory pruning retention rate. However, the inventors have found during image processing that even the image background contributes to improving the image processing performance of the foreground instance, and therefore even less important image markers, which may include semantic information useful for image processing, are an effective complement to the information diversity. Moreover, the diversity of the image marks is very critical to optimizing the structure of the visual transducer, and proper preservation of the secondary image marks can increase the diversity of semantic information, thereby being beneficial to pruning of the image marks. The embodiment of the disclosure realizes a pruning method (which can be simply called a second pruning method) for considering importance and diversity by splitting an original image mark into an important image mark and a secondary image mark and respectively aggregating the important image mark and the secondary image mark.

Fig. 1b is an effect schematic diagram of two pruning methods provided according to an embodiment of the present disclosure. Referring to fig. 1b, the image processing accuracy and diversity scores of the two pruning methods at different pruning retention rates are compared, respectively. It can be seen that there is a positive correlation between the image processing accuracy and the diversity score. Whether the first pruning method or the second pruning method proposed in this embodiment, a higher diversity score always results in higher accuracy. In addition, it is also observed that in the first pruning method, as the pruning retention rate decreases, both the diversity score and the classification accuracy decrease rapidly. However, thanks to the diversity-protected image-marking aggregation strategy, the second pruning method can maintain a relatively high diversity score at different pruning retention rates, so that the accuracy of the second pruning method is always better than that of the first pruning method, even at low retention rates. In summary, it is known that maintaining high image tag diversity is critical to improving the accuracy of image processing.

According to the technical scheme provided by the embodiment of the disclosure, the visual transducer model obtains a new important image mark and a new secondary image mark by splitting a plurality of original image marks of an image to be processed into at least two important image marks and at least two secondary image marks and respectively polymerizing the at least two important image marks and the at least two secondary image marks; and determining a new mark sequence according to the new important image mark and the new secondary image mark, and obtaining an image processing result according to the new mark sequence, so that pruning processing considering importance and diversity is realized, and the efficiency and accuracy of image processing can be improved.

Fig. 2a is a flowchart of another vision-converter-based image processing method provided in accordance with an embodiment of the present disclosure. Referring to fig. 2a, the vision-converter-based image processing method of the present embodiment may include:

s201, determining a plurality of original image marks of an image to be processed;

s202, determining the attention score of the original image mark through a multi-head self-attention module in a visual converter model;

s203, splitting the plurality of original image marks into at least two important image marks and at least two secondary image marks according to the attention scores of the original image marks through a decoupling module in a visual converter model;

s204, aggregating the at least two important image marks through the vision converter model to obtain new important image marks, and aggregating the at least two secondary image marks to obtain new secondary image marks;

s205, performing image processing according to the new important image mark and the new secondary image mark through the visual transducer model to obtain an image processing result.

Fig. 2b is a schematic diagram of an encoder in a visual transducer model provided in an embodiment of the disclosure. Referring to fig. 2b, the encoder in the visual converter model includes a Multi-Head Self-Attention (MHSA), a decoupling module, and a point-wise Forward network module (FFN). The decoupling module is used for splitting the original image mark into an important group and a secondary group.

Specifically, the self-attention score of the original image mark is obtained by self-attention calculation of the original image mark by the MHSA module in the encoder, and the self-attention scores obtained by all heads are averaged to be used as the attention score of the original image mark. The plurality of primary image markers is split into at least two important image markers and at least two secondary image markers according to the attention scores of the plurality of primary image markers by a decoupling module in the encoder. By splitting a plurality of original image marks into important and secondary groups according to the attention scores of the original image marks, accurate splitting is realized, and a foundation is laid for subsequent aggregation of the important image marks and the secondary image marks.

In an alternative embodiment, splitting the plurality of primary image markers into at least two important image markers and at least two secondary image markers according to the attention scores of the primary image markers by a decoupling module in a visual transducer model comprises: the first K original image marks with relatively high attention scores are used as important image marks, and the rest original image marks are used as secondary image marks through a decoupling module in the visual converter model; wherein K is determined according to the pruning retention and the total number of the original image marks and is a positive integer.

Specifically, the number of important image marks can be obtained by the following formula:

K＝N×η；

wherein N is the total number of the original image marks, eta is the pruning retention rate, and K is the number of the important marks. N, K is a positive integer, and the value interval of η is (0.1,0.9). Specifically, the pruning retention rate of the visual transducer model and the total number of original image marks are obtained, and the number of important marks is determined according to the pruning retention rate and the total number of original image marks. And selecting the first K original image marks with relatively high attention scores as important image marks, and taking the remaining N-K original image marks as secondary image marks. The number of important marks is determined according to the pruning maintaining rate, and the original image is marked and grouped according to the number of the important marks, so that the requirements of different pruning maintaining rates can be met, and the pruning flexibility is further improved.

In an alternative embodiment, the visual transformation model is configured on a mobile terminal device; the image processing includes at least one of: image classification, image semantic segmentation, or image object detection.

Specifically, in the image processing process of the visual transformation model provided by the embodiment of the disclosure, image classification, image semantic segmentation, image target detection, and the like can be performed on an image to be processed, for example, OCR (Optical Character Recognition, text recognition) and the like.

Let N x D be the input dimension of the encoder, N being the number of original image marks, D being the embedding dimension of each image mark, both being positive integers. The encoder has a computational complexity of 12ND ² +2N ² D. The original image marks are split into the important image marks and the secondary image marks, and the important image marks and the secondary image marks are respectively aggregated, so that the number of the image marks can be reduced, the calculation complexity is greatly reduced, the visual conversion model can be deployed on mobile terminal equipment, and the convenience of image processing is further improved.

According to the technical scheme provided by the embodiment of the disclosure, the original image mark is split into the important image mark and the secondary image mark according to the attention score of the original image mark through the decoupling module in the encoder, so that the accuracy of image mark splitting can be improved; and the number of important marks is flexibly determined according to the pruning maintaining rate, so that the pruning flexibility is improved.

Fig. 3a is a flowchart of yet another vision-converter-based image processing method provided in accordance with an embodiment of the present disclosure. This embodiment is an alternative to the embodiments described above. Referring to fig. 3a, the vision-converter-based image processing method of the present embodiment may include:

S301, determining a plurality of original image marks of an image to be processed;

s302, splitting the plurality of original image marks into at least two important image marks and at least two secondary image marks through a visual transducer model;

s303, aggregating the at least two important image marks through an important aggregation module in the visual transducer model to obtain new important image marks;

s304, aggregating the at least two secondary image marks through a secondary aggregation module in the visual transducer model to obtain new secondary image marks;

wherein the important aggregation module is different from the aggregation parameters in the secondary aggregation module;

s305, performing image processing according to the new important image mark and the new secondary image mark through the visual transducer model to obtain an image processing result.

Fig. 3b is a schematic diagram of an encoder in a visual transducer model provided in an embodiment of the disclosure. Referring to fig. 3b, in the encoder of the visual transducer model, an important aggregation module and a minor aggregation module are further included between the MHSA module and the FNN module. The important aggregation module is used for aggregating important image marks, and the secondary aggregation module is used for aggregating secondary image marks. In addition, the encoder can further comprise a decoupling module, wherein the input end of the decoding module is connected with the multi-head self-attention module, and the output end of the decoding module is connected with the important aggregation module and the secondary aggregation module respectively.

Specifically, in the encoder, the self-attention score of the original image mark is obtained by self-attention calculation of the original image mark through the MHSA module, a plurality of original image marks are split into at least two important image marks and at least two secondary image marks according to the attention score through the decoupling module, the important image marks are aggregated through the important aggregation module, and the secondary image marks are aggregated through the secondary aggregation module. The aggregation parameters used by the important aggregation module and the secondary aggregation module are different, for example, the retention rate of the secondary image marks is also lower than the retention rate of the important image marks, that is, the ratio of the number of new secondary image marks to the total number of secondary image marks may be lower than the ratio of the number of new important image marks to the total number of important image marks. Through the important aggregation module, the similarity between different important image marks is determined, and the important image marks are subjected to similarity fusion, so that the number of the important image marks can be reduced on the premise of not reducing the performance.

By adopting different aggregation parameters to aggregate the important image marks and the secondary image marks respectively, compared with the method adopting the same clustering parameters to aggregate the important image marks and the secondary image marks, the method not only can consider the influence of diversity on the image processing performance, but also can consider the importance effect, and realizes a pruning method considering both importance and diversity.

The embodiments of the present disclosure also investigated the following three different pruning methods: a first pruning method, a second pruning method and a third pruning method. The first pruning method only retains important image markers, directly discards all secondary image markers, also referred to as an importance-based method; the second pruning method simultaneously emphasizes the importance and diversity of the image marks, and respectively aggregates the important image marks and the secondary image marks; the third pruning method only considers diversity, and does not consider importance, by performing similarity aggregation on all image markers. Fig. 3c is a schematic diagram of the effects of three different pruning methods provided according to embodiments of the present disclosure, with reference to fig. 3c, where the second pruning method always achieves the best performance, while the other two pruning methods achieve relatively weaker performance. The second pruning method is more advantageous in performance, especially at low pruning retention rates.

In an alternative embodiment, the aggregating, by the secondary aggregation module in the visual transducer model, the at least two secondary image markers to obtain a new secondary image marker includes: determining, by a secondary aggregation module in the vision transducer model, a local density of the secondary image marker and a minimum distance from a higher local density, respectively; and polymerizing the at least two secondary image marks according to the local density of the secondary image marks and the minimum distance from the higher local density to obtain new secondary image marks.

Wherein the minimum distance from any secondary image mark to a higher local density is the distance between that secondary image mark to other secondary image marks having a local density greater than that and closest to it.

In the secondary aggregation module, the local density and the minimum distance from the higher local density of each secondary image mark can be respectively determined based on a density peak clustering algorithm (Clustering by fast search and find of Density Peaks, DPC), and the local density and the minimum distance from the higher local density are combined to fuse each secondary image mark to obtain a new secondary image mark. Through research on a large number of clustering algorithms, the performance of the DPC clustering algorithm is superior to that of classical clustering algorithms such as a K-means algorithm. It should be noted that the clustering algorithm used for aggregating the secondary image marks may be the same as or different from the clustering algorithm used for aggregating the secondary image marks.

In an alternative embodiment, the aggregating the at least two secondary image marks to obtain a new secondary image mark according to the local density of the secondary image marks and the minimum distance from the higher local density comprises: clustering the secondary image marks according to the local density of the secondary image marks and the minimum distance from the higher local density to obtain secondary categories to which the secondary image marks belong; and fusing the secondary image marks in the secondary category to obtain a new secondary image mark corresponding to the secondary category.

Specifically, the distance between any two secondary image marks can be determined based on a DPC clustering algorithm, the cut-off distance is calculated according to the distance, and the local density of the secondary image marks is obtained by combining the distance, the cut-off distance and the secondary image marks; and selecting, for each secondary image mark, an adjacent secondary image mark having a larger local density and a closest distance, and taking the distance between the secondary image mark and the adjacent secondary image mark as the minimum distance from the secondary image mark to the higher local density; selecting a cluster center by combining the local density and the minimum distance from the local density to the local density, and distributing the rest secondary image marks to the cluster center closest to the cluster center and having the local density to the cluster center with the higher local density to obtain a plurality of secondary categories; and respectively fusing the secondary image marks in each secondary category to obtain a new secondary image mark of the secondary category, thereby improving the accuracy of the new secondary image mark.

In an alternative embodiment, fusing the secondary image markers in the secondary category to obtain new secondary image markers corresponding to the secondary category includes: and determining the importance of each secondary image mark in each secondary category, and fusing each secondary image mark according to the importance of each secondary image mark to obtain a new secondary image mark corresponding to the secondary category.

In the disclosed embodiment, class labels are also introduced to represent importance, which may be a weight value. The importance of each secondary image mark can be determined through the class mark, and the importance is adopted to carry out weighted fusion on each secondary image mark, so as to obtain a new secondary image mark. By introducing importance to distinguish the semantic weights of the secondary image labels in the same secondary category, the accuracy of the new secondary image labels can be further improved compared to direct averaging of the secondary image labels.

According to the technical scheme provided by the embodiment of the disclosure, the re-image marks and the secondary image marks are respectively aggregated through the important aggregation module and the secondary aggregation module in the encoder, and the important aggregation module is different from the secondary aggregation module in aggregation parameters, so that the semantic diversity and the importance of image processing can be considered, and the image processing performance is further improved.

Fig. 4 is a flowchart of a training method for a visual transducer model provided in accordance with an embodiment of the present disclosure. The method is suitable for the training condition of the vision converter. The method may be performed by a training device of the vision transducer model, which may be implemented in software and/or hardware and may be integrated in an electronic device. As shown in fig. 4, the training method of the visual transducer model of the present embodiment may include:

S401, determining a plurality of original image marks of an image to be trained;

s402, splitting the plurality of original image marks into at least two important image marks and at least two secondary image marks through an initial model to be trained;

s403, aggregating the at least two important image marks to obtain a new important image mark through the initial model, and aggregating the at least two secondary image marks to obtain a new secondary image mark;

s404, performing image processing according to the new important image mark and the new secondary image mark through the initial model to obtain a prediction result of the image to be trained;

s405, training the initial model according to a prediction result and a label result of the image to be trained to obtain a visual transformation model; the visual transformation model is used for carrying out image processing on the image to be processed to obtain an image processing result.

The initial model adopts a network structure of a visual converter, and can comprise an image dividing layer, a thread coding layer, a plurality of encoders and a prediction layer, wherein different encoders are connected end to end, an output sequence of the thread encoder is used as an input of a head encoder, an output of a previous encoder is used as an input of a next encoder, and an output of a tail encoder is used as an input of a tail encoder.

Each encoder may include an MHSA module, a decoupling module, a significant aggregation module, a secondary aggregation module, and an FFN module. The output of the MHSA module is used as the input of the decoupling module, the important image mark output by the decoupling module is used as the input of the important aggregation module, the secondary image mark output by the decoupling module is used as the input of the secondary aggregation module, the new important image mark output by the important aggregation module and the new secondary image mark output by the secondary aggregation module are used as the input of the FFN module, and the FFN module processes the new mark sequence to obtain the coding result of the coder.

And in the model training stage, obtaining an image to be trained and a label result of the image to be trained. Specifically, the image to be trained is segmented through an image segmentation layer to obtain a plurality of local image blocks, and original image marks of the local image blocks are obtained through a linear coding layer to obtain a mark sequence; taking the mark sequence as the input of an encoder, and dividing each original image mark into an important image mark and a secondary image mark according to the attention of the original image mark through the encoder; and respectively polymerizing the important image marks and the secondary image marks to obtain new important image marks and new secondary image marks, and obtaining a new mark sequence. Predicting a new mark sequence obtained by the tail encoder to obtain a prediction result of the image to be trained; and constructing a loss function according to the prediction result of the image to be trained and the label result of the image to be trained, and updating the parameters to be trained of each module in the initial model by adopting the loss function.

The number of the image marks is reduced through the processing, pruning of the visual converter is realized, and the computational complexity of the encoder can be greatly reduced; in addition, the semantic information of the secondary image mark is reserved, the different importance of the important image mark and the secondary image mark is considered, a pruning method considering importance and diversity is realized, and the efficiency and performance of subsequent image processing can be improved.

According to the technical scheme provided by the embodiment of the disclosure, in a model training stage, a plurality of original image marks of an image to be processed are split into important image marks and secondary image marks, and the important image marks and the secondary image marks are respectively polymerized to obtain new important image marks and new secondary image marks; determining a new mark sequence according to the new important image mark and the new secondary image mark, obtaining a predicted result of the image to be trained according to the new mark sequence, and updating a model by combining the predicted result of the image to be trained and the label result, so that pruning processing considering importance and diversity is realized, and the efficiency and accuracy of image processing can be improved.

In an alternative embodiment, splitting the plurality of primary image markers into at least two important image markers and at least two secondary image markers by an initial model to be trained, comprises: determining the attention score of the original image mark through a multi-head self-attention module in the initial model; and splitting the plurality of original image marks into at least two important image marks and at least two secondary image marks according to the attention scores of the original image marks through a decoupling module in the initial model.

In an alternative embodiment, the splitting, by a decoupling module in the initial model, the plurality of original image markers into at least two important image markers and at least two secondary image markers according to the attention scores of the original image markers includes: the first K original image marks with relatively high attention scores are used as important image marks, and the rest original image marks are used as secondary image marks through a decoupling module in the initial model; wherein K is determined according to the pruning retention and the total number of the original image marks and is a positive integer.

In an alternative embodiment, aggregating the at least two important image markers to obtain a new important image marker and aggregating the at least two secondary image markers to obtain a new secondary image marker by the initial model includes: the at least two important image marks are polymerized through an important polymerization module in the initial model to obtain new important image marks; the at least two secondary image marks are polymerized through a secondary aggregation module in the initial model to obtain new secondary image marks; wherein the important aggregation module is different from the aggregation parameters in the secondary aggregation module.

In an alternative embodiment, the aggregating, by the secondary aggregation module in the initial model, the at least two secondary image markers to obtain a new secondary image marker includes: determining the local density of the secondary image mark and the minimum distance from the higher local density respectively through a secondary aggregation module in the initial model; and polymerizing the at least two secondary image marks according to the local density of the secondary image marks and the minimum distance from the higher local density to obtain new secondary image marks.

In an alternative embodiment, the aggregating the at least two secondary image marks according to the local density of the secondary image marks and the minimum distance from the higher local density to obtain a new secondary image mark includes: clustering the secondary image marks according to the local density of the secondary image marks and the minimum distance from the higher local density to obtain secondary categories to which the secondary image marks belong; and fusing the secondary image marks in the secondary category to obtain a new secondary image mark corresponding to the secondary category.

In an optional embodiment, the fusing the secondary image labels in the secondary category to obtain a new secondary image label corresponding to the secondary category includes: and determining the importance of each secondary image mark in each secondary category, and fusing each secondary image mark according to the importance of each secondary image mark to obtain a new secondary image mark corresponding to the secondary category.

In an alternative embodiment, the image processing includes at least one of: image classification, image semantic segmentation, or image object detection.

Fig. 5 is a schematic structural view of an image processing apparatus based on a vision converter according to an embodiment of the present disclosure. The present embodiment is applicable to the case of performing image processing using a vision converter. The apparatus may be implemented in software and/or hardware. As shown in fig. 5, the vision-converter-based image processing apparatus 500 of the present embodiment may include:

a raw image marking module 510 for determining a plurality of raw image markings of an image to be processed;

an image mark splitting module 520 for splitting the plurality of original image marks into at least two important image marks and at least two secondary image marks by a visual transducer model;

an image tag aggregation module 530, configured to aggregate, through the visual transducer model, the at least two important image tags to obtain new important image tags, and aggregate the at least two secondary image tags to obtain new secondary image tags;

and the image processing module 540 is configured to perform image processing according to the new important image mark and the new secondary image mark through the vision converter model, so as to obtain an image processing result.

In an alternative embodiment, the image tag splitting module 520 includes:

an attention unit for determining an attention score of the original image marker by a multi-headed self-attention module in a visual transducer model;

the image mark splitting unit is used for splitting the plurality of original image marks into at least two important image marks and at least two secondary image marks according to the attention scores of the original image marks through a decoupling module in the visual converter model.

In an alternative embodiment, the image tag splitting unit is specifically configured to:

the first K original image marks with relatively high attention scores are used as important image marks, and the rest original image marks are used as secondary image marks through a decoupling module in the visual converter model; wherein K is determined according to the pruning retention and the total number of the original image marks and is a positive integer.

In an alternative embodiment, the image tag aggregation module 530 includes:

an important aggregation unit, configured to aggregate the at least two important image labels through an important aggregation module in the visual transducer model to obtain new important image labels;

A secondary aggregation unit, configured to aggregate the at least two secondary image markers to obtain new secondary image markers through a secondary aggregation module in the visual transducer model; wherein the important aggregation module is different from the aggregation parameters in the secondary aggregation module.

In an alternative embodiment, the secondary polymeric unit comprises:

a density distance subunit for determining, by a secondary aggregation module in the visual transducer model, a local density of the secondary image markers and a minimum distance from a higher local density, respectively;

and the secondary aggregation subunit is used for aggregating the at least two secondary image marks according to the local density of the secondary image marks and the minimum distance from the higher local density to obtain new secondary image marks.

In an alternative embodiment, the secondary polymeric subunits are specifically configured to:

clustering the secondary image marks according to the local density of the secondary image marks and the minimum distance from the higher local density to obtain secondary categories to which the secondary image marks belong;

and fusing the secondary image marks in the secondary category to obtain a new secondary image mark corresponding to the secondary category.

and determining the importance of each secondary image mark in each secondary category, and fusing each secondary image mark according to the importance of each secondary image mark to obtain a new secondary image mark corresponding to the secondary category.

According to the technical scheme, the important aggregation module and the secondary aggregation module in the encoder are used for respectively aggregating the re-image marks and the secondary image marks, and the important aggregation module is different from the secondary aggregation module in aggregation parameters, so that semantic diversity and importance of image processing can be considered, and the image processing performance is further improved.

Fig. 6 is a schematic structural diagram of a training device for a vision transducer model according to an embodiment of the present disclosure. The embodiment is suitable for the training situation of the vision converter. The apparatus may be implemented in software and/or hardware. As shown in fig. 6, the training apparatus 600 of the vision transducer model of the present embodiment may include:

An original image marking module 610, configured to determine a plurality of original image marks of an image to be trained;

an image tag splitting module 620, configured to split the plurality of original image tags into at least two important image tags and at least two secondary image tags by an initial model to be trained;

the image tag aggregation module 630 is configured to aggregate, through the initial model, the at least two important image tags to obtain new important image tags, and aggregate the at least two secondary image tags to obtain new secondary image tags;

the prediction module 640 is configured to perform image processing according to the new important image mark and the new secondary image mark through the initial model, so as to obtain a prediction result of the image to be trained;

the model training module 650 is configured to train the initial model according to the prediction result and the label result of the image to be trained, so as to obtain a visual transformation model; the visual transformation model is used for carrying out image processing on the image to be processed to obtain an image processing result.

According to the technical scheme, in the model training stage, the re-image marks and the secondary image marks are respectively polymerized through the important polymerization module and the secondary polymerization module in the encoder, and the important polymerization module is different from the secondary polymerization module in polymerization parameters, so that semantic diversity and importance of image processing can be considered, and the image processing performance is further improved.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, for example, an image processing method based on a visual transducer or a training method of a visual transducer model. For example, in some embodiments, the vision transducer-based image processing method or the vision transducer model training method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the above-described vision-converter-based image processing method or the training method of the vision converter model may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the visual transducer-based image processing method or the visual transducer model training method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligent software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

Cloud computing (cloud computing) refers to a technical system that a shared physical or virtual resource pool which is elastically extensible is accessed through a network, resources can comprise servers, operating systems, networks, software, applications, storage devices and the like, and resources can be deployed and managed in an on-demand and self-service mode. Through cloud computing technology, high-efficiency and powerful data processing capability can be provided for technical application such as artificial intelligence and blockchain, and model training.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An image processing method based on a vision converter, comprising:

determining a plurality of original image marks of an image to be processed;

performing image processing according to the new important image mark and the new secondary image mark through the vision converter model to obtain an image processing result;

wherein, through the vision transformer model, aggregating the at least two important image marks to obtain a new important image mark, and aggregating the at least two secondary image marks to obtain a new secondary image mark, comprises:

Determining the similarity between different important image marks through an important aggregation module in the visual transducer model, and carrying out similarity fusion on the important image marks to obtain new important image marks;

clustering the at least two secondary image marks through a secondary aggregation module in the visual transducer model, and fusing the secondary image marks to obtain new secondary image marks;

wherein the important aggregation module is different from the aggregation parameters in the secondary aggregation module.

2. The method of claim 1, wherein the splitting the plurality of raw image markers into at least two important image markers and at least two secondary image markers by a visual transducer model comprises:

determining, by a multi-headed self-attention module in a vision transducer model, an attention score for the original image marker;

splitting, by a decoupling module in a visual transducer model, the plurality of primary image markers into at least two important image markers and at least two secondary image markers according to the attention scores of the primary image markers.

3. The method of claim 2, wherein the splitting, by a decoupling module in a visual transducer model, the plurality of raw image markers into at least two important image markers and at least two minor image markers according to the attention scores of the raw image markers comprises:

4. The method of claim 1, wherein clustering the at least two secondary image labels and fusing each secondary image label to obtain a new secondary image label by a secondary aggregation module in the visual transformer model, comprises:

determining, by a secondary aggregation module in the vision transducer model, a local density of the secondary image marker and a minimum distance from a higher local density, respectively;

and clustering the at least two secondary image marks according to the local density of the secondary image marks and the minimum distance from the higher local density, and fusing the secondary image marks to obtain a new secondary image mark.

5. The method of claim 4, wherein the clustering the at least two secondary image markers according to the local density of the secondary image markers and the minimum distance from the higher local density and fusing the secondary image markers to obtain new secondary image markers comprises:

6. The method of claim 5, wherein the fusing the secondary image markers in the secondary category to obtain new secondary image markers corresponding to the secondary category comprises:

7. The method of claim 1, wherein the visual transducer model is configured on a mobile end device; the image processing includes at least one of: image classification, image semantic segmentation, or image object detection.

8. A method of training a vision transducer model, comprising:

determining a plurality of original image marks of an image to be trained;

training the initial model according to the prediction result and the label result of the image to be trained to obtain a visual transducer model; the visual converter model is used for carrying out image processing on an image to be processed to obtain an image processing result;

wherein the aggregating the at least two important image marks to obtain a new important image mark and aggregating the at least two secondary image marks to obtain a new secondary image mark through the initial model includes:

determining the similarity between different important image marks through an important aggregation module in the initial model, and carrying out similarity fusion on the important image marks to obtain new important image marks;

clustering the at least two secondary image marks through a secondary aggregation module in the initial model, and fusing the secondary image marks to obtain new secondary image marks;

9. An image processing apparatus based on a vision converter, comprising:

the image processing module is used for performing image processing according to the new important image mark and the new secondary image mark through the vision converter model to obtain an image processing result;

wherein, the image mark aggregation module includes:

the important aggregation unit is used for determining the similarity between different important image marks through an important aggregation module in the visual transducer model and carrying out similarity fusion on the important image marks to obtain new important image marks;

The secondary aggregation unit is used for clustering the at least two secondary image marks through a secondary aggregation module in the visual transducer model, and fusing the secondary image marks to obtain new secondary image marks;

10. The apparatus of claim 9, wherein the image tag splitting module comprises:

11. The apparatus according to claim 10, wherein the image marking splitting unit is specifically configured to:

12. The apparatus of claim 9, wherein the secondary aggregation unit comprises:

and the secondary aggregation subunit is used for clustering the at least two secondary image marks according to the local density of the secondary image marks and the minimum distance from the higher local density, and fusing the secondary image marks to obtain new secondary image marks.

13. The apparatus of claim 12, wherein the secondary aggregation subunit is specifically configured to:

14. The apparatus of claim 13, wherein the secondary aggregation subunit is specifically configured to:

15. The apparatus of claim 9, wherein the visual transducer model is configured on a mobile end device; the image processing includes at least one of: image classification, image semantic segmentation, or image object detection.

16. A training apparatus for a vision transducer model, comprising:

the model training module is used for training the initial model according to the prediction result and the label result of the image to be trained to obtain a visual converter model; the visual converter model is used for carrying out image processing on an image to be processed to obtain an image processing result;

The image mark aggregation module is specifically configured to:

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-8.