CN113177562B

CN113177562B - Vector determination method and device for merging context information based on self-attention mechanism

Info

Publication number: CN113177562B
Application number: CN202110488969.8A
Authority: CN
Inventors: 李业豪; 姚霆; 潘滢炜; 梅涛
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2024-02-06
Anticipated expiration: 2041-04-29
Also published as: CN113177562A

Abstract

The application discloses a vector determination method and device based on self-attention mechanism fusion context information. One embodiment of the method comprises the following steps: determining key vectors, query vectors and value vectors of feature points in the feature map; performing convolution operation on key vectors of feature points in a feature map by using convolution check of a preset size to obtain first-order context key vectors of context information of the fusion feature points; obtaining a second-order context key vector according to the first-order context key vector and a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector; and merging the first-order context key vector and the second-order context key vector to determine a target vector. The application provides a method for determining a vector fused with context information based on a self-attention mechanism, which improves the expression capability of a target vector; furthermore, a target vector with stronger expressive power can be provided for the machine vision task, and the accuracy of processing the machine vision task is improved.

Description

Vector determination method and device for merging context information based on self-attention mechanism

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a vector determination method and device based on self-attention mechanism fusion context information.

Background

Inspired by the self-attention mechanism (self-attention in Transformer) in the natural language processing field, the design of neural network structures has gradually begun to incorporate the self-attention mechanism in the machine vision recognition field. Conventional self-attention mechanisms generally calculate the attention weight corresponding to each key vector by two independent query vector-key pair (query-key pair), and the attention weight finally acts on the value vector (values) to obtain an output vector.

Disclosure of Invention

The embodiment of the application provides a vector determination method and device based on self-attention mechanism fusion context information.

In a first aspect, an embodiment of the present application provides a method for determining a vector based on self-attention mechanism fusion context information, including: determining key vectors, query vectors and value vectors of feature points in the feature map; performing convolution operation on key vectors of feature points in a feature map by using convolution check of a preset size to obtain first-order context key vectors of context information of the fusion feature points; obtaining a second-order context key vector according to the first-order context key vector and a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector; and merging the first-order context key vector and the second-order context key vector to determine a target vector.

In some embodiments, the obtaining the second-order context key vector according to the first-order context key vector and the query vector and the value vector corresponding to the receptive field of the convolution operation for obtaining the first-order context key vector includes: splicing the first-order context key vector and the target query vector to obtain a spliced vector, wherein the target query vector is characterized to obtain a query vector of a feature point at a central position in a receptive field of convolution operation of the first-order context key vector; and obtaining a second-order context key vector according to the spliced vector and the value vector of the characteristic point in the receptive field of the convolution operation for obtaining the first-order context key vector.

In some embodiments, the obtaining the second-order context key vector according to the value vector of the feature point in the receptive field of the convolution operation of the spliced vector and the first-order context key vector includes: performing convolution operation on the spliced vectors for a plurality of times to obtain an attention matrix; and obtaining a second-order context key vector based on local matrix multiplication operation between the value vector of the characteristic points in the receptive field and the attention matrix.

In some embodiments, the size of the local matrix in the local matrix multiplication operation corresponds to the preset size.

In some embodiments, the self-attention mechanism is a multi-head self-attention mechanism; the method further comprises the following steps: and determining the target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

In some embodiments, the above method comprises: and replacing the output vector of the convolution operation with the convolution kernel of a preset size in the neural network with the finally determined target vector, and processing the visual recognition task through the neural network.

In a second aspect, an embodiment of the present application provides a vector determination apparatus for merging context information based on a self-attention mechanism, including: a first determination unit configured to determine a key vector, a query vector, and a value vector of feature points in the feature map; the convolution unit is configured to carry out convolution operation on the key vectors of the feature points in the feature map by convolution check of a preset size to obtain first-order context key vectors of the context information of the fusion feature points; the obtaining unit is configured to obtain a second-order context key vector according to the first-order context key vector and a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector; and a fusion unit configured to fuse the first-order context key vector and the second-order context key vector to determine a target vector.

In some embodiments, the obtaining unit is further configured to: splicing the first-order context key vector and the target query vector to obtain a spliced vector, wherein the target query vector is characterized to obtain a query vector of a feature point at a central position in a receptive field of convolution operation of the first-order context key vector; and obtaining a second-order context key vector according to the spliced vector and the value vector of the characteristic point in the receptive field of the convolution operation for obtaining the first-order context key vector.

In some embodiments, the obtaining unit is further configured to: performing convolution operation on the spliced vectors for a plurality of times to obtain an attention matrix; and obtaining a second-order context key vector based on local matrix multiplication operation between the value vector of the characteristic points in the receptive field and the attention matrix.

In some embodiments, the self-attention mechanism is a multi-head self-attention mechanism; the above apparatus further comprises: and a second determining unit configured to determine a target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

In some embodiments, the apparatus further comprises: and the processing unit is configured to replace the output vector of the convolution operation with the convolution kernel of a preset size in the neural network with the finally determined target vector and process the visual recognition task through the neural network.

In a third aspect, embodiments of the present application provide a computer readable medium having a computer program stored thereon, wherein the program, when executed by a processor, implements a method as described in any of the implementations of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.

The method and the device for determining the vector based on the self-attention mechanism fused context information are used for determining the key vector, the query vector and the value vector of the feature points in the feature map; performing convolution operation on key vectors of feature points in a feature map by using convolution check of a preset size to obtain first-order context key vectors of context information of the fusion feature points; obtaining a second-order context key vector according to the first-order context key vector and a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector; the first-order context key vector and the second-order context key vector are fused, and the target vector is determined, so that a method for determining the vector fused with the context information based on a self-attention mechanism is provided, and the expression capability of the target vector is improved; furthermore, a target vector with stronger expressive power can be provided for the machine vision task, and the accuracy of processing the machine vision task is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a method of vector determination based on self-attention mechanism fusion context information according to the present application;

fig. 3 is a schematic diagram of an application scenario of a vector determination method based on self-attention mechanism fusion context information according to the present embodiment;

FIG. 4 is a flow chart of yet another embodiment of a method of vector determination based on self-attention mechanism fusion context information according to the present application;

FIG. 5 is a flow chart of deriving a target vector according to the multi-head self-attention mechanism of the present application;

FIG. 6 is a block diagram of one embodiment of a vector determination device based on self-attention mechanism fusing context information in accordance with the present application;

FIG. 7 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

The present application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

FIG. 1 illustrates an exemplary architecture 100 to which the self-attention mechanism fusion context information based vector determination method and apparatus of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The communication connection between the terminal devices 101, 102, 103 constitutes a topology network, the network 104 being the medium for providing the communication link between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The terminal devices 101, 102, 103 may be hardware devices or software supporting network connections for data interaction and data processing. When the terminal device 101, 102, 103 is hardware, it may be various electronic devices supporting network connection, information acquisition, interaction, display, processing, etc., including but not limited to smartphones, tablet computers, electronic book readers, laptop and desktop computers, etc. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. It may be implemented as a plurality of software or software modules, for example, for providing distributed services, or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, such as a background processing server that obtains a target vector of context information fused with feature points based on a self-attention mechanism on a feature map of an image to be recognized for machine vision recognition transmitted by a user through the terminal devices 101, 102, 103. Alternatively, the server may use the resulting target vector for various machine vision downstream tasks, such as target object detection, semantic segmentation, and the like. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be further noted that, the method for determining the vector based on the context information integrated by the self-attention mechanism provided by the embodiments of the present application may be performed by a server, or may be performed by a terminal device, or may be performed by the server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit) included in the vector determination device based on the self-attention mechanism fused context information may be provided in the server, may be provided in the terminal device, or may be provided in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the self-attention-mechanism-based integrated context information vector determination method is operated does not need to perform data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., server or terminal device) on which the self-attention-mechanism-based integrated context information vector determination method is operated.

With continued reference to fig. 2, a flow 200 of one embodiment of a method of vector determination based on self-attention mechanism fusion context information is shown, comprising the steps of:

in step 201, key vectors, query vectors and value vectors of feature points in the feature map are determined.

In this embodiment, an execution body (e.g., a server in fig. 1) of a vector determination method based on self-attention mechanism fusion context information may determine a key vector (key), a query vector (query), and a value vector (value) of feature points in a feature map.

The image to be identified characterized by the feature map is an image to be subjected to a machine vision identification task, and the machine vision identification task comprises, but is not limited to, image identification, object detection, semantic segmentation and the like. Accordingly, any content may be included in the image to be identified.

As an example, the execution body may obtain the key vector, the query vector, and the value vector corresponding to each feature point by applying the transformation matrix corresponding to the key vector, the query vector, and the value vector, respectively, to the feature vector of the feature point in the feature map.

As yet another example, the execution body may determine a feature vector of a feature point in the feature map as one or more vectors of a key vector, a query vector, and a value vector corresponding to the feature point. Specifically, the execution body may determine the feature vector of the feature point in the feature map as the key vector and the query vector corresponding to the feature point, and apply the transformation matrix corresponding to the value vector to the feature vector of the feature point in the feature map to obtain the value vector corresponding to the feature point.

Step 202, performing convolution operation on the key vectors of the feature points in the feature map by using convolution check of a preset size to obtain a first-order context key vector of the context information of the fusion feature points.

In this embodiment, the execution body may perform convolution operation on the key vector of the feature point in the feature map by using a convolution check of a preset size to obtain a first-order context key vector of the context information of the fused feature point.

The preset size may be specifically set according to actual situations (for example, the calculation amount of the convolution process, and the range of the context information of the feature points to be fused). For example, the preset size is 3×3. It is understood that the term context information refers to the field of natural language processing, and in the field of machine vision recognition, the term context information may be specifically understood as feature information of feature points located around feature points.

The key vector corresponding to each feature point can form a key vector matrix. For each feature point in the feature map, the execution body may perform convolution operation through a convolution kernel with a preset size to obtain a first-order context key vector of the context information, which corresponds to the feature point and is represented by a key vector of the feature point in a receptive field of the convolution operation. Wherein, in the receptive field corresponding to the characteristic point, the characteristic point is at the center position.

Taking the size of the convolution kernel of the convolution operation as 3×3 as an example, the execution body performs convolution operation on the key vector corresponding to the feature points (the number of the feature points is 3×3=9) included in the key vector matrix in the receptive field corresponding to each convolution operation, so as to obtain a first-order context key vector corresponding to the feature point at the center position of the receptive field.

Step 203, obtaining a second-order context key vector according to the first-order context key vector and the query vector and the value vector corresponding to the receptive field of the convolution operation of the first-order context key vector.

In this embodiment, the execution body may obtain the second-order context key vector according to the first-order context key vector and the query vector and the value vector corresponding to the receptive field of the convolution operation for obtaining the first-order context key vector.

The query vector and the value vector corresponding to the receptive field of the convolution operation for obtaining the first-order context key vector may be the query vector and the value vector corresponding to the feature point in the receptive field.

As an example, the execution body may obtain the second-order context key vector through operations such as vector stitching and multiplication between the query vector and the value vector corresponding to the receptive field of the convolution operation for obtaining the first-order context key vector.

In some optional implementations of this embodiment, the executing body may execute the step 203 as follows:

first, a first-order context key vector and a target query vector are spliced to obtain a spliced vector.

The target query vector is characterized, and the query vector of the feature point at the central position in the receptive field of the convolution operation of the first-order context key vector is obtained.

The first-order context key vector corresponds to a characteristic point of the center position of the receptive field of the convolution operation for obtaining the first-order context key vector, the target query vector also corresponds to a characteristic point of the receptive field of the convolution operation for obtaining the first-order context key vector, the corresponding first-order context key vector and the target query vector are spliced, and a spliced vector corresponding to the characteristic point can be obtained. It can be understood that, for each feature point in the feature map, the above-described determination process of the splice vector is performed, so that the splice vector corresponding to each feature point can be obtained.

Second, obtaining a second-order context key vector according to the spliced vector and the value vector of the characteristic point in the receptive field of the convolution operation for obtaining the first-order context key vector.

As an example, the execution body may obtain an attention matrix representing the attention to the context information in the receptive field by stitching the vectors, and further perform multiplication operation according to the attention matrix and a value vector of a feature point in the receptive field of the convolution operation for obtaining the first-order context key vector, to obtain the second-order context key vector.

In some optional implementations of this embodiment, the executing body may execute the second step by:

first, a plurality of convolution operations are performed on the spliced vectors to obtain an attention matrix.

As an example, the above-described execution body may obtain the attention matrix based on two convolution operations with a convolution kernel of 1×1. Wherein the first of the two convolution operations has an activation function, e.g. with a ReLU activation function, and the second convolution operation does not have an activation function.

Then, a second-order context key vector is obtained based on a local matrix multiplication operation between the value vector of the feature point in the receptive field and the attention matrix.

The size of the local matrix corresponding to the local matrix may be specifically set according to practical situations, and is not limited herein.

In some optional implementations of this embodiment, the size of the local matrix in the local matrix multiplication operation corresponds to the preset size.

In step 204, the first-order context key vector and the second-order context key vector are fused to determine the target vector.

In this embodiment, the execution body may fuse the first-order context key vector and the second-order context key vector to determine the target vector.

With continued reference to fig. 3, fig. 3 is a schematic diagram 300 of an application scenario of a vector determination method based on self-attention mechanism fusion context information according to the present embodiment. In the application scenario of fig. 3, the server 301 first acquires the feature map 303 from the terminal device 302. Then, the server 301 determines a key vector, a query vector, and a value vector of feature points in the feature map 303. Then, the key vectors of the feature points in the feature map 303 are convolved with a convolution check of a preset size of 3×3 to obtain a first-order context key vector 304 fusing the context information of the feature points. Then, according to the first-order context key vector 304, a query vector 305 and a value vector 306 corresponding to a receptive field of convolution operation of the first-order context key vector are obtained, and a second-order context key vector 307 is obtained; the first order context key vector 304 and the second order context key vector 307 are fused to determine the target vector 308.

The method provided by the embodiment of the application determines the key vector, the query vector and the value vector of the feature points in the feature map; performing convolution operation on key vectors of feature points in a feature map by using convolution check of a preset size to obtain first-order context key vectors of context information of the fusion feature points; obtaining a second-order context key vector according to the first-order context key vector and a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector; the first-order context key vector and the second-order context key vector are fused, and the target vector is determined, so that a method for determining the vector fused with the context information based on a self-attention mechanism is provided, and the expression capability of the target vector is improved; furthermore, a target vector with stronger expressive power can be provided for the machine vision task, and the accuracy of processing the machine vision task is improved.

In some alternative implementations of the present embodiment, the self-attention mechanism is a multi-head self-attention mechanism. The execution body may further determine a target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

As an example, the target vectors corresponding to the heads are spliced, and then linear transformation is performed to obtain the target vector corresponding to the multi-head self-attention mechanism.

In some optional implementations of this embodiment, the execution body may further replace an output vector of the convolution operation with a convolution kernel of a preset size in the neural network with a final determined target vector, and process the visual recognition task through the neural network.

In the implementation mode, the output vector of the convolution operation with the convolution kernel of the preset size in the neural network is replaced by the finally determined target vector, so that the neural network can process the visual recognition task by using the target vector integrating the context, and the accuracy of visual recognition is improved.

Specifically, the execution body may replace a convolution kernel in a neural network for processing a machine vision task with a convolution operation of a preset size, where the network structure is used to obtain a target vector from a key vector, a query vector and a value vector of a feature point.

With continued reference to fig. 4, there is shown a schematic flow 400 of one embodiment of a method of vector determination based on self-attention mechanism fusion context information according to the present application, comprising the steps of:

in step 401, key vectors, query vectors and value vectors of feature points in the feature map are determined.

And step 402, performing convolution operation on the key vectors of the feature points in the feature map by using convolution check of a preset size to obtain first-order context key vectors of the context information of the fusion feature points.

And step 403, splicing the first-order context key vector and the target query vector to obtain a spliced vector.

And step 404, performing convolution operation on the spliced vector for a plurality of times to obtain an attention matrix.

Step 405, obtaining a second-order context key vector based on local matrix multiplication operation between the value vector of the feature point in the receptive field and the attention matrix.

In step 406, the first order context key vector and the second order context key vector are fused to determine the target vector.

Step 407, determining a target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

As an example, as shown in fig. 5, a flow diagram 500 of a multi-head self-attention mechanism to get a target vector is shown. The feature vector of the feature point in the feature map X is determined as a key vector and a query vector corresponding to the feature point, and the feature vector of the feature point in the feature map X is acted on by a conversion matrix corresponding to the value vector to obtain the value vector corresponding to the feature point.

Firstly, acting on a feature map X with the size of H multiplied by W multiplied by 0C through a K multiplied by K convolution operation to obtain a feature map corresponding to a first-order context key vector with the size of H multiplied by 2W multiplied by 3C, and then splicing (Concat) the feature map corresponding to the first-order context key vector and the feature map corresponding to the query vector to obtain a spliced feature map with the size of H multiplied by 4W multiplied by 52C; then, the spliced feature map is subjected to a convolution operation θ of 1×1 to obtain a feature map of size h×w×d, and further, is applied to each head by a convolution operation×1 of 1×1 to obtain a feature map of size h×w× (k×k×c _H ) Is a feature map of (1). Wherein C is _H The number of heads that are the multi-head self-attention mechanism. The resulting rule is then assembledCun is H X W X (KXKXC) _H ) And (3) carrying out local matrix multiplication operation on the feature map corresponding to the value vector, taking an operation result as the feature map corresponding to the second-order context key vector, and fusing (Fusion) the feature map corresponding to the first-order context key vector to obtain a feature map Y corresponding to the target vector.

And step 408, replacing the output vector of the convolution operation with the convolution kernel of a preset size in the neural network with the finally determined target vector, and processing the visual recognition task through the neural network.

As can be seen from this embodiment, compared with the embodiment 200 corresponding to fig. 2, the process 400 of the vector determining method based on the self-attention mechanism integrating the context information in this embodiment specifically illustrates the process of determining the target vector of the multi-head self-attention mechanism, and the process of replacing the output vector of the convolution operation with the convolution kernel of the preset size in the neural network with the final determined target vector, which improves the accuracy of the result obtained by the neural network processing the visual recognition task.

With continued reference to fig. 6, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of a vector determination apparatus based on self-attention mechanisms and context information, where the apparatus embodiment corresponds to the method embodiment shown in fig. 2, and the apparatus may be specifically applied in various electronic devices.

As shown in fig. 6, the vector determination apparatus that integrates context information based on the self-attention mechanism includes: a first determining unit 601 configured to determine a key vector, a query vector, and a value vector of feature points in a feature map; the convolution unit 602 is configured to perform convolution operation on the key vectors of the feature points in the feature map by using convolution check of a preset size to obtain first-order context key vectors of the context information of the fused feature points; an obtaining unit 603, configured to obtain a second-order context key vector according to the first-order context key vector, and a query vector and a value vector corresponding to a receptive field of a convolution operation of the first-order context key vector; the fusing unit 604 is configured to fuse the first-order context key vector and the second-order context key vector to determine the target vector.

In some embodiments, the deriving unit 603 is further configured to: splicing the first-order context key vector and the target query vector to obtain a spliced vector, wherein the target query vector is characterized to obtain a query vector of a feature point at a central position in a receptive field of convolution operation of the first-order context key vector; and obtaining a second-order context key vector according to the spliced vector and the value vector of the characteristic point in the receptive field of the convolution operation for obtaining the first-order context key vector.

In some embodiments, the deriving unit 603 is further configured to: performing convolution operation on the spliced vectors for a plurality of times to obtain an attention matrix; and obtaining a second-order context key vector based on local matrix multiplication operation between the value vector of the characteristic points in the receptive field and the attention matrix.

In some embodiments, the self-attention mechanism is a multi-head self-attention mechanism; the above apparatus further comprises: a second determining unit (not shown in the figure) is configured to determine a target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

In some embodiments, the apparatus further comprises: a processing unit (not shown in the figure) configured to replace the output vector of the convolution operation with the convolution kernel of a preset size in the neural network with the finally determined target vector, and process the visual recognition task through the neural network.

In the present embodiment, a first determination unit in a vector determination apparatus that integrates context information based on a self-attention mechanism determines a key vector, a query vector, and a value vector of feature points in a feature map; the convolution unit carries out convolution operation on the key vectors of the feature points in the feature map by using convolution check of a preset size to obtain first-order context key vectors of the context information of the fusion feature points; the obtaining unit obtains a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector according to the first-order context key vector, and obtains a second-order context key vector; the fusion unit fuses the first-order context key vector and the second-order context key vector and determines the target vector, so that a device for determining the vector fused with the context information based on a self-attention mechanism is provided, and the expression capability of the target vector is improved; furthermore, a target vector with stronger expressive power can be provided for the machine vision task, and the accuracy of processing the machine vision task is improved.

Referring now to FIG. 7, there is illustrated a schematic diagram of a computer system 700 suitable for use in implementing the apparatus of embodiments of the present application (e.g., apparatus 101, 102, 103, 105 illustrated in FIG. 1). The apparatus shown in fig. 7 is merely an example, and should not be construed as limiting the functionality and scope of use of the embodiments herein.

As shown in fig. 7, the computer system 700 includes a processor (e.g., CPU, central processing unit) 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the system 700 are also stored. The processor 701, the ROM702, and the RAM703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input section 706 including a keyboard, a mouse, and the like; an output portion 707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 708 including a hard disk or the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. The drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read therefrom is mounted into the storage section 708 as necessary.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 709, and/or installed from the removable medium 711. The above-described functions defined in the method of the present application are performed when the computer program is executed by the processor 701.

It should be noted that the computer readable medium of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the client computer, partly on the client computer, as a stand-alone software package, partly on the client computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the client computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units may also be provided in a processor, for example, described as: a processor includes a first determination unit, a convolution unit, an acquisition unit, and a fusion unit. The names of these units do not constitute a limitation on the unit itself in some cases, and for example, the convolution unit may also be described as "a unit that performs a convolution operation with a predetermined-size convolution kernel key vector of a feature point in the feature map to obtain a first-order context key vector of the context information of the fused feature point".

As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the computer device to: determining key vectors, query vectors and value vectors of feature points in the feature map; performing convolution operation on key vectors of feature points in a feature map by using convolution check of a preset size to obtain first-order context key vectors of context information of the fusion feature points; obtaining a second-order context key vector according to the first-order context key vector and a query vector and a value vector corresponding to a receptive field of convolution operation of the first-order context key vector; and merging the first-order context key vector and the second-order context key vector to determine a target vector.

The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. A method of vector determination based on self-attention mechanism fusion context information, comprising:

determining key vectors, query vectors and value vectors of feature points in a feature map, wherein an image to be identified, which is characterized by the feature map, is an image to be subjected to a machine vision identification task, and the machine vision identification task comprises image identification, object detection and semantic segmentation;

performing convolution operation on key vectors of feature points in the feature map by using convolution cores with preset sizes to obtain first-order context key vectors of context information of the fused feature points;

splicing and multiplying the first-order context key vector, a query vector of a characteristic point in a central position in a receptive field of convolution operation of the first-order context key vector and a value vector of the characteristic point in the receptive field to obtain a second-order context key vector;

and fusing the first-order context key vector and the second-order context key vector to determine a target vector.

2. The method of claim 1, wherein the obtaining a second-order context key vector from the first-order context key vector and a vector of values corresponding to a receptive field of a convolution operation that obtains the first-order context key vector comprises:

splicing the first-order context key vector and the target query vector to obtain a spliced vector, wherein the target query vector is characterized to obtain a query vector of a feature point at a central position in a receptive field of convolution operation of the first-order context key vector;

and obtaining the second-order context key vector according to the spliced vector and the value vector of the characteristic point in the receptive field of the convolution operation for obtaining the first-order context key vector.

3. The method of claim 2, wherein the deriving the second-order context key vector from the stitched vector and a vector of values of feature points in a receptive field of a convolution operation that derives the first-order context key vector comprises:

performing convolution operation on the spliced vectors for a plurality of times to obtain an attention matrix;

and obtaining the second-order context key vector based on local matrix multiplication operation between the value vector of the characteristic points in the receptive field and the attention matrix.

4. A method according to claim 3, wherein the size of the local matrix correspondence in the local matrix multiplication operation is the same as the preset size.

5. The method of claim 1, wherein the self-attention mechanism is a multi-head self-attention mechanism; and

further comprises:

and determining the target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

6. The method of any of claims 1-5, further comprising:

and replacing the convolution kernel in the neural network with the output vector of the convolution operation with the preset size to be a final determined target vector, and processing a visual recognition task through the neural network.

7. A vector determination apparatus that fuses context information based on a self-attention mechanism, comprising:

a first determining unit configured to determine a key vector, a query vector and a value vector of feature points in a feature map, wherein an image to be identified characterized by the feature map is an image to be subjected to a machine vision identification task, and the machine vision identification task comprises image identification, object detection and semantic segmentation;

the convolution unit is configured to carry out convolution operation on the key vectors of the feature points in the feature map by using convolution cores with preset sizes to obtain first-order context key vectors of the context information of the fusion feature points;

the obtaining unit is configured to splice and multiply the first-order context key vector, the query vector of the characteristic point at the central position in the receptive field of the convolution operation of obtaining the first-order context key vector and the value vector of the characteristic point in the receptive field to obtain a second-order context key vector;

and a fusion unit configured to fuse the first-order context key vector and the second-order context key vector, and determine a target vector.

8. The apparatus of claim 7, wherein the deriving unit is further configured to:

splicing the first-order context key vector and the target query vector to obtain a spliced vector, wherein the target query vector is characterized to obtain a query vector of a feature point at a central position in a receptive field of convolution operation of the first-order context key vector; and obtaining the second-order context key vector according to the spliced vector and the value vector of the characteristic point in the receptive field of the convolution operation for obtaining the first-order context key vector.

9. The apparatus of claim 8, wherein the deriving unit is further configured to:

performing convolution operation on the spliced vectors for a plurality of times to obtain an attention matrix; and obtaining the second-order context key vector based on local matrix multiplication operation between the value vector of the characteristic points in the receptive field and the attention matrix.

10. The apparatus of claim 9, wherein a size corresponding to a local matrix in the local matrix multiplication operation is the same as the preset size.

11. The apparatus of claim 7, wherein the self-attention mechanism is a multi-head self-attention mechanism; and

further comprises:

and a second determining unit configured to determine a target vector corresponding to the multi-head self-attention mechanism according to the target vector corresponding to each head in the multi-head self-attention mechanism.

12. The apparatus of any of claims 7-11, further comprising:

and the processing unit is configured to replace the convolution kernel in the neural network with the output vector of the convolution operation with the preset size and process the visual recognition task through the neural network.

13. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-6.

14. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-6.