CN115880545A

CN115880545A - Feature extraction model training and feature extraction method and device

Info

Publication number: CN115880545A
Application number: CN202211617205.5A
Authority: CN
Inventors: 王发发
Original assignee: Beijing IQIYI Science and Technology Co Ltd
Current assignee: Beijing IQIYI Science and Technology Co Ltd
Priority date: 2022-12-13
Filing date: 2022-12-13
Publication date: 2023-03-31

Abstract

The embodiment of the invention relates to a method and a device for training a feature extraction model and extracting features, wherein the method comprises the following steps: inputting a training picture set into an initial model, extracting a plurality of local features from the training picture by a local feature extraction layer, performing feature enhancement on the plurality of local features by a feature enhancement layer based on an attention mechanism to obtain a plurality of enhanced local features, performing feature aggregation on the global feature extraction layer based on the plurality of enhanced local features to obtain global features, and obtaining prediction classification parameters of the training picture by a classifier according to the global features; adjusting model parameters of the initial model according to the label parameters and the prediction classification parameters of the training pictures to obtain a trained classification model; and constructing a feature extraction model for extracting global features based on the local feature extraction layer, the feature enhancement layer and the global feature extraction layer in the classification model. Therefore, an end-to-end global feature extraction scheme can be realized, and the algorithm complexity is reduced.

Description

Feature extraction model training and feature extraction method and device

5 field of the invention

The embodiment of the invention relates to the field of feature engineering, in particular to a method and a device for model training and feature extraction.

Background

0 with the development of computer technology and the popularization of various imaging apparatuses, the internet has accumulated a sea of years

Image data of the volume. Given a query image, how to efficiently and accurately retrieve images with similar contents from these large-scale image collections becomes an urgent need in many applications. For example, how to efficiently and accurately retrieve image frames containing a specific landmark, street view, clothing, or merchandise from a video stream is an urgent need in popularization and application (e.g., travel promotion, merchandise promotion, etc.).

For this reason, image retrieval algorithms have been proposed in the prior art. Image retrieval algorithm by extracting images

And the characteristic is analyzed and indexed based on the image characteristic, and an index database of the image is established for image retrieval.

However, when extracting image features, the current image retrieval algorithm usually performs feature extraction and feature aggregation separately, resulting in higher algorithm complexity and lower feature extraction efficiency.

Disclosure of Invention

In view of this, to solve the above technical problems, embodiments of the present invention provide a method and an apparatus for training a feature extraction model and extracting features

In a first aspect, an embodiment of the present invention provides a method for training a feature extraction model, including: 5, acquiring a training picture set;

inputting the training picture set into an initial model, extracting a plurality of local features from a training picture by a local feature extraction layer in the initial model and outputting the local features to a feature enhancement layer in the initial model, performing feature enhancement on the local features by the feature enhancement layer based on an attention system to obtain a plurality of enhanced local features and outputting the enhanced local features to a global feature extraction layer in the initial model, performing feature aggregation on the global feature extraction layer based on the enhanced local features to obtain global features and outputting the global features to a classifier in the initial model, and obtaining prediction classification parameters of the training picture by the classifier according to the global features;

adjusting model parameters of the initial model according to the label parameters of the training pictures and the prediction classification parameters to obtain a trained classification model;

and constructing a feature extraction model based on the local feature extraction layer, the feature enhancement layer and the global feature extraction layer in the classification model, wherein the feature extraction model is used for extracting global features of the picture to be identified.

In a possible embodiment, the feature enhancing the local features based on the attention mechanism obtains enhanced local features, including:

calculating the local features by using an attention mechanism to obtain a visual marker sequence;

and performing feature enhancement on the plurality of local features by using the visual mark sequence to obtain a plurality of enhanced local features.

In a possible embodiment, the calculating the local features by using a self-attention mechanism to obtain a visual marker sequence includes:

recoding the local characteristics by utilizing a channel self-attention mechanism;

and grouping the local features after the recoding by utilizing a space self-attention mechanism, and determining each group as a visual mark to obtain the visual mark sequence.

In a possible embodiment, the performing feature enhancement on the local features by using the visual marker sequence to obtain enhanced local features includes:

respectively performing feature enhancement on each visual marker in the visual marker sequence by using an attention mechanism to obtain an enhanced visual marker sequence;

and performing feature enhancement on the plurality of local features by utilizing the enhanced visual marker sequence based on a cross attention mechanism to obtain a plurality of enhanced local features.

In a possible implementation, the method according to claim 1, wherein the performing feature aggregation based on the plurality of enhanced local features to obtain a global feature includes:

and performing feature aggregation on the plurality of enhanced local features and the plurality of local features to obtain a global feature.

In a second aspect, an embodiment of the present invention provides a feature extraction method, including:

inputting the picture to be recognized into the feature extraction model trained according to the method in any one of the first aspect, and obtaining the global features of the picture to be recognized.

In a possible embodiment, the inputting the picture to be recognized into the feature extraction model trained according to the method of any one of claims 1 to 5 to obtain the global features of the picture to be recognized includes:

inputting a picture to be recognized into a feature extraction model trained according to any one of claims 1 to 5, extracting a plurality of local features from the picture to be recognized by a local feature extraction layer in the feature extraction model and outputting the local features to a feature enhancement layer in the feature extraction model, performing feature enhancement on the local features by the feature enhancement layer based on an attention mechanism to obtain a plurality of enhanced local features and outputting the enhanced local features to a global feature extraction layer in the initial model, and performing feature aggregation by the global feature extraction layer based on the enhanced local features to obtain global features.

In a possible implementation, the calculating the local features by using an attention-based mechanism to obtain a visual marker sequence includes:

respectively performing feature enhancement on each visual marker in the visual marker sequence by using a self-attention mechanism to obtain an enhanced visual marker sequence;

and performing feature enhancement on the plurality of local features by utilizing the enhanced visual marker sequence based on a cross attention mechanism to obtain a plurality of enhanced local features. .

In a possible embodiment, the performing feature aggregation based on the several enhanced local features to obtain a global feature includes:

In a third aspect, an embodiment of the present invention provides a feature extraction model training apparatus, including:

the acquisition module is used for acquiring a training picture set;

the model training module is used for inputting the training picture set into an initial model, extracting a plurality of local features from a training picture by a local feature extraction layer in the initial model and outputting the local features to a feature enhancement layer in the initial model, performing feature enhancement on the local features by the feature enhancement layer based on an attention mechanism to obtain a plurality of enhanced local features and outputting the enhanced local features to a global feature extraction layer in the initial model, performing feature aggregation on the global feature extraction layer based on the enhanced local features to obtain global features and outputting the global features to a classifier in the initial model, and obtaining prediction classification parameters of the training picture by the classifier according to the global features; adjusting model parameters of the initial model according to the label parameters of the training pictures and the prediction classification parameters to obtain a trained classification model;

and the feature extraction model construction module is used for constructing a feature extraction model based on the local feature extraction layer, the feature enhancement layer and the global feature extraction layer in the classification model, and the feature extraction model is used for extracting global features of the picture to be identified.

In a fourth aspect, an embodiment of the present invention provides a feature extraction apparatus, including:

the extraction module is used for inputting the picture to be recognized into the feature extraction model trained according to the method of any one of claims 1 to 5, and obtaining the global features of the picture to be recognized.

In a fifth aspect, an embodiment of the present invention provides an electronic device, including: a processor and a memory, the processor being configured to execute a model training program stored in the memory to implement the feature extraction model training method of any one of the first aspects, or to execute a feature extraction program stored in the memory to implement the feature extraction method of any one of the second aspects.

In a sixth aspect, an embodiment of the present invention provides a storage medium storing one or more programs, which are executable by one or more processors to implement the feature extraction model training method according to any one of the first aspects or the feature extraction method according to any one of the second aspects.

According to the technical scheme provided by the embodiment of the invention, a training picture set is obtained, the training picture set is input into an initial model, a plurality of local features are extracted from a training picture by a local feature extraction layer in the initial model and are output to a feature enhancement layer in the initial model, a plurality of enhanced local features are obtained by feature enhancement of the feature enhancement layer on the basis of an attention mechanism and are output to a global feature extraction layer in the initial model, global features are obtained by feature aggregation of the global feature extraction layer on the basis of the plurality of enhanced local features and are output to a classifier in the initial model, a prediction classification parameter of the training picture is obtained by the classifier according to the global features, a trained classification model is obtained by adjusting model parameters of the initial model on the basis of label parameters and the prediction classification parameters of the training picture, a feature extraction model for extracting the global features of the picture to be recognized is constructed on the basis of the local feature extraction layer and the global feature extraction layer in the classification model, and a feature extraction model for extracting the global features of the picture to be recognized is constructed, and then in application, and the picture to be recognized can be obtained by inputting the global feature extraction model. Therefore, an end-to-end global feature extraction scheme is realized, which is beneficial to effective alignment of local features and global features, and the feature extraction and the feature aggregation are jointly realized, so that the feature extraction efficiency is improved, and the algorithm complexity is reduced compared with an algorithm in the prior art in which the feature extraction and the feature aggregation are separately carried out; meanwhile, the local features are subjected to feature enhancement by using an attention mechanism, so that irrelevant background interference in the local features can be inhibited, background noise is weakened, and the influence of local shielding on the features can be weakened, and the characterization capability and the characterization richness of the features are further improved.

Drawings

Fig. 1 is a flowchart of an embodiment of a feature extraction model training method according to an embodiment of the present invention;

FIG. 2 is a schematic view of the initial model;

fig. 3 is a flowchart of an embodiment of a feature extraction method according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of a process for performing feature aggregation on a plurality of local features to obtain a global feature according to an embodiment of the present invention;

FIG. 5 is a block diagram of an embodiment of a model training apparatus according to an embodiment of the present invention;

fig. 6 is a block diagram of an embodiment of a feature extraction apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The feature extraction model training method and the feature extraction method provided by the present invention are further explained in the following with specific embodiments in conjunction with the drawings, which are not intended to limit the embodiments of the present invention.

Referring to fig. 1, a flowchart of an embodiment of a feature extraction model training method according to an embodiment of the present invention is provided. As shown in fig. 1, the method comprises the following steps:

step 101, a training picture set is obtained.

The training picture set comprises a plurality of training pictures, and each training picture has corresponding label parameters and is used for representing the classification information of the object contained in the training picture.

The object may be a landmark, a street view, a dress, a commodity, etc., which may depend on a specific application scenario. For example, in a landmark retrieval scene, landmarks may be included in the training pictures; for another example, in a clothing retrieval scenario, clothing may be included in the training picture.

It can be understood that, no matter what application scenario, the objects contained in the training pictures in the training picture set are not identical. For example, in a landmark retrieval scene, a part of the training pictures in the training picture set comprises a heaven and earth gate, another part of the training pictures comprises a people hero monument, and another part of the training pictures comprises a people hall.

In an embodiment, an execution subject according to an embodiment of the present invention may obtain a plurality of training pictures from an open-source picture database, and then a user sets a corresponding label parameter for each training picture to obtain a training picture set.

In another embodiment, the execution subject of the embodiment of the present invention may extract several video frames from one or more video streams as training pictures, and then set corresponding label parameters for each training picture by the user, so as to obtain a training picture set.

102, inputting the training picture set into an initial model, extracting a plurality of local features from the training picture by a local feature extraction layer in the initial model and outputting the local features to a feature enhancement layer in the initial model, performing feature enhancement on the local features by the feature enhancement layer based on an attention system to obtain a plurality of enhanced local features and outputting the enhanced local features to a global feature extraction layer in the initial model, performing feature aggregation on the global feature extraction layer based on the enhanced local features to obtain global features and outputting the global features to a classifier in the initial model, and obtaining prediction classification parameters of the training picture by the classifier according to the global features.

And 103, adjusting model parameters of the initial model according to the label parameters and the prediction classification parameters of the training pictures to obtain a trained classification model.

And 104, constructing a feature extraction model based on the local feature extraction layer, the feature enhancement layer and the global feature extraction layer in the classification model, wherein the feature extraction model is used for extracting global features of the picture to be identified.

The steps 102 to 104 are described in the following with reference to the schematic structural diagram of the initial model illustrated in fig. 2:

in one embodiment, the initial model uses a convolutional neural network, such as the resenest 101, as the backbone network. Further, as shown in fig. 2, the initial model sequentially includes a local feature extraction layer, a feature enhancement layer, a global feature extraction layer, and a classifier along the direction from the input of the model to the output. Of course, the initial model may include other parts besides the four parts, and is not particularly limited herein.

The local feature extraction layer is used for extracting a plurality of local features from an input picture and outputting the local features to the global feature extraction layer. The local feature is a local expression of the image feature and can reflect local specificity on the image.

The local feature extraction layer generally uses a local image descriptor to extract local features of the input picture. The method for detecting the spots mainly comprises a method for detecting a Gaussian operator, a method for utilizing a Hessian matrix of pixel points and row and column values thereof, a scale invariant feature conversion method and the like; the method for detecting the angular point mainly comprises a Harris angular point detection method (a first derivative matrix detection method based on image gray scale), a FAST characteristic detection method and the like.

As to how the local feature extraction layer extracts a plurality of local features from the input picture, those skilled in the art may refer to the related description in the prior art, and details are not described here.

When processing a picture, the existing convolutional neural network does not consider the importance of different areas in the picture, but uniformly processes all image blocks, and the convolution is limited to operate on a small area, so that the interaction in picture space, such as the inclusion relation, the position relation and the like in a picture, is lost. In view of this, the embodiment of the present invention proposes: an attention mechanism is added on the basis of a backbone network, specifically, a feature enhancement layer is used for carrying out feature enhancement on local features based on the attention mechanism, so that irrelevant background interference in the local features can be inhibited, background noise is weakened, the influence of local shielding on the features is weakened, and the feature representation capability and the feature representation richness are further improved.

As to how the feature enhancement layer performs feature enhancement on several local features based on attention mechanism, those skilled in the art can refer to the related description in the following embodiments, which will not be detailed herein.

The global feature extraction layer is used for carrying out feature aggregation on the basis of a plurality of input enhanced local features to obtain global features, and outputting the global features to the classifier. Global features refer to the overall properties of an image. The global feature is obtained by feature aggregation based on the enhanced local feature after feature enhancement, so that the global feature can inhibit irrelevant background interference, weaken background noise and influence of local shielding on the feature, and further improve the feature characterization capability.

The classifier is used for performing classification identification according to the input global features to obtain prediction classification parameters of the input pictures.

Based on this, in the embodiment of the present invention, a training picture set is input to an initial model, a local feature extraction layer in the initial model extracts a plurality of local features from the training picture and outputs the extracted local features to a feature enhancement layer in the initial model, the feature enhancement layer performs feature enhancement on the local features based on an attention mechanism to obtain a plurality of enhanced local features and outputs the enhanced local features to a global feature extraction layer in the initial model, the global feature extraction layer performs feature aggregation based on the enhanced local features to obtain global features and outputs the global features to a classifier in the initial model, and the classifier obtains predicted classification parameters of the training picture according to the global features.

Further, according to the label parameters and the prediction classification parameters of the training pictures, model parameters of the initial model are adjusted to obtain a trained classification model. Specifically, a loss function is determined according to the label parameters and the prediction classification parameters of the training pictures, whether the current initial model is converged is further determined according to the loss function, if yes, the training is stopped, and the current initial model is determined to be a trained classification model; if not, adjusting the model parameters of the current initial model according to the loss function, and then returning to the step of inputting the training picture to the current initial model until the current initial model is determined to be converged.

According to the description, the local feature extraction layer, the feature enhancement layer and the global feature extraction layer in the trained classification model can be used jointly, so that the global feature can be extracted from the picture to be recognized, and the classifier can classify and recognize the object in the picture to be recognized according to the global feature. Therefore, in the embodiment of the present invention, a feature extraction model for extracting the global feature of the picture to be identified is constructed based on the local feature extraction layer, the feature enhancement layer and the global feature extraction layer in the classification model.

Further, in application, the picture to be recognized may be input to the feature extraction model trained according to the above method, so as to obtain the global feature of the picture to be recognized.

According to the technical scheme provided by the embodiment of the invention, a training picture set is obtained, the training picture set is input into an initial model, a plurality of local features are extracted from the training picture by a local feature extraction layer in the initial model and are output to a feature enhancement layer in the initial model, a plurality of enhanced local features are obtained by performing feature enhancement on the plurality of local features by the feature enhancement layer based on an attention mechanism and are output to a global feature extraction layer in the initial model, a global feature is obtained by performing feature aggregation on the global feature extraction layer based on the plurality of enhanced local features and is output to a classifier in the initial model, a prediction classification parameter of the training picture is obtained by the classifier according to the global feature, a trained classification model is obtained by adjusting a model parameter of the initial model according to a label parameter and the prediction classification parameter of the training picture, a feature extraction model for extracting the global feature of the picture to be identified is constructed based on the local feature extraction layer and the global feature extraction layer in the classification model, and a feature extraction model for extracting the global feature of the picture to be identified is constructed, and a feature extraction model for extracting the global feature is obtained by inputting the picture to be identified in application. Therefore, an end-to-end global feature extraction scheme is realized, which is beneficial to effective alignment of local features and global features, and feature extraction and feature aggregation are jointly realized, so that the feature extraction efficiency is improved, and the algorithm complexity is reduced compared with the algorithm in the prior art, in which feature extraction and feature aggregation are separately performed; meanwhile, the local features are subjected to feature enhancement by using an attention mechanism, so that irrelevant background interference in the local features can be inhibited, background noise is weakened, and the influence of local shielding on the features can be weakened, and the characterization capability and the characterization richness of the features are further improved.

Referring to fig. 3, a flowchart of an embodiment of a feature extraction method according to an embodiment of the present invention is provided. As shown in fig. 3, the method comprises the following steps:

step 301, inputting the picture to be recognized into the feature extraction model.

Step 302, a local feature extraction layer in the feature extraction model extracts a plurality of local features from the picture to be recognized and outputs the local features to a feature enhancement layer in the feature extraction model.

And 303, the feature enhancement layer performs feature enhancement on the local features based on an attention mechanism to obtain a plurality of enhanced local features, and outputs the enhanced local features to a global feature extraction layer in the feature extraction model.

In an embodiment, the feature enhancement layer may implement feature enhancement on several local features based on an attention mechanism through the process shown in fig. 4, so as to obtain enhanced local features. As shown in fig. 4, the method comprises the following steps:

step 401, calculating a plurality of local features by using a self-attention mechanism to obtain a visual marker sequence.

In this step 401, the global feature extraction layer first re-encodes several local features by using a channel self-attention mechanism. The channel self-attention mechanism re-encodes the local features by assigning a weight to each feature channel through a setting algorithm and then applying the weight to each feature channel in the local features. By the processing, the local features which are focused can be determined, and the characterization capability of the local features is improved.

And then, the global feature extraction layer groups a plurality of recoded local features by using a space self-attention mechanism, and determines each group as a visual mark to obtain a visual mark sequence. The space self-attention mechanism can convert various deformation data in space and automatically capture important region characteristics, and then associates, namely groups, a plurality of recoded local characteristics according to the important region characteristics, so that a plurality of groups of visual marks capable of representing images are obtained. Such processing can improve the richness of feature characterization.

Step 402, performing feature enhancement on the plurality of local features by using the visual mark sequence to obtain a plurality of enhanced local features.

In an embodiment, the performing feature enhancement on the plurality of local features by using the visual marker sequence may include: firstly, feature enhancement is carried out on each visual marker in a visual marker sequence by utilizing a self-attention mechanism to obtain an enhanced visual marker sequence, and then feature enhancement is carried out on a plurality of local features by utilizing the enhanced visual marker sequence and based on a cross-attention mechanism to obtain a plurality of enhanced local features.

In the above embodiment, feature enhancement is performed on a plurality of local features by using an enhanced visual marker sequence and based on a cross attention mechanism, and the main purpose is to supplement different visual markers on the basis of original local features, and then further screen out local features needing important attention from the local features supplemented with the visual markers, so as to enhance the characterization capability of the local features.

And step 304, the global feature extraction layer performs feature aggregation based on the plurality of enhanced local features to obtain global features.

In an embodiment, the global feature extraction layer may perform feature aggregation on the plurality of enhanced local features to obtain the global feature.

In another embodiment, feature aggregation may be performed on the several enhanced local features and the several local features to obtain a global feature. Compared with the method that only a plurality of enhanced local features are subjected to feature aggregation to obtain the global features, the method can further improve the characterization capability of the finally obtained global features.

Optionally, the algorithm used for feature aggregation may be a VLAD (Vector of Locally Aggregated Descriptors) algorithm, ASMK algorithm, DELG algorithm, and the like, which is not limited herein.

In the process shown in fig. 3, the attention mechanism is utilized to perform feature enhancement on a plurality of local features, and then the global features are obtained based on the enhanced local features obtained by the feature enhancement, so that irrelevant background interference can be suppressed, the influence of background noise and local shielding on the features can be weakened, and the feature characterization capability can be further improved.

In addition, the attention mechanism is flexible to apply and can be randomly inserted into different backbone networks, so that the overall performance of the algorithm can be improved by applying the attention mechanism to extract features, and the subsequent joint optimization is facilitated.

Referring to fig. 5, a block diagram of an embodiment of a feature extraction model training apparatus according to an embodiment of the present invention is provided. As shown in fig. 5, the apparatus includes:

an obtaining module 51, configured to obtain a training picture set;

a model training module 52, configured to extract, by a local feature extraction layer in the initial model, a plurality of local features from a training picture and output the extracted local features to a feature enhancement layer in the initial model, perform, by the feature enhancement layer, feature enhancement on the local features based on an attention mechanism to obtain a plurality of enhanced local features and output the enhanced local features to a global feature extraction layer in the initial model, perform, by the global feature extraction layer, feature aggregation based on the enhanced local features to obtain global features and output the global features to a classifier in the initial model, and obtain, by the classifier, a predicted classification parameter of the training picture according to the global features; adjusting model parameters of the initial model according to the label parameters of the training pictures and the prediction classification parameters to obtain a trained classification model;

a feature extraction model building module 53, configured to build a feature extraction model based on the local feature extraction layer, the feature enhancement layer, and the global feature extraction layer in the classification model, where the feature extraction model is used to extract global features of a picture to be identified.

In a possible embodiment, the model training module 52 includes:

the feature enhancement unit is configured to perform feature enhancement on the plurality of local features based on an attention mechanism to obtain a plurality of enhanced local features, and specifically includes:

calculating the local features by using an attention mechanism to obtain a visual marker sequence; and performing feature enhancement on the plurality of local features by using the visual mark sequence to obtain a plurality of enhanced local features.

In a possible embodiment, the feature enhancing unit comprises:

the recoding subunit is used for recoding the local characteristics by utilizing a channel self-attention mechanism;

and the grouping subunit is used for grouping the plurality of recoded local features by utilizing a space self-attention mechanism, and determining each group as a visual mark to obtain the visual mark sequence.

In a possible embodiment, the feature enhancing unit includes:

the first enhancer unit is used for respectively performing feature enhancement on each visual marker in the visual marker sequence by using a self-attention mechanism to obtain an enhanced visual marker sequence;

and the second enhancer unit is used for performing feature enhancement on the local features by utilizing the enhanced visual marker sequence based on a cross attention mechanism to obtain enhanced local features.

In a possible embodiment, the feature aggregation unit is specifically configured to:

performing feature aggregation based on the plurality of enhanced local features to obtain global features, including:

Referring to fig. 6, a block diagram of an embodiment of a feature extraction apparatus according to an embodiment of the present invention is provided. As shown in fig. 6, the apparatus includes:

and the extraction module 61 is configured to input the picture to be recognized into the feature extraction model trained according to the feature extraction model training method, so as to obtain the global features of the picture to be recognized.

In a possible implementation manner, the extraction module 61 is specifically configured to:

inputting a picture to be recognized into a feature extraction model trained according to the feature extraction model training method, extracting a plurality of local features from the picture to be recognized by a local feature extraction layer in the feature extraction model, outputting the local features to a feature enhancement layer in the feature extraction model, performing feature enhancement on the local features by the feature enhancement layer based on an attention mechanism to obtain a plurality of enhanced local features, and outputting the enhanced local features to a global feature extraction layer in the initial model, and performing feature aggregation by the global feature extraction layer based on the enhanced local features to obtain global features.

In a possible embodiment, the extraction module 61 comprises:

In a possible embodiment, the feature enhancing unit includes:

the first enhancer unit is used for respectively performing characteristic enhancement on each visual marker in the visual marker sequence by utilizing a self-attention mechanism to obtain an enhanced visual marker sequence;

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device 700 shown in fig. 7 includes: at least one processor 701, memory 702, at least one network interface 704, and other user interfaces 703. The various components in the electronic device 700 are coupled together by a bus system 705. It is understood that the bus system 705 is used to enable communications among the components. The bus system 705 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various busses are labeled in figure 7 as the bus system 705.

The user interface 703 may include, among other things, a display, a keyboard or a pointing device (e.g., a mouse, trackball), a touch pad or a touch screen, among others.

It is to be understood that the memory 702 in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a Read-only memory (ROM), a programmable Read-only memory (PROM), an erasable programmable Read-only memory (erasabprom, EPROM), an electrically erasable programmable Read-only memory (EEPROM), or a flash memory. The volatile memory may be a Random Access Memory (RAM) which functions as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (staticiram, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (syncronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (DDRSDRAM ), enhanced Synchronous DRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DRRAM). The memory 702 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, memory 702 stores the following elements, executable units or data structures, or a subset thereof, or an expanded set thereof: an operating system 7021 and application programs 7022.

The operating system 7021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application 7022 includes various applications, such as a media player (MediaPlayer), a Browser (Browser), and the like, for implementing various application services. Programs that implement methods in accordance with embodiments of the present invention can be included in application program 7022.

In the embodiment of the present invention, the processor 701 is configured to execute the steps of the feature extraction model training method provided by each embodiment of the method by calling a program or an instruction stored in the memory 702, specifically, a program or an instruction stored in the application 7022, for example, including:

acquiring a training picture set;

inputting the training picture set into an initial model, extracting a plurality of local features from a training picture by a local feature extraction layer in the initial model, outputting the local features to a global feature extraction layer in the initial model, performing feature aggregation on the local features by the global feature extraction layer to obtain global features, and outputting the global features to a classifier in the initial model, so that the classifier obtains prediction classification parameters of the training picture according to the global features;

and constructing a feature extraction model based on the local feature extraction layer and the global feature extraction layer in the classification model, wherein the feature extraction model is used for extracting global features of the picture to be identified.

Or for executing the steps of the feature extraction method provided by the method embodiments, for example, including:

and inputting the picture to be recognized into a feature extraction model trained according to any one of the feature extraction model training methods to obtain the global features of the picture to be recognized.

The method disclosed in the above embodiments of the present invention may be applied to the processor 701, or implemented by the processor 701. The processor 701 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 701. The processor 701 may be a general-purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software elements in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in the memory 702, and the processor 701 reads the information in the memory 702, and completes the steps of the method in combination with the hardware.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

The electronic device provided in this embodiment may be the electronic device shown in fig. 7, and may execute all the steps of the feature extraction model training method or the feature extraction method in the embodiments described above, so as to achieve the technical effects of the feature extraction model training method or the feature extraction method in the embodiments described above.

The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of the above kinds of memories.

When the one or more programs in the storage medium are executable by the one or more processors, the feature extraction model training method or the feature extraction method performed on the electronic device side as described above is implemented.

The processor is configured to execute a model training program stored in the memory to implement the following steps of the feature extraction model training method performed on the electronic device side:

acquiring a training picture set;

Or implementing the following steps of the feature extraction method performed on the electronic device side:

and inputting the picture to be recognized into the feature extraction model trained according to any one of the feature extraction model training methods to obtain the global features of the picture to be recognized.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A feature extraction model training method is characterized by comprising the following steps:

acquiring a training picture set;

2. The method of claim 1, wherein the feature enhancing the number of local features based on the attention mechanism results in a number of enhanced local features, comprising:

3. The method of claim 2, wherein the computing the number of local features using a self-attention mechanism to obtain a sequence of visual markers comprises:

grouping a plurality of recoded local features by utilizing a space self-attention mechanism, and determining each group as a visual mark to obtain the visual mark sequence.

4. The method of claim 2, wherein said feature enhancing the plurality of local features using the sequence of visual markers to obtain a plurality of enhanced local features comprises:

5. The method of claim 1, wherein the performing feature aggregation based on the plurality of enhanced local features to obtain a global feature comprises:

6. A method of feature extraction, comprising:

inputting the picture to be recognized into the feature extraction model trained according to the method of any one of claims 1 to 5, and obtaining the global features of the picture to be recognized.

7. The method according to claim 7, wherein the inputting the picture to be recognized into the feature extraction model trained according to any one of claims 1 to 5 to obtain the global features of the picture to be recognized comprises:

inputting a picture to be recognized into a feature extraction model trained according to any one of claims 1 to 5, extracting a plurality of local features from the picture to be recognized by a local feature extraction layer in the feature extraction model and outputting the extracted local features to a feature enhancement layer in the feature extraction model, performing feature enhancement on the local features by the feature enhancement layer based on an attention mechanism to obtain a plurality of enhanced local features and outputting the enhanced local features to a global feature extraction layer in the initial model, and performing feature aggregation by the global feature extraction layer based on the enhanced local features to obtain global features.

8. The method of claim 7, wherein the feature enhancing the number of local features based on the attention mechanism results in a number of enhanced local features, comprising:

9. The method of claim 8, wherein said computing the plurality of local features using a self-attention mechanism to obtain a sequence of visual markers comprises:

10. The method of claim 8, wherein said feature enhancing the plurality of local features using the sequence of visual markers to obtain a plurality of enhanced local features comprises:

11. The method of claim 7, wherein the performing feature aggregation based on the plurality of enhanced local features to obtain a global feature comprises:

12. A feature extraction model training device, comprising:

the acquisition module is used for acquiring a training picture set;

the model training module is used for inputting the training picture set into an initial model, extracting a plurality of local features from a training picture by a local feature extraction layer in the initial model and outputting the local features to a feature enhancement layer in the initial model, performing feature enhancement on the local features by the feature enhancement layer based on an attention mechanism to obtain a plurality of enhanced local features and outputting the enhanced local features to a global feature extraction layer in the initial model, performing feature aggregation on the global feature extraction layer based on the enhanced local features to obtain global features and outputting the global features to a classifier in the initial model, and obtaining a prediction classification parameter of the training picture by the classifier according to the global features; adjusting model parameters of the initial model according to the label parameters of the training pictures and the prediction classification parameters to obtain a trained classification model;

13. A feature extraction device characterized by comprising:

14. An electronic device, comprising: a processor and a memory, the processor being configured to execute a model training program stored in the memory to implement the feature extraction model training method of any one of claims 1 to 5, or to execute a feature extraction program stored in the memory to implement the feature extraction method of any one of claims 6 to 11.

15. A storage medium storing one or more programs executable by one or more processors to implement the feature extraction model training method of any one of claims 1 to 5 or to implement the feature extraction method of any one of claims 6 to 11.