CN111738012A

CN111738012A - Method and device for extracting semantic alignment features, computer equipment and storage medium

Info

Publication number: CN111738012A
Application number: CN202010409366.XA
Authority: CN
Inventors: 韩浩瀚
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2020-10-02
Anticipated expiration: 2040-05-14
Also published as: CN111738012B

Abstract

The application relates to the technical field of classification models in artificial intelligence, and provides a method, a device, computer equipment and a storage medium for extracting semantic alignment features, wherein the method comprises the following steps: extracting a feature map of a target picture based on a preset convolutional neural network, and performing global maximum pooling to obtain a global feature vector; wherein, the index of each component of the global feature vector in the feature map is positioned; obtaining a constituent element of each component of the global feature vector according to the index; sorting the network parameters of the constituent elements of all the components in size, and acquiring input vectors corresponding to the first N target network parameters as target input vectors; and combining the components of each target input vector with the global feature vector to obtain the multi-granularity semantic alignment feature vector. The multi-granularity semantic alignment feature vector combines the effective components and has the multi-granularity characteristic. In addition, the method and the device can be applied to the field of intelligent traffic, so that the construction of an intelligent city is promoted.

Description

Method and device for extracting semantic alignment features, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for extracting semantic alignment features, a computer device, and a storage medium.

Background

AI applications such as modeling based on a deep learning algorithm, face recognition, vehicle recognition and the like have achieved good effects, and a bottleneck period of effect improvement is entered. The current deep learning model still has obvious defects, which are mainly reflected in the processing of semantic alignment and the extraction of multi-granularity characteristics, and the following are specific:

firstly, the current deep learning model is very carved for regional division on a feature map, and the feature map which originally belongs to the same semantic meaning can be divided, so that effective semantic meaning can not be formed; secondly, after the sub-regions divided by the feature map are subjected to feature extraction, the obtained sub-regions are still high-level semantics and essentially have no semantic multi-granularity characteristics; finally, the aligned features of the partitioned sub-regions may have serious semantic inconsistency.

Disclosure of Invention

The present application mainly aims to provide a method, an apparatus, a computer device, and a storage medium for extracting semantic alignment features, which aim to overcome the defect that there is no multi-granularity feature when extracting semantic features at present.

In order to achieve the above object, the present application provides a method for extracting semantic alignment features, comprising the following steps:

extracting a feature map of a target picture based on a preset convolutional neural network, and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map;

acquiring a constituent element of each component of the global feature vector according to the index of each component of the global feature vector in the feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer;

sorting the network parameters of the constituent elements of all the components from large to small, extracting N target network parameters arranged in the front, and acquiring input vectors included in the constituent elements of the target network parameters as target input vectors;

extracting components of each target input vector as effective components;

and combining the global feature vector and each effective component in sequence to obtain a multi-granularity semantic alignment feature vector.

Further, before the step of extracting a feature map of a target picture based on a preset convolutional neural network and performing global maximum pooling processing on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector, the method includes:

obtaining picture sample data, and performing return sampling on the picture sample data to obtain three groups of training sample sets;

training the original convolutional neural network based on the three groups of training sample sets respectively to obtain three initial convolutional neural networks;

randomly selecting two initial convolutional neural networks as target neural networks, and respectively inputting unmarked first pictures into the target neural networks to extract multi-granularity semantic alignment features to obtain first multi-granularity features and second multi-granularity features;

judging whether the first multi-granularity characteristic and the second multi-granularity characteristic are the same or not, if so, marking any one multi-granularity characteristic to the first picture to form a first training pair;

and inputting the first training pair into the unselected initial convolutional neural network for iterative training to obtain a trained convolutional neural network serving as the preset convolutional neural network.

collecting a preselected picture of a pedestrian; the preselected picture comprises pedestrians;

judging whether the preselected picture meets a preset definition condition or not;

and if so, preprocessing the preselected picture, and inputting the preprocessed preselected picture serving as the target picture into a preset convolutional neural network.

Further, the step of judging whether the preselected picture meets a preset definition condition includes:

acquiring a gray image of the preselected picture, acquiring a gray value of each pixel point in the gray image, and calculating an average gray value;

acquiring a first gray value which is larger than the average gray value in the gray image and a second gray value which is smaller than the average gray value;

calculating first difference values between the first gray values and the average gray value, and calculating the average value of the first difference values to obtain a first value;

calculating second difference values between the average gray value and each second gray value, and calculating the average value of each second difference value to obtain a second value;

calculating the average value of the first value and the second value, and judging whether the average value of the first value and the second value is greater than a preset value;

and if so, judging that the preselected picture meets a preset definition condition.

Further, the step of inputting the pre-selected picture, which is preprocessed, into a preset convolutional neural network as the target picture includes:

extracting coordinate information of each pedestrian in the preselected picture through a preset DPM model;

according to the coordinate information of each pedestrian in the preselected picture, segmenting a target image of each pedestrian from the preselected picture;

and creating a blank layer, and paving the target image of each pedestrian in the blank layer to obtain the target picture.

The present application further provides a device for extracting semantic alignment features, including:

the first extraction unit is used for extracting a feature map of a target picture based on a preset convolutional neural network and carrying out global maximum pooling processing on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map;

the first acquisition unit is used for acquiring a constituent element of each component of the global feature vector according to the index of each component of the global feature vector in the feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer;

the extraction unit is used for sorting the network parameters of the constituent elements of all the components from large to small, extracting N target network parameters arranged at the front, and acquiring an input vector included in the constituent element where each target network parameter is located as a target input vector;

an extracting unit configured to extract components of the respective target input vectors as effective components;

and the combination unit is used for sequentially combining the global feature vector and each effective component to obtain a multi-granularity semantic alignment feature vector.

Further, still include:

the second acquisition unit is used for acquiring picture sample data, and performing return sampling on the picture sample data to obtain three groups of training sample sets;

the first training unit is used for respectively training the original convolutional neural network based on the three groups of training sample sets to obtain three initial convolutional neural networks;

the selecting unit is used for randomly selecting two initial convolutional neural networks as target neural networks, and respectively inputting the unmarked first pictures into the target neural networks to extract the multi-granularity semantic alignment features so as to obtain a first multi-granularity feature and a second multi-granularity feature;

the first judging unit is used for judging whether the first multi-granularity characteristic and the second multi-granularity characteristic are the same or not, and if so, marking any one of the multi-granularity characteristics to the first picture to form a first training pair;

and the second training unit is used for inputting the first training pair into the unselected initial convolutional neural network for iterative training to obtain a trained convolutional neural network serving as the preset convolutional neural network.

Further, still include:

the acquisition unit is used for acquiring preselected pictures of pedestrians; the preselected picture comprises pedestrians;

the second judgment unit is used for judging whether the preselected picture meets a preset definition condition or not;

and the input unit is used for preprocessing the preselected picture and inputting the preprocessed preselected picture serving as the target picture into a preset convolutional neural network if the preset picture meets the preset requirement.

Further, the second determination unit includes:

the first acquisition subunit is used for acquiring the gray image of the preselected picture, acquiring the gray value of each pixel point in the gray image and calculating the average gray value;

the second acquisition subunit is used for acquiring a first gray value which is greater than the average gray value in the gray image and a second gray value which is less than the average gray value;

the first calculating subunit is configured to calculate first differences between the first gray values and the average gray value, and calculate an average value of the first differences to obtain a first value;

the second calculating subunit is configured to calculate second differences between the average grayscale value and each of the second grayscale values, and calculate an average value of each of the second differences to obtain a second value;

the third calculating subunit is used for calculating an average value of the first value and the second value and judging whether the average value of the first value and the second value is larger than a preset value or not;

and the judging subunit is used for judging that the preselected picture meets the preset definition condition if the preset definition is larger than the preset definition.

Further, the input unit includes:

the extraction subunit is used for extracting the coordinate information of each pedestrian in the preselected picture through a preset DPM model;

the segmentation subunit is used for segmenting a target image of each pedestrian from the preselected picture according to the coordinate information of each pedestrian in the preselected picture;

and the tiling subunit is used for creating a blank layer and tiling the target image of each pedestrian in the blank layer to obtain the target image.

The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of any one of the above methods when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above.

The method, the device, the computer equipment and the storage medium for extracting the semantic alignment features provided by the application comprise the following steps: extracting a feature map of a target picture based on a preset convolutional neural network, and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map; acquiring a constituent element of each component of the global feature vector according to the index of each component of the global feature vector in the feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer; sorting the network parameters of the constituent elements of all the components from large to small, extracting N target network parameters arranged in the front, and acquiring input vectors included in the constituent elements of the target network parameters as target input vectors; extracting components of each target input vector as effective components; and combining the global feature vector and each effective component in sequence to obtain a multi-granularity semantic alignment feature vector. The granularity used by the method is not hard division of the characteristic diagram, but automatically formed by training of the convolutional neural network, so that the method has good flexibility; the multi-granularity semantic alignment feature vector is combined with effective components, namely detail semantic features, and the extracted semantic features have multi-granularity characteristics. Meanwhile, in the method, the index of each component of the global feature vector in the feature map is based in the effective component extraction process, so that the semantic consistency can be kept.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a method for extracting semantic alignment features according to an embodiment of the present disclosure;

FIG. 2 is a block diagram of an apparatus for extracting semantic alignment features according to an embodiment of the present disclosure;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present application provides a method for extracting semantic alignment features, including the following steps:

step S1, extracting a feature map of a target picture based on a preset convolutional neural network, and performing global maximum pooling processing on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map;

step S2, obtaining the constituent elements of each component of the global feature vector according to the index of each component of the global feature vector in the feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer;

step S3, sorting the network parameters of the constituent elements of all the components from big to small, extracting N target network parameters arranged at the front, and acquiring the input vector included in the constituent element where each target network parameter is located as a target input vector;

step S4, extracting components of each of the target input vectors as effective components;

and step S5, combining the global feature vector and each effective component in sequence to obtain a multi-granularity semantic alignment feature vector.

In the embodiment, the method is applied to a scene with multi-granularity semantic alignment in image recognition; the application can also be applied to the field of intelligent traffic, so that the construction of an intelligent city is promoted. For example, in an image processing scene and a pedestrian recognition scene in the field of intelligent traffic, the effect of feature extraction can be effectively improved, so that the image processing effect is improved.

In the currently generally adopted deep learning identification algorithm, the feature map needs to be subjected to region division (including division in width, height and channel directions), then a back-end network (usually reducing network dimensionality reduction or normalization to a format required by a subsequent network) is put into operation, and finally feature vectors are output. Therefore, in this embodiment, the feature map is not divided into hard regions, but is automatically formed by training, which provides good flexibility.

Specifically, as described in step S1, the target picture is usually a pedestrian picture, and the Convolutional Neural Network (CNN) is used to perform feature extraction on the target picture, so as to form a feature map; the last layer of the convolutional neural network model is a pooling layer which performs global maximum pooling processing on the feature map by forward propagation so as to output a global feature vector. It should be understood that the global feature vector is the highest level semantic feature, which does not have semantically multi-granular properties. Therefore, it is desirable to fuse detail semantic features so that they have semantically multi-granular properties.

In this embodiment, in the process of performing the global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map, that is, the index relationship of each component of the global feature vector is obtained by establishing the index relationship of each input vector in the feature map after performing the global maximum pooling on each input vector in the feature map by the pooling layer.

In step S2, components of the global feature vector need to be propagated backward to obtain the detail semantic features. In this embodiment, according to the function of the index in the forward propagation, using backward propagation, a component of each component of the global feature vector is obtained by global maximum pooling alignment, where the component refers to an input vector x _1 of the pooling layer and a network parameter W of the pooling layer, which should be input when the component of the global feature vector is to be obtained by performing global maximum pooling processing in the pooling layer. It should be understood that for any component x _ i of the global feature vector, there is an element for constituting it, the constituting element being referred to as the constituting element.

As described in the above steps S3-S4, the larger the value of the network parameter is, the higher the reference value thereof is, so it is necessary to sort the network parameters W in the constituent elements of all the components from large to small, extract the N network parameters arranged in the front as final target network parameters, and according to the target network parameters, the constituent element where each of the target network parameters is located can be obtained, and then obtain the input vector x _1 included in the constituent element as the final target input vector. The target input vectors still have the characteristic of multi-granularity without being processed by region division and a pooling layer, so that the components of the target input vectors are extracted as effective components; wherein, the effective component is the detail semantic feature.

Finally, as stated in step S5, the global feature vector is combined with each of the significant components, so as to obtain a multi-granularity semantic alignment feature vector. The multi-granularity semantic alignment feature vector not only realizes semantic alignment, but also reserves multi-granularity characteristics. The multi-granularity characteristic and the semantic alignment are used as the essence of the recognition problem, the effectiveness of the recognition problem is verified in practice, the recognition problem and the semantic alignment are combined in the embodiment, the recognition problem and the semantic alignment have general effectiveness, the recognition effect can be obviously improved, and the robustness is realized.

In this embodiment, compared with the current alignment scheme in face recognition and pedestrian re-recognition, in this embodiment, when semantic alignment is implemented, there is no need to add a labeling requirement to the data set. The granularity used in the embodiment is not from hard division of the feature map, but is automatically formed through training of the convolutional neural network, and the method has good flexibility. The multi-granularity semantic alignment feature vector is combined with detail semantic features (namely effective components) and has multi-granularity characteristics; meanwhile, in the effective component extraction process in the embodiment, the index of each component of the global feature vector in the feature map is based, so that semantic consistency can be maintained.

Preferably, in order to ensure the privacy and security of the data such as the target picture and the multi-granularity semantic alignment feature vector, all the data may be stored in a node of a blockchain. The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

In an embodiment, before the step S1 of extracting a feature map of a target picture based on a preset convolutional neural network and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector, the method includes:

a. obtaining picture sample data, and performing return sampling on the picture sample data to obtain three groups of training sample sets;

in this embodiment, the above-mentioned picture sample data is divided into three groups of training sample sets by using a sample return sampling method, where each training sample set is used for training an original convolutional neural network.

b. Training the original convolutional neural network based on the three groups of training sample sets respectively to obtain three initial convolutional neural networks;

wherein, three groups of the training sample sets are slightly different, so that the three initial convolutional neural networks obtained are also slightly different in prediction results.

c. Randomly selecting two initial convolutional neural networks as target neural networks, and respectively inputting unmarked first pictures into the target neural networks to extract multi-granularity semantic alignment features to obtain first multi-granularity features and second multi-granularity features;

because the labeling workload of the picture is large, the data size of the current picture sample data is small, in order to increase the training data for training the convolutional neural network, a first picture without labels is adopted for training, and the first picture is respectively input into two target neural networks for feature extraction, so that the multi-granularity semantic alignment features are obtained.

d. Judging whether the first multi-granularity characteristic and the second multi-granularity characteristic are the same or not, if so, marking any one multi-granularity characteristic to the first picture to form a first training pair;

if the extracted multi-granularity semantic alignment features of the two target neural networks are the same, it is indicated that the confidence degrees of the two target neural networks are high, and the confidence degrees of the correspondingly extracted features are also high, so that any one of the multi-granularity features can be labeled to the first picture to form a first training pair to serve as a training sample for training the unselected initial convolutional neural network, and the data volume of the training sample is remarkably increased. If the extracted multi-granularity semantic alignment features of the two target neural networks are different, the confidence degrees of the two target neural networks are not high, and the confidence degrees of the correspondingly extracted features are not high, so that the target neural networks need to be trained again to iteratively optimize parameters. Or, a non-labeled picture is reselected and input into the target neural network to extract the multi-granularity semantic alignment features until the multi-granularity semantic alignment features extracted by the two target neural networks are the same.

e. And inputting the first training pair into the unselected initial convolutional neural network for iterative training to obtain a trained convolutional neural network serving as the preset convolutional neural network. In this embodiment, a training process of the preset convolutional neural network is provided; the training process adopts a plurality of models to train together, the confidence of the models is expressed based on the consistency of the training output results, and the accuracy of the training of the models can be shown only when the results output by the plurality of models are the same, so that the accuracy of the training models is improved. Meanwhile, the first image without labels is adopted for training, and after the data volume of the training sample is increased, the convolutional neural network obtained by corresponding training has better effect.

step S11, collecting a preselected picture of the pedestrian; the preselected picture comprises pedestrians;

step S12, judging whether the preselected picture meets the preset definition condition;

and step S13, if yes, preprocessing the preselected picture, and inputting the preprocessed preselected picture serving as the target picture into a preset convolutional neural network.

When the preset convolutional neural network is used for extracting the multi-granularity semantic alignment features, the clearer the picture is, and the better the effect of the finally extracted features is. Therefore, a preselected picture satisfying the condition of definition needs to be selected in advance as the target picture.

In an embodiment, the step S12 of determining whether the preselected picture satisfies a predetermined condition of sharpness includes:

s121, acquiring a gray image of the preselected picture, acquiring a gray value of each pixel point in the gray image, and calculating an average gray value; the grayscale image more easily shows the definition of one picture, so that the definition of the grayscale image is analyzed in this embodiment.

S122, acquiring a first gray value which is larger than the average gray value and a second gray value which is smaller than the average gray value in the gray image; in a picture, if the gray values of all the pixel points are very close, the picture is not clear enough, and when the gray value difference is large, the difference of each pixel point is more easily highlighted, and the final expression form is that the picture is clearer. Therefore, in this embodiment, the gray values of all the pixels need to be compared with the average value, so as to analyze the differences in the following steps.

S123, calculating first difference values between the first gray values and the average gray value, and calculating the average value of the first difference values to obtain a first value;

s124, calculating second difference values between the average gray value and each second gray value, and calculating the average value of each second difference value to obtain a second value;

s125, calculating an average value of the first value and the second value, and judging whether the average value of the first value and the second value is larger than a preset value or not;

and S126, if the preset definition is larger than the preset definition, judging that the preselected picture meets the preset definition condition. If the difference is larger than the preset value, the difference from the average value is larger, and the displayed picture definition is higher; if the first zoom picture is less than the second zoom picture, the first zoom picture is not clear enough.

In an embodiment, the step S13 of inputting the pre-selected picture as the target picture into a preset convolutional neural network after preprocessing the pre-selected picture includes:

extracting coordinate information of each pedestrian in the preselected picture through a preset DPM model; the DPM model is a target detection algorithm and is an important part of a classifier, segmentation and human body posture and behavior classification.

and creating a blank layer, and paving the target image of each pedestrian in the blank layer to obtain the target picture. It should be understood that the tiling in this embodiment refers to dividing the blank layer into a plurality of horizontal and mutually parallel tiled layers, then sequentially splicing the target images in one tiled layer in the horizontal direction, and after the tiling of the same tiled layer is completed, continuing to tile other target images in the next tiled layer.

In this embodiment, all elements in the preselected picture are not input into a preset convolutional neural network, but only the pedestrian images in the preselected picture are extracted and combined to form the target picture, so that the interference characteristics of the target picture are less, the data volume is reduced, the subsequent characteristic extraction is facilitated, and the data computation volume is reduced.

Referring to fig. 2, an embodiment of the present application further provides an apparatus for extracting semantic alignment features, including:

the first extraction unit 10 is configured to extract a feature map of a target picture based on a preset convolutional neural network, and perform global maximum pooling processing on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map;

a first obtaining unit 20, configured to obtain a constituent element of each component of the global feature vector according to an index of each component of the global feature vector in a feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer;

an extracting unit 30, configured to sort the network parameters of the constituent elements of all the components from large to small, extract N target network parameters arranged in the front, and obtain an input vector included in a constituent element where each target network parameter is located, as a target input vector;

an extracting unit 40 for extracting components of the respective target input vectors as effective components;

and the combining unit 50 is configured to sequentially combine the global feature vector and each of the effective components to obtain a multi-granularity semantic alignment feature vector.

In an embodiment, the apparatus for extracting semantic alignment features further includes:

In an embodiment, the second determining unit includes:

In one embodiment, the input unit includes:

In this embodiment, please refer to the method described in the above embodiment for specific implementation of each unit and subunit in the above device embodiment, which is not described herein again.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing picture data, feature vector data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of extracting semantic alignment features.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a method for extracting semantic alignment features. It is to be understood that the computer-readable storage medium in the present embodiment may be a volatile-readable storage medium or a non-volatile-readable storage medium.

In summary, the method, the apparatus, the computer device, and the storage medium for extracting semantic alignment features provided in the embodiments of the present application include: extracting a feature map of a target picture based on a preset convolutional neural network, and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map; acquiring a constituent element of each component of the global feature vector according to the index of each component of the global feature vector in the feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer; sorting the network parameters of the constituent elements of all the components from large to small, extracting N target network parameters arranged in the front, and acquiring input vectors included in the constituent elements of the target network parameters as target input vectors; extracting components of each target input vector as effective components; and combining the global feature vector and each effective component in sequence to obtain a multi-granularity semantic alignment feature vector. The granularity used by the method is not hard division of the characteristic diagram, but automatically formed by training of the convolutional neural network, so that the method has good flexibility; the multi-granularity semantic alignment feature vector is combined with effective components, namely detail semantic features, and has the multi-granularity characteristic.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (SSRDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only for the preferred embodiment of the present application and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims

1. A method for extracting semantic alignment features, comprising the steps of:

extracting components of each target input vector as effective components;

2. The method for extracting semantic alignment features according to claim 1, wherein before the step of extracting the feature map of the target picture based on the preset convolutional neural network and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector, the method comprises:

3. The method for extracting semantic alignment features according to claim 1, wherein before the step of extracting the feature map of the target picture based on the preset convolutional neural network and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector, the method comprises:

4. The method for extracting semantic alignment features according to claim 3, wherein the step of judging whether the pre-selected picture meets a preset definition condition comprises:

5. The method for extracting semantic alignment features according to claim 3, wherein the step of preprocessing the pre-selected picture and inputting the preprocessed pre-selected picture as the target picture into a preset convolutional neural network comprises:

6. An apparatus for extracting semantic alignment features, comprising:

7. The apparatus for extracting semantic alignment features according to claim 6, further comprising:

8. The apparatus for extracting semantic alignment features according to claim 6, further comprising:

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method according to any of claims 1 to 5.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.