CN111738012A - Method and device for extracting semantic alignment features, computer equipment and storage medium - Google Patents

Method and device for extracting semantic alignment features, computer equipment and storage medium Download PDF

Info

Publication number
CN111738012A
CN111738012A CN202010409366.XA CN202010409366A CN111738012A CN 111738012 A CN111738012 A CN 111738012A CN 202010409366 A CN202010409366 A CN 202010409366A CN 111738012 A CN111738012 A CN 111738012A
Authority
CN
China
Prior art keywords
picture
convolutional neural
target
neural network
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010409366.XA
Other languages
Chinese (zh)
Other versions
CN111738012B (en
Inventor
韩浩瀚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202010409366.XA priority Critical patent/CN111738012B/en
Publication of CN111738012A publication Critical patent/CN111738012A/en
Application granted granted Critical
Publication of CN111738012B publication Critical patent/CN111738012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to the technical field of classification models in artificial intelligence, and provides a method, a device, computer equipment and a storage medium for extracting semantic alignment features, wherein the method comprises the following steps: extracting a feature map of a target picture based on a preset convolutional neural network, and performing global maximum pooling to obtain a global feature vector; wherein, the index of each component of the global feature vector in the feature map is positioned; obtaining a constituent element of each component of the global feature vector according to the index; sorting the network parameters of the constituent elements of all the components in size, and acquiring input vectors corresponding to the first N target network parameters as target input vectors; and combining the components of each target input vector with the global feature vector to obtain the multi-granularity semantic alignment feature vector. The multi-granularity semantic alignment feature vector combines the effective components and has the multi-granularity characteristic. In addition, the method and the device can be applied to the field of intelligent traffic, so that the construction of an intelligent city is promoted.

Description

Method and device for extracting semantic alignment features, computer equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for extracting semantic alignment features, a computer device, and a storage medium.
Background
AI applications such as modeling based on a deep learning algorithm, face recognition, vehicle recognition and the like have achieved good effects, and a bottleneck period of effect improvement is entered. The current deep learning model still has obvious defects, which are mainly reflected in the processing of semantic alignment and the extraction of multi-granularity characteristics, and the following are specific:
firstly, the current deep learning model is very carved for regional division on a feature map, and the feature map which originally belongs to the same semantic meaning can be divided, so that effective semantic meaning can not be formed; secondly, after the sub-regions divided by the feature map are subjected to feature extraction, the obtained sub-regions are still high-level semantics and essentially have no semantic multi-granularity characteristics; finally, the aligned features of the partitioned sub-regions may have serious semantic inconsistency.
Disclosure of Invention
The present application mainly aims to provide a method, an apparatus, a computer device, and a storage medium for extracting semantic alignment features, which aim to overcome the defect that there is no multi-granularity feature when extracting semantic features at present.
In order to achieve the above object, the present application provides a method for extracting semantic alignment features, comprising the following steps:
extracting a feature map of a target picture based on a preset convolutional neural network, and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map;
acquiring a constituent element of each component of the global feature vector according to the index of each component of the global feature vector in the feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer;
sorting the network parameters of the constituent elements of all the components from large to small, extracting N target network parameters arranged in the front, and acquiring input vectors included in the constituent elements of the target network parameters as target input vectors;
extracting components of each target input vector as effective components;
and combining the global feature vector and each effective component in sequence to obtain a multi-granularity semantic alignment feature vector.
Further, before the step of extracting a feature map of a target picture based on a preset convolutional neural network and performing global maximum pooling processing on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector, the method includes:
obtaining picture sample data, and performing return sampling on the picture sample data to obtain three groups of training sample sets;
training the original convolutional neural network based on the three groups of training sample sets respectively to obtain three initial convolutional neural networks;
randomly selecting two initial convolutional neural networks as target neural networks, and respectively inputting unmarked first pictures into the target neural networks to extract multi-granularity semantic alignment features to obtain first multi-granularity features and second multi-granularity features;
judging whether the first multi-granularity characteristic and the second multi-granularity characteristic are the same or not, if so, marking any one multi-granularity characteristic to the first picture to form a first training pair;
and inputting the first training pair into the unselected initial convolutional neural network for iterative training to obtain a trained convolutional neural network serving as the preset convolutional neural network.
Further, before the step of extracting a feature map of a target picture based on a preset convolutional neural network and performing global maximum pooling processing on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector, the method includes:
collecting a preselected picture of a pedestrian; the preselected picture comprises pedestrians;
judging whether the preselected picture meets a preset definition condition or not;
and if so, preprocessing the preselected picture, and inputting the preprocessed preselected picture serving as the target picture into a preset convolutional neural network.
Further, the step of judging whether the preselected picture meets a preset definition condition includes:
acquiring a gray image of the preselected picture, acquiring a gray value of each pixel point in the gray image, and calculating an average gray value;
acquiring a first gray value which is larger than the average gray value in the gray image and a second gray value which is smaller than the average gray value;
calculating first difference values between the first gray values and the average gray value, and calculating the average value of the first difference values to obtain a first value;
calculating second difference values between the average gray value and each second gray value, and calculating the average value of each second difference value to obtain a second value;
calculating the average value of the first value and the second value, and judging whether the average value of the first value and the second value is greater than a preset value;
and if so, judging that the preselected picture meets a preset definition condition.
Further, the step of inputting the pre-selected picture, which is preprocessed, into a preset convolutional neural network as the target picture includes:
extracting coordinate information of each pedestrian in the preselected picture through a preset DPM model;
according to the coordinate information of each pedestrian in the preselected picture, segmenting a target image of each pedestrian from the preselected picture;
and creating a blank layer, and paving the target image of each pedestrian in the blank layer to obtain the target picture.
The present application further provides a device for extracting semantic alignment features, including:
the first extraction unit is used for extracting a feature map of a target picture based on a preset convolutional neural network and carrying out global maximum pooling processing on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map;
the first acquisition unit is used for acquiring a constituent element of each component of the global feature vector according to the index of each component of the global feature vector in the feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer;
the extraction unit is used for sorting the network parameters of the constituent elements of all the components from large to small, extracting N target network parameters arranged at the front, and acquiring an input vector included in the constituent element where each target network parameter is located as a target input vector;
an extracting unit configured to extract components of the respective target input vectors as effective components;
and the combination unit is used for sequentially combining the global feature vector and each effective component to obtain a multi-granularity semantic alignment feature vector.
Further, still include:
the second acquisition unit is used for acquiring picture sample data, and performing return sampling on the picture sample data to obtain three groups of training sample sets;
the first training unit is used for respectively training the original convolutional neural network based on the three groups of training sample sets to obtain three initial convolutional neural networks;
the selecting unit is used for randomly selecting two initial convolutional neural networks as target neural networks, and respectively inputting the unmarked first pictures into the target neural networks to extract the multi-granularity semantic alignment features so as to obtain a first multi-granularity feature and a second multi-granularity feature;
the first judging unit is used for judging whether the first multi-granularity characteristic and the second multi-granularity characteristic are the same or not, and if so, marking any one of the multi-granularity characteristics to the first picture to form a first training pair;
and the second training unit is used for inputting the first training pair into the unselected initial convolutional neural network for iterative training to obtain a trained convolutional neural network serving as the preset convolutional neural network.
Further, still include:
the acquisition unit is used for acquiring preselected pictures of pedestrians; the preselected picture comprises pedestrians;
the second judgment unit is used for judging whether the preselected picture meets a preset definition condition or not;
and the input unit is used for preprocessing the preselected picture and inputting the preprocessed preselected picture serving as the target picture into a preset convolutional neural network if the preset picture meets the preset requirement.
Further, the second determination unit includes:
the first acquisition subunit is used for acquiring the gray image of the preselected picture, acquiring the gray value of each pixel point in the gray image and calculating the average gray value;
the second acquisition subunit is used for acquiring a first gray value which is greater than the average gray value in the gray image and a second gray value which is less than the average gray value;
the first calculating subunit is configured to calculate first differences between the first gray values and the average gray value, and calculate an average value of the first differences to obtain a first value;
the second calculating subunit is configured to calculate second differences between the average grayscale value and each of the second grayscale values, and calculate an average value of each of the second differences to obtain a second value;
the third calculating subunit is used for calculating an average value of the first value and the second value and judging whether the average value of the first value and the second value is larger than a preset value or not;
and the judging subunit is used for judging that the preselected picture meets the preset definition condition if the preset definition is larger than the preset definition.
Further, the input unit includes:
the extraction subunit is used for extracting the coordinate information of each pedestrian in the preselected picture through a preset DPM model;
the segmentation subunit is used for segmenting a target image of each pedestrian from the preselected picture according to the coordinate information of each pedestrian in the preselected picture;
and the tiling subunit is used for creating a blank layer and tiling the target image of each pedestrian in the blank layer to obtain the target image.
The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of any one of the above methods when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the above.
The method, the device, the computer equipment and the storage medium for extracting the semantic alignment features provided by the application comprise the following steps: extracting a feature map of a target picture based on a preset convolutional neural network, and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map; acquiring a constituent element of each component of the global feature vector according to the index of each component of the global feature vector in the feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer; sorting the network parameters of the constituent elements of all the components from large to small, extracting N target network parameters arranged in the front, and acquiring input vectors included in the constituent elements of the target network parameters as target input vectors; extracting components of each target input vector as effective components; and combining the global feature vector and each effective component in sequence to obtain a multi-granularity semantic alignment feature vector. The granularity used by the method is not hard division of the characteristic diagram, but automatically formed by training of the convolutional neural network, so that the method has good flexibility; the multi-granularity semantic alignment feature vector is combined with effective components, namely detail semantic features, and the extracted semantic features have multi-granularity characteristics. Meanwhile, in the method, the index of each component of the global feature vector in the feature map is based in the effective component extraction process, so that the semantic consistency can be kept.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a method for extracting semantic alignment features according to an embodiment of the present disclosure;
FIG. 2 is a block diagram of an apparatus for extracting semantic alignment features according to an embodiment of the present disclosure;
fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, an embodiment of the present application provides a method for extracting semantic alignment features, including the following steps:
step S1, extracting a feature map of a target picture based on a preset convolutional neural network, and performing global maximum pooling processing on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map;
step S2, obtaining the constituent elements of each component of the global feature vector according to the index of each component of the global feature vector in the feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer;
step S3, sorting the network parameters of the constituent elements of all the components from big to small, extracting N target network parameters arranged at the front, and acquiring the input vector included in the constituent element where each target network parameter is located as a target input vector;
step S4, extracting components of each of the target input vectors as effective components;
and step S5, combining the global feature vector and each effective component in sequence to obtain a multi-granularity semantic alignment feature vector.
In the embodiment, the method is applied to a scene with multi-granularity semantic alignment in image recognition; the application can also be applied to the field of intelligent traffic, so that the construction of an intelligent city is promoted. For example, in an image processing scene and a pedestrian recognition scene in the field of intelligent traffic, the effect of feature extraction can be effectively improved, so that the image processing effect is improved.
In the currently generally adopted deep learning identification algorithm, the feature map needs to be subjected to region division (including division in width, height and channel directions), then a back-end network (usually reducing network dimensionality reduction or normalization to a format required by a subsequent network) is put into operation, and finally feature vectors are output. Therefore, in this embodiment, the feature map is not divided into hard regions, but is automatically formed by training, which provides good flexibility.
Specifically, as described in step S1, the target picture is usually a pedestrian picture, and the Convolutional Neural Network (CNN) is used to perform feature extraction on the target picture, so as to form a feature map; the last layer of the convolutional neural network model is a pooling layer which performs global maximum pooling processing on the feature map by forward propagation so as to output a global feature vector. It should be understood that the global feature vector is the highest level semantic feature, which does not have semantically multi-granular properties. Therefore, it is desirable to fuse detail semantic features so that they have semantically multi-granular properties.
In this embodiment, in the process of performing the global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map, that is, the index relationship of each component of the global feature vector is obtained by establishing the index relationship of each input vector in the feature map after performing the global maximum pooling on each input vector in the feature map by the pooling layer.
In step S2, components of the global feature vector need to be propagated backward to obtain the detail semantic features. In this embodiment, according to the function of the index in the forward propagation, using backward propagation, a component of each component of the global feature vector is obtained by global maximum pooling alignment, where the component refers to an input vector x _1 of the pooling layer and a network parameter W of the pooling layer, which should be input when the component of the global feature vector is to be obtained by performing global maximum pooling processing in the pooling layer. It should be understood that for any component x _ i of the global feature vector, there is an element for constituting it, the constituting element being referred to as the constituting element.
As described in the above steps S3-S4, the larger the value of the network parameter is, the higher the reference value thereof is, so it is necessary to sort the network parameters W in the constituent elements of all the components from large to small, extract the N network parameters arranged in the front as final target network parameters, and according to the target network parameters, the constituent element where each of the target network parameters is located can be obtained, and then obtain the input vector x _1 included in the constituent element as the final target input vector. The target input vectors still have the characteristic of multi-granularity without being processed by region division and a pooling layer, so that the components of the target input vectors are extracted as effective components; wherein, the effective component is the detail semantic feature.
Finally, as stated in step S5, the global feature vector is combined with each of the significant components, so as to obtain a multi-granularity semantic alignment feature vector. The multi-granularity semantic alignment feature vector not only realizes semantic alignment, but also reserves multi-granularity characteristics. The multi-granularity characteristic and the semantic alignment are used as the essence of the recognition problem, the effectiveness of the recognition problem is verified in practice, the recognition problem and the semantic alignment are combined in the embodiment, the recognition problem and the semantic alignment have general effectiveness, the recognition effect can be obviously improved, and the robustness is realized.
In this embodiment, compared with the current alignment scheme in face recognition and pedestrian re-recognition, in this embodiment, when semantic alignment is implemented, there is no need to add a labeling requirement to the data set. The granularity used in the embodiment is not from hard division of the feature map, but is automatically formed through training of the convolutional neural network, and the method has good flexibility. The multi-granularity semantic alignment feature vector is combined with detail semantic features (namely effective components) and has multi-granularity characteristics; meanwhile, in the effective component extraction process in the embodiment, the index of each component of the global feature vector in the feature map is based, so that semantic consistency can be maintained.
Preferably, in order to ensure the privacy and security of the data such as the target picture and the multi-granularity semantic alignment feature vector, all the data may be stored in a node of a blockchain. The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
In an embodiment, before the step S1 of extracting a feature map of a target picture based on a preset convolutional neural network and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector, the method includes:
a. obtaining picture sample data, and performing return sampling on the picture sample data to obtain three groups of training sample sets;
in this embodiment, the above-mentioned picture sample data is divided into three groups of training sample sets by using a sample return sampling method, where each training sample set is used for training an original convolutional neural network.
b. Training the original convolutional neural network based on the three groups of training sample sets respectively to obtain three initial convolutional neural networks;
wherein, three groups of the training sample sets are slightly different, so that the three initial convolutional neural networks obtained are also slightly different in prediction results.
c. Randomly selecting two initial convolutional neural networks as target neural networks, and respectively inputting unmarked first pictures into the target neural networks to extract multi-granularity semantic alignment features to obtain first multi-granularity features and second multi-granularity features;
because the labeling workload of the picture is large, the data size of the current picture sample data is small, in order to increase the training data for training the convolutional neural network, a first picture without labels is adopted for training, and the first picture is respectively input into two target neural networks for feature extraction, so that the multi-granularity semantic alignment features are obtained.
d. Judging whether the first multi-granularity characteristic and the second multi-granularity characteristic are the same or not, if so, marking any one multi-granularity characteristic to the first picture to form a first training pair;
if the extracted multi-granularity semantic alignment features of the two target neural networks are the same, it is indicated that the confidence degrees of the two target neural networks are high, and the confidence degrees of the correspondingly extracted features are also high, so that any one of the multi-granularity features can be labeled to the first picture to form a first training pair to serve as a training sample for training the unselected initial convolutional neural network, and the data volume of the training sample is remarkably increased. If the extracted multi-granularity semantic alignment features of the two target neural networks are different, the confidence degrees of the two target neural networks are not high, and the confidence degrees of the correspondingly extracted features are not high, so that the target neural networks need to be trained again to iteratively optimize parameters. Or, a non-labeled picture is reselected and input into the target neural network to extract the multi-granularity semantic alignment features until the multi-granularity semantic alignment features extracted by the two target neural networks are the same.
e. And inputting the first training pair into the unselected initial convolutional neural network for iterative training to obtain a trained convolutional neural network serving as the preset convolutional neural network. In this embodiment, a training process of the preset convolutional neural network is provided; the training process adopts a plurality of models to train together, the confidence of the models is expressed based on the consistency of the training output results, and the accuracy of the training of the models can be shown only when the results output by the plurality of models are the same, so that the accuracy of the training models is improved. Meanwhile, the first image without labels is adopted for training, and after the data volume of the training sample is increased, the convolutional neural network obtained by corresponding training has better effect.
In an embodiment, before the step S1 of extracting a feature map of a target picture based on a preset convolutional neural network and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector, the method includes:
step S11, collecting a preselected picture of the pedestrian; the preselected picture comprises pedestrians;
step S12, judging whether the preselected picture meets the preset definition condition;
and step S13, if yes, preprocessing the preselected picture, and inputting the preprocessed preselected picture serving as the target picture into a preset convolutional neural network.
When the preset convolutional neural network is used for extracting the multi-granularity semantic alignment features, the clearer the picture is, and the better the effect of the finally extracted features is. Therefore, a preselected picture satisfying the condition of definition needs to be selected in advance as the target picture.
In an embodiment, the step S12 of determining whether the preselected picture satisfies a predetermined condition of sharpness includes:
s121, acquiring a gray image of the preselected picture, acquiring a gray value of each pixel point in the gray image, and calculating an average gray value; the grayscale image more easily shows the definition of one picture, so that the definition of the grayscale image is analyzed in this embodiment.
S122, acquiring a first gray value which is larger than the average gray value and a second gray value which is smaller than the average gray value in the gray image; in a picture, if the gray values of all the pixel points are very close, the picture is not clear enough, and when the gray value difference is large, the difference of each pixel point is more easily highlighted, and the final expression form is that the picture is clearer. Therefore, in this embodiment, the gray values of all the pixels need to be compared with the average value, so as to analyze the differences in the following steps.
S123, calculating first difference values between the first gray values and the average gray value, and calculating the average value of the first difference values to obtain a first value;
s124, calculating second difference values between the average gray value and each second gray value, and calculating the average value of each second difference value to obtain a second value;
s125, calculating an average value of the first value and the second value, and judging whether the average value of the first value and the second value is larger than a preset value or not;
and S126, if the preset definition is larger than the preset definition, judging that the preselected picture meets the preset definition condition. If the difference is larger than the preset value, the difference from the average value is larger, and the displayed picture definition is higher; if the first zoom picture is less than the second zoom picture, the first zoom picture is not clear enough.
In an embodiment, the step S13 of inputting the pre-selected picture as the target picture into a preset convolutional neural network after preprocessing the pre-selected picture includes:
extracting coordinate information of each pedestrian in the preselected picture through a preset DPM model; the DPM model is a target detection algorithm and is an important part of a classifier, segmentation and human body posture and behavior classification.
According to the coordinate information of each pedestrian in the preselected picture, segmenting a target image of each pedestrian from the preselected picture;
and creating a blank layer, and paving the target image of each pedestrian in the blank layer to obtain the target picture. It should be understood that the tiling in this embodiment refers to dividing the blank layer into a plurality of horizontal and mutually parallel tiled layers, then sequentially splicing the target images in one tiled layer in the horizontal direction, and after the tiling of the same tiled layer is completed, continuing to tile other target images in the next tiled layer.
In this embodiment, all elements in the preselected picture are not input into a preset convolutional neural network, but only the pedestrian images in the preselected picture are extracted and combined to form the target picture, so that the interference characteristics of the target picture are less, the data volume is reduced, the subsequent characteristic extraction is facilitated, and the data computation volume is reduced.
Referring to fig. 2, an embodiment of the present application further provides an apparatus for extracting semantic alignment features, including:
the first extraction unit 10 is configured to extract a feature map of a target picture based on a preset convolutional neural network, and perform global maximum pooling processing on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map;
a first obtaining unit 20, configured to obtain a constituent element of each component of the global feature vector according to an index of each component of the global feature vector in a feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer;
an extracting unit 30, configured to sort the network parameters of the constituent elements of all the components from large to small, extract N target network parameters arranged in the front, and obtain an input vector included in a constituent element where each target network parameter is located, as a target input vector;
an extracting unit 40 for extracting components of the respective target input vectors as effective components;
and the combining unit 50 is configured to sequentially combine the global feature vector and each of the effective components to obtain a multi-granularity semantic alignment feature vector.
In an embodiment, the apparatus for extracting semantic alignment features further includes:
the second acquisition unit is used for acquiring picture sample data, and performing return sampling on the picture sample data to obtain three groups of training sample sets;
the first training unit is used for respectively training the original convolutional neural network based on the three groups of training sample sets to obtain three initial convolutional neural networks;
the selecting unit is used for randomly selecting two initial convolutional neural networks as target neural networks, and respectively inputting the unmarked first pictures into the target neural networks to extract the multi-granularity semantic alignment features so as to obtain a first multi-granularity feature and a second multi-granularity feature;
the first judging unit is used for judging whether the first multi-granularity characteristic and the second multi-granularity characteristic are the same or not, and if so, marking any one of the multi-granularity characteristics to the first picture to form a first training pair;
and the second training unit is used for inputting the first training pair into the unselected initial convolutional neural network for iterative training to obtain a trained convolutional neural network serving as the preset convolutional neural network.
In an embodiment, the apparatus for extracting semantic alignment features further includes:
the acquisition unit is used for acquiring preselected pictures of pedestrians; the preselected picture comprises pedestrians;
the second judgment unit is used for judging whether the preselected picture meets a preset definition condition or not;
and the input unit is used for preprocessing the preselected picture and inputting the preprocessed preselected picture serving as the target picture into a preset convolutional neural network if the preset picture meets the preset requirement.
In an embodiment, the second determining unit includes:
the first acquisition subunit is used for acquiring the gray image of the preselected picture, acquiring the gray value of each pixel point in the gray image and calculating the average gray value;
the second acquisition subunit is used for acquiring a first gray value which is greater than the average gray value in the gray image and a second gray value which is less than the average gray value;
the first calculating subunit is configured to calculate first differences between the first gray values and the average gray value, and calculate an average value of the first differences to obtain a first value;
the second calculating subunit is configured to calculate second differences between the average grayscale value and each of the second grayscale values, and calculate an average value of each of the second differences to obtain a second value;
the third calculating subunit is used for calculating an average value of the first value and the second value and judging whether the average value of the first value and the second value is larger than a preset value or not;
and the judging subunit is used for judging that the preselected picture meets the preset definition condition if the preset definition is larger than the preset definition.
In one embodiment, the input unit includes:
the extraction subunit is used for extracting the coordinate information of each pedestrian in the preselected picture through a preset DPM model;
the segmentation subunit is used for segmenting a target image of each pedestrian from the preselected picture according to the coordinate information of each pedestrian in the preselected picture;
and the tiling subunit is used for creating a blank layer and tiling the target image of each pedestrian in the blank layer to obtain the target image.
In this embodiment, please refer to the method described in the above embodiment for specific implementation of each unit and subunit in the above device embodiment, which is not described herein again.
Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing picture data, feature vector data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of extracting semantic alignment features.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a method for extracting semantic alignment features. It is to be understood that the computer-readable storage medium in the present embodiment may be a volatile-readable storage medium or a non-volatile-readable storage medium.
In summary, the method, the apparatus, the computer device, and the storage medium for extracting semantic alignment features provided in the embodiments of the present application include: extracting a feature map of a target picture based on a preset convolutional neural network, and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map; acquiring a constituent element of each component of the global feature vector according to the index of each component of the global feature vector in the feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer; sorting the network parameters of the constituent elements of all the components from large to small, extracting N target network parameters arranged in the front, and acquiring input vectors included in the constituent elements of the target network parameters as target input vectors; extracting components of each target input vector as effective components; and combining the global feature vector and each effective component in sequence to obtain a multi-granularity semantic alignment feature vector. The granularity used by the method is not hard division of the characteristic diagram, but automatically formed by training of the convolutional neural network, so that the method has good flexibility; the multi-granularity semantic alignment feature vector is combined with effective components, namely detail semantic features, and has the multi-granularity characteristic.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (SSRDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only for the preferred embodiment of the present application and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims (10)

1. A method for extracting semantic alignment features, comprising the steps of:
extracting a feature map of a target picture based on a preset convolutional neural network, and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map;
acquiring a constituent element of each component of the global feature vector according to the index of each component of the global feature vector in the feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer;
sorting the network parameters of the constituent elements of all the components from large to small, extracting N target network parameters arranged in the front, and acquiring input vectors included in the constituent elements of the target network parameters as target input vectors;
extracting components of each target input vector as effective components;
and combining the global feature vector and each effective component in sequence to obtain a multi-granularity semantic alignment feature vector.
2. The method for extracting semantic alignment features according to claim 1, wherein before the step of extracting the feature map of the target picture based on the preset convolutional neural network and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector, the method comprises:
obtaining picture sample data, and performing return sampling on the picture sample data to obtain three groups of training sample sets;
training the original convolutional neural network based on the three groups of training sample sets respectively to obtain three initial convolutional neural networks;
randomly selecting two initial convolutional neural networks as target neural networks, and respectively inputting unmarked first pictures into the target neural networks to extract multi-granularity semantic alignment features to obtain first multi-granularity features and second multi-granularity features;
judging whether the first multi-granularity characteristic and the second multi-granularity characteristic are the same or not, if so, marking any one multi-granularity characteristic to the first picture to form a first training pair;
and inputting the first training pair into the unselected initial convolutional neural network for iterative training to obtain a trained convolutional neural network serving as the preset convolutional neural network.
3. The method for extracting semantic alignment features according to claim 1, wherein before the step of extracting the feature map of the target picture based on the preset convolutional neural network and performing global maximum pooling on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector, the method comprises:
collecting a preselected picture of a pedestrian; the preselected picture comprises pedestrians;
judging whether the preselected picture meets a preset definition condition or not;
and if so, preprocessing the preselected picture, and inputting the preprocessed preselected picture serving as the target picture into a preset convolutional neural network.
4. The method for extracting semantic alignment features according to claim 3, wherein the step of judging whether the pre-selected picture meets a preset definition condition comprises:
acquiring a gray image of the preselected picture, acquiring a gray value of each pixel point in the gray image, and calculating an average gray value;
acquiring a first gray value which is larger than the average gray value in the gray image and a second gray value which is smaller than the average gray value;
calculating first difference values between the first gray values and the average gray value, and calculating the average value of the first difference values to obtain a first value;
calculating second difference values between the average gray value and each second gray value, and calculating the average value of each second difference value to obtain a second value;
calculating the average value of the first value and the second value, and judging whether the average value of the first value and the second value is greater than a preset value;
and if so, judging that the preselected picture meets a preset definition condition.
5. The method for extracting semantic alignment features according to claim 3, wherein the step of preprocessing the pre-selected picture and inputting the preprocessed pre-selected picture as the target picture into a preset convolutional neural network comprises:
extracting coordinate information of each pedestrian in the preselected picture through a preset DPM model;
according to the coordinate information of each pedestrian in the preselected picture, segmenting a target image of each pedestrian from the preselected picture;
and creating a blank layer, and paving the target image of each pedestrian in the blank layer to obtain the target picture.
6. An apparatus for extracting semantic alignment features, comprising:
the first extraction unit is used for extracting a feature map of a target picture based on a preset convolutional neural network and carrying out global maximum pooling processing on the feature map through a pooling layer of the convolutional neural network to obtain a global feature vector; wherein, in the process of global maximum pooling, the pooling layer of the convolutional neural network locates the index of each component of the global feature vector in the feature map;
the first acquisition unit is used for acquiring a constituent element of each component of the global feature vector according to the index of each component of the global feature vector in the feature map; when the output result of the pooling layer of the convolutional neural network is the component of the global feature vector, the constituent element of each component needs to be correspondingly input to the input vector of the pooling layer and the network parameter of the pooling layer;
the extraction unit is used for sorting the network parameters of the constituent elements of all the components from large to small, extracting N target network parameters arranged at the front, and acquiring an input vector included in the constituent element where each target network parameter is located as a target input vector;
an extracting unit configured to extract components of the respective target input vectors as effective components;
and the combination unit is used for sequentially combining the global feature vector and each effective component to obtain a multi-granularity semantic alignment feature vector.
7. The apparatus for extracting semantic alignment features according to claim 6, further comprising:
the second acquisition unit is used for acquiring picture sample data, and performing return sampling on the picture sample data to obtain three groups of training sample sets;
the first training unit is used for respectively training the original convolutional neural network based on the three groups of training sample sets to obtain three initial convolutional neural networks;
the selecting unit is used for randomly selecting two initial convolutional neural networks as target neural networks, and respectively inputting the unmarked first pictures into the target neural networks to extract the multi-granularity semantic alignment features so as to obtain a first multi-granularity feature and a second multi-granularity feature;
the first judging unit is used for judging whether the first multi-granularity characteristic and the second multi-granularity characteristic are the same or not, and if so, marking any one of the multi-granularity characteristics to the first picture to form a first training pair;
and the second training unit is used for inputting the first training pair into the unselected initial convolutional neural network for iterative training to obtain a trained convolutional neural network serving as the preset convolutional neural network.
8. The apparatus for extracting semantic alignment features according to claim 6, further comprising:
the acquisition unit is used for acquiring preselected pictures of pedestrians; the preselected picture comprises pedestrians;
the second judgment unit is used for judging whether the preselected picture meets a preset definition condition or not;
and the input unit is used for preprocessing the preselected picture and inputting the preprocessed preselected picture serving as the target picture into a preset convolutional neural network if the preset picture meets the preset requirement.
9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method according to any of claims 1 to 5.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.
CN202010409366.XA 2020-05-14 2020-05-14 Method, device, computer equipment and storage medium for extracting semantic alignment features Active CN111738012B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010409366.XA CN111738012B (en) 2020-05-14 2020-05-14 Method, device, computer equipment and storage medium for extracting semantic alignment features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010409366.XA CN111738012B (en) 2020-05-14 2020-05-14 Method, device, computer equipment and storage medium for extracting semantic alignment features

Publications (2)

Publication Number Publication Date
CN111738012A true CN111738012A (en) 2020-10-02
CN111738012B CN111738012B (en) 2023-08-18

Family

ID=72647228

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010409366.XA Active CN111738012B (en) 2020-05-14 2020-05-14 Method, device, computer equipment and storage medium for extracting semantic alignment features

Country Status (1)

Country Link
CN (1) CN111738012B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001195592A (en) * 1999-11-16 2001-07-19 Stmicroelectronics Srl Digital image sorting method by contents
CN108062756A (en) * 2018-01-29 2018-05-22 重庆理工大学 Image, semantic dividing method based on the full convolutional network of depth and condition random field
CN108268643A (en) * 2018-01-22 2018-07-10 北京邮电大学 A kind of Deep Semantics matching entities link method based on more granularity LSTM networks
CN108830855A (en) * 2018-04-02 2018-11-16 华南理工大学 A kind of full convolutional network semantic segmentation method based on the fusion of multiple dimensioned low-level feature
US20190050639A1 (en) * 2017-08-09 2019-02-14 Open Text Sa Ulc Systems and methods for generating and using semantic images in deep learning for classification and data extraction
US20190294661A1 (en) * 2018-03-21 2019-09-26 Adobe Inc. Performing semantic segmentation of form images using deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001195592A (en) * 1999-11-16 2001-07-19 Stmicroelectronics Srl Digital image sorting method by contents
US20190050639A1 (en) * 2017-08-09 2019-02-14 Open Text Sa Ulc Systems and methods for generating and using semantic images in deep learning for classification and data extraction
CN108268643A (en) * 2018-01-22 2018-07-10 北京邮电大学 A kind of Deep Semantics matching entities link method based on more granularity LSTM networks
CN108062756A (en) * 2018-01-29 2018-05-22 重庆理工大学 Image, semantic dividing method based on the full convolutional network of depth and condition random field
US20190294661A1 (en) * 2018-03-21 2019-09-26 Adobe Inc. Performing semantic segmentation of form images using deep learning
CN108830855A (en) * 2018-04-02 2018-11-16 华南理工大学 A kind of full convolutional network semantic segmentation method based on the fusion of multiple dimensioned low-level feature

Also Published As

Publication number Publication date
CN111738012B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
CN110751134B (en) Target detection method, target detection device, storage medium and computer equipment
CN109543627B (en) Method and device for judging driving behavior category and computer equipment
Mukhoti et al. Evaluating bayesian deep learning methods for semantic segmentation
WO2021212659A1 (en) Video data processing method and apparatus, and computer device and storage medium
CN110390666B (en) Road damage detection method, device, computer equipment and storage medium
CN109753913B (en) Multi-mode video semantic segmentation method with high calculation efficiency
CN110569721A (en) Recognition model training method, image recognition method, device, equipment and medium
WO2020062360A1 (en) Image fusion classification method and apparatus
CN110163188B (en) Video processing and method, device and equipment for embedding target object in video
CN109886330B (en) Text detection method and device, computer readable storage medium and computer equipment
CN112418278A (en) Multi-class object detection method, terminal device and storage medium
CN110956081B (en) Method and device for identifying position relationship between vehicle and traffic marking and storage medium
CN111507226B (en) Road image recognition model modeling method, image recognition method and electronic equipment
CN111325766B (en) Three-dimensional edge detection method, three-dimensional edge detection device, storage medium and computer equipment
CN111191533A (en) Pedestrian re-identification processing method and device, computer equipment and storage medium
JP2024513596A (en) Image processing method and apparatus and computer readable storage medium
CN112348116A (en) Target detection method and device using spatial context and computer equipment
CN112241646A (en) Lane line recognition method and device, computer equipment and storage medium
US11367206B2 (en) Edge-guided ranking loss for monocular depth prediction
WO2023279799A1 (en) Object identification method and apparatus, and electronic system
CN113111880A (en) Certificate image correction method and device, electronic equipment and storage medium
CN110580507B (en) City texture classification and identification method
CN112733652B (en) Image target recognition method, device, computer equipment and readable storage medium
CN113706550A (en) Image scene recognition and model training method and device and computer equipment
CN117475253A (en) Model training method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant