WO2022120996A1 - 视觉位置识别方法及装置、计算机设备及可读存储介质 - Google Patents

视觉位置识别方法及装置、计算机设备及可读存储介质 Download PDF

Info

Publication number
WO2022120996A1
WO2022120996A1 PCT/CN2020/139639 CN2020139639W WO2022120996A1 WO 2022120996 A1 WO2022120996 A1 WO 2022120996A1 CN 2020139639 W CN2020139639 W CN 2020139639W WO 2022120996 A1 WO2022120996 A1 WO 2022120996A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
model
image
position recognition
visual position
Prior art date
Application number
PCT/CN2020/139639
Other languages
English (en)
French (fr)
Inventor
张锲石
程俊
许震宇
任子良
康宇航
高向阳
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022120996A1 publication Critical patent/WO2022120996A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present application relates to the technical field of machine vision, and in particular, to a visual position recognition method, a visual position recognition device, computer equipment, and a non-volatile computer-readable storage medium.
  • Visual location recognition has important application value in many fields, such as closed-loop detection for SLAM systems, and can also be used for visual content-based image search, 3D modeling, and vehicle navigation.
  • Visual location recognition faces many challenges, such as climate, environmental changes caused by lighting, dynamic object occlusion, different angles of camera acquisition, and real-time system performance, all of which will affect the accuracy of visual location recognition.
  • An autoencoder is an unsupervised learning deep network model.
  • An autoencoder consists of an encoder and a decoder. The encoder compresses the input of the model into a deep representation, which is then restored to the input representation by the decoder.
  • the current autoencoders usually use traditional handcrafted features as the constraints of the autoencoders, so that the autoencoders cannot extract the effective information of the scene well, and the accuracy of visual position recognition is not high.
  • Embodiments of the present application provide a visual position recognition method, a visual position recognition device, a computer device, and a non-volatile computer-readable storage medium to solve the problem of low accuracy of visual position recognition.
  • the visual position recognition method of the embodiment of the present application includes: constructing an auto-encoder model, the auto-encoder model including an encoder model and a decoder model connected in sequence; inputting a training image into a pre-trained VGG-16 model to output the training the first information of the image; inputting the training image into the autoencoder model to output the second information of the training image; calculating the difference between the first information and the second information; When the difference is less than a preset value, it is determined that the training of the auto-encoder model is completed; when the difference is greater than the preset value, the parameters of the auto-encoder model are modified, and the input of the training image into the The step of outputting the second information of the training image from the auto-encoder model; using the encoder model in the trained auto-encoder model to perform visual position recognition.
  • the encoder model includes multiple convolutional layers and multiple pooling layers, and the decoder model includes multiple fully connected layers.
  • a penalty term is added to at least one of the convolutional layers; and/or a dropout layer is provided between at least two adjacent fully-connected layers.
  • the calculating the difference between the first information and the second information includes: using an L2 loss function to calculate the difference between the first information and the second information .
  • the visual position recognition method before the step of using the encoder model in the trained autoencoder model to perform visual position recognition, the visual position recognition method further includes: inputting a test image into the trained autoencoder model. obtain the third information of the test image in the autoencoder model; input the retrieval image into the trained autoencoder image to obtain the fourth information of the retrieval image; according to the third information and the The fourth information calculates the similarity between the test image and the retrieval image; determines the visual position recognition result according to the similarity; calculates the index difference between the index of the test image and the index of the retrieval image; The accuracy of the visual recognition result is determined according to the index difference.
  • calculating the similarity between the test image and the retrieval image according to the third information and the fourth information includes: calculating according to the third information and the fourth information The cosine similarity between the test image and the retrieval image; the determining a visual position recognition result according to the similarity includes: when the cosine similarity is greater than a preset similarity, confirming that the test image is the same as the The retrieval image corresponds to the same scene; when the cosine similarity is less than the preset similarity, it is confirmed that the test image and the retrieval image correspond to different scenes.
  • the determining the accuracy of the visual recognition result according to the index difference value includes: when the index difference value is smaller than a preset index difference value, determining the accuracy of the visual recognition result is greater than a predetermined threshold; when the index difference is greater than the preset index difference, it is determined that the accuracy of the visual recognition result is less than a predetermined threshold.
  • the visual position recognition device of the embodiment of the present application includes a building module, a first input module, a second input module, a first calculation module, a first determination module, and an identification module.
  • the building block is used to construct an autoencoder model, which includes a sequentially connected encoder model and a decoder model.
  • the first input module is used to input the training image into the pre-trained VGG-16 model to output the first information of the training image.
  • the second input module is used for inputting the training image into the autoencoder model to output second information of the training image.
  • the first calculation module is configured to calculate the difference between the first information and the second information.
  • the first determination module is used to: determine that the autoencoder model is trained when the difference is less than a preset value; modify the parameters of the autoencoder model when the difference is greater than a preset value, and return the result. the step of inputting the training image into the autoencoder model to output second information of the training image.
  • the recognition module is used for visual position recognition using the encoder model in the trained autoencoder model.
  • the computer device of the embodiments of the present application includes a processor, a memory, and one or more programs, where the one or more programs are stored in the memory, and the one or more programs are executed by the processor to implement any of the above The visual position recognition method described in the embodiment.
  • the non-volatile computer-readable storage medium of the embodiment of the present application contains a computer program.
  • the computer program is executed by the processor, the visual position recognition method described in any of the above embodiments.
  • the visual position recognition method, visual position recognition device, computer equipment, and non-volatile computer-readable storage medium of the embodiments of the present application train the encoder in the self-encoder based on the features of the deep network VGG-16 as constraints, thereby , replacing traditional handcrafted features with deep network features, and further compressing features through autoencoders to obtain more accurate and powerful features.
  • the robustness to the influence of illumination and viewing angle is improved, and a higher accuracy of visual position recognition is achieved.
  • FIG. 1 is a schematic flowchart of a visual recognition method according to some embodiments of the present application.
  • FIG. 2 is a schematic diagram of a module of a visual recognition device according to some embodiments of the present application.
  • Fig. 3 is the principle schematic diagram of the visual recognition method of some embodiments of the present application.
  • FIG. 4 is a schematic flowchart of a visual recognition method according to some embodiments of the present application.
  • FIG. 5 is a schematic block diagram of a visual recognition device according to some embodiments of the present application.
  • Fig. 6 is the principle schematic diagram of the visual recognition method of some embodiments of the present application.
  • FIG. 7 is a schematic diagram of a computer device according to some embodiments of the present application.
  • FIG. 8 is a schematic diagram of interaction between a non-volatile computer-readable storage medium and a processor according to some embodiments of the present application.
  • the present application discloses a visual position recognition method, which is characterized in that it includes:
  • 011 Build an auto-encoder model, which includes a sequentially connected encoder model and a decoder model;
  • 013 input the training image into the autoencoder model to output the second information of the training image
  • 014 Calculate the difference between the first information and the second information
  • 016 modify the parameters of the autoencoder model when the difference value is greater than the preset value, and return to the step of inputting the training image into the autoencoder model to output the second information of the training image;
  • 017 Use the encoder model in the trained autoencoder model for visual position recognition.
  • the present application also discloses a visual position recognition device 10 .
  • the visual position recognition method of the embodiment of the present application can be implemented by the visual position recognition apparatus 10 of the embodiment of the present application.
  • the visual position recognition device 10 includes a construction module 111 , a first input module 112 , a second input module 113 , a first calculation module 114 , a first determination module 115 and an identification module 116 .
  • Step 011 may be implemented by building module 111 .
  • Step 012 may be implemented by the first input module 112 .
  • Step 013 may be implemented by the second input module 113 .
  • Step 014 may be implemented by the first computing module 114 .
  • Steps 015 and 016 may be implemented by the first determination module 115 .
  • Step 017 may be implemented by the identification module 116 .
  • the building block 111 can be used to build an auto-encoder model, which includes a sequentially connected encoder model and a decoder model.
  • the first input module 112 may be used to input the training image into the pre-trained VGG-16 model to output the first information of the training image.
  • the second input module 113 may be used to input the training image into the autoencoder model to output the second information of the training image.
  • the first calculation module 114 may be used to calculate the difference between the first information and the second information.
  • the first determination module 115 can be used to determine that the training of the autoencoder model is completed when the difference value is less than the preset value, modify the parameters of the autoencoder model when the difference value is greater than the preset value, and return to input the training image into the autoencoder model. to output the second information of the training image.
  • the recognition module 116 may be used for visual position recognition using the encoder model in the trained autoencoder model.
  • the VGG-16 model is a pre-trained model.
  • the VGG-16 model can be weight-trained on ImageNet to make the pre-trained VGG-16 model capable of feature extraction.
  • the VGG-16 model can extract highly abstract features from images. These features summarize some of the spatial and shape features in the image, which can greatly reduce the effects caused by illumination and angle changes. For example, if two images of the same scene are taken, in which the A image is taken when the light is strong, and the B image is taken when the light is weak, then the features of the A image and the B image extracted by the VGG-16 model are different. smaller.
  • the autoencoder model includes an encoder model and a decoder model, and the encoder model and the decoder model are sequentially connected.
  • the encoder model includes multiple convolutional layers and multiple pooling layers, wherein the number of convolutional layers can be 3 or 4, and the number of pooling layers can also be 3 or 4.
  • the number of convolutional layers is 4
  • the number of pooling layers is also 4
  • each pooling layer is appended to a convolutional layer.
  • the value range of the number of kernels in the convolutional layer can be [4, 256], which is not limited here. Setting the number of convolutional layers to 4 can not only avoid the problem that the number of convolutional layers is too small, which leads to weak feature extraction ability, but also avoid the problem of too many layers of convolutional layers and slow feature extraction. question.
  • the decoder model includes multiple fully connected layers, wherein the number of fully connected layers can be 2 or 3.
  • the number of fully connected layers is two, and the functions of the two fully connected layers are similar to the upsampling layer and the deconvolution layer. Since the fully connected layer can directly set the output dimension, therefore, Using 2 fully connected layers can output the desired dimension without calculating the specific convolutional layer.
  • the autoencoder model needs to be trained. Due to the unsupervised nature of the autoencoder model, training does not require a large number of labeled images. Therefore, the training of the autoencoder model focuses on the selection of image types.
  • the training set should include images in various states of the same scene, so that the autoencoder model can be robust to environmental changes.
  • the Places365-Standard dataset can be chosen as the training set.
  • the Places365-Standard dataset contains 1.8 million images from 365 different scenes. Each scene provides 5,000 similar scenes in different states. Using this dataset as a training set enables the self-encoding model to extract the Significant features in the same scene, so as to achieve strong generalization ability.
  • the images in the training set can be scaled first, for example, the images in the training set can be scaled to a size of 224x224x3, and then the scaled training images Im are input into the pre-trained pre-trained VGG-16 model to Output the first information V1 of the training image Im (which can also be understood as a label), and input multiple training images Im into the autoencoder model to output the second information V2 of the training image Im, wherein the first information V1 and The second information V2 is represented by a vector. Then, the difference between the first information V1 and the second information V2 can be calculated. It should be noted that the first information V1 and the second information V2 used for calculating the difference here correspond to the same training image Im.
  • the L2 loss function can be used to calculate the difference between the first information V1 and the second information V2, that is
  • the L2 loss function is often used in regression problems. Through the L2 loss function, it is possible to make The output of the autoencoder fits the VGG-16 features as closely as possible.
  • the training image Im is the autoencoder model whose input parameters are modified to output the second information V2 of the training image Im. This cycle repeats until the difference between the first information V1 and the second information V2 is less than or equal to the preset value. It can be understood that the evaluation criterion for whether the model is successfully trained is the difference between the output of the model and the label of the model.
  • the autoencoder model is the model to be trained, and the output of the VGG-16 model is the label of the target of the autoencoder model. Therefore, the effect of the self-encoder model can be judged by the difference between the second information V2 output by the self-encoder model and the first information V1 output by the VGG-16 model. When the difference between the two is less than or equal to the preset value , indicating that the effect of the autoencoder model is better, and the training of the autoencoder model is completed.
  • epoch represents the number of times of training, there are 365 scenes during training, each scene has a training subset, and training all training subsets at a time is called an epoch.
  • the encoder model in the trained autoencoder model (that is, the trained encoder) can be used for visual position recognition.
  • a modeling device for indoor 3D modeling can walk indoors and acquire images in real time.
  • the modeling device uses the trained encoder model to perform feature extraction on images, and performs image matching based on the extracted features, so that it can determine which images indicate the same scene among the multiple acquired images, and based on the acquired images and images
  • the matching results are used for 3D modeling of the interior. Since the features extracted by the encoder model of the embodiments of the present application are less affected by illumination and viewing angle, the result of image matching is more accurate, and further, the result of 3D modeling is also more accurate. In addition, since the number of convolutional layers in the encoder model is small, the feature extraction time is reduced, which is beneficial to improve the speed of image matching and further reduce the time required for 3D modeling.
  • the visual position recognition method and visual position recognition device 10 of the embodiment of the present application train the encoder in the self-encoder based on the feature of the deep network VGG-16 as a constraint condition, thereby replacing the traditional manual feature with the deep network feature, and by The autoencoder realizes further compression of the features and obtains more accurate and powerful features.
  • the robustness to the influence of illumination and viewing angle is improved, and a higher accuracy of visual position recognition is achieved.
  • the number of convolutional layers in the encoder model is small, the feature extraction time is reduced, which is beneficial to improve the speed of image matching in visual position recognition.
  • a penalty term is added to at least one of the convolutional layers; and/or a dropout layer is provided between at least two adjacent fully-connected layers.
  • At least one convolutional layer is added with a penalty item, which may be one of the convolutional layers added with a penalty item, two convolutional layers with a penalty item added, or three convolutional layers added with a penalty item, or It can be that all convolutional layers are added with penalty items, etc., which is not limited here.
  • the penalty term may be L1 regularization, L2 regularization, etc., which is not limited here. By adding a penalty term, overfitting of the autoencoder model can be avoided.
  • the penalty term may be L2 regularization. It can be understood that L2 regularization is to add the sum of the squares of the weight parameters to the original loss function. L2 regularization can penalize the weights of unimportant features, thus avoiding overfitting of the autoencoder model.
  • the number of fully connected layers is two, and a drop layer may be added between the two fully connected layers.
  • the number of discarding layers can be one or more layers, which is not limited here.
  • the value of the discarding rate of the discarding layer can be [0.5, 0.8], and the discarding rate of different discarding layers can be the same or different, which is not limited here.
  • a drop layer is set between two fully connected layers, and the drop rate of the drop layer is 0.5. It can be understood that the dropout layer makes certain two neurons not necessarily appear in the same sub-network structure every time, which can prevent some features from being effective only under other features, forcing the autoencoder model to learn more generalized.
  • the adaptive features can improve the feature extraction effect of the auto-encoder.
  • the visual position recognition method further includes:
  • 018 Input the test image into the trained autoencoder model to obtain the third information of the test image
  • 019 Input the retrieval image into the trained autoencoder image to obtain the fourth information of the retrieval image
  • step 020 calculates the similarity between the test image and the retrieved image according to the third information and the fourth information, including:
  • Step 021 determines the visual position recognition result according to the similarity, including:
  • Step 023 determines the accuracy of the visual recognition result according to the index difference, including:
  • the index difference is greater than the preset index difference, it is determined that the accuracy of the visual recognition result is less than a predetermined threshold.
  • the visual position recognition device 10 further includes a third input module 117 , a fourth input module 118 , a second calculation module 119 , a second determination module 120 , a third calculation module 121 , and a second calculation module 119 .
  • Three determination modules 122 may be implemented by the third input module 117 .
  • Step 019 may be implemented by the fourth input module 118 .
  • Step 020 may be implemented by the second computing module 119 .
  • Step 021 may be implemented by the second determination module 120 .
  • Step 022 may be implemented by the third computing module 121 .
  • Step 023 may be implemented by the third determination module 122 .
  • the third input module 117 can be used to input the test image into the trained autoencoder model to obtain the third information of the test image.
  • the fourth input module 118 may be configured to input the retrieval image into the trained autoencoder image to obtain fourth information of the retrieval image.
  • the second calculation module 119 may be configured to calculate the similarity between the test image and the search image according to the third information and the fourth information.
  • the second determination module 120 may be configured to determine the visual position recognition result according to the similarity.
  • the third calculation module 121 may be configured to calculate the index difference between the index of the test image and the index of the retrieved image.
  • the third determination module 122 can be used to determine the accuracy of the visual recognition result according to the index difference.
  • the second calculation module 119 may be further configured to calculate the cosine similarity between the test image and the search image according to the third information and the fourth information.
  • the second determining module 120 may also be configured to confirm that the test image and the search image correspond to the same scene when the cosine similarity is greater than the preset similarity, and confirm that the test image and the search image correspond to different scenes when the cosine similarity is less than the preset similarity .
  • the third determining module 122 may also be configured to determine that the accuracy of the visual recognition result is greater than a predetermined threshold when the index difference is smaller than the preset index difference, and determine the accuracy of the visual recognition result when the index difference is greater than the preset index difference is less than a predetermined threshold.
  • Test images can also use images from the Places365-Standard dataset.
  • the images in the Places365-Standard dataset can be divided into training set and test set. Since each scene in the Places365-Standard dataset has 5000 images, 4200 images can be used as the training set and 800 images can be used as the training set.
  • the test set, the images of the test set do not participate in the training of the autoencoder model.
  • the images in the test set can be further divided into a test image set and a retrieval image set, multiple test images in the test image set (Query images shown in Figure 6) and multiple retrieval images in the retrieval image set (shown in Figure 6).
  • the reference image are time-series images, and each test image and each retrieval image have an index (which can also be understood as a number).
  • each test set has 400 images
  • each retrieval image set also has 400 images, corresponding to the test image set of the same scene and the retrieval image set
  • the 400 test images in the test image set can be multiple consecutive images taken in the morning for the S scene
  • the index is 1-400
  • the 400 retrieval images in the retrieval image set can be taken in the evening for the S scene of multiple consecutive images, indexed 1-400.
  • test image is input into the trained encoder model to obtain the third information (Qi), and multiple (for example, N x 224 x 224 x 3, N is a positive integer) retrieval images are selected and input into the trained encoder model to obtain the fourth information (Ri).
  • the similarity between the test image and the retrieval image is calculated according to the third information and the fourth information, for example, the cosine similarity between the test image and the detection image can be calculated according to the third information and the fourth information according to formula (1).
  • a preset similarity is set, for example, it may be 0.8 or other values. If the cosine similarity between the third information and the fourth information is greater than or equal to the preset similarity, confirm that the test image and the retrieval image correspond to the same scene; if the cosine similarity between the two is less than the preset similarity, confirm that the test image and the retrieval image correspond The images correspond to different scenes.
  • a cosine similarity matrix can be calculated according to multiple third information and fourth information.
  • each The maximum value of a row is the best match.
  • the rightmost picture in Figure 3 is the heat map corresponding to the cosine similarity matrix between the test image and the retrieved image. The heat map is used to display the difference between the two values. When the pixels on the main diagonal of the heat map are displayed When the first predetermined color is displayed, and the pixels at the remaining positions are displayed as the second predetermined color, it indicates that the matching degree of the images is relatively high.
  • a tolerance ie, a preset index difference
  • Tolerance can be defined as formula (2):
  • the features extracted by the deep network have richer geometric information and semantic information than the traditional hand-made features, and the structure of the autoencoder enables the model to learn Deeper network features deeper, more compact representations. Therefore, the present application can extract more robust features in scene images and achieve feature extraction capabilities similar to VGG-16. Compared with the deep network VGG-16, it reduces the feature extraction time by nearly 4 times. It not only effectively improves the accuracy in scene recognition, but also reduces the running time of location recognition, meeting the real-time requirements of location recognition.
  • Computer device 20 includes a processor 21, a memory 22, and one or more programs. One or more programs are stored in the memory 22, and the one or more programs are executed by the processor 21 to implement the visual position recognition method described in any of the above embodiments.
  • one or more programs are executed by the processor 21 to realize the following steps:
  • 011 Build an auto-encoder model, which includes a sequentially connected encoder model and a decoder model;
  • 013 input the training image into the autoencoder model to output the second information of the training image
  • 014 Calculate the difference between the first information and the second information
  • 016 modify the parameters of the autoencoder model when the difference value is greater than the preset value, and return to the step of inputting the training image into the autoencoder model to output the second information of the training image;
  • 017 Use the encoder model in the trained autoencoder model for visual position recognition.
  • an embodiment of the present application further discloses a non-volatile computer-readable storage medium 30 .
  • the non-volatile computer-readable storage medium 30 contains computer programs. When the computer program is executed by the processor 21, the visual position recognition method described in any one of the above embodiments is implemented.
  • 011 Build an auto-encoder model, which includes a sequentially connected encoder model and a decoder model;
  • 013 input the training image into the autoencoder model to output the second information of the training image
  • 014 Calculate the difference between the first information and the second information
  • 016 modify the parameters of the autoencoder model when the difference value is greater than the preset value, and return to the step of inputting the training image into the autoencoder model to output the second information of the training image;
  • 017 Use the encoder model in the trained autoencoder model for visual position recognition.
  • any description of a process or method in the flowcharts or otherwise described herein may be understood to represent a module, segment or portion of code comprising one or more executable instructions for implementing a specified logical function or step of the process , and the scope of the preferred embodiments of the present application includes alternative implementations in which the functions may be performed out of the order shown or discussed, including performing the functions substantially concurrently or in the reverse order depending upon the functions involved, which should It is understood by those skilled in the art to which the embodiments of the present application belong.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Library & Information Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种视觉位置识别方法、视觉位置识别装置、计算机设备及非易失性计算机可读存储介质。视觉位置识别方法包括:构建自编码器模型,所述自编码器模型包括顺序连接的编码器模型和解码器模型;将训练图像输入预训练的VGG-16模型以输出所述训练图像的第一信息;将所述训练图像输入所述自编码器模型以输出所述训练图像的第二信息;计算所述第一信息与所述第二信息之间的差值;在所述差值小于预设值时确定所述自编码器模型训练完毕;在所述差值大于预设值时修改所述自编码器模型的参数,并返回所述将所述训练图像输入所述自编码器模型以输出所述训练图像的第二信息的步骤;利用训练完毕的所述自编码器模型中的编码器模型进行视觉位置识别。

Description

视觉位置识别方法及装置、计算机设备及可读存储介质 技术领域
本申请涉及机器视觉技术领域,特别涉及一种视觉位置识别方法、视觉位置识别装置、计算机设备及非易失性计算机可读存储介质。
背景技术
视觉位置识别在很多领域有重要的应用价值,例如,用于SLAM系统的闭环检测,还可以用于基于视觉内容的图像搜索,3D建模和车辆导航等。视觉位置识别面临着许多挑战,例如气候、光照引起的环境变化、动态物体遮挡、摄像机获取内容的不同角度、系统的实时性等均会对视觉位置识别的准确率产生影响。
目前,可以通过基于深度学习的视觉定位方法进行视觉位置识别。但是,目前基于深度学习的视觉定位方法存在一些问题,比如,这些方法中的模型的健壮性是以大量的内存为代价的,这些内存被深度网络的过多参数占用,从而导致特征提取时间长;此外,这些方法需要花费大量的精力来生成带标签的图像来训练模型。可以通过引入自编码器来解决上述问题。自编码器是一种无监督学习的深度网络模型。自编码器由编码器和解码器两部分组成。编码器将模型的输入压缩到一个深层的表达,再由解码器将其还原成输入的表示。然而,目前的自编码器通常使用传统的手工特征作为自编码器的约束条件,使得自编码器不能较好地提取场景的有效信息,视觉位置识别的准确率不高。
发明内容
本申请实施方式提供了一种视觉位置识别方法、视觉位置识别装置、计算机设备及非易失性计算机可读存储介质,以解决视觉位置识别的准确率不高的问题。
本申请实施方式的视觉位置识别方法包括:构建自编码器模型,所述自编码器模型包括顺序连接的编码器模型和解码器模型;将训练图像输入预训练的VGG-16模型以输出所述训练图像的第一信息;将所述训练图像输入所述自编码器模型以输出所述训练图像的第二信息;计算所述第一信息与所述第二信息之间的差值;在所述差值小于预设值时确定所述自编码器模型训练完毕;在所述差值大于预设值时修改所述自编码器模型的参数,并返回所述将所述训练图像输入所述自编码器模型以输出所述训练图像的第二信息的步骤;利用训练完毕的所述自编码器模型中的编码器模型进行视觉位置识别。
在某些实施方式中,所述编码器模型包括多个卷积层和多个池化层,所述解码器模型包括 多个全连接层。
在某些实施方式中,至少一个所述卷积层中添加有惩罚项;和/或至少两个相邻的所述全连接层之间设置有丢弃层。
在某些实施方式中,所述计算所述第一信息与所述第二信息之间的差值,包括:利用L2损失函数计算所述第一信息与所述第二信息之间的差值。
在某些实施方式中,所述视觉位置识别方法在所述利用训练完毕的所述自编码器模型中的编码器模型进行视觉位置识别的步骤前,还包括:将测试图像输入训练完毕的所述自编码器模型中以获得所述测试图像的第三信息;将检索图像输入训练完毕的所述自编码器图像中以获得所述检索图像的第四信息;根据所述第三信息和所述第四信息计算所述测试图像与所述检索图像的相似度;根据所述相似度确定视觉位置识别结果;计算所述测试图像的索引与所述检索图像的索引之间的索引差值;根据所述索引差值确定所述视觉识别结果的准确度。
在某些实施方式中,所述根据所述第三信息和所述第四信息计算所述测试图像与所述检索图像的相似度,包括:根据所述第三信息和所述第四信息计算所述测试图像与所述检索图像的余弦相似度;所述根据所述相似度确定视觉位置识别结果,包括:在所述余弦相似度大于预设相似度时,确认所述测试图像与所述检索图像对应同一场景;在所述余弦相似度小于所述预设相似度时,确认所述测试图像与所述检索图像对应不同场景。
在某些实施方式中,所述根据所述索引差值确定所述视觉识别结果的准确度,包括:在所述索引差值小于预设索引差值时,确定所述视觉识别结果的准确度大于预定阈值;在所述索引差值大于所述预设索引差值时,确定所述视觉识别结果的准确度小于预定阈值。
本申请实施方式的视觉位置识别装置包括构建模块、第一输入模块、第二输入模块、第一计算模块、第一确定模块及识别模块。构建模块用于构建自编码器模型,所述自编码器模型包括顺序连接的编码器模型和解码器模型。第一输入模块用于将训练图像输入预训练的VGG-16模型以输出所述训练图像的第一信息。第二输入模块用于将所述训练图像输入所述自编码器模型以输出所述训练图像的第二信息。第一计算模块用于计算所述第一信息与所述第二信息之间的差值。第一确定模块用于:在所述差值小于预设值时确定所述自编码器模型训练完毕;在所述差值大于预设值时修改所述自编码器模型的参数,并返回所述将所述训练图像输入所述自编码器模型以输出所述训练图像的第二信息的步骤。识别模块用于利用训练完毕的所述自编码器模型中的编码器模型进行视觉位置识别。
本申请实施方式的计算机设备包括处理器、存储器及一个或多个程序,所述一个或多个程序存储在所述存储器中,所述一个或多个程序被所述处理器执行以实现上述任意实施方式所述的视觉位置识别方法。
本申请实施方式的非易失性计算机可读存储介质包含计算机程序。所述计算机程序被处理器执行时上述任意实施方式所述的视觉位置识别方法。
本申请实施方式的视觉位置识别方法、视觉位置识别装置、计算机设备及非易失性计算机可读存储介质基于深度网络VGG-16的特征作为约束条件来训练自编码器中的编码器,由此,通过深度网络特征代替传统手工特征,并通过自编码器实现对特征的进一步压缩,得到更精确强大的特征。提升了对光照和视角等影响的鲁棒性,取得了较高的视觉位置识别的准确率。
本申请实施方式的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本申请的实践了解到。
附图说明
本申请的上述和/或附加的方面和优点可以从结合下面附图对实施方式的描述中将变得明显和容易理解,其中:
图1是本申请某些实施方式的视觉识别方法的流程示意图;
图2是本申请某些实施方式的视觉识别装置的模块示意图;
图3是本申请某些实施方式的视觉识别方法的原理示意图;
图4是本申请某些实施方式的视觉识别方法的流程示意图;
图5是本申请某些实施方式的视觉识别装置的模块示意图;
图6是本申请某些实施方式的视觉识别方法的原理示意图;
图7是本申请某些实施方式的计算机设备的示意图;
图8是本申请某些实施方式的非易失性计算机可读存储介质与处理器的交互示意图。
具体实施方式
下面详细描述本申请的实施方式,所述实施方式的示例在附图中示出,其中,相同或类似的标号自始至终表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的,仅用于解释本申请的实施方式,而不能理解为对本申请的实施方式的限制。
请参阅图1,本申请公开一种视觉位置识别方法,其特征在于,包括:
011:构建自编码器模型,自编码器模型包括顺序连接的编码器模型和解码器模型;
012:将训练图像输入预训练的VGG-16模型以输出训练图像的第一信息;
013:将训练图像输入自编码器模型以输出训练图像的第二信息;
014:计算第一信息与第二信息之间的差值;
015:在差值小于预设值时确定自编码器模型训练完毕;
016:在差值大于预设值时修改自编码器模型的参数,并返回将训练图像输入自编码器模型以输出训练图像的第二信息的步骤;
017:利用训练完毕的自编码器模型中的编码器模型进行视觉位置识别。
请参阅图2,本申请还公开一种视觉位置识别装置10。本申请实施方式的视觉位置识别方法可以由本申请实施方式的视觉位置识别装置10实现。视觉位置识别装置10包括构建模块111、第一输入模块112、第二输入模块113、第一计算模块114、第一确定模块115及识别模块116。步骤011可以由构建模块111实现。步骤012可以由第一输入模块112实现。步骤013可以由第二输入模块113实现。步骤014可以由第一计算模块114实现。步骤015和步骤016可以由第一确定模块115实现。步骤017可以由识别模块116实现。
也即是说,构建模块111可以用于构建自编码器模型,自编码器模型包括顺序连接的编码器模型和解码器模型。第一输入模块112可以用于将训练图像输入预训练的VGG-16模型以输出训练图像的第一信息。第二输入模块113可以用于将训练图像输入自编码器模型以输出训练图像的第二信息。第一计算模块114可以用于计算第一信息与第二信息之间的差值。第一确定模块115可以用于在差值小于预设值时确定自编码器模型训练完毕,在差值大于预设值时修改自编码器模型的参数,并返回将训练图像输入自编码器模型以输出训练图像的第二信息的步骤。识别模块116可以用于利用训练完毕的自编码器模型中的编码器模型进行视觉位置识别。
请结合图3,VGG-16模型为预训练好的模型。在一个例子中,VGG-16模型可以在ImageNet上进行权重训练,以使得预训练的VGG-16模型具备特征提取能力。VGG-16模型可以对图片进行高度的抽象特征的提取,这些特征概括了图像中一些的空间特性和形状特征,可以很大程度地减少由光照和角度变化引起的影响。例如,拍摄同一场景的两张图像,其中,A图像是在光照较强时拍摄的,B图像是在光照较弱时拍摄的,那么,A图像和B图像经过VGG-16模型提取的特征相差较小。自编码器模型包括编码器模型和解码器模型,编码器模型和解码器模型顺序连接。编码器模型包括多个卷积层及多个池化层,其中,卷积层的个数可以为3个或4个,池化层的个数也可以为3个或4个。在图3所示实施例中,卷积层的个数为4个,池化层的层数也为4个,每一个池化层附加于一个卷积层之后。卷积层的核数的取值范围可以为[4,256],在此不作限制。将卷积层的个数设置为4个,既可以避免卷积层的层数过少,导致的特征提取能力弱的问题,还可以避免卷积层的层数过多,特征提取速度慢的问题。解码器模型包括多个全连接层,其中,全连接层的个数可以为2个或3个。图3所示实施例中,全连接层的层数为2个,该2个全连接层的作用类似于上采样层和反卷积层,由于全连接层可以直接设置输出的维度,因此,使用2个全连接层可以在不计算具体卷积层的情况下输出想要的维度。
在构建好自编码器模型后,需要对自编码器模型进行训练。由于自编码器模型具有无监督特点,训练不需要大量带有标签的图片,因此,自编码器模型的训练重点放在对图像类型的选择上。训练集应该包括同一场景的各种状态下的图像,才能使自编码器模型对环境变化具有一定的鲁棒性。在一个例子中,可以选用Places365-Standard数据集作为训练集。Places365-Standard数据集包含180万张图像,这些图像来自365个不同的场景,每个场景提供了5000张不同状态的同类场景,将该数据集作为训练集,可以使得自编码模型可以提取出在同一场景下的显著的特征,从而实现很强的泛化能力。具体地,可以先对训练集中的图像进行缩放操作,例如将训练集中的图像缩放到大小为224x224x3,随后,再将缩放后的多张训练图像Im输入预训练的预训练的VGG-16模型以输出训练图像Im的第一信息V1(也可理解为标签),并将多张训练图像Im输入到自编码器模型中,以输出训练图像Im的第二信息V2,其中,第一信息V1和第二信息V2由向量表示。随后,可以计算第一信息V1与第二信息V2之间的差值,需要说明的是,此处用于计算差值的第一信息V1和第二信息V2对应同一幅训练图像Im。在一个例子中,可以利用L2损失函数计算第一信息V1与第二信息V2之间的差值,即||V1,V2||,L2损失函数常用于回归问题,通过L2损失函数,可以使得自编码器的输出尽可能拟合VGG-16的特征。具体地,若二者之间的差值小于或等于预设值,则确定自编码器模型训练完毕;若二者之间的差值大于预设值,则修改自编码器模型的参数(例如,修改自编码器模型的权重),并返回步骤013,需要说明的是,此时,训练图像Im是输入参数修改后的自编码器模型以输出训练图像Im的第二信息V2。如此循环往复,直至第一信息V1与第二信息V2之间的差值小于或等于预设值为止。可以理解,模型是否训练成功的评价标准是该模型的输出与模型对标的标签之间的差异。输出与标签的差值越小,则该模型的效果越好;差值越大,则该模型的效果越差。本申请实施方式中,自编码器模型为待训练的模型,VGG-16模型的输出为自编码器模型对标的标签。因此,可以通过自编码器模型输出的第二信息V2与VGG-16模型输出的第一信息V1之前的差值来评判自编码器模型的效果,当二者差值小于或等于预设值时,说明自编码器模型的效果较好,自编码器模型训练完毕。
进一步地,在训练时,还可以使用keras训练自编码器模型,使用具有0.001学习率的adam优化器以在训练时调整自编码器模型的参数以减小损失函数的值,还可以使用Earlystop技术,Earlystop为用于提前停止训练的函数,使得测试集的损失小于1.5时停止训练,从而避免训练时长过久使得自编码器模型把图像中一些无用的信息也当成有用信息,导致识别率降低的问题。整个训练过程大约需要花费8个epoch。其中,epoch表示训练的次数,训练时有365个场景,将每个场景有一个训练子集,将所有训练子集训练一次称为一个epoch。
在自编码器模型训练完毕之后,即可用训练完毕的自编码器模型中的编码器模型(也即训 练完毕的编码器)进行视觉位置识别。具体地,以室内3D建模为例,用于室内3D建模的建模设备可以在室内行走并实时获取图像。该建模设备利用训练完毕的编码器模型对图像进行特征提取,并基于提取的特征进行图像匹配,从而可以确定获取的多张图像中,哪些图像指示相同的场景,并基于获取的图像和图像匹配的结果进行室内的3D建模。由于本申请实施方式的编码器模型提取的特征受光照和视角的影响较小,因此图像匹配的结果较为准确,进一步地,3D建模的结果也会更为准确。此外,由于编码器模型的卷积层数量较少,因此,特征提取时间减少,有利于提升图像匹配的速度,进一步地,降低3D建模所需的时间。
本申请实施方式的视觉位置识别方法及视觉位置识别装置10基于深度网络VGG-16的特征作为约束条件来训练自编码器中的编码器,由此,通过深度网络特征代替传统手工特征,并通过自编码器实现对特征的进一步压缩,得到更精确强大的特征。提升了对光照和视角等影响的鲁棒性,取得了较高的视觉位置识别的准确率。并且,由于编码器模型的卷积层数量较少,因此,特征提取时间减少,有利于提升视觉位置识别时图像匹配的速度。
在某些实施方式中,至少一个所述卷积层中添加有惩罚项;和/或至少两个相邻的所述全连接层之间设置有丢弃层。
其中,至少一个卷积层添加有惩罚项可以是其中一个卷积层添加有惩罚项,也可以是两个卷积层添加有惩罚项,还可以是三个卷积层添加有惩罚项,还可以是所有卷积层均添加有惩罚项等,在此不作限制。其中,惩罚项可以是L1正则化、L2正则化等,在此不作限制。通过添加惩罚项,可以避免自编码器模型的过度拟合。作为一个示例,惩罚项可以为L2正则化。可以理解,L2正则化即在原来的损失函数的基础上加上权重参数的平方和。L2正则化可以惩罚不重要的特征的权重,从而避免自编码器模型的过度拟合。
图3所示实施例中,全连接层的个数为两个,该两个全连接层之间可以增加丢弃层。丢弃层的层数可以是一层或多层,在此不作限制。丢弃层的丢弃率的取值可以为[0.5,0.8],不同丢弃层的丢弃率可以相同,也可以不同,在此也不作限制。在本申请的一个实施例中,两个全连接层之间设置有一层丢弃层,丢弃层的丢弃率为0.5。可以理解,丢弃层使得某两个神经元不一定每次都在同一个子网络结构中出现,这可以阻止某些特征仅仅在其他特征下才能有效的情况,迫使自编码器模型去学习更加具有通适性的特征,提升自编码器的特征提取效果。
请参阅图4,在某些实施方式中,视觉位置识别方法在利用训练完毕的自编码器模型中的编码器模型进行视觉位置识别的步骤前,还包括:
018:将测试图像输入训练完毕的自编码器模型中以获得测试图像的第三信息;
019:将检索图像输入训练完毕的自编码器图像中以获得检索图像的第四信息;
020:根据第三信息和第四信息计算测试图像与检索图像的相似度;
021:根据相似度确定视觉位置识别结果;
022:计算测试图像的索引与检索图像的索引之间的索引差值;
023:根据索引差值确定视觉识别结果的准确度。
进一步地,步骤020根据第三信息和第四信息计算测试图像与检索图像的相似度,包括:
根据第三信息和第四信息计算测试图像与检索图像的余弦相似度;
步骤021根据相似度确定视觉位置识别结果,包括:
在余弦相似度大于预设相似度时,确认测试图像与检索图像对应同一场景;
在余弦相似度小于预设相似度时,确认测试图像与检索图像对应不同场景。
步骤023根据索引差值确定视觉识别结果的准确度,包括:
在索引差值小于预设索引差值时,确定视觉识别结果的准确度大于预定阈值;
在索引差值大于预设索引差值时,确定视觉识别结果的准确度小于预定阈值。
请参阅图5,在某些实施方式中,视觉位置识别装置10还包括第三输入模块117、第四输入模块118、第二计算模块119、第二确定模块120、第三计算模块121及第三确定模块122。步骤018可以由第三输入模块117实现。步骤019可以由第四输入模块118实现。步骤020可以由第二计算模块119实现。步骤021可以由第二确定模块120实现。步骤022可以由第三计算模块121实现。步骤023可以由第三确定模块122实现。
也即是说,第三输入模块117可以用于将测试图像输入训练完毕的自编码器模型中以获得测试图像的第三信息。第四输入模块118可以用于将检索图像输入训练完毕的自编码器图像中以获得检索图像的第四信息。第二计算模块119可以用于根据第三信息和第四信息计算测试图像与检索图像的相似度。第二确定模块120可以用于根据相似度确定视觉位置识别结果。第三计算模块121可以用于计算测试图像的索引与检索图像的索引之间的索引差值。第三确定模块122实现可以用于根据索引差值确定视觉识别结果的准确度。
进一步地,第二计算模块119还可以用于根据第三信息和第四信息计算测试图像与检索图像的余弦相似度。第二确定模块120还可以用于在余弦相似度大于预设相似度时,确认测试图像与检索图像对应同一场景,在余弦相似度小于预设相似度时,确认测试图像与检索图像对应不同场景。第三确定模块122还可以用于在索引差值小于预设索引差值时,确定视觉识别结果的准确度大于预定阈值,在索引差值大于预设索引差值时,确定视觉识别结果的准确度小于预定阈值。
请结合图6,在自编码器模型训练完毕后,可以对训练完毕的自编码器模型中的编码器模型进行测试。测试图像也可以用Places365-Standard数据集中的图像。例如,可以将Places365-Standard数据集中的图像划分为训练集和测试集,由于Places365-Standard数据集中 每个场景均有5000张图,则可以将其中的4200张图像作为训练集,800张图像作为测试集,测试集的图像不参与自编码器模型的训练。测试集中的图像进一步地又可以被分为测试图像集和检索图像集,测试图像集中的多张测试图像(图6所示的Query images)和检索图像集中的多张检索图像(图6所示的reference image)均为时序图像,且每张测试图像和每张检索图像均具有索引(也可以理解为编号)。例如,每一个测试集中的800张图像均被分为测试图像集和检索图像集,每一测试图像集中有400张图像,每一检索图像集中也有400张图像,对应于同一场景的测试图像集和检索图像集中,测试图像集中的400张测试图像可以为针对S场景在早上拍摄的多张连续图像,索引为1-400,检索图像集中的400张检索图像可以为针对该S场景在晚上拍摄的多张连续图像,索引为1-400。那么,测试前,可以先将测试集的所有图像进行缩放,例如缩放到224 x 224 x 3大小,随后,在测试时,可以选择多张(例如N x 224 x 224 x 3,N为正整数)测试图像输入到训练完毕的编码器模型中以获得第三信息(Qi),选择多张(例如N x 224 x 224 x 3,N为正整数)检索图像输入到训练完毕的编码器模型中以获得第四信息(Ri)。随后,根据第三信息和第四信息计算测试图像与检索图像之间的相似度,例如,可以根据公式(1)计算第三信息和第四信息计算测试图像与检测图像之间的余弦相似度:
Figure PCTCN2020139639-appb-000001
通常地,当余弦相似度接近1时,说明两幅图像更有可能表示同一场景;当余弦相似度接近-1时,说明两幅图像更有可能表示不同场景。在本申请的一个实施例中,设定了预设相似度,例如可以为0.8或其他数值。若第三信息和第四信息的余弦相似度大于或等于预设相似度,则确认测试图像与检索图像对应同一场景;若二者的余弦相似度小于预设相似度,则确认测试图像与检索图像对应不同场景。由于是将多张测试图像和多张检索图像输入到训练完毕的编码器模型中,因此,可以根据多个第三信息和第四信息计算得到一个余弦相似度矩阵,余弦相似度矩阵中,每一行的最大值就是最优匹配。图3最右侧的图为测试图像与检索图像之间的余弦相似度矩阵对应的热力图,热力图用于显示两个数值之间的差异,当热力图的主对角线上的像素呈现为第一预定颜色,而其余位置处的像素呈现为第二预定颜色时,说明图像的匹配度较高。
由于测试集中的很多图像均为连续的序列图像,因此,可以添加一个容忍差(也即预设索引差值)来定义编码器模型是否识别出正确场景。容忍差可以被定义为公式(2):
|frame query-frame search|<4.    (2)也即是说,当测试图像的索引为3时,若其匹配到的检索图像的索引为5,5-3=2,2<4,则说明视觉识别的准确度高于预设阈值;若其匹配到的检索图像的索引为10,10-3=7,7>4,则说明视觉识别的准确度低于预设阈值。
综上,本申请实施方式的视觉位置识别方法及视觉位置识别装置10中,深度网络提取的 特征相比传统手工特征具有更丰富的几何信息与语义信息,通过自编码器的架构,使模型学习深度网络特征更深层,更紧凑的表达。因此,本申请能够提取出场景图像中更加鲁棒的特征,达到与VGG-16相似的提取特征能力。相比深度网络VGG-16,减少了近4倍的特征提取时间。不但有效提高了在场景识别中的准确率,还减少了位置识别运行时间,满足了位置识别的实时性的要求。
请参阅图7,本申请实施方式还公开了一种计算机设备20。计算机设备20包括处理器21、存储器22及一个或多个程序。一个或多个程序存储在存储器22中,一个或多个程序被处理器21执行以实现上述任一实施方式所述的视觉位置识别方法。
例如,请结合图1和图7,一个或多个程序被处理器21执行以实现以下步骤:
011:构建自编码器模型,自编码器模型包括顺序连接的编码器模型和解码器模型;
012:将训练图像输入预训练的VGG-16模型以输出训练图像的第一信息;
013:将训练图像输入自编码器模型以输出训练图像的第二信息;
014:计算第一信息与第二信息之间的差值;
015:在差值小于预设值时确定自编码器模型训练完毕;
016:在差值大于预设值时修改自编码器模型的参数,并返回将训练图像输入自编码器模型以输出训练图像的第二信息的步骤;
017:利用训练完毕的自编码器模型中的编码器模型进行视觉位置识别。
请参阅图8,本申请实施方式还公开了一种非易失性计算机可读存储介质30。非易失性计算机可读存储介质30包含计算机程序。计算机程序被处理器21执行时实现上述任一实施方式所述的视觉位置识别方法。
例如,请结合图1和图8,计算机程序被处理器21执行时实现一下步骤:
011:构建自编码器模型,自编码器模型包括顺序连接的编码器模型和解码器模型;
012:将训练图像输入预训练的VGG-16模型以输出训练图像的第一信息;
013:将训练图像输入自编码器模型以输出训练图像的第二信息;
014:计算第一信息与第二信息之间的差值;
015:在差值小于预设值时确定自编码器模型训练完毕;
016:在差值大于预设值时修改自编码器模型的参数,并返回将训练图像输入自编码器模型以输出训练图像的第二信息的步骤;
017:利用训练完毕的自编码器模型中的编码器模型进行视觉位置识别。
在本说明书的描述中,参考术语“一个实施方式”、“一些实施方式”、“示意性实施方式”、“示例”、“具体示例”或“一些示例”等的描述意指结合所述实施方式或示例描述的具体特征、 结构、材料或者特点包含于本申请的至少一个实施方式或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施方式或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施方式或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本申请的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。
尽管上面已经示出和描述了本申请的实施方式,可以理解的是,上述实施方式是示例性的,不能理解为对本申请的限制,本领域的普通技术人员在本申请的范围内可以对上述实施方式进行变化、修改、替换和变型。

Claims (10)

  1. 一种视觉位置识别方法,其特征在于,包括:
    构建自编码器模型,所述自编码器模型包括顺序连接的编码器模型和解码器模型;
    将训练图像输入预训练的VGG-16模型以输出所述训练图像的第一信息;
    将所述训练图像输入所述自编码器模型以输出所述训练图像的第二信息;
    计算所述第一信息与所述第二信息之间的差值;
    在所述差值小于预设值时确定所述自编码器模型训练完毕;
    在所述差值大于预设值时修改所述自编码器模型的参数,并返回所述将所述训练图像输入所述自编码器模型以输出所述训练图像的第二信息的步骤;
    利用训练完毕的所述自编码器模型中的编码器模型进行视觉位置识别。
  2. 根据权利要求1所述的视觉位置识别方法,其特征在于,所述编码器模型包括多个卷积层和多个池化层,所述解码器模型包括多个全连接层。
  3. 根据权利要求2所述的视觉位置识别方法,其特征在于,至少一个所述卷积层中添加有惩罚项;和/或
    至少两个相邻的所述全连接层之间设置有丢弃层。
  4. 根据权利要求1所述的视觉位置识别方法,其特征在于,所述计算所述第一信息与所述第二信息之间的差值,包括:
    利用L2损失函数计算所述第一信息与所述第二信息之间的差值。
  5. 根据权利要求1所述的视觉位置识别方法,其特征在于,所述视觉位置识别方法在所述利用训练完毕的所述自编码器模型中的编码器模型进行视觉位置识别的步骤前,还包括:
    将测试图像输入训练完毕的所述自编码器模型中以获得所述测试图像的第三信息;
    将检索图像输入训练完毕的所述自编码器图像中以获得所述检索图像的第 四信息;
    根据所述第三信息和所述第四信息计算所述测试图像与所述检索图像的相似度;
    根据所述相似度确定视觉位置识别结果;
    计算所述测试图像的索引与所述检索图像的索引之间的索引差值;
    根据所述索引差值确定所述视觉识别结果的准确度。
  6. 根据权利要求5所述的视觉位置识别方法,其特征在于,所述根据所述第三信息和所述第四信息计算所述测试图像与所述检索图像的相似度,包括:
    根据所述第三信息和所述第四信息计算所述测试图像与所述检索图像的余弦相似度;
    所述根据所述相似度确定视觉位置识别结果,包括:
    在所述余弦相似度大于预设相似度时,确认所述测试图像与所述检索图像对应同一场景;
    在所述余弦相似度小于所述预设相似度时,确认所述测试图像与所述检索图像对应不同场景。
  7. 根据权利要求5所述的视觉位置识别方法,其特征在于,所述根据所述索引差值确定所述视觉识别结果的准确度,包括:
    在所述索引差值小于预设索引差值时,确定所述视觉识别结果的准确度高于预定阈值;
    在所述索引差值大于所述预设索引差值时,确定所述视觉识别结果的准确度低于所述预定阈值。
  8. 一种视觉位置识别装置,其特征在于,包括:
    构建模块,用于构建自编码器模型,所述自编码器模型包括顺序连接的编码器模型和解码器模型;
    第一输入模块,用于将训练图像输入预训练的VGG-16模型以输出所述训练图像的第一信息;
    第二输入模块,用于将所述训练图像输入所述自编码器模型以输出所述训练图像的第二信息;
    第一计算模块,用于计算所述第一信息与所述第二信息之间的差值;
    第一确定模块,用于:
    在所述差值小于预设值时确定所述自编码器模型训练完毕;
    在所述差值大于预设值时修改所述自编码器模型的参数,并返回所述将所述训练图像输入所述自编码器模型以输出所述训练图像的第二信息的步骤;
    识别模块,用于利用训练完毕的所述自编码器模型中的编码器模型进行视觉位置识别。
  9. 一种计算机设备,其特征在于,包括:
    处理器;
    存储器;及
    一个或多个程序,所述一个或多个程序存储在所述存储器中,所述一个或多个程序被所述处理器执行以实现权利要求1-7任意一项所述的视觉位置识别方法。
  10. 一种包含计算机程序的非易失性计算机可读存储介质,其特征在于,所述计算机程序被处理器执行时实现权利要求1-7任意一项所述的视觉位置识别方法。
PCT/CN2020/139639 2020-12-10 2020-12-25 视觉位置识别方法及装置、计算机设备及可读存储介质 WO2022120996A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011436657.4A CN112463999A (zh) 2020-12-10 2020-12-10 视觉位置识别方法及装置、计算机设备及可读存储介质
CN202011436657.4 2020-12-10

Publications (1)

Publication Number Publication Date
WO2022120996A1 true WO2022120996A1 (zh) 2022-06-16

Family

ID=74801175

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/139639 WO2022120996A1 (zh) 2020-12-10 2020-12-25 视觉位置识别方法及装置、计算机设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN112463999A (zh)
WO (1) WO2022120996A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113899675B (zh) * 2021-10-13 2022-05-27 淮阴工学院 一种基于机器视觉的混凝土抗渗自动检测方法、装置
CN115418475B (zh) * 2022-09-09 2024-03-01 苏州新凌电炉有限公司 一种网带炉智能化监测管理方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805185A (zh) * 2018-05-29 2018-11-13 腾讯科技(深圳)有限公司 模型的训练方法、装置、存储介质及计算机设备
CN110281983A (zh) * 2019-06-28 2019-09-27 清华大学 一种基于视觉场景识别的轨道列车精准停车系统
CN111160409A (zh) * 2019-12-11 2020-05-15 浙江大学 一种基于共同特征学习的异构神经网络知识重组方法
JP2020160743A (ja) * 2019-03-26 2020-10-01 日本電信電話株式会社 評価装置、評価方法、および、評価プログラム
CN111967573A (zh) * 2020-07-15 2020-11-20 中国科学院深圳先进技术研究院 数据处理方法、装置、设备及计算机可读存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059597B (zh) * 2019-04-04 2022-09-06 南京理工大学 基于深度相机的场景识别方法
CN110222588B (zh) * 2019-05-15 2020-03-27 合肥进毅智能技术有限公司 一种人脸素描图像衰老合成方法、装置及存储介质
CN110390336B (zh) * 2019-06-05 2023-05-23 广东工业大学 一种提高特征点匹配精度的方法
CN110363290B (zh) * 2019-07-19 2023-07-25 广东工业大学 一种基于混合神经网络模型的图像识别方法、装置及设备
CN110428009B (zh) * 2019-08-02 2020-06-16 南京航空航天大学 一种全卷积神经网络及相应的细观结构识别方法
CN110941734B (zh) * 2019-11-07 2022-09-27 南京理工大学 基于稀疏图结构的深度无监督图像检索方法
CN112016531A (zh) * 2020-10-22 2020-12-01 成都睿沿科技有限公司 模型训练方法、对象识别方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805185A (zh) * 2018-05-29 2018-11-13 腾讯科技(深圳)有限公司 模型的训练方法、装置、存储介质及计算机设备
JP2020160743A (ja) * 2019-03-26 2020-10-01 日本電信電話株式会社 評価装置、評価方法、および、評価プログラム
CN110281983A (zh) * 2019-06-28 2019-09-27 清华大学 一种基于视觉场景识别的轨道列车精准停车系统
CN111160409A (zh) * 2019-12-11 2020-05-15 浙江大学 一种基于共同特征学习的异构神经网络知识重组方法
CN111967573A (zh) * 2020-07-15 2020-11-20 中国科学院深圳先进技术研究院 数据处理方法、装置、设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN112463999A (zh) 2021-03-09

Similar Documents

Publication Publication Date Title
CN114782691B (zh) 基于深度学习的机器人目标识别与运动检测方法、存储介质及设备
CN109960742B (zh) 局部信息的搜索方法及装置
CN112215119B (zh) 一种基于超分辨率重建的小目标识别方法、装置及介质
CN108805016B (zh) 一种头肩区域检测方法及装置
CN112364931B (zh) 一种基于元特征和权重调整的少样本目标检测方法及网络系统
CN110879982B (zh) 一种人群计数系统及方法
EP4404148A1 (en) Image processing method and apparatus, and computer-readable storage medium
CN114663502A (zh) 物体姿态估计、图像处理方法及相关设备
WO2022120996A1 (zh) 视觉位置识别方法及装置、计算机设备及可读存储介质
CN113313763A (zh) 一种基于神经网络的单目相机位姿优化方法及装置
CN105719248A (zh) 一种实时的人脸变形方法及其系统
CN110827312A (zh) 一种基于协同视觉注意力神经网络的学习方法
CN112836625A (zh) 人脸活体检测方法、装置、电子设备
CN112329662B (zh) 基于无监督学习的多视角显著性估计方法
CN116266387A (zh) 基于重参数化残差结构和坐标注意力机制的yolov4的图像识别算法及系统
CN115063447A (zh) 一种基于视频序列的目标动物运动追踪方法及相关设备
CN111597913A (zh) 一种基于语义分割模型的车道线图片检测分割方法
CN112396042A (zh) 实时更新的目标检测方法及系统、计算机可读存储介质
CN114419102B (zh) 一种基于帧差时序运动信息的多目标跟踪检测方法
CN113112547A (zh) 机器人及其重定位方法、定位装置及存储介质
CN116977674A (zh) 图像匹配方法、相关设备、存储介质及程序产品
CN114612545A (zh) 图像分析方法及相关模型的训练方法、装置、设备和介质
CN117095300B (zh) 建筑图像处理方法、装置、计算机设备和存储介质
CN117011819A (zh) 基于特征引导注意力的车道线检测方法、装置及设备
CN114820755B (zh) 一种深度图估计方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20964925

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20964925

Country of ref document: EP

Kind code of ref document: A1