WO2022117096A1 - 第一人称视角图像识别方法、装置及计算机可读存储介质 - Google Patents

第一人称视角图像识别方法、装置及计算机可读存储介质 Download PDF

Info

Publication number
WO2022117096A1
WO2022117096A1 PCT/CN2021/135527 CN2021135527W WO2022117096A1 WO 2022117096 A1 WO2022117096 A1 WO 2022117096A1 CN 2021135527 W CN2021135527 W CN 2021135527W WO 2022117096 A1 WO2022117096 A1 WO 2022117096A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
person perspective
images
image recognition
person
Prior art date
Application number
PCT/CN2021/135527
Other languages
English (en)
French (fr)
Inventor
高瑞东
陈勃霖
蔡锦霖
Original Assignee
影石创新科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 影石创新科技股份有限公司 filed Critical 影石创新科技股份有限公司
Publication of WO2022117096A1 publication Critical patent/WO2022117096A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes

Definitions

  • the present application relates to the technical field of video recognition, and in particular, to a first-person perspective image recognition method device and a computer-readable storage medium.
  • the first-person perspective is referred to as the first point of view (POV, Point of View), which originally means a point-of-view character writing technique.
  • POV Point of View
  • the first perspective is often used in games, and its definition in the game is: watching the entire game demonstration from the perspective of the game operator himself, which is equivalent to standing behind the operator and watching, what you see is what you see operator sees.
  • first-person perspective video refers to video data in a first perspective (user perspective) captured by a camera worn by the user.
  • wearable shooting equipment users can either wear the wearable shooting equipment on their bodies to shoot "first-person perspective" videos in extreme sports such as mountain climbing, surfing, and bungee jumping, or they can directly use the wearable shooting equipment to shoot videos.
  • the device directly shoots non-"first-person view” video. Due to the difference in perspective, there will be obvious differences in the shooting content of the two types of videos. For example, the "first-person perspective" video is more immersive and allows the video viewer to more intuitively feel the photographer's feeling when shooting the video. , so the post-processing of the two videos is very different.
  • the present invention aims to solve the defects of the existing first-view video detection methods, and provides a first-person-view image recognition method, device, and computer-readable storage medium.
  • the present invention discloses a first-person perspective image recognition method, the method comprising: S1: acquiring a plurality of images including a first-person perspective and a plurality of images not including a first-person perspective; S2: according to whether the hand is included Classify and label images with features and hand feature information; S3: Enhance the classified and labeled images to obtain diverse image training samples; S4: Input the image training samples into a pre-built neural network for training to obtain the first step.
  • One-view image recognition model S5: input the to-be-recognized image into the first-view image recognition model; S6: determine whether the to-be-recognized image is a first-person viewpoint image according to the output of the first-view image recognition model; wherein, the first-person viewpoint image A photo or video frame that contains at least the features of the photographer's hand.
  • the present invention discloses a first-person perspective image recognition device, the device comprising:
  • the acquisition module is used to acquire a plurality of images containing a first-person perspective and a plurality of images that do not contain a first-person perspective; the classification and labeling module is used to classify and label the images according to whether they contain hand features and hand feature information;
  • the enhancement processing module is used to enhance the classified and annotated images to obtain diverse image training samples;
  • the training module is used to input the image training samples into a pre-built neural network for training to obtain a first-person perspective image recognition model;
  • the input module is used to input the image to be recognized into the first-person perspective image recognition model;
  • the judgment module is used to determine whether the to-be-recognized image is a first-person perspective image according to the output of the first-person perspective image recognition model; wherein, the first-person perspective image A photo or video frame that contains at least the features of the photographer's hand.
  • the present invention discloses a computer-readable storage medium, where executable instructions are stored on the computer-readable storage medium, and when the executable instructions are executed by a processor, the above-mentioned first-person perspective image recognition method is implemented.
  • the solution of the present invention can automatically identify whether the image is from a first-person perspective, and when applied to a video, it can automatically determine whether the current video frame is from a first-person perspective, eliminating the need for manual judgment, and the user only needs to provide
  • the input video frame can automatically distinguish whether the video frame is a first-person perspective, which is beneficial to the post-processing of the video, and has the advantages of fast processing speed and high accuracy.
  • FIG. 1 is a flowchart of a method for constructing a first-person perspective image recognition model in Embodiment 1 of the present invention.
  • FIG. 2 is an example of typical images classified into the first category in Embodiment 1 of the present invention.
  • FIG. 3 is an example of typical images classified into the second category in Embodiment 1 of the present invention.
  • FIG. 4 is an example of typical images classified into the third category in Embodiment 1 of the present invention.
  • FIG. 5 is an example of typical images classified into the fourth category in Embodiment 1 of the present invention.
  • FIG. 6 is a schematic diagram of comparison before and after the image training sample is scaled in Embodiment 1 of the present invention.
  • FIG. 7 is a structural block diagram of a first-person perspective recognition device in Embodiment 2 of the present invention.
  • FIG. 8 is a flowchart of a processing procedure of a video frame to be identified in Embodiment 1 of the present invention.
  • the method for constructing a first-person perspective image recognition model in this embodiment includes the following steps.
  • S1 Acquire multiple images with first-person perspective and multiple images without first-person perspective.
  • the acquired image is a photo or video frame captured by a wearable photographing device
  • the first-person perspective image is a photo or video frame that at least includes the characteristics of the photographer's hand.
  • an image data set can be pre-built, which includes photos taken by fixing the wearable photographing device on the photographer (ie, first-person perspective images) and taking the wearable photographing device as a common camera. photos (i.e. non-first-person view images).
  • S2 Classify and label images according to whether they contain hand features and hand feature information.
  • the images are manually classified and labeled, and the basis for the classification and labeling is: when the wearable photographing device is used as a common camera to take photos or videos, the photographer's hand features (such as arms, fingers, etc.) generally do not appear. In photos or videos (except selfies), hand features tend to appear in photos or videos when the wearable camera is attached to the photographer (such as clothing, hat, or headband) to capture the photo or video.
  • the wearable photographing device is used as a common camera to take photos or videos
  • hand features such as arms, fingers, etc.
  • hand features tend to appear in photos or videos when the wearable camera is attached to the photographer (such as clothing, hat, or headband) to capture the photo or video.
  • the photos or video frames captured by the wearable photographing device are classified according to the following features: Images containing the features of fingers or arms protruding from the edge of the image are classified into the first category, and marked as first-person Perspective images, as shown in Figure 2, are typical examples of this type of images; images containing selfies are classified into the second category and marked as first-person perspective images, as shown in Figure 3, which is a typical example of this type of image ; classify the images containing complete arm and finger features into the third category and mark them as non-first-person perspective, as shown in Figure 4, which is a typical example of this category of images; classify images that do not contain arm or finger features into The fourth category, marked as non-first-person perspective, is shown in Figure 5, which is a typical example of this category of images.
  • S3 Enhance the classified and annotated images to obtain diverse image training samples.
  • the main purpose of this step is to use fewer photos or video frames to obtain more image training samples, that is, an image can be enhanced to obtain multiple image training samples. It should be noted that each image training sample
  • the categories and labels of the samples are consistent with the original images. Specifically, image scaling, center cropping, random horizontal flipping, and random small changes in image brightness, contrast, saturation, and hue can be performed on the classified and annotated images.
  • the detector needs to detect the finger or arm protruding from the bottom of the image, and the rotation will destroy such a situation, the rotation operation used for conventional data enhancement is removed in the additional processing stage; , in order to avoid completely cropping the hand features at the edge of the image, in this embodiment, the random cropping used for conventional data enhancement is changed to center cropping.
  • S4 Input the image training sample into a pre-built neural network for training to obtain a first-view image recognition model.
  • the neural network in this embodiment is the MobilenNetV2 model, and its training process includes the following sub-steps.
  • transforms.Normalize is used to normalize the scaled image to improve the detection performance of the model.
  • dropout is used to train the neural network in this step to alleviate overfitting.
  • the preset condition in this embodiment is to train the neural network for a predetermined number of rounds.
  • the model can also be tested or tested, and the training is stopped after the relevant parameters of the model reach the set threshold, thereby completing the first perspective. Construction of an image recognition model.
  • a first-view image recognition model may also be constructed based on other neural networks, such as a VGG network, a deep residual network model (ResNet, Deep Residual Network),
  • GoogleNet series models (such as InceptionV1-V4), SqueezeNet network models, ShuffNet series network models, etc.
  • S5 Input the image to be recognized into the first-person perspective image recognition model.
  • the to-be-recognized image is scaled to a preset size and then input to the first-person perspective image recognition model.
  • transforms.Resize in the torchvision library is used to scale the size of the photo or video frame into a 224x224 image.
  • the image to be recognized can also be directly input into the first-person perspective image recognition model.
  • S6 Determine whether the to-be-recognized image is a first-person perspective image according to the output of the first-perspective image recognition model.
  • the processing process of the to-be-recognized video frame in this embodiment is as follows: after the to-be-recognized image is processed by multiple convolution layers and maximum pooling layers, global spatial information is aggregated through the global average pooling layer. , get a feature vector with a length of 1280, and then go through the full connection layer for final classification to get an output vector with a length of 4 (that is, the number of classifications, if it is another number of classifications, the length will change accordingly), and then after the softmax activation function Obtain the probability distribution of the image to be recognized in each category and output it.
  • the video When judging a video frame, it can be judged whether the video is a first-person perspective video according to the probability distribution result of multiple consecutive video frames. For example, if the proportion of video frames in the video that are determined to be the first-person perspective is greater than a set threshold (eg, 60%), the video is determined to be the first-person perspective video.
  • a set threshold eg, 60%
  • this embodiment discloses a first-person perspective image recognition device, including: an acquisition module for acquiring a plurality of images including a first-person perspective and a plurality of images not including a first-person perspective; classification and labeling module, which is used to classify and label images according to whether they contain hand features and hand feature information; the enhancement processing module is used to enhance the classified and labelled images to obtain diverse image training samples; the training module, It is used to input the image training sample into a pre-built neural network for training to obtain a first-person perspective image recognition model; the input module is used to input the image to be recognized into the first-person perspective image recognition model; the judgment module is used to recognize the first-person perspective image The output of the model determines whether the image to be recognized is a first-person perspective image; wherein, the first-person perspective image is a photo or video frame that at least includes the characteristics of the photographer's hand.
  • This embodiment provides a computer-readable storage medium, characterized in that, executable instructions are stored on the computer-readable storage medium, and when the executable instructions are executed by a processor, the first-person perspective in Embodiment 1 is used. Image recognition method.
  • executable instructions in the above-mentioned embodiments may, but do not necessarily correspond to files in the file system, and may be stored in a part of the file that saves other programs or data, for example, stored in Hypertext Markup Language (HTML, One or more scripts in a Hyper TextMarkup Language) document, in a single file dedicated to the program in question, or in multiple cooperating files (for example, storing one or more modules, subroutines, or code sections) file).
  • HTML Hypertext Markup Language
  • executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communication network execute on.
  • the storage medium can be a computer-readable storage medium, for example, a ferroelectric memory (FRAM, Ferromagnetic Random Access Memory), Read Only Memory (ROM, Read Only Memory), Programmable Read Only Memory (PROM, Programmable Read Only Memory), Erasable Programmable Read Only Memory (EPROM, Erasable Programmable Read Only Memory), electrified Erasable programmable read only memory (EEPROM, Electrically Erasable Programmable Read Only Memory), flash memory, magnetic surface memory, optical disk, or compact disk read only memory (CD-ROM, Compact Disk-Read Only Memory) and other memories; can also include Various devices of one or any combination of the above memories.
  • FRAM ferroelectric memory
  • ROM Read Only Memory
  • PROM Programmable Read Only Memory
  • EPROM Erasable Programmable Read Only Memory
  • EEPROM Electrically Erasable Programmable Read Only Memory
  • flash memory magnetic surface memory, optical disk, or compact disk read only memory (CD-ROM, Compact Disk-Read Only Memory) and other memories
  • CD-ROM Compact Dis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

一种第一人称视角图像识别方法,包括:S1:获取包含多张第一人称视角的图像以及多张不包含第一人称视角的图像;S2:根据是否包含手部特征以及手部特征信息对图像进行分类及标注;S3:对分类及标注后的图像进行增强处理以获得多样化的图像训练样本;S4:将图像训练样本输入预先构建的神经网络进行训练得到第一视角图像识别模型;S5:将待识别图像输入第一视角图像识别模型;S6:根据第一视角图像识别模型的输出判断待识别图像是否为第一人称视角图像。该方法可自动识别图像是否为第一人称视角,用户只需提供输入的视频帧,就可以自动区分该视频帧是否为第一人称视角,有利于对视频的后期处理,具有处理速度快且准确率高的优点。

Description

第一人称视角图像识别方法、装置及计算机可读存储介质 技术领域
本申请涉及视频识别技术领域,具体涉及第一人称视角图像识别方法装置及计算机可读存储介质。
背景技术
第一人称视角简称第一视角(POV,Point of View),原来的意思是一种视点人物写作手法,简单来说,就是将相机安装在特定的人或动物上,记录从该特定的人或动物的视角所看到的一切,第一视角常用于游戏,其在游戏中的定义是:以游戏操作者本人的视角观看整场游戏演示,相当于站在操作者身后看,自己所见即为操作者所见。同样地,“第一人称视角视频”是指通过用户穿戴的拍摄装置拍摄的第一视角(用户视角)下的视频数据。
随着可穿戴的拍摄设备的普及,使用者既可以把可穿戴拍摄设备佩戴在身上,在登山、冲浪、蹦极等极限运动中拍摄出“第一人称视角”的视频,也可以直接使用可穿戴拍摄设备直接拍摄非“第一人称视角”的视频。由于视角的差异会造成这两种视频在拍摄内容上的明显差异,比如,“第一人称视角”的视频更具代入感,更能让视频观看者更直观地感受拍摄者在拍摄视频时的感受,因此,对这两种视频的后期处理方式截然不同。
技术问题
然而,视频帧是否为“第一人称视角”,在图像上的区别并不明显,通过传统的计算机视觉方法进行区分,会相对繁琐,而且准确率不高。
因此,有必要对现有的第一视角视频检测方法进行改进。
技术解决方案
本发明旨在解决现有第一视角视频检测方法存在的缺陷,提供用于第一人称视角图像识别方法、装置及计算机可读存储介质。
第一方面,本发明公开了一种第一人称视角图像识别方法,该方法包括:S1:获取包含多张第一人称视角的图像以及多张不包含第一人称视角的图像;S2:根据是否包含手部特征以及手部特征信息对图像进行分类及标注;S3:对分类及标注后的图像进行增强处理以获得多样化的图像训练样本;S4:将图像训练样本输入预先构建的神经网络进行训练得到第一视角图像识别模型;S5:将待识别图像输入第一视角图像识别模型;S6:根据第一视角图像识别模型的输出判断待识别图像是否为第一人称视角图像;其中,所述第一 人称视角图像为至少包含拍摄者手部特征的照片或视频帧。
第二方面,本发明公开了一种第一人称视角图像识别装置,该装置包括:
获取模块,用于获取包含多张第一人称视角的图像以及多张不包含第一人称视角的图像;分类及标注模块,用于根据是否包含手部特征以及手部特征信息对图像进行分类及标注;增强处理模块,用于对分类及标注后的图像进行增强处理以获得多样化的图像训练样本;训练模块,用于将图像训练样本输入预先构建的神经网络进行训练得到第一人称视角图像识别模型;输入模块,用于将待识别图像输入第一人称视角图像识别模型;判断模块,用于根据第一视角图像识别模型的输出判断待识别图像是否为第一人称视角图像;其中,所述第一人称视角图像为至少包含拍摄者手部特征的照片或视频帧。
第三方面,本发明公开了一种计算机可读存储介质,所述计算机可读存储介质上存储有可执行指令,所述可执行指令被处理器执行时以实现上述第一人称视角图像识别方法。
有益效果
与现有技术相比,本发明方案可自动识别图像是否为第一人称视角,应用在视频上时可以自动判断当前视频帧是否为第一人称视角,免除了用于手动判断的需要,用户只需提供输入的视频帧,就可以自动区分该视频帧是否为第一人称视角,有利于对视频的后期处理,具有处理速度快且准确率高的优点。
附图说明
图1是本发明实施例1中的第一人称视角图像识别模型构建方法的流程图。
图2是本发明实施例1中分为第一类的典型图像示例。
图3是本发明实施例1中分为第二类的典型图像示例。
图4是本发明实施例1中分为第三类的典型图像示例。
图5是本发明实施例1中分为第四类的典型图像示例。
图6是本发明实施例1中的图像训练样本缩放前后的对比示意图。
图7是本发明实施例2中的第一人称视角识别装置的结构框图。
图8是本发明实施例1中的待识别视频帧的处理过程的流程图。
本发明的实施方式
为了使本发明的目的、技术方案及有益效果更加清楚明白,以下结合附图 及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
为了说明本发明所述的技术方案,下面通过具体实施例来进行说明。
实施例1
如图1至图6所示,本实施例中的第一人称视角图像识别模型构建方法包括以下步骤。
S1:获取包含多张第一人称视角的图像以及多张不包含第一人称视角的图像。
本实施例中,获取的图像为可穿戴拍摄装置拍摄的照片或视频帧;第一人称视角图像为至少包含拍摄者手部特征的照片或视频帧。在本实施例中,可以预先构建一个图像数据集,该数据集中包含有将可穿戴拍摄装置固定在拍摄者身上拍摄的照片(即第一人称视角图像)以及将可穿戴拍摄装置作为普通相机来拍摄的照片(即非第一人称视角图像)。
S2:根据是否包含手部特征以及手部特征信息对图像进行分类及标注。
本步骤通过人工对图像进行分类及标注,其分类及标注的依据为:可穿戴拍摄装置作为普通相机来拍摄照片或视频时,拍摄者的手部特征(如手臂、手指等)一般不会出现在照片或视频中(自拍除外),而将可穿戴拍摄装置固定在拍摄者身上(如衣服、帽子或头带上)来拍摄照片或视频时,手部特征往往会出现在照片或视频中。基于上述原理,本步骤中将可穿戴拍摄装置拍摄的照片或视频帧按以下特征进行分类:将包含有从图像边缘伸出的手指或手臂特征的图像分为第一类,并标记为第一人称视角图像,如图2所示,为该类图像的典型示例;将包含自拍照的图像分为第二类,并标记为第一人称视角图像,如图3所示,为该类图像的典型示例;将包含完整的手臂和手指特征的图像分为第三类,并标记为非第一人称视角,如图4所示,为该类图像的典型示例;将不包含手臂或手指特征的图像分为第四类,并标记为非第一人称视角,如图5所示,为该类图像的典型示例。由上可知,上述分类及标注只是本实施例中的分类方式,本领域一般技术人员还可根据手部特征的位置、在图中的所在比例等特征对分类及标准进行进一步的细化或优化,从而分类的类别数量也对应发生变化。
S3:对分类及标注后的图像进行增强处理以获得多样化的图像训练样本。
本步骤的主要目的是为了使用较少的照片或视频帧来获得更多的图像训 练样本,也就是一张图像可以通过增强处理后可以获得多张图像训练样本,需要注意的是,各图像训练样本的类别和标注与原图像一致。具体地,可以对分类及标注后图像的进行图像缩放、中心裁剪、随机水平翻转以及随机小幅度改变图像亮度、对比度、饱和度和色调。需要注意的是,在本实施例中,因为检测器需要检测从图像下方伸出来的手指或手臂,而旋转会破环这样的情况,因此在增加处理阶段去除常规数据增强所用的旋转操作;另外,为避免完全裁掉图像边缘的手部特征,本实施例中将常规数据增强所用的随机裁剪改为中心裁剪。
S4:将图像训练样本输入预先构建的神经网络进行训练得到第一视角图像识别模型。
在本实施例中的神经网络为MobilenNetV2模型,其训练过程包括以下子步骤。
S41:将所述图像训练样本缩放成预设的尺寸。
首先将可穿戴拍摄装置拍摄的照片或视频帧输入预先构建的神经网络,接着利用torchvision库里的transforms.Resize把照片或视频帧的大小缩放成224x224的图像,如图6所示,在不影响图像中关键特征(手部特征)的情况下,大幅缩小输入图像的尺寸,提高后续模型进行训练和推理的速度。
S42:对缩放后的图像训练样本进行归一化处理。
本步骤中,利用transforms.Normalize对缩放后的图像进行归一化,提高模型的检测性能。其具体过程为I'=(I-E)/STD,其中I为原图像,E为在数据集中估算出来的图像均值,STD为在数据集中估算出来的图像标准差。
S43:将归一化处理后的图像训练样本输入预先构建的神经网络进行训练。
由于本实施例中训练数据不大,在本步骤中使用dropout对神经网络进行训练,以缓解过拟合。
S44:对神经网络的训练达到预设条件后完成第一视角图像识别模型的构建。
本实施例中的预设条件为对所述神经网络训练预定的轮数,当然,还可以对模型进行检测或测试,在模型的相关参数达到设定的阈值后停止训练,从而完成第一视角图像识别模型的构建。
此外,本实施例中还可以基于其他神经网络构建第一视角图像识别模型,比如VGG网络,深度残差网络模型(ResNet,Deep Residual Network),
GoogleNet系列模型(如InceptionV1-V4),SqueezeNet网络模型,ShuffNet系列网络模型等等。
S5:将待识别图像输入第一人称视角图像识别模型。
在本实施例中将所述待识别图像缩放成预设的尺寸后再输入第一人称视角图像识别模型。具体地,如图6所示,利用torchvision库里的transforms.Resize把照片或视频帧的大小缩放成224x224的图像。当然,也可以将待识别图像直接输入第一人称视角图像识别模型。
S6:根据第一视角图像识别模型的输出判断待识别图像是否为第一人称视角图像。
如图8所示,本实施例中的待识别视频帧的处理过程如下:将待识别图像经过多次的卷积层和最大池化层处理之后,通过全局平均池化层聚合全局的空间信息,得到长度为1280的特征向量,再经过全连接层进行最后的分类,得到长度为4(即分类数量,如果是其他数量的分类,则长度对应变化)的输出向量,然后经过softmax激活函数后得到待识别图像在每个类别的概率分布并输出。在本实施例中,概率分布p=[p0,p1,p2,p3],p0+p1+p2+p3=1,其中,p0代表视频帧为实施例1中的第一类别图像的概率,p1代表视频帧为实施例1中的第二类别图像的概率,p2代表视频帧为实施例1中的第三类别图像的概率,p3代表视频帧为实施例1中的第四类别图像的概率。然后根据该视频帧的实际获得的概率分布p=[p0,p1,p2,p3]判断待识别图像是否为第一人称视角图像,判断依据之一为:p0+p1>p2+p3。
在对视频帧进行判断时,可以根据连续多个视频帧的概率分布结果来判断该视频是否为第一人称视角视频。比如,如果视频中的判断为第一人称视角的视频帧的比例大于设定阈值(如60%),则判断该视频为第一人称视角视频。
实施例2
如图7所示,本实施例揭示了一种第一人称视角图像识别装置,包括:获取模块,用于获取包含多张第一人称视角的图像以及多张不包含第一人称视角的图像;分类及标注模块,用于根据是否包含手部特征以及手部特征信息对图像进行分类及标注;增强处理模块,用于对分类及标注后的图像进行增强处理以获得多样化的图像训练样本;训练模块,用于将图像训练样本输入预先构建的神经网络进行训练得到第一人称视角图像识别模型;输入模块, 用于将待识别图像输入第一人称视角图像识别模型;判断模块,用于根据第一视角图像识别模型的输出判断待识别图像是否为第一人称视角图像;其中,所述第一人称视角图像为至少包含拍摄者手部特征的照片或视频帧。
实施例3
本实施例提供了一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有可执行指令,所述可执行指令被处理器执行时以实施例1中的第一人称视角图像识别方法。
需要说明的是,上述实施例中的可执行指令可以但不一定对应于文件系统中的文件,可以被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper TextMarkup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,存储介质可以是计算机可读存储介质,例如,铁电存储器(FRAM,Ferromagnetic Random Access Memory)、只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read Only Memory)、带电可擦可编程只读存储器(EEPROM,Electrically Erasable Programmable Read Only Memory)、闪存、磁表面存储器、光盘、或光盘只读存储器(CD-ROM,Compact Disk-Read Only Memory)等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。

Claims (11)

  1. 一种第一人称视角图像识别方法,其特征在于,包括:
    S1:获取包含多张第一人称视角的图像以及多张不包含第一人称视角的图像;
    S2:根据是否包含手部特征以及手部特征信息对图像进行分类及标注;
    S3:对分类及标注后的图像进行增强处理以获得多样化的图像训练样本;
    S4:将图像训练样本输入预先构建的神经网络进行训练得到第一人称视角图像识别模型;
    S5:将待识别图像输入第一人称视角图像识别模型;
    S6:根据第一视角图像识别模型的输出判断待识别图像是否为第一人称视角图像;
    其中,所述第一人称视角图像为至少包含拍摄者手部特征的照片或视频帧。
  2. 根据权利要求1所述的第一人称视角图像识别方法,其特征在于,所述步骤S2为:将包含有从图像边缘伸出的手指或手臂特征的图像分为第一类,并标记为第一人称视角图像;将包含自拍照的图像分为第二类,并标记为第一人称视角图像;将包含完整的手臂和手指特征的图像分为第三类,并标记为非第一人称视角;将不包含手臂或手指特征的图像分为第四类,并标记为非第一人称视角。
  3. 根据权利要求1所述的第一人称视角图像识别方法,其特征在于,所述步骤S3中的增强处理包括:对分类及标注后图像的进行图像缩放、中心裁剪、随机水平翻转以及随机小幅度改变图像亮度、对比度、饱和度和色调。
  4. 根据权利要求1所述的第一人称视角图像识别方法,其特征在于,所述神经网络为MobileNetV2、VGG、ResNet、GoogleNet、SqueezeNet或ShuffNet之一。
  5. 根据权利要求1所述的第一人称视角图像识别方法,其特征在于,所述步骤S4包括:
    S41:将所述图像训练样本缩放成预设的尺寸;
    S42:对缩放后的图像训练样本进行归一化处理;
    S43:将归一化处理后的图像训练样本输入预先构建的神经网络进行训练;
    S44:对神经网络的训练达到预设条件后完成第一人称视角图像识别模型的构建。
  6. 根据权利要求5所述的第一人称视角图像识别方法,其特征在于,所述 步骤S43包括使用dropout对神经网络进行训练。
  7. 根据权利要求5所述的第一人称视角图像识别方法,其特征在于,所述步骤S44中的预设条件为对所述神经网络训练预定的轮数。
  8. 根据权利要求1所述的第一人称视角图像识别方法,其特征在于,所述步骤S5包括:将所述待识别图像缩放成预设的尺寸后再输入第一人称视角图像识别模型。
  9. 根据权利要求1所述的第一人称视角图像识别方法,其特征在于,所述待识别图像在所述第一人称角度模型中的处理过程为:将待识别图像经过多次的卷积层和最大池化层处理之后,通过全局平均池化层聚合全局的空间信息,得到长度为1280的特征向量,再经过全连接层进行最后的分类,得到长度为分类数量的输出向量,然后经过softmax激活函数后得到待识别图像在每个类别的概率分布并输出。
  10. 一种第一人称视角图像识别装置,其特征在于,包括:
    获取模块,用于获取包含多张第一人称视角的图像以及多张不包含第一人称视角的图像;
    分类及标注模块,用于根据是否包含手部特征以及手部特征信息对图像进行分类及标注;
    增强处理模块,用于对分类及标注后的图像进行增强处理以获得多样化的图像训练样本;
    训练模块,用于将图像训练样本输入预先构建的神经网络进行训练得到第一视角图像识别模型;
    输入模块,用于将待识别图像输入第一视角图像识别模型;
    判断模块,用于根据第一视角图像识别模型的输出判断待识别图像是否为第一人称视角图;
    其中,所述第一人称视角图像为至少包含拍摄者手部特征的照片或视频帧。
  11. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有可执行指令,所述可执行指令被处理器执行时以实现权利要求1至9任一项所述的第一人称视角图像识别方法。
PCT/CN2021/135527 2020-12-03 2021-12-03 第一人称视角图像识别方法、装置及计算机可读存储介质 WO2022117096A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011392225.8A CN112381055A (zh) 2020-12-03 2020-12-03 第一人称视角图像识别方法、装置及计算机可读存储介质
CN202011392225.8 2020-12-03

Publications (1)

Publication Number Publication Date
WO2022117096A1 true WO2022117096A1 (zh) 2022-06-09

Family

ID=74589811

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/135527 WO2022117096A1 (zh) 2020-12-03 2021-12-03 第一人称视角图像识别方法、装置及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN112381055A (zh)
WO (1) WO2022117096A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112381055A (zh) * 2020-12-03 2021-02-19 影石创新科技股份有限公司 第一人称视角图像识别方法、装置及计算机可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7171042B2 (en) * 2000-12-04 2007-01-30 Intel Corporation System and method for classification of images and videos
CN105718879A (zh) * 2016-01-19 2016-06-29 华南理工大学 基于深度卷积神经网络的自由场景第一视角手指关键点检测方法
CN110348572A (zh) * 2019-07-09 2019-10-18 上海商汤智能科技有限公司 神经网络模型的处理方法及装置、电子设备、存储介质
CN111062871A (zh) * 2019-12-17 2020-04-24 腾讯科技(深圳)有限公司 一种图像处理方法、装置、计算机设备及可读存储介质
CN111814810A (zh) * 2020-08-11 2020-10-23 Oppo广东移动通信有限公司 图像识别方法、装置、电子设备及存储介质
CN112381055A (zh) * 2020-12-03 2021-02-19 影石创新科技股份有限公司 第一人称视角图像识别方法、装置及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7171042B2 (en) * 2000-12-04 2007-01-30 Intel Corporation System and method for classification of images and videos
CN105718879A (zh) * 2016-01-19 2016-06-29 华南理工大学 基于深度卷积神经网络的自由场景第一视角手指关键点检测方法
CN110348572A (zh) * 2019-07-09 2019-10-18 上海商汤智能科技有限公司 神经网络模型的处理方法及装置、电子设备、存储介质
CN111062871A (zh) * 2019-12-17 2020-04-24 腾讯科技(深圳)有限公司 一种图像处理方法、装置、计算机设备及可读存储介质
CN111814810A (zh) * 2020-08-11 2020-10-23 Oppo广东移动通信有限公司 图像识别方法、装置、电子设备及存储介质
CN112381055A (zh) * 2020-12-03 2021-02-19 影石创新科技股份有限公司 第一人称视角图像识别方法、装置及计算机可读存储介质

Also Published As

Publication number Publication date
CN112381055A (zh) 2021-02-19

Similar Documents

Publication Publication Date Title
Matern et al. Exploiting visual artifacts to expose deepfakes and face manipulations
US10438077B2 (en) Face liveness detection method, terminal, server and storage medium
CN108898579B (zh) 一种图像清晰度识别方法、装置和存储介质
US11182592B2 (en) Target object recognition method and apparatus, storage medium, and electronic device
US10832069B2 (en) Living body detection method, electronic device and computer readable medium
US11074436B1 (en) Method and apparatus for face recognition
WO2018188453A1 (zh) 人脸区域的确定方法、存储介质、计算机设备
CN108829900B (zh) 一种基于深度学习的人脸图像检索方法、装置及终端
CN111738243B (zh) 人脸图像的选择方法、装置、设备及存储介质
WO2017185630A1 (zh) 基于情绪识别的信息推荐方法、装置和电子设备
CN109657533A (zh) 行人重识别方法及相关产品
CN108710847A (zh) 场景识别方法、装置及电子设备
JP7286010B2 (ja) 人体属性の認識方法、装置、電子機器及びコンピュータプログラム
WO2022028184A1 (zh) 一种拍摄控制方法、装置、电子设备及存储介质
US9690980B2 (en) Automatic curation of digital images
CN112733802B (zh) 图像的遮挡检测方法、装置、电子设备及存储介质
CN109299658B (zh) 脸部检测方法、脸部图像渲染方法、装置及存储介质
WO2021127841A1 (zh) 属性识别方法、装置、存储介质及电子设备
Liu et al. 3d high-fidelity mask face presentation attack detection challenge
CN106815803B (zh) 图片的处理方法及装置
CN109670517A (zh) 目标检测方法、装置、电子设备和目标检测模型
CN112036209A (zh) 一种人像照片处理方法及终端
WO2022117096A1 (zh) 第一人称视角图像识别方法、装置及计算机可读存储介质
CN110516572B (zh) 一种识别体育赛事视频片段的方法、电子设备及存储介质
CN114051116A (zh) 一种驾考车辆的视频监控方法、装置以及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21900118

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21900118

Country of ref document: EP

Kind code of ref document: A1