WO2020258286A1 - 图像处理方法、装置、拍摄装置和可移动平台 - Google Patents

图像处理方法、装置、拍摄装置和可移动平台 Download PDF

Info

Publication number
WO2020258286A1
WO2020258286A1 PCT/CN2019/093835 CN2019093835W WO2020258286A1 WO 2020258286 A1 WO2020258286 A1 WO 2020258286A1 CN 2019093835 W CN2019093835 W CN 2019093835W WO 2020258286 A1 WO2020258286 A1 WO 2020258286A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
image
information
semantic
target scene
Prior art date
Application number
PCT/CN2019/093835
Other languages
English (en)
French (fr)
Inventor
王涛
李思晋
刘政哲
李然
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2019/093835 priority Critical patent/WO2020258286A1/zh
Priority to CN201980011444.6A priority patent/CN111837158A/zh
Publication of WO2020258286A1 publication Critical patent/WO2020258286A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/45Cameras or camera modules comprising electronic image sensors; Control thereof for generating image signals from two or more image sensors being of different type or operating in different modes, e.g. with a CMOS sensor for moving images in combination with a charge-coupled device [CCD] for still images

Definitions

  • the present invention relates to the field of image processing, in particular to an image processing method, device, photographing device and movable platform.
  • semantic segmentation is performed on a single image shot for the scene, the semantic segmentation map is obtained, and the target is identified according to the semantic segmentation map.
  • the above-mentioned target recognition methods are difficult to distinguish categories with missing distances and similar textures in scenes, especially scenes with more complex backgrounds, such as the distinction between grass and ground, and the distinction between front and rear vehicles, which are difficult to achieve through the above-mentioned target recognition methods.
  • the invention provides an image processing method, device, photographing device and a movable platform.
  • the present invention is implemented through the following technical solutions:
  • an image processing method including:
  • semantic information and position information of each target in the target scene are obtained.
  • an image processing device comprising:
  • Storage device for storing program instructions
  • One or more processors call program instructions stored in the storage device, and when the program instructions are executed, the one or more processors are individually or collectively configured to:
  • semantic information and position information of each target in the target scene are obtained.
  • a photographing device comprising:
  • Image acquisition module used to acquire binocular images of the target scene
  • Storage device for storing program instructions
  • One or more processors call program instructions stored in the storage device, and when the program instructions are executed, the one or more processors are individually or collectively configured to:
  • semantic information and position information of each target in the target scene are obtained.
  • a movable platform is provided, and the movable platform includes:
  • Image acquisition module used to acquire binocular images of the target scene
  • Storage device for storing program instructions
  • One or more processors call program instructions stored in the storage device, and when the program instructions are executed, the one or more processors are individually or collectively configured to:
  • semantic information and position information of each target in the target scene are obtained.
  • the present invention when performing target recognition, the present invention combines the depth information of the target scene with the semantic segmentation map of the monocular image of the target scene, and can more accurately obtain the information of each target in the target scene. Semantic information and location information realize the distinction between categories with missing distances and similar textures in the target scene, and provide support for constructing accurate and practical semantic maps; the target recognition method of the present invention is particularly suitable for target scenes with complex backgrounds.
  • Fig. 1 is a method flowchart of an image processing method in an embodiment of the present invention
  • Figure 2a is a monocular image in a binocular image of a target scene in an embodiment of the present invention
  • Fig. 2b is a schematic diagram of the representation of depth information of the target scene shown in Fig. 2a;
  • 2c is a flowchart of a specific implementation manner of determining the depth information of a target scene according to binocular images in an embodiment of the present invention
  • Figure 3 is a specific implementation process of obtaining the semantic information and position information of each target in the target scene according to depth information and a semantic segmentation diagram of a monocular image in a binocular image in an embodiment of the present invention Figure;
  • Figure 4a is a schematic diagram of an application scenario of an image processing method in an embodiment of the present invention.
  • Figure 4b is a schematic diagram of the depth map and binocular image of the scene in Figure 4a;
  • FIG. 5a is a schematic diagram of another application scenario of the image processing method in an embodiment of the present invention.
  • Figure 5b is a schematic diagram of the depth map and binocular image of the scene in Figure 5a;
  • Fig. 6a is a schematic diagram of another application scenario of the image processing method in an embodiment of the present invention.
  • Fig. 6b is a schematic diagram of the depth map and binocular image of the scene in Fig. 6a;
  • Figure 7 is another specific implementation of obtaining the semantic information and position information of each target in the target scene according to the depth information and the semantic segmentation diagram of a monocular image in the binocular image in an embodiment of the present invention flow chart;
  • FIG. 8 is a specific method flowchart of an image processing method in an embodiment of the present invention.
  • FIG. 9 is a structural block diagram of an image processing device in an embodiment of the present invention.
  • FIG. 10 is a structural block diagram of a photographing device in an embodiment of the present invention.
  • Fig. 11 is a schematic structural diagram of a movable platform in an embodiment of the present invention.
  • the present invention When performing target recognition, the present invention combines the depth information of the target scene with the semantic segmentation map of the monocular image of the target scene, can more accurately obtain the semantic information and position information of each target in the target scene, and realizes the Distinguish between categories with missing distance and similar texture.
  • the present invention can provide pixel-level semantic recognition for scenes under the perspective of a movable platform, and construct a semantic map to provide support for key strategic semantic categories, such as information such as drivable areas, people, and cars.
  • the single monocular image in the binocular image lacks distance information and color information, and the results of semantic segmentation are poor.
  • the present invention relies on dual
  • the target image when performing target recognition, combines the depth information of the target scene and the semantic segmentation map of the monocular image of the target scene, which can more accurately obtain the semantic information and position information of each target in the scene from the perspective of the movable platform. It realizes the distinction of the categories with missing distance and similar texture in the scene under the perspective of the movable platform, and provides support for other intelligent functions of the movable platform.
  • the movable platform of the present invention has a shooting function
  • the movable platform can be a vehicle, an unmanned aerial vehicle, a handheld platform, an unmanned ship, and the like.
  • the vehicle can be an unmanned vehicle, a remote control car, etc.
  • the unmanned aerial vehicle can be an aerial photography drone or other unmanned aerial vehicle with shooting functions.
  • Fig. 1 is a method flowchart of an image processing method in an embodiment of the present invention. referring to Fig. 1, the image processing method of an embodiment of the present invention may include the following steps:
  • a binocular camera is used for shooting to obtain binocular images of the target scene.
  • the above-mentioned binocular camera may be a camera of a binocular camera, which is mounted on a movable platform for use, or can be used directly; of course, the above-mentioned binocular camera may also be integrated on the movable platform.
  • a monocular camera is used to shoot at different positions to obtain binocular images of the target scene.
  • the different positions correspond to the shooting positions of the binocular camera.
  • the monocular camera in this embodiment may be a monocular camera, and the monocular camera can be mounted on a movable platform or used directly; of course, the above-mentioned monocular camera can also be integrated on the movable platform.
  • a set of binocular images of the target scene can be acquired, and a set of binocular images includes two monocular images, namely, a left-eye image and a right-eye image.
  • S102 Determine the depth information of the target scene according to the binocular image
  • the depth information may include: relative distance information of each target in the target scene in a preset coordinate system.
  • the depth information includes: distance information of each target in the target scene relative to a shooting device that photographed the target scene, such as the distance of each target relative to the lens, or the distance of each target relative to other positions of the shooting device.
  • the aforementioned preset coordinate system may be a world coordinate system or a custom coordinate system. Understandably, in other embodiments, absolute distance information may also be used to represent depth information.
  • the depth information can be presented in the form of feature maps or in the form of data.
  • Fig. 2a is a monocular image in a binocular image of a target scene;
  • Fig. 2b is the depth information of the target scene shown in Fig. 2a, and this embodiment uses a feature map method to present the depth information of the target scene.
  • a deep learning method is used to determine the depth information, as shown in FIG. 2c, which is a specific implementation method for determining the depth information of the target scene based on the binocular image.
  • FIG. 2c is a specific implementation method for determining the depth information of the target scene based on the binocular image.
  • the image information includes the color information of each channel of the corresponding monocular image, such as RGB components; in addition, in this embodiment, the image information of one or more sets of binocular images is input into the first convolutional neural network to determine the target Depth information of the scene. Moreover, the finally determined depth information can be represented by a feature map whose length and width are the same as those of the monocular image in the binocular image.
  • the network structure of the first convolutional neural network can be designed as required.
  • the first convolutional neural network can include a plurality of first network units connected in sequence, and the first network units are used to For feature extraction; optionally, the first convolutional neural network includes three sequentially connected first network units, the input of the first first network unit is the image information of the binocular image, and the first network unit in the middle The input is the output of the first network unit, and the input of the last first network unit is the output of the middle first network unit; optionally, the output of the first first network unit and the output of the middle first network unit are shared As the input of the last first network unit to deepen the depth of the first convolutional neural network.
  • the first network unit of this embodiment may include at least one of a convolutional layer, a batch normalization layer, and a nonlinear activation layer.
  • the convolutional layer, the batch normalization layer, and the nonlinear activation layer all select regular operations. This is not specified.
  • the first network unit may also include other network layers, and is not limited to a convolutional layer, a batch normalization layer, and/or a nonlinear activation layer.
  • the image processing method may further include: preprocessing the binocular image to form a binocular The two monocular images of the image have the same size, so that the corresponding image points on the two monocular images match; in some embodiments, before determining the depth information of the target scene based on the binocular image, the image processing method It may also include: eliminating the distortion of the two monocular images of the binocular image through binocular correction. Through the above preprocessing, the matching degree of the binocular images is improved, thereby improving the accuracy of the depth information.
  • S103 Obtain semantic information and position information of each target in the target scene according to the depth information and the semantic segmentation map of a monocular image in the binocular image.
  • the depth information of the target scene with the semantic segmentation map of the monocular image of the target scene for target recognition, and obtains more accurate semantic information and position information accuracy of each target in the target scene, and realizes the lack of distance and texture in the target scene. Distinguish between similar categories.
  • the semantic information includes at least information used to indicate the category of the target, for example, the target is a vehicle, a pedestrian, a road, or the sky.
  • step S103 may include multiple, optional, see FIG. 3, in some embodiments, according to the depth information and the semantic segmentation map of a monocular image in the binocular image, obtain the target scene in the target scene
  • the realization process of semantic information and location information may include but not limited to the following steps:
  • S301 Perform semantic segmentation on a monocular image in the binocular image to obtain a semantic segmentation map of the monocular image
  • the monocular image can be either the left-eye image or the right-eye image; since the left-eye image is usually used as the benchmark when capturing the binocular image, this embodiment chooses to perform semantic segmentation on the left-eye image to obtain the semantics of the left-eye image. Segmentation diagram.
  • the existing semantic segmentation algorithm can be used to implement step S301.
  • a monocular image in the binocular image is input into the second convolutional neural network, and the second convolutional neural network is based on preset target classification rules and The image information of the monocular image determines the semantic segmentation map of the monocular image.
  • the second convolutional neural network please refer to the subsequent description.
  • S302 Determine the semantic information and position information of each target in the target scene according to the depth information and the semantic segmentation map.
  • the realization process of determining the semantic information and position information of each target in the target scene according to the depth information and the semantic segmentation map may include but not limited to the following steps:
  • the initial semantic information may identify these targets as the same target, resulting in inaccurate target recognition.
  • the multiple targets in this step may be grass and ground, front and rear vehicles, adjacent walls, etc. in the target scene, or may be multiple targets with the same or similar target categories and adjacent locations.
  • the position information of each target in the target scene can also be other position information of the corresponding target, and is not limited to boundary information.
  • the above-mentioned depth information is expressed based on a depth map.
  • the image processing method provided in this embodiment can be applied to semantic segmentation of occluded images.
  • the binocular camera including camera a and camera b
  • the binocular camera From the observation perspective in the direction of the arrow, car A is partially blocked by car B.
  • a monocular image a (collected by camera a) can be obtained as shown in Figure 4b.
  • monocular image b (collected by camera b).
  • a depth map as shown in the upper figure of Figure 4b can be obtained (in which different filling patterns indicate different depths).
  • the image is combined with the initial semantic information of each target in the semantic segmentation map of the monocular image a, or by combining the depth map with the initial semantic information of each target in the semantic segmentation map of the monocular image b, it can be distinguished that the distance ahead is different
  • the type of vehicle that is, the target category
  • the target category can also be further distinguished.
  • the binocular camera looks at the viewing angle from the direction of the arrow.
  • Car C covers part of car B and car A, and car B covers part of it.
  • Car A after the binocular camera shoots along the observation angle, you can get monocular image a (collected by camera a) and monocular image b (collected by camera b) as shown in Figure 5b.
  • monocular image a and monocular image For the target image b, the depth map shown in the upper figure of Fig.
  • 5b can be obtained by combining the depth map with the initial semantic information of each target in the semantic segmentation map of the monocular image a, or by combining the depth map with the semantics of the monocular image b Combining the initial semantic information of each target in the segmentation map can distinguish different occlusion relationships in the front, and further distinguish the type of vehicle (that is, the target category).
  • the image processing method provided by this embodiment can also be applied to semantic segmentation of images with objects with similar textures.
  • the front of the binocular camera is a wall with a corner, and the wall D is relative to the wall.
  • the surface E is closer to the binocular camera, and the wall D and the wall E have similar textures.
  • monocular image b (collected by camera b), according to monocular image a and monocular image b, the depth map shown in the upper figure of Figure 6b can be obtained, and the semantic segmentation of the depth map and monocular image a Combine the initial semantic information of, or by combining the depth map with the initial semantic information of each target in the semantic segmentation map of monocular image b, you can distinguish the front and back relationship between wall D and wall E, and further distinguish the boundary
  • the information is a wall with a corner.
  • the realization process of obtaining the semantic information and position information of each target in the target scene may include but is not limited to the following step:
  • S701 Input depth information and image information of a monocular image in the binocular image into a second pre-trained convolutional neural network to obtain semantic information and position information of each target in the target scene;
  • the second convolutional neural network is used to determine the semantic segmentation map of the monocular image according to the preset target classification rules and the image information of the monocular image; and obtain the semantics of each target in the target scene based on the depth information and the semantic segmentation map Information and location information.
  • the image training set used to train the second convolutional neural network includes image training sets of multiple target categories, and the image training set of each category includes at least one subcategory image training
  • the target categories include at least two of the following: vehicles, sky, roads, static obstacles, and dynamic obstacles; of course, the target categories are not limited to the categories listed above, and can also be set to other categories.
  • the subcategories of vehicles can be specifically divided into cars, trucks, buses, trains, RVs, etc.
  • the subcategories of static obstacles can be specifically divided into buildings, walls, guardrails, telephone poles, traffic lights, traffic signs, etc.
  • dynamic barriers Subcategories of things can include pedestrians, bicycles, motorcycles, etc.
  • the target classification rule in this embodiment corresponds to the target category, that is, the second convolutional neural network can identify the target belonging to the above target category in the monocular image.
  • the network structure of the second convolutional neural network can be designed as required.
  • the second convolutional neural network includes a plurality of second network units connected in sequence, and the second network units are used to Input for target classification; optionally, the second convolutional neural network includes three sequentially connected second network units, the input depth information of the first second network unit and the image information of a monocular image in the binocular image,
  • the input of the middle second network unit is the output of the first network unit
  • the input of the last second network unit is the output of the middle second network unit
  • the output of the first second network unit is The output of the second network unit is collectively used as the input of the last second network unit to deepen the depth of the network of the second convolutional neural network.
  • the second network layer in this embodiment includes at least one of a convolutional layer, a batch normalization layer, and a nonlinear activation layer.
  • the convolutional layer, the batch normalization layer, and the nonlinear activation layer all select regular operations. No specific instructions.
  • the second network unit may also include other network layers, and is not limited to a convolutional layer, a batch normalization layer, and/or a nonlinear activation layer.
  • the position information in step S701 may be boundary information of each target in the target scene, or other position information of each target in the target scene.
  • step S301 and step S302 are implemented in the above-mentioned second convolutional neural network.
  • the image information preprocessed by the binocular image is input to the first convolutional neural network to determine the depth information of the target scene; then the depth information of the target scene and one of the binocular images
  • the image information of a monocular image is input into the second convolutional neural network to obtain the semantic information and position information of each target in the target scene.
  • the semantic information may include: a recognition result and a corresponding recognition confidence.
  • the recognition result is used to indicate the information of the category of the target
  • the recognition confidence is used to indicate the accuracy of the recognition result. Misrecognized targets can be removed through the confidence, and the accuracy of target recognition is improved.
  • the image processing method may further include: generating according to the recognition result, the corresponding recognition confidence and the position information The semantic map of the target scene, so as to intuitively present the target recognition result based on the semantic map.
  • the recognition result, the corresponding recognition confidence and position information the realization process of generating the semantic map of the target scene may include but not limited to the following steps:
  • the contour of the target corresponding to the recognition result can be displayed in the semantic segmentation map.
  • the target corresponding to the recognition result is marked as the label of the target category of the preset recognition result in the semantic segmentation map.
  • the labeled semantic segmentation map is the semantic map of the target scene, and this embodiment visually presents the target category where the target recognition result is located in the semantic segmentation map through annotation.
  • the label of each target category is preset, and the label of the target category can be represented by color, pattern, etc., where different target categories correspond to different labels.
  • the colors corresponding to different target categories are different, for example, the color corresponding to the sky is blue, the color corresponding to the ground is brown, the color corresponding to the grass is green, etc.; optionally, different subcategories under the same target category correspond to The colors of are the same color, but the colors corresponding to different subcategories under the same target category have different depths.
  • the recognition confidence corresponding to the recognition result is less than or equal to the preset confidence threshold, it is determined that the recognition result may be misrecognized.
  • the target category information of the target can be directly ignored to avoid Affect the results of semantic segmentation.
  • an embodiment of the present invention also provides an image processing device.
  • the image processing device 100 includes: a first storage device 110 and one or more first processors 120.
  • the first storage device 110 is used to store program instructions; one or more first processors 120 call the program instructions stored in the first storage device 110, and when the program instructions are executed, one or more first processing
  • the device 120 is individually or collectively configured to: obtain a binocular image of the target scene; determine the depth information of the target scene according to the binocular image; according to the depth information and a semantic segmentation map of a monocular image in the binocular image , To obtain the semantic information and location information of each target in the target scene.
  • the first processor 120 can implement the image processing method of the embodiment shown in FIG. 1, FIG. 2c, FIG. 3, FIG. 7 and FIG. 8. For details, refer to the image processing method of the above-mentioned embodiment for the image processing apparatus 100 of this embodiment. Be explained.
  • the image processing device 100 of this embodiment can be a computer or other equipment with image processing capabilities, or can be a shooting device with a camera function, such as a camera, a video camera, a smart phone, a smart terminal, or a shooting stabilizer. Unmanned aerial vehicles and so on.
  • an embodiment of the present invention also provides a photographing device.
  • the photographing device 200 includes: a first image acquisition module 210, a second storage device 220, and one or more second The processor 230.
  • the first image acquisition module 210 is used for collecting binocular images of the target scene;
  • the second storage device 220 is used for storing program instructions; one or more second processors 230 are used to call the data stored in the second storage device 220
  • the program instructions when the program instructions are executed, the one or more second processors 230 are individually or collectively configured to: acquire binocular images of the target scene collected by the first image acquisition module 210; according to the binocular images , Determine the depth information of the target scene; Obtain the semantic information and position information of each target in the target scene according to the depth information and the semantic segmentation map of a monocular image in the binocular image.
  • the first image acquisition module 210 includes a lens and an imaging sensor matched with the lens, such as an image sensor such as CCD and CMOS.
  • the second processor 230 can implement the image processing method of the embodiment shown in FIG. 1, FIG. 2c, FIG. 3, FIG. 7 and FIG. 8 of the present invention. Refer to the image processing method of the foregoing embodiment to perform the imaging device 200 of this embodiment. Description.
  • the photographing device 200 can be a camera with a photographing function, a video camera, a smart phone, a smart terminal, a photographing stabilizer (such as a handheld PTZ), an unmanned aerial vehicle (such as a drone), and so on.
  • the movable platform 300 includes: a second image acquisition module 310, a third storage device 320, and one or more third processors 330.
  • the second image acquisition module 310 is used to collect binocular images of the target scene;
  • the third storage device 320 is used to store program instructions; one or more third processors 330 are used to call the storage device 320 Program instructions.
  • one or more third processors 330 are individually or collectively configured to: acquire binocular images of the target scene collected by the second image acquisition module 310; , Determine the depth information of the target scene; Obtain the semantic information and position information of each target in the target scene according to the depth information and the semantic segmentation map of a monocular image in the binocular image.
  • the second image acquisition module 310 of this embodiment may be a camera, or may be a structure with a photographing function formed by a combination of a lens and an imaging sensor (such as CCD, CMOS, etc.).
  • an imaging sensor such as CCD, CMOS, etc.
  • the third processor 330 can implement the image processing method of the embodiment shown in FIG. 1, FIG. 2c, FIG. 3, FIG. 7 and FIG. 8. For details, refer to the image processing method of the above embodiment for the mobile platform 300 of this embodiment. Be explained.
  • the movable platform 300 is an unmanned aerial vehicle.
  • the unmanned aerial vehicle is an aerial photography unmanned aerial vehicle, and other unmanned aerial vehicles that do not have a camera function do not belong to the protection subject of this embodiment.
  • the unmanned aerial vehicle may be a multi-rotor unmanned aerial vehicle or a fixed-wing unmanned aerial vehicle.
  • the embodiment of the present invention does not specifically limit the type of the unmanned aerial vehicle.
  • the second image acquisition module 310 can be mounted on the fuselage (not shown) via a pan/tilt (not shown), and the second image acquisition module 310 can be stabilized by the pan/tilt. It is a two-axis pan/tilt or a three-axis pan/tilt, which is not specifically limited in the embodiment of the present invention.
  • the aforementioned storage device may include volatile memory, such as random-access memory (RAM); the storage device may also include non-volatile memory, such as flash memory ( flash memory, hard disk drive (HDD) or solid-state drive (SSD); the storage device 110 may also include a combination of the foregoing types of memory.
  • volatile memory such as random-access memory (RAM)
  • non-volatile memory such as flash memory ( flash memory, hard disk drive (HDD) or solid-state drive (SSD)
  • SSD solid-state drive
  • the storage device 110 may also include a combination of the foregoing types of memory.
  • the foregoing processor may be a central processing unit (CPU).
  • the processor can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA) ) Or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • DSP Digital Signal Processor
  • ASIC application-specific integrated circuit
  • FPGA field-programmable gate array
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • an embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the image processing method of the foregoing embodiment are implemented.
  • the computer-readable storage medium may be the internal storage unit of the pan/tilt head described in any of the foregoing embodiments, such as a hard disk or a memory.
  • the computer-readable storage medium may also be an external storage device of the pan-tilt, such as a plug-in hard disk, a smart media card (SMC), an SD card, a flash card (Flash Card), etc. equipped on the device .
  • the computer-readable storage medium may also include both an internal storage unit of the pan-tilt and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by the pan/tilt, and can also be used to temporarily store data that has been output or will be output.
  • the program can be stored in a computer readable storage medium. During execution, it may include the procedures of the above-mentioned method embodiments.
  • the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

一种图像处理方法、装置、拍摄装置和可移动平台,所述方法包括:获取目标场景的双目图像;根据所述双目图像,确定所述目标场景的深度信息;根据所述深度信息以及所述双目图像中的一个单目图像的语义分割图,获得所述目标场景中各目标的语义信息和位置信息。本发明在进行目标识别时,结合了目标场景的深度信息与目标场景的单目图像的语义分割图,能够更精确的获取目标场景中各目标的语义信息和位置信息,实现了对目标场景中距离缺失、纹理相近的类别的区分,为构建精确实用的语义地图提供了支撑;本发明的目标识别方法尤其适用于背景较为复杂的目标场景。

Description

图像处理方法、装置、拍摄装置和可移动平台 技术领域
本发明涉及图像处理领域,尤其涉及一种图像处理方法、装置、拍摄装置和可移动平台。
背景技术
相关技术中,在进行场景中的目标进行识别时,对针对该场景拍摄的单张图像进行语义分割,获得该语义分割图,根据语义分割图识别目标。上述目标识别方式难以对场景特别是背景较为复杂的场景中距离缺失、纹理相近的类别进行区分,如草丛和地面的区分以及前后车辆的区分,通过上述目标识别方式很难实现。
发明内容
本发明提供一种图像处理方法、装置、拍摄装置和可移动平台。
具体地,本发明是通过如下技术方案实现的:
根据本发明的第一方面,提供一种图像处理方法,所述方法包括:
获取目标场景的双目图像;
根据所述双目图像,确定所述目标场景的深度信息;
根据所述深度信息以及所述双目图像中的一个单目图像的语义分割图,获得所述目标场景中各目标的语义信息和位置信息。
根据本发明的第二方面,提供一种图像处理装置,所述装置包括:
存储装置,用于存储程序指令;
一个或多个处理器,调用所述存储装置中存储的程序指令,当所述程序指令被执行时,所述一个或多个处理器单独地或共同地被配置成用于:
获取目标场景的双目图像;
根据所述双目图像,确定所述目标场景的深度信息;
根据所述深度信息以及所述双目图像中的一个单目图像的语义分割图,获得所述目标场景中各目标的语义信息和位置信息。
根据本发明的第三方面,提供一种拍摄装置,所述拍摄装置包括:
图像采集模块,用于采集目标场景的双目图像;
存储装置,用于存储程序指令;
一个或多个处理器,调用所述存储装置中存储的程序指令,当所述程序指令被执行时,所述一个或多个处理器单独地或共同地被配置成用于:
获取所述图像采集模块采集的目标场景的双目图像;
根据所述双目图像,确定所述目标场景的深度信息;
根据所述深度信息以及所述双目图像中的一个单目图像的语义分割图,获得所述目标场景中各目标的语义信息和位置信息。
根据本发明的第四方面,提供一种可移动平台,所述可移动平台包括:
图像采集模块,用于采集目标场景的双目图像;
存储装置,用于存储程序指令;
一个或多个处理器,调用所述存储装置中存储的程序指令,当所述程序指令被执行时,所述一个或多个处理器单独地或共同地被配置成用于:
获取所述图像采集模块采集的目标场景的双目图像;
根据所述双目图像,确定所述目标场景的深度信息;
根据所述深度信息以及所述双目图像中的一个单目图像的语义分割图,获得所述目标场景中各目标的语义信息和位置信息。
由以上本发明实施例提供的技术方案可见,本发明在进行目标识别时,结合了目标场景的深度信息与目标场景的单目图像的语义分割图,能够更精确的获取目标场景中各目标的语义信息和位置信息,实现了对目标场景中距离缺失、纹理相近的类别的区分,为构建精确实用的语义地图提供了支撑;本发明的目标识别方法尤其适用于背景较为复杂的目标场景。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本发明一实施例中的一种图像处理方法的方法流程图;
图2a是本发明一实施例中的一目标场景的双目图像中的一个单目图像;
图2b为图2a所示目标场景的深度信息的表征示意图;
图2c是本发明一实施例中的一种根据双目图像,确定目标场景的深度信息的一种具体实现方式的流程图;
图3是本发明一实施例中的一种根据深度信息以及双目图像中的一个单目图像的 语义分割图,获得目标场景中各目标的语义信息和位置信息的一种具体实现方式的流程图;
图4a是本发明一实施例中的图像处理方法的应用场景示意图;
图4b是图4a场景的深度图和双目图像的示意图;
图5a是本发明一实施例中的图像处理方法的另一应用场景示意图;
图5b是图5a场景的深度图和双目图像的示意图;
图6a是本发明一实施例中的图像处理方法的又一应用场景示意图;
图6b是图6a场景的深度图和双目图像的示意图;
图7是本发明一实施例中的一种根据深度信息以及双目图像中的一个单目图像的语义分割图,获得目标场景中各目标的语义信息和位置信息的另一种具体实现方式的流程图;
图8是本发明一实施例中的一种图像处理方法的一具体的方法流程图;
图9是本发明一实施例中的一种图像处理装置的结构框图;
图10是本发明一实施例中的一种拍摄装置的结构框图;
图11是本发明一实施例中的一种可移动平台的结构示意图。
具体实施方式
传统目标识别方式难以对场景中距离缺失、纹理相近的类别进行区分,如场景中的草丛和地面、前后车辆、相邻墙壁等类别。
本发明在进行目标识别时,结合了目标场景的深度信息与目标场景的单目图像的语义分割图,能够更精确的获取目标场景中各目标的语义信息和位置信息,实现了对目标场景中距离缺失、纹理相近的类别的区分。
本发明可以为可移动平台视角下的场景提供像素级语义识别,构建语义图提供关键策略语义类别支持,例如可行驶区域、人、车等信息。双目图像中的单幅单目图像由于缺乏距离信息、颜色信息,语义分割的结果较差,对于一些难以分辨的类别,例如草丛和地面、以及前后车辆等,对于此,本发明依赖于双目图像,在进行目标识别时,结合了目标场景的深度信息与目标场景的单目图像的语义分割图,能够更精确的获取可移动平台视角下的场景中各目标的语义信息和位置信息,实现了对可移动平台视角下的场景中距离缺失、纹理相近的类别的区分,为可移动平台的其他智能功能提供支持。
本发明的可移动平台具备拍摄功能,该可移动平台可以为车辆、无人飞行器、手持云台、无人船等。其中,车辆可以为无人驾驶车辆、遥控车等,无人飞行器可以为 航拍无人机或其他具有拍摄功能的无人飞行器。
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
需要说明的是,在不冲突的情况下,下述的实施例及实施方式中的特征可以相互组合。
图1是本发明一实施例中的一种图像处理方法的方法流程图;参见图1,本发明实施例的图像处理方法可以包括如下步骤:
S101:获取目标场景的双目图像;
双目图像的获取方式可根据需要选择,例如,在某些实施例中,采用双目拍摄像头进行拍摄,获取目标场景的双目图像。上述双目摄像头可以为双目相机的摄像头,该双目相机搭载在可移动平台上使用,也可以直接使用;当然,也可将上述双目摄像头集成在可移动平台上。
在某些实施例中,采用单目摄像头在不同的位置进行拍摄,获取目标场景的双目图像。其中,不同位置与双目拍摄像头的拍摄位置相对应。本实施例的单目摄像头可以为单目相机的摄像头,该单目相机可以搭载在可移动平台上使用,也可以直接使用;当然,也可以将上述单目摄像头集成在可移动平台上。
其中,对于同一目标场景的双目图像,可以获取该目标场景的一组或多组双目图像,一组双目图像包括两幅单目图像,即左目图像和右目图像。
S102:根据双目图像,确定目标场景的深度信息;
其中,深度信息可以包括:目标场景中的各目标在预设坐标系下的相对距离信息。可选的,深度信息包括:目标场景中的各目标相对拍摄目标场景的拍摄装置的距离信息,如各目标相对镜头的距离,或各目标相对拍摄装置的其他位置的距离。上述预设坐标系可以为世界坐标系,也可以为自定义坐标系。可以理解地,在其他实施例中,也可以采用绝对距离信息来表示深度信息。
深度信息可以采用特征图方式呈现,也可以采用数据方式呈现。图2a为一目标场景的双目图像中的一个单目图像;图2b为图2a所示目标场景的深度信息,本实施例采用特征图方式来呈现目标场景的深度信息。
相关技术中,利用三角形相似原理确定深度信息,该方式计算过程较为复杂,使用时间较长。为减小确定深度信息的时间,本实施例中,采用深度学习方式确定深度信息,如图2c所示,为根据双目图像,确定目标场景的深度信息的一种具体实现方式。参见图2c,在根据双目图像,确定目标场景的深度信息时,将双目图像的图像信息输 入预先训练的第一卷积神经网络中,确定目标场景的深度信息。其中,图像信息包括对应单目图像的各通道的颜色信息,如RGB分量;另外,本实施例中,将一组或多组双目图像的图像信息输入第一卷积神经网络中,确定目标场景的深度信息。并且,最终确定的深度信息可以通过一幅特征图表现,该特征图的长度和宽度与双目图像中的单目图像的长度和宽度相同。
第一卷积神经网络的网络结构可根据需要设计,例如,在一个可行的实现方式中,第一卷积神经网络可以包括多个依次连接的第一网络单元,第一网络单元用于对各自的输入进行特征提取;可选的,第一卷积神经网络包括三个依次连接的第一网络单元,首个第一网络单元的输入为双目图像的图像信息,中间的第一网络单元的输入为第一网络单元的输出,最后一个第一网络单元的输入为中间的第一网络单元的输出;可选的,将首个第一网络单元的输出和中间的第一网络单元的输出共同作为最后一个第一网络单元的输入,以加深第一卷积神经网络的网络的深度。
本实施例的第一网络单元可以包括卷积层、批量标准化层和非线性激活层中的至少一个,其中,卷积层、批量标准化层和非线性激活层均选择常规操作,本实施例对此不作具体说明。当然,第一网络单元也可以包括其他网络层,不限于卷积层、批量标准化层和/或非线性激活层。
此外,请再次参见图2c,在某些实施例中,在根据双目图像,确定目标场景的深度信息之前,所述图像处理方法还可以包括:对双目图像进行预处理,使得构成双目图像的两幅单目图像尺寸一致,使得上述两幅单目图像上对应的像点匹配;在某些实施例中,在根据双目图像,确定目标场景的深度信息之前,所述图像处理方法还可以包括:通过双目校正来消除双目图像的两幅单目图像的畸变。通过上述预处理,提高双目图像的匹配度,从而提高深度信息的精度。
S103:根据深度信息以及双目图像中的一个单目图像的语义分割图,获得目标场景中各目标的语义信息和位置信息。
同时结合目标场景的深度信息与目标场景的单目图像的语义分割图进行目标识别,获得更加精确的目标场景中各目标的语义信息和位置信息的精度,实现了对目标场景中距离缺失、纹理相近的类别的区分。
本实施例中,语义信息至少包括用于表示目标所在的类别的信息,如目标为车辆、行人、道路或天空等表示目标所在的类别的信息。
步骤S103的实现方式可以包括多种,可选的,参见图3,在某些实施例中,根据深度信息以及双目图像中的一个单目图像的语义分割图,获得目标场景中各目标的语义信息和位置信息的实现过程可以包括但不限于如下步骤:
S301:对双目图像中的一个单目图像进行语义分割,获得单目图像的语义分割图;
该单目图像可以为左目图像,也可以为右目图像;由于双目图像获取时,通常以左目图像作为拍摄时的基准,故本实施例选择对左目图像进行语义分割,获得上述左目图像的语义分割图。
可以采用现有语义分割算法来实现步骤S301,本实施例中,将双目图像中的一个单目图像输入第二卷积神经网络,由第二卷积神经网络根据预设的目标分类规则以及单目图像的图像信息,确定单目图像的语义分割图,对于第二卷积神经网络,请参见后续描述。
S302:根据深度信息和语义分割图,确定目标场景中各目标的语义信息和位置信息。
其中,根据深度信息和语义分割图,确定目标场景中各目标的语义信息和位置信息的实现过程可以包括但不限于如下步骤:
(1)根据深度信息以及语义分割图中各目标的初始语义信息,对语义分割图中目标类别相同或相近的、位置相邻的多个目标进行区分;
对于语义分割图中目标类别相同或相近的、位置相邻的多个目标,初始语义信息可能将这些目标识别成同一个目标,导致目标识别不准确。该步骤中的多个目标可以为目标场景中的草丛和地面、前后车辆、相邻墙壁等,也可以为其他目标类别相同或相近的、位置相邻的多个目标。
(2)获得多个目标的语义信息和边界信息。
当然,根据深度信息和语义分割图,确定目标场景中各目标的位置信息也可以为对应目标的其他位置信息,不限于边界信息。
下述实施例中,基于深度图表示上述深度信息。
本实施例提供的图像处理方法,可应用于对于存在遮挡的图像进行语义分割,如图4a所示的俯视图,当车A和车B处于如图所示的位置时,双目相机(包括摄像头a和摄像头b)从箭头方向的观察视角看,车A被车B遮挡了一部分,双目相机沿观察视角拍摄后,可得到如图4b下图所示的单目图像a(由摄像头a采集)和单目图像b(摄像头b采集),根据单目图像a和单目图像b,可得到如图4b上图所示的深度图(其中不同的填充图案表示不同的深度),通过将深度图与单目图像a的语义分割图中各目标的初始语义信息结合,或通过将深度图与单目图像b的语义分割图中各目标的初始语义信息结合,即可分辨出前方为距离不同的两辆车,也可进一步分辨出车辆的类型(也即目标类别)。
当然本实施例也可分辨出更复杂的遮挡的情况,如图5a所示的俯视图,双目相机从箭头方向的观察视角看,车C遮挡了部分车B和车A,车B遮挡了部分车A,双目相机沿观察视角拍摄后,可得到如图5b下图所示的单目图像a(由摄像头a采集)和 单目图像b(摄像头b采集),根据单目图像a和单目图像b,可得到图5b上图所示的深度图,通过将深度图与单目图像a的语义分割图中各目标的初始语义信息结合,或通过将深度图与单目图像b的语义分割图中各目标的初始语义信息结合,即可分辨出前方不同的遮挡关系,也可进一步分辨出车辆的类型(也即目标类别)。
本实施例提供的图像处理方法,也可应用于对于存在纹理相近物体的图像进行语义分割,例如如图6a所示的俯视图,双目相机前方为存在转角的墙面,墙面D相对于墙面E更靠近双目相机,且墙面D和墙面E具有相近的纹理,双目相机沿观察视角拍摄后,可得到如图6b下图所示的单目图像a(由摄像头a采集)和单目图像b(摄像头b采集),根据单目图像a和单目图像b,可得到图6b上图所示的深度图,通过将深度图与单目图像a的语义分割图中各目标的初始语义信息结合,或通过将深度图与单目图像b的语义分割图中各目标的初始语义信息结合,即可分辨出墙面D与墙面E的前后关系,也可进一步分辨出边界信息为存在转角的墙面。
参见图7,在某些实施例中,根据深度信息以及双目图像中的一个单目图像的语义分割图,获得目标场景中各目标的语义信息和位置信息的实现过程可以包括但不限于如下步骤:
S701:将深度信息和双目图像中的一个单目图像的图像信息输入预先训练的第二卷积神经网络中,获得目标场景中各目标的语义信息和位置信息;
其中,第二卷积神经网络用于根据预设的目标分类规则以及单目图像的图像信息,确定单目图像的语义分割图;并基于深度信息和语义分割图获得目标场景中各目标的语义信息和位置信息。
本实施例中,为更精确地实现目标分类,训练第二卷积神经网络所使用的图像训练集包括多个目标类别的图像训练集,每个类别的图像训练集包括至少一个子类别图像训练集;可选的,目标类别包括如下至少两种:车辆、天空、道路、静态障碍物和动态障碍物;当然,目标类别不限于上述列举的类别,还可以设置成其他类别。此外,车辆的子类别具体可分为轿车、卡车、公交、火车、房车等,静态障碍物的子类别具体可分为建筑物、墙、护栏、电线杆、交通灯、交通标志等,动态障碍物的子类别可包括行人、自行车、摩托车等。
本实施例的目标分类规则与目标类别相对应,也即,第二卷积神经网络能够识别出单目图像中属于上述目标类别的目标。
第二卷积神经网络的网络结构可根据需要设计,例如,在一个可行的实现方式中,第二卷积神经网络包括多个依次连接的第二网络单元,第二网络单元用于对各自的输入进行目标分类;可选的,第二卷积神经网络包括三个依次连接的第二网络单元,首个第二网络单元的输入深度信息和双目图像中的一个单目图像的图像信息,中间的第二网络单元的输入为第一网络单元的输出,最后一个第二网络单元的输入为中间的第 二网络单元的输出;可选的,将首个第二网络单元的输出和中间的第二网络单元的输出共同作为最后一个第二网络单元的输入,以加深第二卷积神经网络的网络的深度。
本实施例的第二网络层包括卷积层、批量标准化层和非线性激活层中的至少一个,其中,卷积层、批量标准化层和非线性激活层均选择常规操作,本实施例对此不作具体说明。当然,第二网络单元也可以包括其他网络层,不限于卷积层、批量标准化层和/或非线性激活层。
此外,步骤S701中的位置信息可以为目标场景中各目标的边界信息,也可以为目标场景中各目标的其他位置信息。
在某些实施例中,步骤S301和步骤S302均在上述第二卷积神经网络中实现。
参见图8,在某些实施例中,将双目图像预处理后的图像信息输入第一卷积神经网络,确定目标场景的深度信息;再将目标场景的深度信息和双目图像中的一幅单目图像的图像信息输入第二卷积神经网络,获得目标场景中各目标的语义信息和位置信息。
可选的,在一些实施例中,语义信息可以包括:识别结果和对应的识别置信度。其中,识别结果用于表示目标所在的类别的信息,识别置信度用于表示该识别结果的准确性,通过置信度可以去除误识别的目标,提高目标识别的准确度。
进一步地,参见图8,在一些实施例中,获得目标场景中各目标的语义信息和位置信息之后,所述图像处理方法还可以包括:根据识别结果、对应的识别置信度以及位置信息,生成目标场景的语义图,从而基于语义图直观呈现目标识别结果。其中,根据识别结果、对应的识别置信度以及位置信息,生成目标场景的语义图的实现过程可以包括但不限于如下步骤:
(1)、根据识别结果以及位置信息,确定语义分割图中识别结果对应的目标;
可根据识别结果以及位置信息,在语义分割图中显示识别结果对应的目标的轮廓。
(3)、若识别结果对应的识别置信度大于预设置信度阈值,则在语义分割图中将识别结果对应的目标标注为预设的识别结果所在目标类别的标注。
标注后的语义分割图即为目标场景的语义图,本实施例通过标注,将目标识别结果所在的目标类别直观呈现在语义分割图中。
本实施例中,各目标类别的标注预先设定,目标类别的标注可以通过颜色、图案等来表示,其中,不同目标类别对应的标注不同。可选的,不同目标类别对应的颜色不同,例如,天空对应的颜色为蓝色,地面对应的颜色为褐色,草丛对应的颜色为绿色等;可选的,同一目标类别下的不同子类别对应的颜色为同一颜色,但同一目标类别下的不同子类别对应的颜色具有不同的深度。
此外,若识别结果对应的识别置信度小于或等于预设置信度阈值,则确定该识别结果存在误识别的可能,而对于存在误识别的目标,可直接忽略该目标的目标类别信息,以避免对语义分割结果造成影响。
对应于上述实施例的图像处理方法,本发明实施例还提供一种图像处理装置,参见图9,所述图像处理装置100包括:第一存储装置110和一个或多个第一处理器120。
其中,第一存储装置110,用于存储程序指令;一个或多个第一处理器120,调用第一存储装置110中存储的程序指令,当程序指令被执行时,一个或多个第一处理器120单独地或共同地被配置成用于:获取目标场景的双目图像;根据双目图像,确定目标场景的深度信息;根据深度信息以及双目图像中的一个单目图像的语义分割图,获得目标场景中各目标的语义信息和位置信息。
第一处理器120可以实现如本发明图1、图2c、图3、图7以及图8所示实施例的图像处理方法,可参见上述实施例的图像处理方法对本实施例的图像处理装置100进行说明。
需要说明的是,本实施例的图像处理装置100可以为电脑等具备图像处理能力的设备,也可以为带有摄像功能的拍摄装置,如照相机,摄像机,智能手机,智能终端,拍摄稳定器,无人飞行器等等。
对应于上述实施例的图像处理方法,本发明实施例还提供一种拍摄装置,参见图10,该拍摄装置200包括:第一图像采集模块210、第二存储装置220和一个或多个第二处理器230。
其中,第一图像采集模块210,用于采集目标场景的双目图像;第二存储装置220,用于存储程序指令;一个或多个第二处理器230,调用第二存储装置220中存储的程序指令,当程序指令被执行时,一个或多个第二处理器230单独地或共同地被配置成用于:获取第一图像采集模块210采集的目标场景的双目图像;根据双目图像,确定目标场景的深度信息;根据深度信息以及双目图像中的一个单目图像的语义分割图,获得目标场景中各目标的语义信息和位置信息。
可选的,第一图像采集模块210包括镜头和与镜头相配合的成像传感器,如CCD、CMOS等图像传感器。
第二处理器230可以实现如本发明图1、图2c、图3、图7以及图8所示实施例的图像处理方法,可参见上述实施例的图像处理方法对本实施例的拍摄装置200进行说明。
该拍摄装置200可为带有摄像功能的照相机,摄像机,智能手机,智能终端,拍摄稳定器(如手持云台),无人飞行器(如无人机)等等。
本发明实施例提供一种可移动平台,参见图11,所述可移动平台300包括:第二 图像采集模块310、第三存储装置320和一个或多个第三处理器330。
其中,第二图像采集模块310,用于采集目标场景的双目图像;第三存储装置320,用于存储程序指令;一个或多个第三处理器330,调用第三存储装置320中存储的程序指令,当程序指令被执行时,一个或多个第三处理器330单独地或共同地被配置成用于:获取第二图像采集模块310采集的目标场景的双目图像;根据双目图像,确定目标场景的深度信息;根据深度信息以及双目图像中的一个单目图像的语义分割图,获得目标场景中各目标的语义信息和位置信息。
本实施例的第二图像采集模块310可以为相机,也可以为镜头和成像传感器(如CCD、CMOS等)组合形成的具有拍摄功能的结构。
第三处理器330可以实现如本发明图1、图2c、图3、图7以及图8所示实施例的图像处理方法,可参见上述实施例的图像处理方法对本实施例的可移动平台300进行说明。
在一可行的实现方式中,所述可移动平台300为无人机,可以理解地,该无人机为航拍无人机,其他不具有摄像功能的无人机不属于本实施例的保护主体。所述无人机可为多旋翼无人机,也可为固定翼无人机,本发明实施例对无人机的类型不作具体限定。进一步的,所述第二图像采集模块310可通过云台(未标出)搭载在机身(未标出),通过云台对第二图像采集模块310进行增稳,其中,该云台可为两轴云台,也可为三轴云台,本发明实施例对此不作具体限定。
上述存储装置可以包括易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM);存储装置也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储装置110还可以包括上述种类的存储器的组合。
上述处理器可以是中央处理器(central processing unit,CPU)。该处理器还可以是其它通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)或者其它可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
此外,本发明实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述实施例的图像处理方法的步骤。
所述计算机可读存储介质可以是前述任一实施例所述的云台的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是云台的外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介质还可以既包括云台的内部存储单元也 包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述云台所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所揭露的仅为本发明部分实施例而已,当然不能以此来限定本发明之权利范围,因此依本发明权利要求所作的等同变化,仍属本发明所涵盖的范围。

Claims (33)

  1. 一种图像处理方法,其特征在于,所述方法包括:
    获取目标场景的双目图像;
    根据所述双目图像,确定所述目标场景的深度信息;
    根据所述深度信息以及所述双目图像中的一个单目图像的语义分割图,获得所述目标场景中各目标的语义信息和位置信息。
  2. 根据权利要求1所述的方法,其特征在于,所述深度信息包括:所述目标场景中的各目标在预设坐标系下的相对距离信息。
  3. 根据权利要求2所述的方法,其特征在于,所述深度信息包括:所述目标场景中的各目标相对拍摄所述目标场景的拍摄装置的距离信息。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述根据所述双目图像,确定所述目标场景的深度信息,包括:
    将所述双目图像的图像信息输入预先训练的第一卷积神经网络中,确定所述目标场景的深度信息。
  5. 根据权利要求4所述的方法,其特征在于,所述第一卷积神经网络包括多个依次连接的第一网络单元,所述第一网络单元用于对各自的输入进行特征提取;
    所述第一网络单元包括卷积层、批量标准化层和非线性激活层中的至少一个。
  6. 根据权利要求1所述的方法,其特征在于,所述根据所述双目图像,确定所述目标场景的深度信息之前,还包括:
    对所述双目图像进行预处理,使得构成所述双目图像的两幅单目图像尺寸一致。
  7. 根据权利要求1所述的方法,其特征在于,所述根据所述深度信息以及所述双目图像中的一个单目图像的语义分割图,获得所述目标场景中各目标的语义信息和位置信息,包括:
    对所述双目图像中的一个单目图像进行语义分割,获得所述单目图像的语义分割图;
    根据所述深度信息和所述语义分割图,确定所述目标场景中各目标的语义信息和位置信息。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述深度信息和所述语义分割图,确定所述目标场景中各目标的语义信息和位置信息,包括:
    根据所述深度信息以及所述语义分割图中各目标的初始语义信息,对所述语义分割图中目标类别相同或相近的、位置相邻的多个目标进行区分;
    获得所述多个目标的语义信息和边界信息。
  9. 根据权利要求1或7所述的方法,其特征在于,所述根据所述深度信息以及所述双目图像中的一个单目图像的语义分割图,获得所述目标场景中各目标的语义信息和位置信息,包括:
    将所述深度信息和所述双目图像中的一个单目图像的图像信息输入预先训练的第 二卷积神经网络中,获得所述目标场景中各目标的语义信息和位置信息;
    其中,所述第二卷积神经网络用于,根据预设的目标分类规则以及所述单目图像的图像信息,确定所述单目图像的语义分割图;并基于所述深度信息和所述语义分割图获得所述目标场景中各目标的语义信息和位置信息。
  10. 根据权利要求9所述的方法,其特征在于,训练所述第二卷积神经网络所使用的图像训练集包括多个目标类别的图像训练集,每个类别的图像训练集包括至少一个子类别图像训练集;
    所述目标分类规则与所述目标类别相对应。
  11. 根据权利要求10所述的方法,其特征在于,所述目标类别包括如下至少两种:
    车辆、天空、道路、静态障碍物和动态障碍物。
  12. 根据权利要求9所述的方法,其特征在于,所述第二卷积神经网络包括多个依次连接的第二网络单元,所述第二网络单元用于对各自的输入进行目标分类;
    所述第二网络层包括卷积层、批量标准化层和非线性激活层中的至少一个。
  13. 根据权利要求1所述的方法,其特征在于,所述语义信息包括:识别结果和对应的识别置信度。
  14. 根据权利要求13所述的方法,其特征在于,所述获得所述目标场景中各目标的语义信息和位置信息之后,还包括:
    根据所述识别结果、对应的识别置信度以及所述位置信息,生成所述目标场景的语义图。
  15. 根据权利要求14所述的方法,其特征在于,所述根据所述识别结果、对应的识别置信度以及所述位置信息,生成所述目标场景的语义图,包括:
    根据所述识别结果以及所述位置信息,确定所述语义分割图中所述识别结果对应的目标;
    若所述识别结果对应的识别置信度大于预设置信度阈值,则在所述语义分割图中将所述识别结果对应的目标标注为预设的所述识别结果所在目标类别的标注。
  16. 一种图像处理装置,其特征在于,所述装置包括:
    存储装置,用于存储程序指令;
    一个或多个处理器,调用所述存储装置中存储的程序指令,当所述程序指令被执行时,所述一个或多个处理器单独地或共同地被配置成用于:
    获取目标场景的双目图像;
    根据所述双目图像,确定所述目标场景的深度信息;
    根据所述深度信息以及所述双目图像中的一个单目图像的语义分割图,获得所述目标场景中各目标的语义信息和位置信息。
  17. 根据权利要求16所述的图像处理装置,其特征在于,所述深度信息包括:所述目标场景中的各目标在预设坐标系下的相对距离信息。
  18. 根据权利要求17所述的图像处理装置,其特征在于,所述深度信息包括:所 述目标场景中的各目标相对拍摄所述目标场景的拍摄装置的距离信息。
  19. 根据权利要求16至18任一项所述的图像处理装置,其特征在于,所述一个或多个处理器单独地或共同地进一步被配置成用于:
    将所述双目图像的图像信息输入预先训练的第一卷积神经网络中,确定所述目标场景的深度信息。
  20. 根据权利要求19所述的图像处理装置,其特征在于,所述第一卷积神经网络包括多个依次连接的第一网络单元,所述第一网络单元用于对各自的输入进行特征提取;
    所述第一网络单元包括卷积层、批量标准化层和非线性激活层中的至少一个。
  21. 根据权利要求16所述的图像处理装置,其特征在于,所述一个或多个处理器在根据所述双目图像,确定所述目标场景的深度信息之前,还单独地或共同地进一步被配置成用于:
    对所述双目图像进行预处理,使得构成所述双目图像的两幅单目图像尺寸一致。
  22. 根据权利要求16所述的图像处理装置,其特征在于,所述一个或多个处理器单独地或共同地进一步被配置成用于:
    对所述双目图像中的一个单目图像进行语义分割,获得所述单目图像的语义分割图;
    根据所述深度信息和所述语义分割图,确定所述目标场景中各目标的语义信息和位置信息。
  23. 根据权利要求22所述的图像处理装置,其特征在于,所述一个或多个处理器单独地或共同地进一步被配置成用于:
    根据所述深度信息以及所述语义分割图中各目标的初始语义信息,对所述语义分割图中目标类别相同或相近的、位置相邻的多个目标进行区分;
    获得所述多个目标的语义信息和边界信息。
  24. 根据权利要求16或22所述的图像处理装置,其特征在于,所述一个或多个处理器单独地或共同地进一步被配置成用于:
    将所述深度信息和所述双目图像中的一个单目图像的图像信息输入预先训练的第二卷积神经网络中,获得所述目标场景中各目标的语义信息和位置信息;
    其中,所述第二卷积神经网络用于,根据预设的目标分类规则以及所述单目图像的图像信息,确定所述单目图像的语义分割图;并基于所述深度信息和所述语义分割图获得所述目标场景中各目标的语义信息和位置信息。
  25. 根据权利要求24所述的图像处理装置,其特征在于,训练所述第二卷积神经网络所使用的图像训练集包括多个目标类别的图像训练集,每个类别的图像训练集包括至少一个子类别图像训练集;
    所述目标分类规则与所述目标类别相对应。
  26. 根据权利要求25所述的图像处理装置,其特征在于,所述目标类别包括如下 至少两种:
    车辆、天空、道路、静态障碍物和动态障碍物。
  27. 根据权利要求24所述的图像处理装置,其特征在于,所述第二卷积神经网络包括多个依次连接的第二网络单元,所述第二网络单元用于对各自的输入进行目标分类;
    所述第二网络层包括卷积层、批量标准化层和非线性激活层中的至少一个。
  28. 根据权利要求16所述的图像处理装置,其特征在于,所述语义信息包括:识别结果和对应的识别置信度。
  29. 根据权利要求28所述的图像处理装置,其特征在于,所述一个或多个处理器在获得所述目标场景中各目标的语义信息和位置信息之后,还单独地或共同地进一步被配置成用于:
    根据所述识别结果、对应的识别置信度以及所述位置信息,生成所述目标场景的语义图。
  30. 根据权利要求29所述的图像处理装置,其特征在于,所述一个或多个处理器单独地或共同地进一步被配置成用于:
    根据所述识别结果以及所述位置信息,确定所述语义分割图中所述识别结果对应的目标;
    若所述识别结果对应的识别置信度大于预设置信度阈值,则在所述语义分割图中将所述识别结果对应的目标标注为预设的所述识别结果所在目标类别的标注。
  31. 一种拍摄装置,其特征在于,所述拍摄装置包括:
    图像采集模块,用于获得目标场景的双目图像;
    存储装置,用于存储程序指令;
    一个或多个处理器,调用所述存储装置中存储的程序指令,当所述程序指令被执行时,所述一个或多个处理器单独地或共同地被配置成用于实施权利要求1-15之一所述的方法。
  32. 一种可移动平台,其特征在于,所述可移动平台包括:
    图像采集模块,用于获得目标场景的双目图像;
    存储装置,用于存储程序指令;
    一个或多个处理器,调用所述存储装置中存储的程序指令,当所述程序指令被执行时,所述一个或多个处理器单独地或共同地被配置成用于实施权利要求1-15之一所述的方法。
  33. 根据权利要求32所述的可移动平台,其特征在于,所述可移动平台为无人飞行器和车辆中的至少一种。
PCT/CN2019/093835 2019-06-28 2019-06-28 图像处理方法、装置、拍摄装置和可移动平台 WO2020258286A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/093835 WO2020258286A1 (zh) 2019-06-28 2019-06-28 图像处理方法、装置、拍摄装置和可移动平台
CN201980011444.6A CN111837158A (zh) 2019-06-28 2019-06-28 图像处理方法、装置、拍摄装置和可移动平台

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/093835 WO2020258286A1 (zh) 2019-06-28 2019-06-28 图像处理方法、装置、拍摄装置和可移动平台

Publications (1)

Publication Number Publication Date
WO2020258286A1 true WO2020258286A1 (zh) 2020-12-30

Family

ID=72912596

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/093835 WO2020258286A1 (zh) 2019-06-28 2019-06-28 图像处理方法、装置、拍摄装置和可移动平台

Country Status (2)

Country Link
CN (1) CN111837158A (zh)
WO (1) WO2020258286A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112950699A (zh) * 2021-03-30 2021-06-11 深圳市商汤科技有限公司 深度测量方法、装置、电子设备及存储介质
CN112967283A (zh) * 2021-04-22 2021-06-15 上海西井信息科技有限公司 基于双目摄像头的目标识别方法、系统、设备及存储介质
CN113570631A (zh) * 2021-08-28 2021-10-29 西安安森智能仪器股份有限公司 一种基于图像的指针式仪表智能识别方法及设备
CN113762267A (zh) * 2021-09-02 2021-12-07 北京易航远智科技有限公司 一种基于语义关联的多尺度双目立体匹配方法及装置
CN115661668A (zh) * 2022-12-13 2023-01-31 山东大学 一种辣椒花待授粉花朵识别方法、装置、介质及设备

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784693A (zh) * 2020-12-31 2021-05-11 珠海金山网络游戏科技有限公司 图像处理方法及装置
CN114049444B (zh) * 2022-01-13 2022-04-15 深圳市其域创新科技有限公司 一种3d场景生成方法及装置
WO2024005707A1 (en) * 2022-07-01 2024-01-04 Grabtaxi Holdings Pte. Ltd. Method, device and system for detecting dynamic occlusion

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101742349A (zh) * 2010-01-05 2010-06-16 浙江大学 一种对三维场景的表达方法及其电视系统
CN106778614A (zh) * 2016-12-16 2017-05-31 中新智擎有限公司 一种人体识别方法和装置
CN108734713A (zh) * 2018-05-18 2018-11-02 大连理工大学 一种基于多特征图的交通图像语义分割方法
CN108986136A (zh) * 2018-07-23 2018-12-11 南昌航空大学 一种基于语义分割的双目场景流确定方法及系统
CN109002837A (zh) * 2018-06-21 2018-12-14 网易(杭州)网络有限公司 一种图像语义分类方法、介质、装置和计算设备
US20190080462A1 (en) * 2017-09-14 2019-03-14 Samsung Electronics Co., Ltd. Method and apparatus for calculating depth map based on reliability
CN109490926A (zh) * 2018-09-28 2019-03-19 浙江大学 一种基于双目相机和gnss的路径规划方法
US20190096125A1 (en) * 2017-09-28 2019-03-28 Nec Laboratories America, Inc. Generating occlusion-aware bird eye view representations of complex road scenes

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101742349A (zh) * 2010-01-05 2010-06-16 浙江大学 一种对三维场景的表达方法及其电视系统
CN106778614A (zh) * 2016-12-16 2017-05-31 中新智擎有限公司 一种人体识别方法和装置
US20190080462A1 (en) * 2017-09-14 2019-03-14 Samsung Electronics Co., Ltd. Method and apparatus for calculating depth map based on reliability
US20190096125A1 (en) * 2017-09-28 2019-03-28 Nec Laboratories America, Inc. Generating occlusion-aware bird eye view representations of complex road scenes
CN108734713A (zh) * 2018-05-18 2018-11-02 大连理工大学 一种基于多特征图的交通图像语义分割方法
CN109002837A (zh) * 2018-06-21 2018-12-14 网易(杭州)网络有限公司 一种图像语义分类方法、介质、装置和计算设备
CN108986136A (zh) * 2018-07-23 2018-12-11 南昌航空大学 一种基于语义分割的双目场景流确定方法及系统
CN109490926A (zh) * 2018-09-28 2019-03-19 浙江大学 一种基于双目相机和gnss的路径规划方法

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112950699A (zh) * 2021-03-30 2021-06-11 深圳市商汤科技有限公司 深度测量方法、装置、电子设备及存储介质
CN112967283A (zh) * 2021-04-22 2021-06-15 上海西井信息科技有限公司 基于双目摄像头的目标识别方法、系统、设备及存储介质
CN112967283B (zh) * 2021-04-22 2023-08-18 上海西井科技股份有限公司 基于双目摄像头的目标识别方法、系统、设备及存储介质
CN113570631A (zh) * 2021-08-28 2021-10-29 西安安森智能仪器股份有限公司 一种基于图像的指针式仪表智能识别方法及设备
CN113570631B (zh) * 2021-08-28 2024-04-26 西安安森智能仪器股份有限公司 一种基于图像的指针式仪表智能识别方法及设备
CN113762267A (zh) * 2021-09-02 2021-12-07 北京易航远智科技有限公司 一种基于语义关联的多尺度双目立体匹配方法及装置
CN113762267B (zh) * 2021-09-02 2024-03-12 北京易航远智科技有限公司 一种基于语义关联的多尺度双目立体匹配方法及装置
CN115661668A (zh) * 2022-12-13 2023-01-31 山东大学 一种辣椒花待授粉花朵识别方法、装置、介质及设备
CN115661668B (zh) * 2022-12-13 2023-03-31 山东大学 一种辣椒花待授粉花朵识别方法、装置、介质及设备

Also Published As

Publication number Publication date
CN111837158A (zh) 2020-10-27

Similar Documents

Publication Publication Date Title
WO2020258286A1 (zh) 图像处理方法、装置、拍摄装置和可移动平台
JP6844038B2 (ja) 生体検出方法及び装置、電子機器並びに記憶媒体
WO2021004312A1 (zh) 一种基于双目立体视觉系统的车辆智能测轨迹方法
WO2017054314A1 (zh) 一种建筑物高度计算方法、装置和存储介质
US20150138310A1 (en) Automatic scene parsing
WO2020154990A1 (zh) 目标物体运动状态检测方法、设备及存储介质
CN106971185B (zh) 一种基于全卷积网络的车牌定位方法及装置
WO2020258297A1 (zh) 图像语义分割方法、可移动平台及存储介质
CN109741241B (zh) 鱼眼图像的处理方法、装置、设备和存储介质
WO2021217398A1 (zh) 图像的处理方法及装置、可移动平台及其控制终端、计算机可读存储介质
CN111243003B (zh) 车载双目摄像机及其检测道路限高杆的方法、装置
CN111209840B (zh) 一种基于多传感器数据融合的3d目标检测方法
CN109883433B (zh) 基于360度全景视图的结构化环境中车辆定位方法
CN113673584A (zh) 一种图像检测方法及相关装置
WO2023016082A1 (zh) 三维重建方法、装置、电子设备及存储介质
CN115035235A (zh) 三维重建方法及装置
WO2022047701A1 (zh) 图像处理方法和装置
CN110667474A (zh) 通用障碍物检测方法、装置与自动驾驶系统
JP4882577B2 (ja) 物体追跡装置およびその制御方法、物体追跡システム、物体追跡プログラム、ならびに該プログラムを記録した記録媒体
CN108564654B (zh) 三维大场景的画面进入方式
WO2024067732A1 (zh) 神经网络模型的训练方法、车辆视图的生成方法和车辆
EP4287137A1 (en) Method, device, equipment, storage media and system for detecting drivable space of road
CN108090930A (zh) 基于双目立体相机的障碍物视觉检测系统及方法
CN116051736A (zh) 一种三维重建方法、装置、边缘设备和存储介质
CN116342632A (zh) 一种基于深度信息的抠图方法及抠图网络训练方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19935082

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19935082

Country of ref document: EP

Kind code of ref document: A1