WO2022179604A1 - 一种分割图置信度确定方法及装置 - Google Patents

一种分割图置信度确定方法及装置 Download PDF

Info

Publication number
WO2022179604A1
WO2022179604A1 PCT/CN2022/077911 CN2022077911W WO2022179604A1 WO 2022179604 A1 WO2022179604 A1 WO 2022179604A1 CN 2022077911 W CN2022077911 W CN 2022077911W WO 2022179604 A1 WO2022179604 A1 WO 2022179604A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
segmentation
segmentation map
confidence level
map
Prior art date
Application number
PCT/CN2022/077911
Other languages
English (en)
French (fr)
Inventor
杨录
宋晴
邵滨
李志豪
许松岑
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022179604A1 publication Critical patent/WO2022179604A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection

Definitions

  • the present application relates to the field of image processing, and in particular, to a method and device for determining a confidence level of a segmentation map.
  • Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military.
  • it is to install eyes (cameras/camcorders) and brains (algorithms) on the computer to identify, track and measure the target instead of the human eye, so that the computer can perceive the environment.
  • perception can be viewed as extracting information from sensory signals
  • computer vision can also be viewed as the science of how to make artificial systems "perceive" from images or multidimensional data.
  • computer vision is the use of various imaging systems to replace the visual organ to obtain input information, and then the computer replaces the brain to process and interpret the input information.
  • the ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
  • Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military.
  • it is to install eyes (cameras/camcorders) and brains (algorithms) on the computer to identify, track and measure the target instead of the human eye, so that the computer can perceive the environment.
  • perception can be viewed as extracting information from sensory signals
  • computer vision can also be viewed as the science of how to make artificial systems "perceive" from images or multidimensional data.
  • computer vision is the use of various imaging systems to replace the visual organ to obtain input information, and then the computer replaces the brain to process and interpret the input information.
  • the ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
  • image segmentation is a commonly used technology in the field of computer vision.
  • the target detection network is used to output the category, detection frame and segmentation map of the target object, and there will be a corresponding confidence for the detection frame.
  • Applications In some applications, it is necessary to obtain the confidence of the segmentation map. For example, in the application of AR special effects display, AR effects need to be displayed on objects with high confidence in the segmentation map, but not on objects with low confidence in the segmentation map. show.
  • the confidence of the detection frame is directly multiplexed as the confidence of the segmentation map. Since the confidence of the detection frame can only represent the location reliability of the detection frame, it cannot well express the image of the segmentation map. Segmentation quality.
  • the present application provides a method, device, etc. for determining the confidence level of a segmentation map, so as to improve the accuracy of the confidence level of the segmentation map.
  • the following describes the content of the invention of the present application through different aspects. It can be understood that the implementation manners and beneficial effects of the following aspects can be referred to each other.
  • the present application provides a method for determining the confidence level of a segmentation map, the method comprising:
  • the terminal device can obtain the input video stream, wherein the video stream can be captured by the shooting device in the terminal device. Specifically, in the scene of AR special effects display, the video stream can be captured by the terminal device. The real-time video stream captured by the device. Wherein, the video stream includes a first image (or referred to as a first image frame).
  • the user may select a video stream on the terminal device, wherein the selected video stream includes the first image.
  • the terminal device may acquire the input first image, where the first image may be captured by a photographing device in the terminal device.
  • the user may select the first image on the terminal device (eg, an album on the terminal device or an album on the cloud side).
  • a pre-trained neural network model can be used to perform target detection on the first image to obtain a target detection result of the target object, wherein the target detection result can include the detection frame of the target object, and the corresponding , and the first confidence level is used to indicate the positioning accuracy of the detection frame.
  • a pre-trained neural network model may also be used to perform image segmentation on the image in the detection frame, so as to obtain a segmentation map corresponding to the target object.
  • the first confidence level is adjusted to obtain a second confidence level of the segmentation map, wherein the higher the image segmentation quality is, the higher the second confidence level is.
  • the image segmentation quality of the segmentation map can be obtained, and the first confidence of the detection frame is adjusted based on the image segmentation quality, so that the adjusted first confidence can indicate the detection
  • the positioning accuracy of the box can also indicate the image segmentation quality of the segmentation map.
  • the image segmentation quality is used to indicate the boundary definition of the target object in the segmentation map, wherein the higher the boundary definition is, the higher the image segmentation quality is.
  • the image segmentation quality of the segmentation map is used to adjust the first confidence level of the detection frame to obtain the second confidence level as the confidence level of the segmentation map, so that the second confidence level not only includes the positioning accuracy of the detection frame,
  • the segmentation quality information of the segmentation map itself can also be included, so that the confidence level of the segmentation map with higher accuracy can be obtained.
  • the method for determining the confidence level of the segmentation map in this embodiment does not need to add an additional network to calculate the confidence level of the segmentation map. Based on the result of the target detection and the result of the segmentation map, the confidence level of the segmentation map can be obtained directly without increasing the confidence level of the segmentation map. A lot of extra computation is more friendly to the terminal side, making the solution easy to deploy on the terminal side.
  • the image segmentation quality is used to indicate the boundary definition of the target object in the segmentation map, wherein the higher the boundary definition is, the higher the image segmentation quality is.
  • the value of each pixel in the segmentation map can be between 0 and 1, wherein the pixel value of the pixel in the background area is 0 or close to 0, and the pixel value in the boundary area of the target object is The pixel value of the image is around 0.5.
  • the threshold value of 0.5 can be used for binarization, so errors will occur at the edge.
  • the pixel value is around 0.5. The fewer the points, the better the quality of the segmentation image.
  • the second confidence level of the segmentation map is used to indicate the positioning accuracy of the detection frame and the image segmentation quality of the segmentation map, wherein the higher the positioning accuracy, the higher the positioning accuracy.
  • the second confidence level is higher.
  • the segmentation map includes an object area where the target object is located, a background area of the target object, and a boundary area between the object area and the background area.
  • the image segmentation quality of the segmentation map including:
  • the ratio of the number of pixels included in the object area to the number of pixels included in the target area is used as the image segmentation quality of the segmentation map, where the target area is the union of the object area and the boundary area.
  • the ratio of the number of pixels included in the object area to the number of pixels included in the target area may be used as the image segmentation quality of the segmentation map, where the target area is the object area and all The union of the described boundary regions. That is, the smaller the boundary area (or the smaller the number of pixels expressed as the boundary area), the higher the image segmentation quality.
  • the method further includes:
  • Determining the region where the pixel value of the segmentation map is greater than the foreground threshold is located as the object region
  • Determining the area where the pixel value of the segmentation map is less than the background threshold is located as the background area
  • the area where the pixel value of the segmentation map is greater than the background threshold and less than the foreground threshold is determined as the boundary area.
  • the segmentation map includes the object area where the target object is located, the background area of the target object, and the boundary area between the object area and the background area, and the pixel values in the segmentation map
  • the area where the pixels greater than the foreground threshold are located is determined as the object area
  • the area where the pixels whose pixel values are less than the background threshold in the segmentation map are located is determined as the background area
  • the pixel values in the segmentation map are greater than the threshold.
  • the area where the background threshold and the pixel points smaller than the foreground threshold are located is determined as the boundary area.
  • the background threshold can be 0.25, then the pixel value of the pixels in the background area is less than 0.25, and the foreground threshold can be 0.7, then the pixel value of the pixels in the foreground area is greater than 0.7, and then the pixel value of the pixels in the border area is greater than 0.25 less than 0.7.
  • the obtaining the image segmentation quality of the segmentation map includes:
  • initial pixel values of some pixels included in the segmentation map wherein the initial pixel values of the partial pixels are greater than a preset value, and the preset value is less than the foreground threshold and greater than or equal to the background threshold , the foreground threshold is less than 1;
  • the initial pixel value is adjusted according to the target mapping relationship to obtain the adjusted pixel value of the pixel points included in the segmentation map, wherein the target mapping relationship represents the initial pixel value and the adjusted pixel value.
  • the image segmentation quality of the segmentation map is determined according to the adjusted pixel values of the pixels included in the segmentation map, wherein the larger the average value of the adjusted pixel values of the pixels included in the segmentation map, the higher the The higher the image segmentation quality.
  • the obtaining the image segmentation quality of the segmentation map includes:
  • initial pixel values of some pixels included in the segmentation map wherein the initial pixel values of the partial pixels are greater than a preset value, and the preset value is less than the foreground threshold and greater than or equal to the background threshold , the foreground threshold is less than 1;
  • the initial pixel value is adjusted according to the target mapping relationship to obtain the adjusted pixel value of the pixel points included in the segmentation map, wherein the target mapping relationship represents the initial pixel value and the adjusted pixel value.
  • the image segmentation quality of the segmentation map is determined according to the adjusted pixel values of the pixels included in the segmentation map, wherein the larger the average value of the adjusted pixel values of the pixels included in the segmentation map, the higher the The higher the image segmentation quality.
  • the pixel values of the pixel points in the segmentation map are mapped based on a preset mapping relationship, wherein the image segmentation quality of the segmentation map is expressed based on the average value of the mapped pixel values, which is equivalent to mapping the image segmentation quality of the segmentation map.
  • the latter pixel value is used as the weight when calculating the image segmentation quality of the pixel.
  • the larger the weight the greater the improvement of the image segmentation quality, and the smaller the weight, the greater the reduction of the image segmentation quality.
  • the weight given to this part of the pixels is very important.
  • the assigned weight is close to 0.
  • the assigned weight is The weight of , gradually increases, and when the initial pixel value is 1, the assigned weight reaches 1.
  • the slope of the target mapping relationship may gradually increase, so the target mapping relationship is equivalent to an increasingly steeper mapping relationship.
  • the adjusting the first confidence level according to the image segmentation quality to obtain the second confidence level of the segmentation map includes:
  • the product of the image segmentation quality and the first confidence level is determined as the second confidence level of the segmentation map.
  • the second confidence level of the segmentation map can indicate both the positioning accuracy of the detection frame and the image segmentation quality of the segmentation map, wherein the higher the positioning accuracy, the higher the positioning accuracy.
  • the method further includes:
  • the target object in the first image is replaced with a first object, the first object being different from the target object.
  • the terminal device may display AR objects around the target object in the first image based on the second confidence level being higher than a threshold. Specifically, it may be based on the 3D position information of the target object. Displays AR objects around the target object.
  • the terminal device may replace the target object with a first object based on the second confidence level being higher than a threshold, where the first object is other
  • the background area where the target object is located in the image frame that is, in the first image, the A area is the area where the target object is located, and in other image frames, the A area is the background area where the target object is located, and the target object moves to the image. other locations or not within the image).
  • the present application provides an apparatus for determining a confidence level of a segmentation map, the apparatus comprising:
  • an acquisition module for acquiring the first image
  • a target detection module configured to perform target detection on the first image to obtain a detection frame of the target object in the first image and a first confidence level of the detection frame, where the first confidence level is used to indicate the The positioning accuracy of the detection frame;
  • an image segmentation module configured to perform image segmentation on the image in the detection frame, to obtain a segmentation map corresponding to the target object and to obtain the image segmentation quality of the segmentation map;
  • a confidence level determination module configured to adjust the first confidence level according to the image segmentation quality to obtain a second confidence level of the segmentation map, wherein the higher the image segmentation quality, the higher the second confidence level bigger.
  • the first confidence level of the detection frame is adjusted by using the image segmentation quality of the segmentation map to obtain the second confidence level as the segmentation map, so that the second confidence level not only includes the positioning accuracy of the detection frame, but also the segmentation The segmentation quality information of the image itself, so that the confidence of the segmentation image with higher accuracy can be obtained.
  • the method for determining the confidence level of the segmentation map in this embodiment does not need to add an additional network to calculate the confidence level of the segmentation map. Based on the result of the target detection and the result of the segmentation map, the confidence level of the segmentation map can be obtained directly without increasing the confidence level of the segmentation map. A lot of extra computation is more friendly to the terminal side, making the solution easy to deploy on the terminal side.
  • the image segmentation quality is used to indicate the boundary definition of the target object in the segmentation map, wherein the higher the boundary definition is, the higher the image segmentation quality is.
  • the second confidence level of the segmentation map is used to indicate the positioning accuracy of the detection frame and the image segmentation quality of the segmentation map, wherein the higher the positioning accuracy, the higher the positioning accuracy.
  • the second confidence level is higher.
  • the segmentation map includes an object area where the target object is located, a background area of the target object, and a boundary area between the object area and the background area, and the acquiring module , for:
  • the ratio of the number of pixels included in the object area to the number of pixels included in the target area is used as the image segmentation quality of the segmentation map, where the target area is the union of the object area and the boundary area.
  • the acquiring module is configured to: determine the area where the pixel value of the segmentation map is greater than the foreground threshold as the object area; determine that the pixel value in the segmentation map is smaller than the background The region where the pixel points of the threshold are located is determined as the background region; the region where the pixel values of the segmentation map are greater than the background threshold and smaller than the foreground threshold is determined as the boundary region.
  • the image segmentation module is configured to perform image segmentation on the image in the detection frame to obtain an initial segmentation map corresponding to the target object, where the initial segmentation map includes a plurality of pixel points , and the probability that each pixel belongs to each category; the maximum probability value among the probabilities of each pixel belonging to each category is taken as the pixel value of each pixel in the segmentation map; the part included in the segmentation map is obtained
  • the pixel value of the pixel point wherein the pixel value of the part of the pixel point is greater than the preset value, the preset value is less than the foreground threshold value and greater than or equal to the background threshold value, and the foreground threshold value is less than 1; according to the The pixel values of some pixel points included in the segmentation map determine the image segmentation quality of the segmentation map, wherein the larger the average value of the adjusted pixel values of the pixel points included in the segmentation map, the higher the image segmentation quality. high.
  • the image segmentation module is configured to acquire initial pixel values of some pixels included in the segmentation map, wherein the initial pixel values of the partial pixels are greater than a preset value, and the The preset value is less than the foreground threshold and greater than or equal to the background threshold, and the foreground threshold is less than 1; according to the target mapping relationship, the initial pixel value is adjusted to obtain the pixel points included in the segmentation map.
  • Adjusted pixel value wherein the target mapping relationship represents a mapping relationship between the initial pixel value and the adjusted pixel value, and in the target mapping relationship, as the initial pixel value is changed by the The preset value changes to 1, the adjusted pixel value gradually changes from 0 to 1, and as the initial pixel value increases, the slope of the target mapping relationship does not change or increases; according to the segmentation
  • the adjusted pixel values of the pixel points included in the figure determine the image segmentation quality of the segmentation map, wherein the larger the average value of the adjusted pixel values of the pixel points included in the segmentation map, the higher the image segmentation quality. higher.
  • the confidence level determination module is configured to determine the product of the image segmentation quality and the first confidence level as the second confidence level of the segmentation map.
  • the apparatus further includes: an image processing module configured to display an AR object around the target object in the first image based on the second confidence level being higher than a threshold; or, Based on the second confidence level being above a threshold, the target object in the first image is replaced with a first object, the first object being different from the target object.
  • an embodiment of the present application provides a model training apparatus, which may include a memory, a processor, and a bus system, wherein the memory is used to store a program, and the processor is used to execute the program in the memory, so as to execute the above-mentioned first aspect any optional method.
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it runs on a computer, the computer causes the computer to execute the above-mentioned first and third aspects and any of its optional methods.
  • an embodiment of the present application provides a computer program, including code, when the code is executed, to perform any optional method in the first aspect above.
  • the present application provides a system-on-a-chip
  • the system-on-a-chip includes a processor for supporting an execution device or a training device to implement the functions involved in the above aspects, for example, sending or processing data involved in the above methods; or, information.
  • the chip system further includes a memory for storing program instructions and data necessary for executing the device or training the device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • An embodiment of the present application provides a method for determining a confidence level of a segmentation map.
  • the method includes: acquiring a first image; and performing target detection on the first image to acquire a detection frame of a target object in the first image and all The first confidence level of the detection frame, the first confidence level is used to indicate the positioning accuracy of the detection frame; image segmentation is performed on the image in the detection frame to obtain the segmentation map corresponding to the target object and obtaining the image segmentation quality of the segmentation map; adjusting the first confidence level according to the image segmentation quality to obtain the second confidence level of the segmentation map, wherein the higher the image segmentation quality, the higher the confidence level of the first confidence level. The greater the confidence.
  • the first confidence level of the detection frame is adjusted by using the image segmentation quality of the segmentation map to obtain the second confidence level as the segmentation map, so that the second confidence level not only includes the positioning accuracy of the detection frame, but also can
  • the segmentation quality information of the segmentation map itself is included, so that the confidence of the segmentation map with higher accuracy can be obtained.
  • the method for determining the confidence level of the segmentation map in this embodiment does not need to add an additional network to calculate the confidence level of the segmentation map. Based on the result of the target detection and the result of the segmentation map, the confidence level of the segmentation map can be obtained directly without increasing the confidence level of the segmentation map. A lot of extra computation is more friendly to the terminal side, making the solution easy to deploy on the terminal side.
  • FIG. 1 is a schematic structural diagram of a terminal provided in an embodiment of the present application.
  • FIG. 2a is a block diagram of a software structure of a terminal according to an embodiment of the present disclosure
  • 2b is a schematic structural diagram of a server provided in an embodiment of the application.
  • FIG. 2c further provides a segmentation map confidence level determination system according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a method for determining a confidence level of a segmentation map provided in an embodiment of the present application
  • Fig. 6 is the structural representation of a kind of backbone network
  • Fig. 7 is the schematic diagram of a kind of header
  • FIG. 8 is a schematic diagram of a segmentation diagram according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of various target mapping relationships in this embodiment.
  • FIG. 10 is a schematic structural diagram of an apparatus for determining a confidence level of a segmentation map provided in an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • FIG. 1 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • the terminal 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, Antenna 1, Antenna 2, Mobile Communication Module 150, Wireless Communication Module 160, Audio Module 170, Speaker 170A, Receiver 170B, Microphone 170C, Headphone Interface 170D, Sensor Module 180, Key 190, Motor 191, Indicator 192, Camera 193, Display screen 194, and subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
  • the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the terminal 100 .
  • the terminal 100 may include more or less components than shown, or some components may be combined, or some components may be separated, or different component arrangements.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • baseband processor baseband processor
  • neural-network processing unit neural-network processing unit
  • the controller can generate operation control signals according to the instruction opcode and timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transceiver
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB universal serial bus
  • the I2C interface is a bidirectional synchronous serial bus that includes a serial data line (SDA) and a serial clock line (SCL).
  • the processor 110 may contain multiple sets of I2C buses.
  • the processor 110 can be respectively coupled to the touch sensor 180K, the charger, the flash, the camera 193 and the like through different I2C bus interfaces.
  • the processor 110 may couple the touch sensor 180K through the I2C interface, so that the processor 110 and the touch sensor 180K communicate with each other through the I2C bus interface, so as to realize the touch function of the terminal 100 .
  • the I2S interface can be used for audio communication.
  • the processor 110 may contain multiple sets of I2S buses.
  • the processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170 .
  • the audio module 170 can transmit audio signals to the wireless communication module 160 through the I2S interface, so as to realize the function of answering calls through a Bluetooth headset.
  • the PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals.
  • the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface.
  • the audio module 170 can also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.
  • the UART interface is a universal serial data bus used for asynchronous communication.
  • the bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication.
  • a UART interface is typically used to connect the processor 110 with the wireless communication module 160 .
  • the processor 110 communicates with the Bluetooth module in the wireless communication module 160 through the UART interface to implement the Bluetooth function.
  • the audio module 170 can transmit audio signals to the wireless communication module 160 through the UART interface, so as to realize the function of playing music through the Bluetooth headset.
  • the MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 .
  • MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc.
  • the processor 110 communicates with the camera 193 through the CSI interface, so as to realize the shooting function of the terminal 100 .
  • the processor 110 communicates with the display screen 194 through the DSI interface to implement the display function of the terminal 100 .
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface may be used to connect the processor 110 with the camera 193, the display screen 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like.
  • the GPIO interface can also be configured as I2C interface, I2S interface, UART interface, MIPI interface, etc.
  • the USB interface 130 is an interface that conforms to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like.
  • the USB interface 130 can be used to connect a charger to charge the terminal 100, and can also be used to transmit data between the terminal 100 and peripheral devices. It can also be used to connect headphones to play audio through the headphones.
  • the interface can also be used to connect other electronic devices, such as AR devices.
  • the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the terminal 100 .
  • the terminal 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger may be a wireless charger or a wired charger.
  • the charging management module 140 may receive charging input from the wired charger through the USB interface 130 .
  • the charging management module 140 may receive wireless charging input through the wireless charging coil of the terminal 100 . While the charging management module 140 charges the battery 142 , it can also supply power to the electronic device through the power management module 141 .
  • the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, the internal memory 121, the display screen 194, the camera 193, and the wireless communication module 160.
  • the power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance).
  • the power management module 141 may also be provided in the processor 110 .
  • the power management module 141 and the charging management module 140 may also be provided in the same device.
  • the wireless communication function of the terminal 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in terminal 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G, etc. applied on the terminal 100 .
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like.
  • the mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 .
  • at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110 .
  • at least part of the functional modules of the mobile communication module 150 may be provided in the same device as at least part of the modules of the processor 110 .
  • the modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the low frequency baseband signal is processed by the baseband processor and passed to the application processor.
  • the application processor outputs sound signals through audio devices (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 194 .
  • the modem processor may be a separate device.
  • the modem processor may be independent of the processor 110, and may be provided in the same device as the mobile communication module 150 or other functional modules.
  • the wireless communication module 160 can provide applications on the terminal 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • BT wireless fidelity
  • GNSS global navigation satellite system
  • frequency modulation frequency modulation, FM
  • NFC near field communication technology
  • infrared technology infrared, IR
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for
  • the antenna 1 of the terminal 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the terminal 100 can communicate with the network and other devices through wireless communication technology.
  • the wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc.
  • the GNSS may include global positioning system (global positioning system, GPS), global navigation satellite system (global navigation satellite system, GLONASS), Beidou navigation satellite system (beidou navigation satellite system, BDS), quasi-zenith satellite system (quasi -zenith satellite system, QZSS) and/or satellite based augmentation systems (SBAS).
  • global positioning system global positioning system, GPS
  • global navigation satellite system global navigation satellite system, GLONASS
  • Beidou navigation satellite system beidou navigation satellite system, BDS
  • quasi-zenith satellite system quadsi -zenith satellite system, QZSS
  • SBAS satellite based augmentation systems
  • the terminal 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • Display screen 194 is used to display images, videos, and the like.
  • Display screen 194 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light).
  • LED diode AMOLED
  • flexible light-emitting diode flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.
  • the terminal 100 may include one or N display screens 194 , where N is a positive integer greater than one.
  • the terminal 100 can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor.
  • the ISP is used to process the data fed back by the camera 193 .
  • the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
  • ISP can also perform algorithm optimization on image noise and brightness.
  • ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be provided in the camera 193 .
  • Camera 193 is used to capture still images or video.
  • the object is projected through the lens to generate an optical image onto the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the terminal 100 may include 1 or N cameras 193 , where N is a positive integer greater than 1.
  • a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the terminal 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point, and so on.
  • Video codecs are used to compress or decompress digital video.
  • Terminal 100 may support one or more video codecs.
  • the terminal 100 can play or record videos in various encoding formats, for example, moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.
  • MPEG moving picture experts group
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the terminal 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the terminal 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like.
  • the storage data area may store data (such as audio data, phone book, etc.) created during the use of the terminal 100 and the like.
  • the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
  • the processor 110 executes various functional applications and data processing of the terminal 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
  • the terminal 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.
  • the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • Speaker 170A also referred to as a "speaker" is used to convert audio electrical signals into sound signals.
  • the terminal 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • the receiver 170B also referred to as "earpiece" is used to convert audio electrical signals into sound signals.
  • the voice can be answered by placing the receiver 170B close to the human ear.
  • the microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
  • the user can make a sound near the microphone 170C through the human mouth, and input the sound signal into the microphone 170C.
  • the terminal 100 may be provided with at least one microphone 170C. In other embodiments, the terminal 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the terminal 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
  • the earphone jack 170D is used to connect wired earphones.
  • the earphone interface 170D may be the USB interface 130, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • the pressure sensor 180A is used to sense pressure signals, and can convert the pressure signals into electrical signals.
  • the pressure sensor 180A may be provided on the display screen 194 .
  • the capacitive pressure sensor may be comprised of at least two parallel plates of conductive material. When a force is applied to the pressure sensor 180A, the capacitance between the electrodes changes.
  • the terminal 100 determines the intensity of the pressure according to the change in capacitance. When a touch operation acts on the display screen 194, the terminal 100 detects the intensity of the touch operation according to the pressure sensor 180A.
  • the terminal 100 may also calculate the touched position according to the detection signal of the pressure sensor 180A.
  • touch operations acting on the same touch position but with different touch operation intensities may correspond to different operation instructions. For example, when a touch operation whose intensity is less than the first pressure threshold acts on the short message application icon, the instruction for viewing the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, the instruction to create a new short message is executed.
  • the gyro sensor 180B may be used to determine the motion attitude of the terminal 100 .
  • the angular velocity of terminal 100 about three axes ie, x, y, and z axes
  • the gyro sensor 180B can be used for image stabilization.
  • the gyroscope sensor 180B detects the angle at which the terminal 100 shakes, calculates the distance to be compensated by the lens module according to the angle, and allows the lens to counteract the shake of the terminal 100 through reverse motion to achieve anti-shake.
  • the gyro sensor 180B can also be used for navigation and somatosensory game scenarios.
  • the air pressure sensor 180C is used to measure air pressure.
  • the terminal 100 calculates the altitude through the air pressure value measured by the air pressure sensor 180C to assist in positioning and navigation.
  • the magnetic sensor 180D includes a Hall sensor.
  • the terminal 100 can detect the opening and closing of the flip holster using the magnetic sensor 180D.
  • the terminal 100 can detect the opening and closing of the flip according to the magnetic sensor 180D. Further, according to the detected opening and closing state of the leather case or the opening and closing state of the flip cover, characteristics such as automatic unlocking of the flip cover are set.
  • the acceleration sensor 180E can detect the magnitude of the acceleration of the terminal 100 in various directions (generally three axes). When the terminal 100 is stationary, the magnitude and direction of gravity can be detected. It can also be used to identify the posture of electronic devices, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc.
  • the terminal 100 can measure the distance through infrared or laser. In some embodiments, when shooting a scene, the terminal 100 can use the distance sensor 180F to measure the distance to achieve fast focusing.
  • Proximity light sensor 180G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes.
  • the light emitting diodes may be infrared light emitting diodes.
  • the terminal 100 emits infrared light to the outside through light emitting diodes.
  • the terminal 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object near the terminal 100. When insufficient reflected light is detected, the terminal 100 may determine that there is no object near the terminal 100 .
  • the terminal 100 can use the proximity light sensor 180G to detect that the user holds the terminal 100 close to the ear to talk, so as to automatically turn off the screen to save power.
  • Proximity light sensor 180G can also be used in holster mode, pocket mode automatically unlocks and locks the screen.
  • the ambient light sensor 180L is used to sense ambient light brightness.
  • the terminal 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness.
  • the ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.
  • the ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the terminal 100 is in a pocket, so as to prevent accidental touch.
  • the fingerprint sensor 180H is used to collect fingerprints.
  • the terminal 100 can use the collected fingerprint characteristics to unlock the fingerprint, access the application lock, take a picture with the fingerprint, answer the incoming call with the fingerprint, and the like.
  • the temperature sensor 180J is used to detect the temperature.
  • the terminal 100 uses the temperature detected by the temperature sensor 180J to execute a temperature processing strategy. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold value, the terminal 100 reduces the performance of the processor located near the temperature sensor 180J, so as to reduce power consumption and implement thermal protection.
  • the terminal 100 when the temperature is lower than another threshold, the terminal 100 heats the battery 142 to avoid abnormal shutdown of the terminal 100 due to low temperature.
  • the terminal 100 boosts the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperature.
  • Touch sensor 180K also called “touch device”.
  • the touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”.
  • the touch sensor 180K is used to detect a touch operation on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • Visual output related to touch operations may be provided through display screen 194 .
  • the touch sensor 180K may also be disposed on the surface of the terminal 100 , which is different from the position where the display screen 194 is located.
  • the bone conduction sensor 180M can acquire vibration signals.
  • the bone conduction sensor 180M can acquire the vibration signal of the vibrating bone mass of the human voice.
  • the bone conduction sensor 180M can also contact the pulse of the human body and receive the blood pressure beating signal.
  • the bone conduction sensor 180M can also be disposed in the earphone, combined with the bone conduction earphone.
  • the audio module 170 can analyze the voice signal based on the vibration signal of the vocal vibration bone block obtained by the bone conduction sensor 180M, so as to realize the voice function.
  • the application processor can analyze the heart rate information based on the blood pressure beat signal obtained by the bone conduction sensor 180M, and realize the function of heart rate detection.
  • the keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key.
  • the terminal 100 may receive key input and generate key signal input related to user settings and function control of the terminal 100 .
  • Motor 191 can generate vibrating cues.
  • the motor 191 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback.
  • touch operations acting on different applications can correspond to different vibration feedback effects.
  • the motor 191 can also correspond to different vibration feedback effects for touch operations on different areas of the display screen 194 .
  • Different application scenarios for example: time reminder, receiving information, alarm clock, games, etc.
  • the touch vibration feedback effect can also support customization.
  • the indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.
  • the SIM card interface 195 is used to connect a SIM card.
  • the SIM card can be contacted and separated from the terminal 100 by inserting into the SIM card interface 195 or pulling out from the SIM card interface 195 .
  • the terminal 100 may support 1 or N SIM card interfaces, where N is a positive integer greater than 1.
  • the SIM card interface 195 can support Nano SIM card, Micro SIM card, SIM card and so on. Multiple cards can be inserted into the same SIM card interface 195 at the same time. The types of the plurality of cards may be the same or different.
  • the SIM card interface 195 can also be compatible with different types of SIM cards.
  • the SIM card interface 195 is also compatible with external memory cards.
  • the terminal 100 interacts with the network through the SIM card to realize functions such as calls and data communication.
  • the terminal 100 employs an eSIM, ie an embedded SIM card.
  • the eSIM card can be embedded in the terminal 100 and cannot be separated from the terminal 100 .
  • the software system of the terminal 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiments of the present invention take an Android system with a layered architecture as an example to illustrate the software structure of the terminal 100 as an example.
  • FIG. 2a is a software structural block diagram of the terminal 100 according to an embodiment of the present disclosure.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces.
  • the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.
  • the application layer can include a series of application packages.
  • the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer may include a window manager, content provider, view system, telephony manager, resource manager, notification manager, etc.
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
  • Content providers are used to store and retrieve data and make these data accessible to applications.
  • the data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications.
  • a display interface can consist of one or more views.
  • the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
  • the telephony manager is used to provide the communication function of the terminal 100 .
  • the management of call status including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
  • the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the electronic device vibrates, and the indicator light flashes.
  • Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
  • the core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
  • a system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
  • surface manager surface manager
  • media library Media Libraries
  • 3D graphics processing library eg: OpenGL ES
  • 2D graphics engine eg: SGL
  • the Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
  • 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.
  • the workflow of the software and hardware of the terminal 100 is exemplarily described below in conjunction with the capturing and photographing scene.
  • a corresponding hardware interrupt is sent to the kernel layer.
  • the kernel layer processes touch operations into raw input events (including touch coordinates, timestamps of touch operations, etc.). Raw input events are stored at the kernel layer.
  • the application framework layer obtains the original input event from the kernel layer, and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and the control corresponding to the click operation is the control of the camera application icon, for example, the camera application calls the interface of the application framework layer to start the camera application, and then starts the camera driver by calling the kernel layer.
  • the camera 193 captures still images or video.
  • This embodiment of the present application further provides a server 1300 .
  • the server 1300 may include a processor 1310 and a transceiver 1320, and the transceiver 1320 may be connected with the processor 1310, as shown in FIG. 2b.
  • the transceiver 1320 may include a receiver and a transmitter, and may be used to receive or transmit messages or data, and the transceiver 1320 may be a network card.
  • the server 1300 may further include an acceleration component (which may be referred to as an accelerator). When the acceleration component is a network acceleration component, the acceleration component may be a network card.
  • the processor 1310 may be the control center of the server 1300, and uses various interfaces and lines to connect various parts of the entire server 1300, such as the transceiver 1320 and the like.
  • the processor 1310 may be a central processing unit (Central Processing Unit, CPU).
  • the processor 1310 may include one or more processing units.
  • the processor 1310 may also be a digital signal processor, an application specific integrated circuit, a field programmable gate array, a GPU, or other programmable logic device, or the like.
  • the server 1300 may further include a memory 1330, which can be used to store software programs and modules.
  • the processor 1310 reads the software codes and modules stored in the memory 1330 to execute various functional applications and data processing of the server 1300.
  • the embodiment of the present application also provides a system for determining the confidence level of a segmentation graph, as shown in FIG. 2c, the system may include a terminal device and a server.
  • the terminal device may be a mobile terminal, a human-computer interaction device, or a vehicle-mounted visual perception device, such as a mobile phone, an intelligent robot, an unmanned vehicle, an intelligent monitor, an Augmented Reality (AR) wearable device, and the like.
  • AR Augmented Reality
  • the methods provided by the embodiments of the present disclosure can be used in application fields such as human-computer interaction, vehicle-mounted visual perception, augmented reality, intelligent monitoring, and unmanned driving.
  • the terminal device may send an image (for example, the first image in this embodiment of the present application) to the server, and the server performs image processing and analysis to obtain the confidence level of the segmentation map (for example, the second image in this embodiment of the present application). confidence), and transmit the second confidence to the terminal device.
  • an image for example, the first image in this embodiment of the present application
  • the server performs image processing and analysis to obtain the confidence level of the segmentation map (for example, the second image in this embodiment of the present application). confidence), and transmit the second confidence to the terminal device.
  • special effects display can be added to the character objects in the video, for example, wing special effects can be displayed on the back of the character (as shown in Figure 3); Image segmentation is performed on the image frames in the video to obtain a segmentation map for the human object, and based on the segmentation map and the position information of the human object in the image, special effects are displayed at the corresponding position.
  • the recognition of human objects in the segmentation map will be inaccurate. Therefore, it is necessary to obtain the confidence level of the segmentation map and determine whether to perform special effects based on the confidence level of the segmentation map. If the confidence level is low, it means that the object in the segmentation map may not be a person, such as an object other than a person, or the object in the segmentation map is a human object, but it cannot be very accurate from the segmentation map. to determine the position, outline (or boundary) and other information of the character object. In this case, the special effect will not be displayed.
  • the image 1401 and the image 1402 in FIG. 4 are images that can be used for fusion processing.
  • the moving object 1411 can be segmented from the image 1401 first, and then the moving object 1411 can be segmented through the image 1402 at the position of the removed moving object. background image 1412 to complete image 1401.
  • the images used for fusion processing may be two images or more images. When there are many moving objects to be removed in the image and the two images are not enough to fuse to obtain a complete image without moving objects, more images can be fused to obtain a complete image without moving objects .
  • ADAS advanced driver assistance systems
  • ADS advanced driver systems
  • the detection result of the object (including the detection frame and the classification result) can be obtained, and the confidence of the detection result is directly used as the confidence of the segmentation map.
  • the confidence of the detection result is used to indicate the positioning accuracy of the detection frame. Therefore, the confidence can only express whether the position of the detection frame is at the position of the object, but cannot reflect the quality of the segmentation map.
  • AR special effects display in the scene displayed by AR special effects, it may be necessary to obtain a clear position of the boundary of the target object in the segmentation map. Therefore, when the position of the detection frame is very accurate and the quality of the segmentation map is very low, the existing In the technology, AR effects are also displayed on the object. However, in this case, due to the low quality of the segmentation map, the effects of the displayed AR effects (such as display position, display direction, etc.) are very poor. Take wings as an example. In this case, the mutual occlusion relationship between wings and character objects will be affected, that is, the so-called virtual and real occlusion occurs. Therefore, the confidence of the segmentation map should not only include the information of the positioning accuracy of the detection frame, but also the segmentation quality of the objects in the segmentation map, such as whether the boundary is clear or not.
  • FIG. 5 is a schematic diagram of an embodiment of a method for determining a confidence level of a segmentation map provided by an embodiment of the present application
  • the method for determining the confidence level of the segmentation map includes:
  • the execution subject of step 501 may be a terminal device or a server. Specifically, the terminal device may acquire the first image; or the terminal device may acquire the first image and send the first image to the server on the cloud side, and then the server may acquire the first image. image.
  • the terminal device can obtain the input video stream, where the video stream can be captured by the camera device in the terminal device, and in the AR special effect display scene, the video stream is captured by the camera device in the terminal device.
  • a real-time video stream wherein the video stream includes a first image (or referred to as a first image frame).
  • the user may select a video stream on the terminal device, wherein the selected video stream includes the first image.
  • the terminal device may acquire the input first image, where the first image may be captured by a photographing device in the terminal device.
  • the user may select the first image on the terminal device (eg, an album on the terminal device or an album on the cloud side).
  • the terminal device eg, an album on the terminal device or an album on the cloud side.
  • the first image may include a target object, where the target object may be a human object or a non-human object.
  • the first image may include a plurality of objects, and the target object is one of the plurality of objects.
  • the first image may be preprocessed, and normalization processing in the RGB domain may be performed for excitation, which is not limited in this application.
  • a pre-trained neural network model may be used to perform target detection on the first image to obtain a target detection result of the target object, wherein the target detection result may include the detection of the target object box, correspondingly, the first confidence level is used to indicate the positioning accuracy of the detection box.
  • a pre-trained neural network model may also be used to perform image segmentation on the image in the detection frame, so as to obtain a segmentation map corresponding to the target object.
  • segmentation map in this embodiment of the present application may also be referred to as a Mask or a mask activation map.
  • the neural network model used for target detection may include a backbone network and a head-end Header.
  • FIG. 6 is a schematic structural diagram of a backbone network.
  • the backbone network is used for Receive the input first image, perform convolution processing on the input first image, and output feature maps (feature map C1, feature map C2, feature map C3, feature map C4) with different resolutions corresponding to the image; That is to say, feature maps of different sizes corresponding to the image are output, and the backbone network completes the extraction of basic features and provides corresponding features for subsequent detection.
  • the backbone network can perform a series of convolution processing on the input image to obtain feature maps at different scales (with different resolutions). These feature maps will provide basic features for subsequent detection modules.
  • the backbone network can take various forms, such as visual geometry group (VGG), residual neural network (resnet), the core structure of GoogLeNet (Inception-net), etc.
  • the backbone network can perform convolution processing on the input image to generate several convolution feature maps of different scales.
  • Each feature map is a matrix of H*W*C, where H is the height of the feature map and W is the width of the feature map. , C is the number of channels of the feature map.
  • the backbone can use a variety of existing convolutional network frameworks, such as VGG16, Resnet50, Inception-Net, etc. The following is an example of Resnet18 as Backbone.
  • the resolution of the input image is H*W*3 (height H, width W, and the number of channels is 3, that is, three channels of RBG).
  • the input image can be convolved through a convolutional layer Res18-Conv1 of Resnet18 to generate Featuremap (feature map) C1.
  • This feature map is downsampled twice relative to the input image, and the number of channels is expanded to 64, so C1's
  • the resolution is H/4*W/4*64.
  • C1 can undergo convolution operation through Res18-Conv2 of Resnet18 to obtain Featuremap C2.
  • the resolution of this feature map is consistent with C1;
  • C2 continues to undergo convolution operation through Res18-Conv3 to generate Featuremap C3, which is further downsampled relative to C2.
  • the number of channels is doubled, and its resolution is H/8*W/8*128; finally, C3 undergoes a convolution operation with Res18-Conv4 to generate Featuremap C4, and its resolution is H/16*W/16
  • the backbone network in this embodiment of the present application may also be called a backbone network, which is not limited here.
  • the network used for target detection may further include one or more parallel heads (or referred to as headers).
  • the task object (such as the target object in this embodiment) is detected, and the detection frame of the area where the target object is located and the first confidence level corresponding to the detection frame are output; wherein, each head in the network for target detection can complete different The detection of the task object; wherein, the task object is the object to be detected in the task; the higher the first confidence level, the greater the probability that the target object exists in the detection frame corresponding to the first confidence level, or Described as the first confidence level is used to indicate the positioning accuracy of the detection frame. When the positioning accuracy is higher, the probability that the target object exists in the detection frame is higher.
  • different heads can complete different object detection tasks.
  • one head of multiple heads can complete car detection and output the detection frame and confidence of Car/Truck/Bus; one head of multiple heads can complete human detection and output Pedestrian/Cyclist/Tricyle detection frame and Confidence; one of multiple heads can complete the detection of traffic lights, and output the detection frame and confidence of Red_Trafficlight/Green_Trafficlight/Yellow_TrafficLight/Black_TrafficLight.
  • the network used for target detection may include multiple serial heads; the serial heads are connected to one parallel head; it should be emphasized here that, in fact, the serial heads are not necessary. It is only necessary to detect the scene of the detection box, and there is no need to include the serial head.
  • the serial head can perform image segmentation according to the region where the detection frame is located (that is, step 503 in the embodiment).
  • the serial head can be used to: extract the features of the area where the detection frame is located on the feature map by using the detection frame of the task object of the task provided by the parallel head connected to it, according to the detection frame of the area where the detection frame is located
  • the feature predicts the 3D information, Mask information (or called segmentation map) or Keypiont information of the task object of the said task.
  • the serial head can complete the 3D/Mask/Keypoint detection of objects inside the detection frame based on the detection frame of the task.
  • serial 3D_head0 completes the estimation of the vehicle's orientation, center of mass and length, width and height, thereby outputting the 3D frame of the vehicle;
  • serial Mask_head0 predicts the vehicle's fine mask, thereby segmenting the vehicle;
  • serial Keypont_head0 completes the key points of the vehicle 's estimate.
  • Serial heads are not necessary. Some tasks do not require 3D/Mask/Keypoint detection, so serial heads are not required. For example, the detection of traffic lights, only the detection frame needs to be detected, and serial heads are not required.
  • some tasks can choose to connect one or more serial heads in series according to the specific needs of the task, such as the detection of parking lots (Parkingslot), in addition to the detection frame, it also needs the key points of the parking space, so in this task Only one serial Keypoint_head needs to be connected in series, and the head of 3D and Mask is not required.
  • the header can complete the detection of the detection frame of a task according to the feature map provided by the FPN, and output the detection frame of the object of the task and the corresponding confidence level, etc.
  • FIG. 7 is a schematic diagram of a header.
  • the header includes three modules: Region Proposal Network (RPN), ROI-ALIGN and RCNN.
  • the RPN module can be used to predict the region where the task object is located on the feature map, and output candidate detection frames that match the region; or it can be understood that the RPN predicts the region where the task object may exist on the feature map. , and give the boxes of these regions, which are called proposals. For example, when the head is responsible for detecting a car, its RPN layer predicts a candidate frame where there may be a car; when the head is responsible for detecting a person, its RPN layer predicts a candidate frame where there may be a person. Of course, these proposals are inaccurate. On the one hand, they do not necessarily contain the objects of the task, and on the other hand, the boxes are not compact.
  • the detection candidate region prediction process can be implemented by the RPN module of the head, which predicts the regions where the task object may exist according to the feature map provided by the FPN, and gives the candidate frames (also called candidate regions, Proposal) of these regions.
  • the head is responsible for detecting a car, its RPN layer predicts a candidate frame where there may be a car.
  • the RPN layer can generate the feature map RPN Hidden by, for example, a 3*3 convolution on the feature map.
  • the RPN layer of the back head will predict the proposal from the RPN Hidden. Specifically, the RPN layer of the head predicts the coordinates and confidence of the proposal at each position of the RPN Hidden through a 1*1 convolution. The higher the confidence, the greater the probability that this Proposal exists in the object of the task. For example, the larger the score of a certain proposal in the head, the greater the probability of its existence in the car.
  • the Proposal predicted by each RPN layer needs to go through the Proposal merging module, remove the redundant Proposal according to the degree of overlap between the Proposals (this process can be used but not limited to the NMS algorithm), and select the largest score among the remaining K Proposals.
  • N (N ⁇ K) proposals are used as candidate regions where objects may exist. These proposals are inaccurate, on the one hand they do not necessarily contain the objects of the task, and on the other hand the boxes are not compact. Therefore, the RPN module is only a rough detection process, and the subsequent RCNN module is required for subdivision. When the RPN module returns the coordinates of the Proposal, it does not directly return the absolute value of the coordinates, but returns the coordinates relative to the Anchor. The higher the match between these Anchors and the actual objects, the higher the probability that the RPN can detect the objects.
  • the ROI-ALIGN module is used to deduct the features of the region where the candidate detection frame is located from a feature map provided by the FPN according to the region predicted by the RPN module; that is, the ROI-ALIGN module is mainly based on the RPN module.
  • the features of the region where each Proposal is located are deducted from a feature map, and resized to a fixed size to obtain the features of each Proposal.
  • ROI-ALIGN can use but is not limited to ROI-POOLING (region of interest pooling)/ROI-ALIGN (region of interest extraction)/PS-ROIPOOLING (position-sensitive region of interest pooling)/ Feature extraction methods such as PS-ROIALIGN (Position Sensitive Region of Interest Extraction).
  • the RCNN module is used to perform convolution processing on the features of the region where the candidate detection frame is located through a neural network to obtain the confidence that the candidate detection frame belongs to each object category; adjust the coordinates of the candidate detection frame through the neural network , so that the adjusted detection candidate frame is more matched to the shape of the actual object than the candidate detection frame, and the adjusted detection candidate frame with a confidence greater than a preset threshold is selected as the detection frame of the region.
  • the RCNN module mainly refines the features of each proposal proposed by the ROI-ALIGN module, and obtains the confidence of each proposal belonging to each category (for example, for the task of car, Backgroud/Car/Truck will be given /Bus 4 points), and adjust the coordinates of the Proposal detection frame to output a more compact detection frame. After these detection boxes are merged by non-maximum suppression (NMS), they are output as the final detection box.
  • NMS non-maximum suppression
  • the subdivision of the detection candidate region is mainly implemented by the RCNN module of the head in Figure 7. According to the features of each proposal extracted by the ROI-ALIGN module, it further returns a more compact detection frame coordinate, and at the same time, the proposal is classified. Output the confidence that it belongs to each category. There are many achievable forms of RCNN.
  • the feature size output by the ROI-ALIGN module can be N*14*14*256 (Feature of proposals), which is first processed by the convolution module 4 (Res18-Conv5) of Resnet18 in the RCNN module.
  • the output feature size is N*7*7*512, which is then processed by a Global Avg Pool (average pooling layer), and the 7*7 features in each channel in the input feature are averaged to obtain N*512 features, where each 1*512-dimensional feature vector represents the feature of each proposal.
  • FC the vector of N*4 is output, and these 4 numerical values represent the x/y coordinates of the center point of the box, the width and height of the box
  • Degree in head0, you need to give the score of this box is Backgroud/Car/Truck/Bus).
  • the perception network can also include other heads, which can further perform 3D/Mask/Keypoint detection on the basis of detecting the detection frame.
  • the ROI-ALIGN module extracts the features of the area where each detection frame is located on the feature map output by the FPN according to the accurate detection frame provided by the head. Assuming that the number of detection frames is M, then The feature size output by the ROI-ALIGN module is M*14*14*256, which is first processed by Res18-Conv5 of Resnet18, and the output feature size is N*7*7*512, and then passed through a Global Avg Pool (average pooling).
  • header shown in FIG. 7 is only an implementation manner, and does not constitute a limitation to the present application.
  • target detection may be performed on the first image to obtain the first confidence level of the detection frame of the target object in the first image
  • image segmentation may be performed on the image in the detection frame to obtain the first confidence level of the detection frame of the target object in the first image.
  • the segmentation map corresponding to the target object.
  • image segmentation may be performed on the image in the detection frame to obtain an initial segmentation map corresponding to the target object, the initial segmentation map includes multiple pixels, and each pixel belongs to the probability of each category, and the maximum probability value among the probabilities of each pixel belonging to each category is taken as the pixel value of each pixel in the segmentation map.
  • N represents the number of pictures
  • C represents the number of segmentation categories
  • H and W represent the height and width of the picture, respectively.
  • the value of each pixel position (n, c, h, w) is between 0 and 1. The larger the value, the greater the probability that the pixel belongs to category c.
  • the maximum value matrix (N, H, W) and the maximum value subscript matrix (N, H, W) are obtained.
  • the maximum subscript matrix (N,H,W) can be used as a segmentation map.
  • the image segmentation quality of the segmentation map can be obtained, and the first confidence of the detection frame is adjusted based on the image segmentation quality, so that the adjusted first confidence can indicate the detection
  • the positioning accuracy of the box can also indicate the image segmentation quality of the segmentation map.
  • the image segmentation quality is used to indicate the boundary definition of the target object in the segmentation map, wherein the higher the boundary definition is, the higher the image segmentation quality is.
  • the value of each pixel in the segmentation map can be between 0 and 1, wherein the pixel value of the pixel in the background area is 0 or close to 0, and the pixel value in the boundary area of the target object is The pixel value of is around 0.5, as shown in Figure 8, where (a) is the image input, (b) is the probability map, (c) is the segmentation map, (d) is the pixel less than 0.4, ( e) is a point less than 0.6, (f) is a point less than 0.8, and most of the pixel values of the pixels inside the human body are 1, or close to 1: because when using this segmentation map, a threshold of 0.5 can be used for binary values. Therefore, errors will occur at the edge. Ideally, when a segmentation image has fewer pixels near the edge of the human body with a pixel value of 0.5, it means that the quality of the segmentation image is relatively better.
  • the segmentation map includes the object area where the target object is located, the background area of the target object, and the boundary area between the object area and the background area, and the pixel values in the segmentation map
  • the area where the pixels greater than the foreground threshold are located is determined as the object area
  • the area where the pixels whose pixel values are less than the background threshold in the segmentation map are located is determined as the background area
  • the pixel values in the segmentation map are greater than the threshold.
  • the area where the background threshold and the pixel points smaller than the foreground threshold are located is determined as the boundary area.
  • the background threshold can be 0.25, then the pixel value of the pixels in the background area is less than 0.25, and the foreground threshold can be 0.7, then the pixel value of the pixels in the foreground area is greater than 0.7, and then the pixel value of the pixels in the border area is greater than 0.25 less than 0.7.
  • the ratio of the number of pixels included in the object area to the number of pixels included in the target area may be used as the image segmentation quality of the segmentation map, where the target area is the object area and all The union of the described boundary regions. That is, the smaller the boundary area (or the smaller the number of pixels expressed as the boundary area), the higher the image segmentation quality.
  • the image segmentation quality can be calculated as shown in the following formula:
  • high_fg_threshold represents the foreground threshold
  • low_fg_threshold is the background threshold.
  • pixel values of some pixels included in the segmentation map may be acquired, wherein the pixel values of the partial pixels are greater than a preset value, and the preset value is less than the foreground threshold and greater than or equal to the background threshold, the foreground threshold is less than 1, and the image segmentation quality of the segmentation map is determined according to the pixel values of some pixels included in the segmentation map, wherein the pixel values included in the segmentation map are The larger the average value of the adjusted pixel values, the higher the quality of the image segmentation.
  • the high foreground area can be obtained, the mean value of the maximum value matrix of the high foreground area can be counted, and the matrix of dimension N is returned, representing each picture.
  • the category confidence can also be calculated: traverse all categories C, obtain the area of the category according to the maximum value matrix, obtain the foreground area under the category, and obtain the mean value under the category, as the confidence of the category, Finally, an N*C matrix is returned.
  • initial pixel values of some pixels included in the segmentation map may be acquired, wherein the initial pixel values of the partial pixels are greater than a preset value, and the preset value is less than the foreground threshold and is greater than or equal to the background threshold, the foreground threshold is less than 1, and the initial pixel value is adjusted according to the target mapping relationship to obtain the adjusted pixel value of the pixel points included in the segmentation map, wherein, The target mapping relationship represents a mapping relationship between the initial pixel value and the adjusted pixel value.
  • the adjusted pixel value determines the image segmentation quality of the segmentation map, wherein the larger the average value of the adjusted pixel values of the pixel points included in the segmentation map, the higher the image segmentation quality.
  • the pixel values of the pixel points in the segmentation map are mapped based on a preset mapping relationship, wherein the image segmentation quality of the segmentation map is expressed based on the average value of the mapped pixel values, which is equivalent to mapping the image segmentation quality of the segmentation map.
  • the latter pixel value is used as the weight when calculating the image segmentation quality of the pixel.
  • the larger the weight the greater the improvement of the image segmentation quality, and the smaller the weight, the greater the reduction of the image segmentation quality.
  • the weight given to this part of the pixels is very important.
  • the assigned weight is close to 0.
  • the assigned weight is The weight of , gradually increases, and when the initial pixel value is 1, the assigned weight reaches 1.
  • the slope of the target mapping relationship may gradually increase, so the target mapping relationship is equivalent to an increasingly steeper mapping relationship.
  • the target mapping relationship may be a preset functional relationship.
  • FIG. 9 is a schematic diagram of various target mapping relationships in this embodiment, wherein the abscissa in FIG. 9 is the initial pixel value (wherein, The minimum initial pixel value is greater than the preset value, for example, the preset value in FIG. 9 is 0.5), wherein the ordinate in FIG. 9 is a value from 0 to 1, and the lower two of the various functional relationships shown in FIG. 9
  • the slope of the target mapping relationship becomes larger.
  • the second functional relationship is ranked second from the top. As the initial pixel value becomes larger, the slope of the target mapping relationship constant.
  • the image segmentation quality of the segmentation map may be determined according to the adjusted pixel values of the pixels included in the segmentation map, wherein the adjusted pixel values of the pixels included in the segmentation map The larger the average value of , the higher the quality of the image segmentation.
  • the image segmentation quality of the segmentation map can be calculated by the following formula:
  • F is the target mapping relationship, which maps the pixels whose pixel value is greater than 0.5 to another value in the [0.5-1.0] interval.
  • the important point is that the value closer to 1 is given a higher mapping value, and the numerator is the above
  • the sum of the adjusted pixel values of the pixel points included in the segmentation map, and the denominator is the number of pixel points whose initial pixel value is greater than a preset value (eg, 0.5).
  • the first confidence level may be adjusted based on the image segmentation quality to determine the second confidence level as the segmentation map.
  • the product of the image segmentation quality and the first confidence level may be determined as the second confidence level of the segmentation map.
  • the second confidence level of the segmentation map can indicate both the positioning accuracy of the detection frame and the image segmentation quality of the segmentation map, wherein the higher the positioning accuracy, the higher the positioning accuracy.
  • the terminal device may display an AR object around the target object in the first image based on the second confidence level being higher than a threshold.
  • the terminal device may display an AR object around the target object in the first image based on the second confidence level being higher than a threshold. Specifically, it may be based on The 3D position information of the target object displays AR objects around the target object.
  • the terminal device may replace the target object in the first image with a first object that is different from the target object based on the second confidence level being higher than a threshold.
  • the terminal device may replace the target object with the first object based on the second confidence level being higher than the threshold, wherein , the first object is the background area where the target object is located in other image frames (that is, in the first image, area A is the area where the target object is located, in other image frames, area A is the background area where the target object is located, and The target object has moved to another location in the image or is not within the image).
  • train the neural network for target detection first label the portrait segmentation data (I), the 3D information data of the portrait (V), obtain the data set and train the joint network, and compare the obtained model output results with the real Compare the labels of , calculate the difference, and use the obtained difference to update the network parameters in reverse until the preset number of training times is reached; in the inference process of the model, the RGB video stream is input into the joint model network, and for each frame image, obtain the 3D information of the frame and the segmentation mask (or called the segmentation map); determine whether the frame needs to display the AR wing effect.
  • the confidence level of the segmentation map in the embodiment corresponding to FIG. 5 can be used to determine The method estimates a confidence level for the segmentation mask. If the confidence level is greater than 0.6, the frame has an AR wing effect. If it is determined that the AR wing effect is output, the 3D information is used to reconstruct the wing position, and then the virtual and real occlusion is performed with the segmentation mask.
  • An embodiment of the present application provides a method for determining a confidence level of a segmentation map.
  • the method includes: acquiring a first image; and performing target detection on the first image to acquire a detection frame of a target object in the first image and all The first confidence level of the detection frame, the first confidence level is used to indicate the positioning accuracy of the detection frame; image segmentation is performed on the image in the detection frame to obtain the segmentation map corresponding to the target object and obtaining the image segmentation quality of the segmentation map; adjusting the first confidence level according to the image segmentation quality to obtain the second confidence level of the segmentation map, wherein the higher the image segmentation quality, the higher the confidence level of the first confidence level. The greater the confidence.
  • the first confidence level of the detection frame is adjusted by using the image segmentation quality of the segmentation map to obtain the second confidence level as the segmentation map, so that the second confidence level not only includes the positioning accuracy of the detection frame, but also can
  • the segmentation quality information of the segmentation map itself is included, so that the confidence of the segmentation map with higher accuracy can be obtained.
  • the method for determining the confidence level of the segmentation map in this embodiment does not need to add an additional network to calculate the confidence level of the segmentation map. Based on the result of the target detection and the result of the segmentation map, the confidence level of the segmentation map can be obtained directly without increasing the confidence level of the segmentation map. A lot of extra computation is more friendly to the terminal side, making the solution easy to deploy on the terminal side.
  • FIG. 10 is a schematic structural diagram of an apparatus for determining a confidence level of a segmentation map provided by an embodiment of the present application.
  • the device 1000 for determining a confidence level of a segmentation map provided by an embodiment of the present application includes:
  • step 501 For the specific description of the acquisition module 1001, reference may be made to the description of step 501, which will not be repeated here.
  • a target detection module 1002 configured to perform target detection on the first image to obtain a detection frame of the target object in the first image and a first confidence level of the detection frame, where the first confidence level is used to indicate the positioning accuracy of the detection frame;
  • step 502 For the specific description of the target detection module 1002, reference may be made to the description of step 502, which will not be repeated here.
  • An image segmentation module 1003 configured to perform image segmentation on the image in the detection frame, to obtain a segmentation map corresponding to the target object and to obtain the image segmentation quality of the segmentation map;
  • step 503 For the specific description of the image segmentation module 1003, reference may be made to the description of step 503, which will not be repeated here.
  • a confidence level determination module 1004 configured to adjust the first confidence level according to the image segmentation quality to obtain a second confidence level of the segmentation map, wherein the higher the image segmentation quality, the higher the second confidence level the greater the degree.
  • step 504 For the specific description of the confidence level determination module 1004, reference may be made to the description of step 504, which will not be repeated here.
  • the image segmentation quality is used to indicate the boundary definition of the target object in the segmentation map, wherein the higher the boundary definition is, the higher the image segmentation quality is.
  • the second confidence level of the segmentation map is used to indicate the positioning accuracy of the detection frame and the image segmentation quality of the segmentation map, wherein the higher the positioning accuracy, the higher the positioning accuracy.
  • the second confidence level is higher.
  • the segmentation map includes an object area where the target object is located, a background area of the target object, and a boundary area between the object area and the background area, and the acquiring module , for:
  • the ratio of the number of pixels included in the object area to the number of pixels included in the target area is used as the image segmentation quality of the segmentation map, where the target area is the union of the object area and the boundary area.
  • the obtaining module is used for:
  • Determining the region where the pixel value of the segmentation map is greater than the foreground threshold is located as the object region
  • Determining the area where the pixel value of the segmentation map is less than the background threshold is located as the background area
  • the area where the pixel value of the segmentation map is greater than the background threshold and less than the foreground threshold is determined as the boundary area.
  • the image segmentation module is configured to perform image segmentation on the image in the detection frame to obtain an initial segmentation map corresponding to the target object, where the initial segmentation map includes a plurality of pixel points , and the probability that each pixel belongs to each category;
  • the image segmentation quality of the segmentation map is determined according to the pixel values of some pixels included in the segmentation map, wherein the larger the average value of the adjusted pixel values of the pixels included in the segmentation map, the higher the image segmentation quality of the image. The higher the segmentation quality.
  • the image segmentation module is configured to acquire initial pixel values of some pixels included in the segmentation map, wherein the initial pixel values of the partial pixels are greater than a preset value, and the The preset value is less than the foreground threshold and greater than or equal to the background threshold, and the foreground threshold is less than 1;
  • the initial pixel value is adjusted according to the target mapping relationship to obtain the adjusted pixel value of the pixel points included in the segmentation map, wherein the target mapping relationship represents the initial pixel value and the adjusted pixel value.
  • the image segmentation quality of the segmentation map is determined according to the adjusted pixel values of the pixels included in the segmentation map, wherein the larger the average value of the adjusted pixel values of the pixels included in the segmentation map, the higher the The higher the image segmentation quality.
  • the confidence level determination module is configured to determine the product of the image segmentation quality and the first confidence level as the second confidence level of the segmentation map.
  • the apparatus further includes:
  • an image processing module configured to display an AR object around the target object in the first image based on the second confidence level being higher than a threshold;
  • the target object in the first image is replaced with a first object, the first object being different from the target object.
  • An embodiment of the present application provides an apparatus for determining a confidence level of a segmentation map.
  • the apparatus includes: an acquisition module for acquiring a first image; and a target detection module for performing target detection on the first image to acquire the The detection frame of the target object in the first image and the first confidence level of the detection frame, the first confidence level is used to indicate the positioning accuracy of the detection frame; the image segmentation module is used for the detection frame.
  • a confidence level determination module is configured to adjust the first confidence level according to the image segmentation quality, so as to obtain the image segmentation quality of the segmentation map.
  • a second confidence level of the segmentation map is acquired, wherein the higher the image segmentation quality is, the higher the second confidence level is.
  • the first confidence level of the detection frame is adjusted by using the image segmentation quality of the segmentation map to obtain the second confidence level as the segmentation map, so that the second confidence level not only includes the positioning accuracy of the detection frame, but also can
  • the segmentation quality information of the segmentation map itself is included, so that the confidence of the segmentation map with higher accuracy can be obtained.
  • the method for determining the confidence level of the segmentation map in this embodiment does not need to add an additional network to calculate the confidence level of the segmentation map.
  • the confidence level of the segmentation map can be obtained directly without increasing the confidence level of the segmentation map.
  • a lot of extra computation is more friendly to the terminal side, making the solution easy to deploy on the terminal side.
  • FIG. 11 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 1100 is implemented by one or more servers, and the server 1100 can be configured or The performance varies greatly and may include one or more central processing units (CPU) 1111 (eg, one or more processors) and memory 1132, and one or more storage applications 1142 or data Storage medium 1130 (eg, one or more mass storage devices) at 1144 .
  • the memory 1132 and the storage medium 1130 may be short-term storage or persistent storage.
  • the program stored in the storage medium 1130 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server.
  • the central processing unit 1111 may be configured to communicate with the storage medium 1130 to execute a series of instruction operations in the storage medium 1130 on the server 1100 .
  • the server 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input and output interfaces 1158; or, one or more operating systems 1141, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
  • operating systems 1141 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
  • the server may execute the method for determining the confidence level of the segmentation map in the embodiment corresponding to FIG. 5 .
  • the segmentation map confidence level determination apparatus 1000 described in FIG. 10 may be a module in a server, and a processor in the server may execute the segmentation map confidence level determination method performed by the segmentation map confidence level determination apparatus 1000 .
  • Embodiments of the present application also provide a computer program product that, when running on a computer, causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned server.
  • Embodiments of the present application further provide a computer-readable storage medium, where a program for performing signal processing is stored in the computer-readable storage medium, and when it runs on a computer, the computer executes the steps performed by the aforementioned execution device. , or, causing the computer to perform the steps as performed by the aforementioned training device.
  • the server or terminal device provided in the embodiments of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin, or a circuit Wait.
  • the processing unit can execute the computer executable instructions stored in the storage unit, so that the chip in the execution device executes the data processing method described in the above embodiments, or the chip in the training device executes the data processing method described in the above embodiment.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • FIG. 12 is a schematic structural diagram of a chip provided by an embodiment of the application.
  • the chip may be represented as a neural network processor NPU 1200, and the NPU 1200 is mounted as a co-processor to the main CPU (Host CPU), tasks are allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 1203, which is controlled by the controller 1204 to extract the matrix data in the memory and perform multiplication operations.
  • the NPU 1200 can implement the model training method provided in the embodiment described in FIG. 6 through the mutual cooperation between various internal devices, or perform reasoning on the model obtained by training.
  • the operation circuit 1203 in the NPU 1200 may perform the steps of acquiring the first neural network model and performing model training on the first neural network model.
  • the arithmetic circuit 1203 in the NPU 1200 includes multiple processing units (Process Engine, PE).
  • arithmetic circuit 1203 is a two-dimensional systolic array.
  • the arithmetic circuit 1203 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition.
  • the arithmetic circuit 1203 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 1202 and buffers it on each PE in the operation circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 1201 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 1208 .
  • Unified memory 1206 is used to store input data and output data.
  • the weight data is directly passed through the storage unit access controller (Direct Memory Access Controller, DMAC) 1205, and the DMAC is transferred to the weight memory 1202.
  • Input data is also moved to unified memory 1206 via the DMAC.
  • DMAC Direct Memory Access Controller
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 1210, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 1209.
  • IFB Instruction Fetch Buffer
  • the bus interface unit 1210 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1209 to obtain instructions from the external memory, and also for the storage unit access controller 1205 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1206 , the weight data to the weight memory 1202 , or the input data to the input memory 1201 .
  • the vector calculation unit 1207 includes a plurality of operation processing units, and further processes the output of the operation circuit 1203, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional/fully connected layer network computation in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.
  • the vector computation unit 1207 can store the vector of the processed output to the unified memory 1206 .
  • the vector calculation unit 1207 can apply a linear function; or a nonlinear function to the output of the operation circuit 1203, such as performing linear interpolation on the feature plane extracted by the convolutional layer, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 1207 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as activation input to arithmetic circuitry 1203, eg, for use in subsequent layers in a neural network.
  • the instruction fetch memory (instruction fetch buffer) 1209 connected to the controller 1204 is used to store the instructions used by the controller 1204;
  • the unified memory 1206, the input memory 1201, the weight memory 1202 and the instruction fetch memory 1209 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above program.
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be retrieved from a website, computer, training device, or data Transmission from the center to another website site, computer, training facility or data center via wired (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) means.
  • wired eg coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device, a data center, or the like that includes an integration of one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开提供了一种分割图置信度确定方法,方法包括:获取图像中目标对象的检测框的置信度,然后获取目标对象的分割图的图像分割质量,根据该图像分割质量对前述置信度进行调整获得调整后的置信度以作为分割图的置信度。可见,利用分割图的图像分割质量对检测框的置信度进行调整,得到分割图的置信度,使得分割图的置信度可以包括分割图本身的分割质量信息,从而提高分割图置信度的准确性。

Description

一种分割图置信度确定方法及装置
本申请要求于2021年02月27日提交中国专利局、申请号为202110221912.1、发明名称为“一种分割图置信度确定方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理领域,尤其涉及一种分割图置信度确定方法及装置。
背景技术
计算机视觉是各个应用领域,如制造业、检验、文档分析、医疗诊断,和军事等领域中各种智能/自主系统中不可分割的一部分,它是一门关于如何运用照相机/摄像机和计算机来获取人们所需的被拍摄对象的数据与信息的学问。形象地说,就是给计算机安装上眼睛(照相机/摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等,从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息,所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说,计算机视觉就是用各种成像系统代替视觉器官获取输入信息,再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界,具有自主适应环境的能力。
计算机视觉是各个应用领域,如制造业、检验、文档分析、医疗诊断,和军事等领域中各种智能/自主系统中不可分割的一部分,它是一门关于如何运用照相机/摄像机和计算机来获取人们所需的被拍摄对象的数据与信息的学问。形象地说,就是给计算机安装上眼睛(照相机/摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等,从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息,所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说,计算机视觉就是用各种成像系统代替视觉器官获取输入信息,再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界,具有自主适应环境的能力。其中,图像分割(image segmentation)是计算机视觉领域常用到的技术。
目前,基于深度学习技术的实例相关的分割算法已经大规模应用在产业界当中。一般来说是使用目标检测网络输出目标物体的类别、检测框以及分割图,对于检测框会有相应的置信度(confidence),在实际应用中,会根据置信度选择保留哪些检测框作为接下来的应用。在一些应用中,需要获取到分割图的置信度,比如,在AR特效显示的应用中,需要在分割图置信度高的对象进行AR特效显示,而不在分割图置信度低的对象进行AR特效显示。
在现有的实现中,直接将检测框的置信度复用作为分割图的置信度,由于检测框的置信度仅能够表示检测框的定位置信度,不能够很好的表达出分割图的图像分割质量。
发明内容
本申请提供一种分割图置信度确定方法、装置等,以提高分割图置信度的准确性。下面通过不同方面对本申请发明内容进行介绍,可以理解的是,以下各个方面的实现方式和有益效果可以互相参考。
第一方面,本申请提供了一种分割图置信度确定方法,所述方法包括:
获取第一图像。在一种实现中,终端设备可以获取到输入的视频流,其中视频流可以是终端设备中的拍摄设备拍摄得到的,具体的,在AR特效显示的场景中,视频流可以为终端设备的拍摄设备拍摄得到的实时视频流。其中,视频流包括第一图像(或者称之为第一图像帧)。在一种实现中,用户可以在终端设备上选择一段视频流,其中,选择的视频流包括第一图像。在一种实现中,终端设备可以获取到输入的第一图像,其中第一图像可以是终端设备中的拍摄设备拍摄得到的。在一种实现中,用户可以在终端设备上(例如在终端设备上的相册或者云侧的相册)中选择第一图像。
对所述第一图像进行目标检测,以获取所述第一图像中目标对象的检测框以及所述检测框的第一置信度,所述第一置信度用于指示所述检测框的定位准确度;
其中,可以使用预先训练好的神经网络模型,对所述第一图像进行目标检测,以得到目标对象的目标检测结果,其中,所述目标检测结果可以包括所述目标对象的检测框,相应的,所述第一置信度用于指示所述检测框的定位准确度。
对所述检测框内的图像进行图像分割,以获取所述目标对象对应的分割图以及获取所述分割图的图像分割质量;
其中,还可以使用预先训练好的神经网络模型对检测框内的图像进行图像分割,以获取所述目标对象对应的分割图。
根据所述图像分割质量,调整所述第一置信度,以获取所述分割图的第二置信度,其中,所述图像分割质量越高,所述第二置信度越大。
为了进行分割图的置信度的计算,可以获取所述分割图的图像分割质量,并基于图像分割质量对检测框的第一置信度进行调整,以使得调整后的第一置信度除了可以指示检测框的定位准确度还可以指示分割图的图像分割质量。在一种实现中,所述图像分割质量用于指示所述分割图中所述目标对象的边界清晰度,其中,所述边界清晰度越高,所述图像分割质量越高。
本实施例利用分割图的图像分割质量对检测框的第一置信度进行调整,得到作为所述分割图的置信度的第二置信度,进而使得第二置信度不只包括检测框的定位精度,也可以包括分割图本身的分割质量信息,以此可以得到精度更高的分割图置信度。且本实施例中的分割图置信度确定方法不需要增加额外的网络来计算分割图的置信度,基于目标检测的结果以及分割图的结果就可以直接得到准确地分割图置信度,不需要增加很多额外的计算量,针对于端侧更为友好,使得方案易于部署在端侧来实现。
在一种可能的实现中,所述图像分割质量用于指示所述分割图中所述目标对象的边界清晰度,其中,所述边界清晰度越高,所述图像分割质量越高。
以目标对象为人物对象为例,分割图的每个像素点的值可以在0到1之间,其中,背 景区域的像素点的像素值为0或者接近0,目标对象的边界区域的像素点的像素值在0.5左右,在使用分割图时,可以用0.5阈值进行二值化,所以在边缘会产生误差,那么理想情况下,当一张分割图在人体边缘的像素值在0.5附近的像素点越少,那么说明该张分割图的质量也相对较好一些。
在一种可能的实现中,所述分割图的第二置信度用于指示所述检测框的定位准确度以及所述分割图的图像分割质量,其中,所述定位准确度越高,所述第二置信度越高。
在一种可能的实现中,所述分割图包括所述目标对象所在的对象区域、所述目标对象的背景区域、以及所述对象区域和所述背景区域之间的边界区域,所述获取所述分割图的图像分割质量,包括:
将所述对象区域包括的像素点数量与目标区域包括的像素点数量的比值作为所述分割图的图像分割质量,其中,所述目标区域为所述对象区域以及所述边界区域的并集。
本申请实施例中,可以将所述对象区域包括的像素点数量与目标区域包括的像素点数量的比值作为所述分割图的图像分割质量,其中,所述目标区域为所述对象区域以及所述边界区域的并集。也就是说,边界区域越小(或者表述为边界区域的像素点的数量越少),则图像分割质量越高。
在一种可能的实现中,所述方法还包括:
将所述分割图中像素值大于前景阈值的像素点所在的区域确定为所述对象区域;
将所述分割图中像素值小于背景阈值的像素点所在的区域确定为所述背景区域;
将所述分割图中像素值大于所述背景阈值且小于所述前景阈值的像素点所在的区域确定为所述边界区域。
具体的,所述分割图包括所述目标对象所在的对象区域、所述目标对象的背景区域、以及所述对象区域和所述背景区域之间的边界区域,可以将所述分割图中像素值大于前景阈值的像素点所在的区域确定为所述对象区域,将所述分割图中像素值小于背景阈值的像素点所在的区域确定为所述背景区域,将所述分割图中像素值大于所述背景阈值且小于所述前景阈值的像素点所在的区域确定为所述边界区域。例如,背景阈值可以为0.25,则背景区域的像素点的像素值小于0.25,,前景阈值可以为0.7,则前景区域的像素点的像素值大于0.7,进而边界区域的像素点的像素值大于0.25小于0.7。
在一种可能的实现中,所述获取所述分割图的图像分割质量,包括:
获取所述分割图中包括的部分像素点的初始像素值,其中,所述部分像素点的初始像素值大于预设值,所述预设值小于所述前景阈值且大于或等于所述背景阈值,所述前景阈值小于1;
根据目标映射关系,对所述初始像素值进行调整,以获取所述分割图中包括的像素点的调整后的像素值,其中,所述目标映射关系表示所述初始像素值与所述调整后的像素值 之间的映射关系,在所述目标映射关系中,随着所述初始像素值由所述预设值变化至1,所述调整后的像素值由0逐渐变化至1,且随着所述初始像素值变大,所述目标映射关系的斜率不变或变大;
根据所述分割图中包括的像素点的调整后的像素值确定所述分割图的图像分割质量,其中,所述分割图中包括的像素点的调整后的像素值的平均值越大,所述图像分割质量越高。
在一种可能的实现中,所述获取所述分割图的图像分割质量,包括:
获取所述分割图中包括的部分像素点的初始像素值,其中,所述部分像素点的初始像素值大于预设值,所述预设值小于所述前景阈值且大于或等于所述背景阈值,所述前景阈值小于1;
根据目标映射关系,对所述初始像素值进行调整,以获取所述分割图中包括的像素点的调整后的像素值,其中,所述目标映射关系表示所述初始像素值与所述调整后的像素值之间的映射关系,在所述目标映射关系中,随着所述初始像素值由所述预设值变化至1,所述调整后的像素值由0逐渐变化至1,且随着所述初始像素值变大,所述目标映射关系的斜率不变或变大;
根据所述分割图中包括的像素点的调整后的像素值确定所述分割图的图像分割质量,其中,所述分割图中包括的像素点的调整后的像素值的平均值越大,所述图像分割质量越高。
本实施例中,将分割图中的像素点的像素值进行了基于预设映射关系的映射,其中,并基于映射后的像素值的平均值来表达分割图的图像分割质量,相当于将映射后的像素值作为所在像素点计算图像分割质量时的权重,权重越大,则表示对于图像分割质量的提高作用越大,权重越小,则表示对于图像分割质量的降低作用越大。具体的,针对于像素值小于所述前景阈值且大于或等于所述背景阈值之间的像素点,是造成分割图的图像分割质量变低的原因,因此这一部分像素点所赋予的权重要很小,例如赋予为0或者接近0的值,因此,随着所述初始像素值由所述预设值变化至1,所述调整后的像素值由0逐渐变化至1,且随着所述初始像素值变大,所述目标映射关系的斜率不变或变大,也就是在接近于预设值附近的初始像素值,赋予的权重接近于0,随着初始像素值的变大,赋予的权重逐渐变大,当初始像素值为1时,赋予的权重到达1。
本实施例中,随着所述初始像素值变大,所述目标映射关系的斜率可以逐渐变大,因此目标映射关系相当于是一个越来越陡的映射关系。
在一种可能的实现中,所述根据所述图像分割质量,调整所述第一置信度,以获取所述分割图的第二置信度,包括:
确定所述图像分割质量与所述第一置信度的乘积作为所述分割图的第二置信度。
通过上述方式,使得所述分割图的第二置信度既可以指示所述检测框的定位准确度,也可以指示所述分割图的图像分割质量,其中,所述定位准确度越高,所述第二置信度越高,检测框的定位准确度越高,所述第二置信度越高。
在一种可能的实现中,在所述获取所述分割图的第二置信度之后,所述方法还包括:
基于所述第二置信度高于阈值,在所述第一图像中的所述目标对象周围显示AR对象;或,
基于所述第二置信度高于阈值,将所述第一图像中的所述目标对象替换为第一对象,所述第一对象与所述目标对象不同。
在AR特效显示的场景中,终端设备可以基于所述第二置信度高于阈值,在所述第一图像中的所述目标对象周围显示AR对象,具体的,可以基于目标对象的3D位置信息在目标对象周围显示AR对象。
在AI去路人的场景中,若目标对象为确定出的移动的路人,则终端设备可以基于所述第二置信度高于阈值,将目标对象替换为第一对象,其中,第一对象为其他图像帧中目标对象所在的背景区域(也就是,在第一图像中,A区域为目标对象所在的区域,在其他图像帧中A区域是目标对象所在的背景区域,而目标对象移动到了图像的其他位置或者不在图像内)。
第二方面,本申请提供了一种分割图置信度确定装置,所述装置包括:
获取模块,用于获取第一图像;
目标检测模块,用于对所述第一图像进行目标检测,以获取所述第一图像中目标对象的检测框以及所述检测框的第一置信度,所述第一置信度用于指示所述检测框的定位准确度;
图像分割模块,用于对所述检测框内的图像进行图像分割,以获取所述目标对象对应的分割图以及获取所述分割图的图像分割质量;
置信度确定模块,用于根据所述图像分割质量,调整所述第一置信度,以获取所述分割图的第二置信度,其中,所述图像分割质量越高,所述第二置信度越大。
其中,利用分割图的图像分割质量对检测框的第一置信度进行调整,得到作为所述分割图的第二置信度,进而使得第二置信度不只包括检测框的定位精度,也可以包括分割图本身的分割质量信息,以此可以得到精度更高的分割图置信度。且本实施例中的分割图置信度确定方法不需要增加额外的网络来计算分割图的置信度,基于目标检测的结果以及分割图的结果就可以直接得到准确地分割图置信度,不需要增加很多额外的计算量,针对于端侧更为友好,使得方案易于部署在端侧来实现。
在一种可能的实现中,所述图像分割质量用于指示所述分割图中所述目标对象的边界清晰度,其中,所述边界清晰度越高,所述图像分割质量越高。
在一种可能的实现中,所述分割图的第二置信度用于指示所述检测框的定位准确度以及所述分割图的图像分割质量,其中,所述定位准确度越高,所述第二置信度越高。
在一种可能的实现中,所述分割图包括所述目标对象所在的对象区域、所述目标对象的背景区域、以及所述对象区域和所述背景区域之间的边界区域,所述获取模块,用于:
将所述对象区域包括的像素点数量与目标区域包括的像素点数量的比值作为所述分割图的图像分割质量,其中,所述目标区域为所述对象区域以及所述边界区域的并集。
在一种可能的实现中,所述获取模块,用于:将所述分割图中像素值大于前景阈值的像素点所在的区域确定为所述对象区域;将所述分割图中像素值小于背景阈值的像素点所在的区域确定为所述背景区域;将所述分割图中像素值大于所述背景阈值且小于所述前景阈值的像素点所在的区域确定为所述边界区域。
在一种可能的实现中,所述图像分割模块,用于对所述检测框内的图像进行图像分割,以获取所述目标对象对应的初始分割图,所述初始分割图包括多个像素点,以及每个像素点属于各个类别的概率;将每个像素点属于各个类别的概率中的最大概率值作为所述分割图中每个像素点的像素值;获取所述分割图中包括的部分像素点的像素值,其中,所述部分像素点的像素值大于预设值,所述预设值小于所述前景阈值且大于或等于所述背景阈值,所述前景阈值小于1;根据所述分割图中包括的部分像素点的像素值确定所述分割图的图像分割质量,其中,所述分割图中包括的像素点的调整后的像素值的平均值越大,所述图像分割质量越高。
在一种可能的实现中,所述图像分割模块,用于获取所述分割图中包括的部分像素点的初始像素值,其中,所述部分像素点的初始像素值大于预设值,所述预设值小于所述前景阈值且大于或等于所述背景阈值,所述前景阈值小于1;根据目标映射关系,对所述初始像素值进行调整,以获取所述分割图中包括的像素点的调整后的像素值,其中,所述目标映射关系表示所述初始像素值与所述调整后的像素值之间的映射关系,在所述目标映射关系中,随着所述初始像素值由所述预设值变化至1,所述调整后的像素值由0逐渐变化至1,且随着所述初始像素值变大,所述目标映射关系的斜率不变或变大;根据所述分割图中包括的像素点的调整后的像素值确定所述分割图的图像分割质量,其中,所述分割图中包括的像素点的调整后的像素值的平均值越大,所述图像分割质量越高。
在一种可能的实现中,所述置信度确定模块,用于确定所述图像分割质量与所述第一置信度的乘积作为所述分割图的第二置信度。
在一种可能的实现中,所述装置还包括:图像处理模块,用于基于所述第二置信度高于阈值,在所述第一图像中的所述目标对象周围显示AR对象;或,基于所述第二置信度高于阈值,将所述第一图像中的所述目标对象替换为第一对象,所述第一对象与所述目标对象不同。
第三方面,本申请实施例提供了一种模型训练装置,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于执行存储器中的程序,以执行如上述第一方面任一可选的方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面、第三方面及其任一可选的方法。
第五方面,本申请实施例提供了一种计算机程序,包括代码,当代码被执行时,以执行如上述第一方面任一可选的方法。
第六方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持执行设备或训练设备实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据;或,信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存执行设备或训练设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
本申请实施例提供了一种分割图置信度确定方法,所述方法包括:获取第一图像;对所述第一图像进行目标检测,以获取所述第一图像中目标对象的检测框以及所述检测框的第一置信度,所述第一置信度用于指示所述检测框的定位准确度;对所述检测框内的图像进行图像分割,以获取所述目标对象对应的分割图以及获取所述分割图的图像分割质量;根据所述图像分割质量,调整所述第一置信度,以获取所述分割图的第二置信度,其中,所述图像分割质量越高,所述第二置信度越大。通过上述方式,利用分割图的图像分割质量对检测框的第一置信度进行调整,得到作为所述分割图的第二置信度,进而使得第二置信度不只包括检测框的定位精度,也可以包括分割图本身的分割质量信息,以此可以得到精度更高的分割图置信度。且本实施例中的分割图置信度确定方法不需要增加额外的网络来计算分割图的置信度,基于目标检测的结果以及分割图的结果就可以直接得到准确地分割图置信度,不需要增加很多额外的计算量,针对于端侧更为友好,使得方案易于部署在端侧来实现。
附图说明
图1为本申请实施例中提供的终端的一种结构示意图;
图2a为本公开实施例的终端的软件结构框图;
图2b为本申请实施例中提供的服务器的一种结构示意图;
图2c为本申请实施例还提供了一种分割图置信度确定系统;
图3为本申请实施例提供的一种应用场景的示意;
图4为本申请实施例提供的一种应用场景的示意;
图5为本申请实施例中提供的分割图置信度确定方法的一种流程示意图;
图6为一种主干网络的结构示意;
图7为一种header的示意;
图8为本申请实施例的一种分割图的示意;
图9为本实施例中的多种目标映射关系的示意;
图10为本申请实施例中提供的分割图置信度确定装置的一种结构示意图;
图11为本申请实施例提供的服务器一种结构示意图;
图12为本申请实施例提供的芯片的一种结构示意图。
具体实施方式
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。
为便于理解,下面将对本申请实施例提供的终端100的结构进行示例说明。参见图1,图1是本申请实施例提供的终端设备的结构示意图。
如图1所示,终端100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
可以理解的是,本发明实施例示意的结构并不构成对终端100的具体限定。在本申请另一些实施例中,终端100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令 的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
I2C接口是一种双向同步串行总线,包括一根串行数据线(serial data line,SDA)和一根串行时钟线(derail clock line,SCL)。在一些实施例中,处理器110可以包含多组I2C总线。处理器110可以通过不同的I2C总线接口分别耦合触摸传感器180K,充电器,闪光灯,摄像头193等。例如:处理器110可以通过I2C接口耦合触摸传感器180K,使处理器110与触摸传感器180K通过I2C总线接口通信,实现终端100的触摸功能。
I2S接口可以用于音频通信。在一些实施例中,处理器110可以包含多组I2S总线。处理器110可以通过I2S总线与音频模块170耦合,实现处理器110与音频模块170之间的通信。在一些实施例中,音频模块170可以通过I2S接口向无线通信模块160传递音频信号,实现通过蓝牙耳机接听电话的功能。
PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。在一些实施例中,音频模块170与无线通信模块160可以通过PCM总线接口耦合。在一些实施例中,音频模块170也可以通过PCM接口向无线通信模块160传递音频信号,实现通过蓝牙耳机接听电话的功能。所述I2S接口和所述PCM接口都可以用于音频通信。
UART接口是一种通用串行数据总线,用于异步通信。该总线可以为双向通信总线。它将要传输的数据在串行通信与并行通信之间转换。在一些实施例中,UART接口通常被用于连接处理器110与无线通信模块160。例如:处理器110通过UART接口与无线通信模块160中的蓝牙模块通信,实现蓝牙功能。在一些实施例中,音频模块170可以通过UART接口向无线通信模块160传递音频信号,实现通过蓝牙耳机播放音乐的功能。
MIPI接口可以被用于连接处理器110与显示屏194,摄像头193等外围器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI),显示屏串行接口(display serial interface,DSI)等。在一些实施例中,处理器110和摄像头193通过CSI接口通信,实现终端100的拍摄功能。处理器110和显示屏194通过DSI接口通信,实现终端100的显示功能。
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器110与摄像头193,显示屏194,无线通信模块160,音频模块170,传感器模块180等。GPIO接口还可以被配置为I2C接 口,I2S接口,UART接口,MIPI接口等。
USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为终端100充电,也可以用于终端100与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他电子设备,例如AR设备等。
可以理解的是,本发明实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对终端100的结构限定。在本申请另一些实施例中,终端100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块140可以通过终端100的无线充电线圈接收无线充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为电子设备供电。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,显示屏194,摄像头193,和无线通信模块160等供电。电源管理模块141还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。在其他一些实施例中,电源管理模块141也可以设置于处理器110中。在另一些实施例中,电源管理模块141和充电管理模块140也可以设置于同一个器件中。
终端100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。终端100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在终端100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以 是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。
无线通信模块160可以提供应用在终端100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,终端100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得终端100可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期演进(long term evolution,LTE),BT,GNSS,WLAN,NFC,FM,和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)。
终端100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,终端100可以包括1个或N个显示屏194,N为大于1的正整数。
终端100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。 感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,终端100可以包括1个或N个摄像头193,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当终端100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。终端100可以支持一种或多种视频编解码器。这样,终端100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现终端100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展终端100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储终端100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行终端100的各种功能应用以及数据处理。
终端100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。终端100可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当终端100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话 或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。终端100可以设置至少一个麦克风170C。在另一些实施例中,终端100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,终端100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。压力传感器180A的种类很多,如电阻式压力传感器,电感式压力传感器,电容式压力传感器等。电容式压力传感器可以是包括至少两个具有导电材料的平行板。当有力作用于压力传感器180A,电极之间的电容改变。终端100根据电容的变化确定压力的强度。当有触摸操作作用于显示屏194,终端100根据压力传感器180A检测所述触摸操作强度。终端100也可以根据压力传感器180A的检测信号计算触摸的位置。在一些实施例中,作用于相同触摸位置,但不同触摸操作强度的触摸操作,可以对应不同的操作指令。例如:当有触摸操作强度小于第一压力阈值的触摸操作作用于短消息应用图标时,执行查看短消息的指令。当有触摸操作强度大于或等于第一压力阈值的触摸操作作用于短消息应用图标时,执行新建短消息的指令。
陀螺仪传感器180B可以用于确定终端100的运动姿态。在一些实施例中,可以通过陀螺仪传感器180B确定终端100围绕三个轴(即,x,y和z轴)的角速度。陀螺仪传感器180B可以用于拍摄防抖。示例性的,当按下快门,陀螺仪传感器180B检测终端100抖动的角度,根据角度计算出镜头模组需要补偿的距离,让镜头通过反向运动抵消终端100的抖动,实现防抖。陀螺仪传感器180B还可以用于导航,体感游戏场景。
气压传感器180C用于测量气压。在一些实施例中,终端100通过气压传感器180C测得的气压值计算海拔高度,辅助定位和导航。
磁传感器180D包括霍尔传感器。终端100可以利用磁传感器180D检测翻盖皮套的开合。在一些实施例中,当终端100是翻盖机时,终端100可以根据磁传感器180D检测翻盖的开合。进而根据检测到的皮套的开合状态或翻盖的开合状态,设置翻盖自动解锁等特性。
加速度传感器180E可检测终端100在各个方向上(一般为三轴)加速度的大小。当终端100静止时可检测出重力的大小及方向。还可以用于识别电子设备姿态,应用于横竖屏切换,计步器等应用。
距离传感器180F,用于测量距离。终端100可以通过红外或激光测量距离。在一些实施例中,拍摄场景,终端100可以利用距离传感器180F测距以实现快速对焦。
接近光传感器180G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。发光二极管可以是红外发光二极管。终端100通过发光二极管向外发射红外光。终端100使用光电二极管检测来自附近物体的红外反射光。当检测到充分的反射光时,可以确定终端 100附近有物体。当检测到不充分的反射光时,终端100可以确定终端100附近没有物体。终端100可以利用接近光传感器180G检测用户手持终端100贴近耳朵通话,以便自动熄灭屏幕达到省电的目的。接近光传感器180G也可用于皮套模式,口袋模式自动解锁与锁屏。
环境光传感器180L用于感知环境光亮度。终端100可以根据感知的环境光亮度自适应调节显示屏194亮度。环境光传感器180L也可用于拍照时自动调节白平衡。环境光传感器180L还可以与接近光传感器180G配合,检测终端100是否在口袋里,以防误触。
指纹传感器180H用于采集指纹。终端100可以利用采集的指纹特性实现指纹解锁,访问应用锁,指纹拍照,指纹接听来电等。
温度传感器180J用于检测温度。在一些实施例中,终端100利用温度传感器180J检测的温度,执行温度处理策略。例如,当温度传感器180J上报的温度超过阈值,终端100执行降低位于温度传感器180J附近的处理器的性能,以便降低功耗实施热保护。在另一些实施例中,当温度低于另一阈值时,终端100对电池142加热,以避免低温导致终端100异常关机。在其他一些实施例中,当温度低于又一阈值时,终端100对电池142的输出电压执行升压,以避免低温导致的异常关机。
触摸传感器180K,也称“触控器件”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于终端100的表面,与显示屏194所处的位置不同。
骨传导传感器180M可以获取振动信号。在一些实施例中,骨传导传感器180M可以获取人体声部振动骨块的振动信号。骨传导传感器180M也可以接触人体脉搏,接收血压跳动信号。在一些实施例中,骨传导传感器180M也可以设置于耳机中,结合成骨传导耳机。音频模块170可以基于所述骨传导传感器180M获取的声部振动骨块的振动信号,解析出语音信号,实现语音功能。应用处理器可以基于所述骨传导传感器180M获取的血压跳动信号解析心率信息,实现心率检测功能。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。终端100可以接收按键输入,产生与终端100的用户设置以及功能控制有关的键信号输入。
马达191可以产生振动提示。马达191可以用于来电振动提示,也可以用于触摸振动反馈。例如,作用于不同应用(例如拍照,音频播放等)的触摸操作,可以对应不同的振动反馈效果。作用于显示屏194不同区域的触摸操作,马达191也可对应不同的振动反馈效果。不同的应用场景(例如:时间提醒,接收信息,闹钟,游戏等)也可以对应不同的振动反馈效果。触摸振动反馈效果还可以支持自定义。
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。
SIM卡接口195用于连接SIM卡。SIM卡可以通过插入SIM卡接口195,或从SIM卡接口195拔出,实现和终端100的接触和分离。终端100可以支持1个或N个SIM卡接口, N为大于1的正整数。SIM卡接口195可以支持Nano SIM卡,Micro SIM卡,SIM卡等。同一个SIM卡接口195可以同时插入多张卡。所述多张卡的类型可以相同,也可以不同。SIM卡接口195也可以兼容不同类型的SIM卡。SIM卡接口195也可以兼容外部存储卡。终端100通过SIM卡和网络交互,实现通话以及数据通信等功能。在一些实施例中,终端100采用eSIM,即:嵌入式SIM卡。eSIM卡可以嵌在终端100中,不能和终端100分离。
终端100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本发明实施例以分层架构的Android系统为例,示例性说明终端100的软件结构。
图2a是本公开实施例的终端100的软件结构框图。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。
应用程序层可以包括一系列应用程序包。
如图2a所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。
如图2a所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供终端100的通信功能。例如通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架 层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。
2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
下面结合捕获拍照场景,示例性说明终端100软件以及硬件的工作流程。
当触摸传感器180K接收到触摸操作,相应的硬件中断被发给内核层。内核层将触摸操作加工成原始输入事件(包括触摸坐标,触摸操作的时间戳等信息)。原始输入事件被存储在内核层。应用程序框架层从内核层获取原始输入事件,识别该输入事件所对应的控件。以该触摸操作是触摸单击操作,该单击操作所对应的控件为相机应用图标的控件为例,相机应用调用应用框架层的接口,启动相机应用,进而通过调用内核层启动摄像头驱动,通过摄像头193捕获静态图像或视频。
本申请实施例还提供了一种服务器1300。
服务器1300可以包括处理器1310、收发器1320,收发器1320可以与处理器1310连接,如图2b所示。收发器1320可以包括接收器和发送器,可以用于接收或者发送消息或数据,收发器1320可以是网卡。服务器1300还可以包括加速部件(可称为加速器),当加速部件为网络加速部件时,加速部件可以为网卡。处理器1310可以是服务器1300的控制中心,利用各种接口和线路连接整个服务器1300的各个部分,如收发器1320等。在本发明中,处理器1310可以是中央处理器(Central Processing Unit,CPU),可选的,处理器1310可以包括一个或多个处理单元。处理器1310还可以是数字信号处理器、专用集成电路、现场可编程门阵列、GPU或者其他可编程逻辑器件等。服务器1300还可以包括存储器1330,存储器1330可用于存储软件程序以及模块,处理器1310通过读取存储在存储器1330的软件代码以及模块,从而执行服务器1300的各种功能应用以及数据处理。
本申请实施例还提供了一种分割图置信度确定系统,如图2c所示,该系统可以包括终端设备和服务器。其中,终端设备可以是可移动终端、人机交互设备、车载视觉感知设备,如手机、智能机器人、无人驾驶车辆、智能监控器、增强现实(Augmented Reality,AR)穿戴设备等。相应地,本公开实施例提供的方法可以用于人机交互、车载视觉感知、增强现实、智能监控、无人驾驶等应用领域中。
示例性的,终端设备可以将图像(例如本申请实施例中的第一图像)发送至服务器,由服务器进行图像处理以及分析,以得到分割图的置信度(例如本申请实施例中的第二置 信度),并将第二置信度传递至终端设备。
为了便于理解,结合附图和应用场景,对本申请实施例提供的一种分割图置信度确定方法进行具体阐述。
首先描述本申请实施例的几个应用场景示例:
一、AR特效显示
在该场景中,可以对视频中的人物对象增加特效显示,例如在人物的后背显示翅膀特效(如图3所示);在具体的实现中,可以获取到终端设备拍摄得到的视频,并对视频中的图像帧进行图像分割,得到针对于人物对象的分割图,并基于分割图以及人物对象在图像中的位置信息,在相应的位置显示特效。
在一些情况下,由于图像的图像质量或者遮挡物干扰等情况,会使得分割图中人物对象的识别不够准确,因此需要获取到分割图的置信度,并基于分割图的置信度确定是否进行特效的显示,其中,若置信度较低,则表示分割图中的对象可能不是人物,例如是除人物之外的物体,或者分割图中的对象是人物对象,但却不能从分割图中很准确的确定出人物对象的位置、轮廓(或者称之为边界)等信息,在这种情况下,则不会在显示特效。
二、AI去路人
在该场景中,把主角人物和背景中的路人都实例分割出来,刷选出高置信度分割图,并将移动的路人所在的图像区域裁减掉,并替换为其他图像帧中对应位置的背景区域。具体可以如图4所示,图4中的图像1401、图像1402是可作为融合处理的图像,可以先从图像1401中分割运动目标1411,然后通过图像1402在所移除的运动目标的位置处的背景图像1412来将图像1401补充完整。可以理解的是,通过其他图像在所移除的运动目标的位置处的背景图像来将该最清晰的图像补充完整时,需要其他图像在所移除的运动目标的位置处的背景图像不包含运动目标,例如,图4中的图像1402在所移除的运动目标的位置处的背景图像1412不包含运动目标。可以理解的是,用于进行融合处理的图像可以为两个图像,也可以为更多个图像。当图像中需要移除的运动目标较多,两个图像不足以融合得到一幅完整的无运动目标的图像时,可以通过更多个图像进行融合处理,从而得到一个完整的无运动目标的图像。
三、自动驾驶中的车辆对象分割
在自动驾驶中,需要为车载终端设备提供路面车辆的实例分割结果,辅助自动驾驶系统更好的进行驾驶决策。
示例性的,在先进的驾驶辅助系统(advanced driver assistance systems,ADAS)和先进的驾驶系统(advanced driver systems,ADS)中,需要实时进行多类型的目标检测,包括:动态障碍物(行人(Pedestrian)、骑行者(Cyclist)、三轮车(Tricycle)、轿车(Car)、卡车(Truck)、公交车(Bus)),静态障碍物(交通锥标(TrafficCone)、交通棍标(TrafficStick)、消防栓(FireHydrant)、摩托车(Motocycle)、自行车(Bicycle)),交通标志(TrafficSign、导向标志(GuideSign)、广告牌(Billboard)、红色交通灯(TrafficLight_Red)/黄色交通灯(TrafficLight_Yellow)/绿色交通灯(TrafficLight_Green)/黑色交通灯(TrafficLight_Black)、路标(RoadSign))。为了与激光雷达的数据进行融合,需要获取动态障碍物的掩膜(Mask),从而把打到动态障碍物上的激光点 云筛选出来。
现有的分割图置信度的确定方法,在对图像进行目标检测后,可以得到对象的检测结果(包括检测框以及分类结果),并将检测结果的置信度直接作为分割图的置信度,由于检测结果的置信度用于指示所述检测框的定位准确度,因此,该置信度只能表达出检测框的位置是否位于对象所在的位置,而不能反映出分割图的分割图质量。
以AR特效显示为例,在AR特效显示的场景中,可能需要获取到分割图中目标对象边界的清晰位置,因此当检测框的位置很准确,而分割图的质量很低时,现有的技术中也会在对象上显示AR特效,然而在这种情况下,由于分割图的质量很低,因此显示的AR特效的效果(例如显示位置、显示方向等)都很差,以AR特效为翅膀为例,在这种情况下,会影响翅膀与人物对象的相互遮挡关系,也就是出现所谓的虚实遮挡。因此,分割图的置信度除了需要包括检测框的定位准确度的信息之外,还应该包括分割图中对象的分割质量,例如边界是否清晰等。
参照图5,图5为本申请实施例提供的一种分割图置信度确定方法的实施例示意,
如图5示出的那样,本申请实施例提供的分割图置信度确定方法包括:
501、获取第一图像;
步骤501的执行主体可以为终端设备或者服务器,具体的,终端设备可以获取第一图像;或者终端设备可以获取第一图像并将第一图像发送至云侧的服务器,进而服务器可以获取到第一图像。
以执行主体为终端设备为例。
在一种实现中,终端设备可以获取到输入的视频流,其中视频流可以是终端设备中的拍摄设备拍摄得到的,在AR特效显示的场景中,视频流为终端设备的拍摄设备拍摄得到的实时视频流,其中,视频流包括第一图像(或者称之为第一图像帧)。
在一种实现中,用户可以在终端设备上选择一段视频流,其中,选择的视频流包括第一图像。
在一种实现中,终端设备可以获取到输入的第一图像,其中第一图像可以是终端设备中的拍摄设备拍摄得到的。
在一种实现中,用户可以在终端设备上(例如在终端设备上的相册或者云侧的相册)中选择第一图像。
本申请实施例中,从用户的视觉上,第一图像可以包括目标对象,其中,目标对象可以为人物对象或者非人物的物体对象。第一图像可以包括多个对象,目标对象为多个对象中的一个。
应理解,在获取到第一图像之后,可以对第一图像进行预处理,激励进行在RGB域上的归一化处理,本申请并不限定。
502、对所述第一图像进行目标检测,以获取所述第一图像中目标对象的检测框以及所述检测框的第一置信度,所述第一置信度用于指示所述检测框的定位准确度。
503、对所述检测框内的图像进行图像分割,以获取所述目标对象对应的分割图以及获取所述分割图的图像分割质量。
本申请实施例中,可以使用预先训练好的神经网络模型,对所述第一图像进行目标检测,以得到目标对象的目标检测结果,其中,所述目标检测结果可以包括所述目标对象的检测框,相应的,所述第一置信度用于指示所述检测框的定位准确度。
本申请实施例中,还可以使用预先训练好的神经网络模型对检测框内的图像进行图像分割,以获取所述目标对象对应的分割图。
应理解,本申请实施例中的分割图也可以称之为Mask或者mask激活图。
示例性的,用于进行目标检测的神经网络模型可以包括主干网络以及头端Header,参照图6,图6为一种主干网络的结构示意,如图6中示出的那样,主干网络用于接收输入的第一图像,并对输入的第一图像进行卷积处理,输出对应所述图像的具有不同分辨率的特征图(特征图C1、特征图C2、特征图C3、特征图C4);也就是说输出对应所述图像的不同大小的特征图,主干网络完成基础特征的提取,为后续的检测提供相应的特征。
具体的,主干网络可以对输入的图像进行一系列的卷积处理,得到在不同的尺度(具有不同分辨率)下的特征图(feature map)。这些特征图将为后续的检测模块提供基础特征。主干网络可以采用多种形式,比如视觉几何组(visual geometry group,VGG)、残差神经网络(residual neural network,resnet)、GoogLeNet的核心结构(Inception-net)等。
主干网络可以对输入的图像进行卷积处理,生成若干不同尺度的卷积特征图,每张特征图是一个H*W*C的矩阵,其中H是特征图的高度,W是特征图的宽度、C是特征图的通道数。backbone可以采用目前多种现有的卷积网络框架,比如VGG16、Resnet50、Inception-Net等,下面以Resnet18为Backbone为例进行说明。
假设输入的图像的分辨率为H*W*3(高度H,宽度W,通道数为3,也就是RBG三个通道)。输入图像可以经过Resnet18的一个卷积层Res18-Conv1进行卷积操作,生成Featuremap(特征图)C1,这个特征图相对于输入图像进行了2次下采样,并且通道数扩充为64,因此C1的分辨率是H/4*W/4*64。C1可以经过Resnet18的Res18-Conv2进行卷积操作,得到Featuremap C2,这个特征图的分辨率与C1一致;C2继续经过Res18-Conv3进行卷积操作,生成Featuremap C3,这个特征图相对C2进一步下采样,通道数增倍,其分辨率为H/8*W/8*128;最后C3经过Res18-Conv4进行卷积操作,生成Featuremap C4,其分辨率为H/16*W/16*256。
需要说明的是,本申请实施例中的主干网络也可以称为骨干网络,这里并不限定。
需要说明的是,图6中示出的主干网络仅为一种实现方式,并不构成对本申请的限定。
本申请实施例中,用于进行目标检测的网络还可以包括一个或多个并行head(或者称之为Header),每个头端head,用于根据主干网络输出的特征图,对一个任务中的任务物体(例如本实施例中的目标对象)进行检测,输出目标对象所在区域的检测框以及检测框对应的第一置信度;其中,用于进行目标检测的网络中的每个head可以完成不同的任务物体的检测;其中,所述任务物体为该任务中需要检测的物体;第一置信度越高,表示所述对应该第一置信度的检测框内存在目标物体的概率越大,或者描述为第一置信度用于指示所 述检测框的定位准确度,当定位准确度越高,则检测框内存在目标物体的概率越大。
本申请实施例中,不同的head可以完成不同的对象检测任务。
比如多个head中的一个head可以完成车的检测,输出Car/Truck/Bus的检测框和置信度;多个head中的一个head可以完成人的检测,输出Pedestrian/Cyclist/Tricyle的检测框和置信度;多个head中的一个head可以完成交通灯的检测,输出Red_Trafficligh/Green_Trafficlight/Yellow_TrafficLight/Black_TrafficLight的检测框和置信度。
本申请实施例中,用于进行目标检测的网络可以包括多个串行head;所述串行head与一个并行head连接;这里需要强调的是,实际上,串行head并不是必须的,对于只需要检测检测框的场景,就不需要包括串行head。串行的head可以根据检测框所在的区域,进行图像分割(也就是实施例中的步骤503)。
其中,所述串行head可以用于:利用其连接的并行head提供的所属任务的任务物体的检测框,在特征图上提取所述检测框所在区域的特征,根据所述检测框所在区域的特征对所述所属任务的任务物体的3D信息、Mask信息(或者称之为分割图)或Keypiont信息进行预测。串行head可在检测出该任务的检测框的基础上,完成检测框内部物体的3D/Mask/Keypoint检测。比如,串行3D_head0完成车辆的朝向、质心和长宽高的估计,从而输出车辆的3D框;串行Mask_head0预测车辆的精细掩膜,从而把车辆分割开来;串行Keypont_head0完成车辆的关键点的估计。串行head并不是必须的,某些任务不需要进行3D/Mask/Keypoint检测,则不需要串接串行head,比如交通灯的检测,只需要检测检测框,就不用串接串行head。另外,某些任务可以根据任务的具体需求,选择串接一个或者多个串行head,比如停车场(Parkingslot)的检测,除了需要得到检测框外,还需要车位的关键点,因此在这个任务中只需要串接一个串行Keypoint_head即可,不需要3D和Mask的head。
本申请实施例中,header可以根据FPN提供的特征图,完成一个任务的检测框的检测,输出这个任务的物体的检测框以及对应的置信度等等,接下来描述一种header的结构示意,参照图7,图7为一种header的示意,如图7中示出的那样,head包括候选区域生成网络(Region Proposal Network,RPN)、ROI-ALIGN和RCNN三个模块。
其中,RPN模块可以用于在特征图上预测所述任务物体所在的区域,并输出匹配所述区域的候选检测框;或者可以这样理解,RPN在特征图上预测出可能存在该任务物体的区域,并且给出这些区域的框,这些区域称为候选区域(Proposal)。比如,当head负责检测车时,其RPN层就预测出可能存在车的候选框;当head负责检测人时,其RPN层就预测出可能存在人的候选框。当然,这些Proposal是不准确的,一方面其不一定含有该任务的物体,另一方面这些框也是不紧致的。
检测候选区域预测流程可以由head的RPN模块实施,其根据FPN提供的特征图,预测出可能存在该任务物体的区域,并且给出这些区域的候选框(也可以叫候选区域,Proposal)。在本实施例中,若head负责检测车,其RPN层就预测出可能存在车的候选框。
RPN层可以在特征图上通过例如一个3*3的卷积生成特征图RPN Hidden。后面head的RPN层将会从RPN Hidden中预测Proposal。具体来说,head的RPN层分别通过一个1*1的卷积预测出RPN Hidden每个位置处的Proposal的坐标以及置信度。这个置信度越高,表示这个 Proposal存在该任务的物体的概率越大。比如,在head中某个Proposal的score越大,就表示其存在车的概率越大。每个RPN层预测出来的Proposal需要经过Proposal合并模块,根据Proposal之间的重合程度去掉多余的Proposal(这个过程可以采用但不限制于NMS算法),在剩余的K个Proposal中挑选出score最大的N(N<K)个Proposal作为候选的可能存在物体的区域。这些Proposal是不准确的,一方面其不一定含有该任务的物体,另一方面这些框也是不紧致的。因此,RPN模块只是一个粗检测的过程,需要后续的RCNN模块进行细分。在RPN模块回归Proposal的坐标时,并不是直接回归坐标的绝对值,而是回归出相对于Anchor的坐标。当这些Anchor与实际的物体匹配越高,RPN能检测出物体的概率越大。
ROI-ALIGN模块用于根据所述RPN模块预测得到的区域,从所述FPN提供的一个特征图中扣取出所述候选检测框所在区域的特征;也就是说,ROI-ALIGN模块主要根据RPN模块提供的Proposal,在某个特征图上把每个Proposal所在的区域的特征扣取出来,并且resize到固定的大小,得到每个Proposal的特征。可以理解的是,ROI-ALIGN模块可以使用但不局限于ROI-POOLING(感兴趣区域池化)/ROI-ALIGN(感兴趣区域提取)/PS-ROIPOOLING(位置敏感的感兴趣区域池化)/PS-ROIALIGN(位置敏感的感兴趣区域提取)等特征抽取方法。
RCNN模块用于通过神经网络对所述候选检测框所在区域的特征进行卷积处理,得到所述候选检测框属于各个物体类别的置信度;通过神经网络对所述候选区域检测框的坐标进行调整,使得调整后的检测候选框比所述候选检测框与实际物体的形状更加匹配,并选择置信度大于预设阈值的调整后的检测候选框作为所述区域的检测框。也就是说,RCNN模块主要是对ROI-ALIGN模块提出的每个Proposal的特征进行细化处理,得到每个Proposal的属于各个类别置信度(比如对于车这个任务,会给出Backgroud/Car/Truck/Bus 4个分数),同时对Proposal的检测框的坐标进行调整,输出更加紧致的检测框。这些检测框经过非极大值抑制(non maximum suppression,NMS)合并后,作为最后的检测框输出。
检测候选区域细分类主要由图7中的head的RCNN模块实施,其根据ROI-ALIGN模块提取出来的每个Proposal的特征,进一步回归出更加紧致的检测框坐标,同时对这个Proposal进行分类,输出其属于各个类别的置信度。RCNN的可实现形式很多,ROI-ALIGN模块输出的特征大小可以为N*14*14*256(Feature of proposals),其在RCNN模块中首先经过Resnet18的卷积模块4(Res18-Conv5)处理,输出的特征大小为N*7*7*512,然后通过一个Global Avg Pool(平均池化层)进行处理,把输入特征中每个通道内的7*7的特征进行平均,得到N*512的特征,其中每个1*512维的特征向量代表每个Proposal的特征。接下来通过2个全连接层FC分别回归框的精确坐标(输出N*4的向量,这4个数值分表表示框的中心点x/y坐标,框的宽高),框的类别的置信度(在head0中,需要给出这个框是Backgroud/Car/Truck/Bus的分数)。最后通过框合并操作,选择分数最大的若干个框,并且通过NMS操作去除重复的框,从而得到紧致的框输出。
在一些实际应用场景中,该感知网络还可以包括其他head,可以在检测出检测框的基础上,进一步进行3D/Mask/Keypoint检测。示例性的,以3D为例,ROI-ALIGN模块根据head提供的准确的检测框,在FPN输出的特征图上提取出每个检测框所在区域的特征,假设检测框的个数为M,那么ROI-ALIGN模块输出的特征大小为M*14*14*256,其首先经过 Resnet18的Res18-Conv5处理,输出的特征大小为N*7*7*512,然后通过一个Global Avg Pool(平均池化层)进行处理,把输入特征中每个通道的7*7的特征进行平均,得到M*512的特征,其中每个1*512维的特征向量代表每个检测框的特征。接下来通过3个全连接层FC分别回归框中物体的朝向角(orientation,M*1向量)、质心点坐标(centroid,M*2向量,这2个数值表示质心的x/y坐标)和长宽高(dimention)。
需要说明的是,图7中示出的header仅为一种实现方式,并不构成对本申请的限定。
本申请实施例中,可以对所述第一图像进行目标检测,以获取所述第一图像中目标对象的检测框的第一置信度,并对检测框内的图像进行图像分割,以获取所述目标对象对应的分割图。
在一种可能的实现中,可以对所述检测框内的图像进行图像分割,以获取所述目标对象对应的初始分割图,所述初始分割图包括多个像素点,以及每个像素点属于各个类别的概率,并将每个像素点属于各个类别的概率中的最大概率值作为所述分割图中每个像素点的像素值。具体的,对于输入维度为(N,C,H,W)的概率分割图,其中N代表图片数目,C代表分割类别数,H,W分别代表图片的高和宽。每个像素点位置(n,c,h,w)的值都在0,1之间,值越大,代表该像素点属于类别c的概率越大。针对概率分割图(N,C,H,W),先沿着该矩阵第二维度,求得最大值矩阵(N,H,W)以及最大值下标矩阵(N,H,W),该最大值下标矩阵(N,H,W)可以作为分割图。
为了进行分割图的置信度的计算,可以获取所述分割图的图像分割质量,并基于图像分割质量对检测框的第一置信度进行调整,以使得调整后的第一置信度除了可以指示检测框的定位准确度还可以指示分割图的图像分割质量。
接下来描述如何获取到分割图的图像分割质量。
在一种实现中,所述图像分割质量用于指示所述分割图中所述目标对象的边界清晰度,其中,所述边界清晰度越高,所述图像分割质量越高。
以目标对象为人物对象为例,分割图的每个像素点的值可以在0到1之间,其中,背景区域的像素点的像素值为0或者接近0,目标对象的边界区域的像素点的像素值在0.5左右,具体可以如图8所示,其中,(a)图是图像输入,(b)是概率图,(c)是分割图,(d)是小于0.4的像素点,(e)是小于0.6的点,(f)是小于0.8的点,人体内部的像素点的像素值大部分为1,或者接近1:由于在使用该张分割图时,可以用0.5阈值进行二值化,所以在边缘会产生误差,那么理想情况下,当一张分割图在人体边缘的像素值在0.5附近的像素点越少,那么说明该张分割图的质量也相对较好一些。
具体的,所述分割图包括所述目标对象所在的对象区域、所述目标对象的背景区域、以及所述对象区域和所述背景区域之间的边界区域,可以将所述分割图中像素值大于前景阈值的像素点所在的区域确定为所述对象区域,将所述分割图中像素值小于背景阈值的像素点所在的区域确定为所述背景区域,将所述分割图中像素值大于所述背景阈值且小于所述前景阈值的像素点所在的区域确定为所述边界区域。例如,背景阈值可以为0.25,则背景区域的像素点的像素值小于0.25,,前景阈值可以为0.7,则前景区域的像素点的像素值大于0.7,进而边界区域的像素点的像素值大于0.25小于0.7。
本申请实施例中,可以将所述对象区域包括的像素点数量与目标区域包括的像素点数量的比值作为所述分割图的图像分割质量,其中,所述目标区域为所述对象区域以及所述边界区域的并集。也就是说,边界区域越小(或者表述为边界区域的像素点的数量越少),则图像分割质量越高。
示例性的,可以如下公式所示来计算图像分割质量:
Figure PCTCN2022077911-appb-000001
其中,high_fg_threshold表示前景阈值,low_fg_threshold背景阈值,通过计算像素值大于前景阈值的像素个数与大于背景阈值的像素个数的比值,进而得到一个系数Pixel_scoring,该系数Pixel_scoring可以作为分割图的图像分割质量。
在另一种实现中,可以获取所述分割图中包括的部分像素点的像素值,其中,所述部分像素点的像素值大于预设值,所述预设值小于所述前景阈值且大于或等于所述背景阈值,所述前景阈值小于1,根据所述分割图中包括的部分像素点的像素值确定所述分割图的图像分割质量,其中,所述分割图中包括的像素点的调整后的像素值的平均值越大,所述图像分割质量越高。
具体的,可以根据分割图中像素值大于阈值T(也就是上述预设值),求得高前景区域,统计高前景区域的最大值矩阵的均值,返回维度为N的矩阵,代表每张图片的实例置信度,该实例置信度可以作为本实施例中的图像分割质量。此外,还可以计算类别置信度:遍历所有类别C,根据最大值矩阵求得该类别的区域,求得该类别下的前景区域,并求得该类别下的均值,作为该类别的置信度,最终返回一个N*C的矩阵。
在另一种实现中,可以获取所述分割图中包括的部分像素点的初始像素值,其中,所述部分像素点的初始像素值大于预设值,所述预设值小于所述前景阈值且大于或等于所述背景阈值,所述前景阈值小于1,根据目标映射关系,对所述初始像素值进行调整,以获取所述分割图中包括的像素点的调整后的像素值,其中,所述目标映射关系表示所述初始像素值与所述调整后的像素值之间的映射关系,在所述目标映射关系中,随着所述初始像素值由所述预设值变化至1,所述调整后的像素值由0逐渐变化至1,且随着所述初始像素值变大,所述目标映射关系的斜率不变或变大,并根据所述分割图中包括的像素点的调整后的像素值确定所述分割图的图像分割质量,其中,所述分割图中包括的像素点的调整后的像素值的平均值越大,所述图像分割质量越高。
本实施例中,将分割图中的像素点的像素值进行了基于预设映射关系的映射,其中,并基于映射后的像素值的平均值来表达分割图的图像分割质量,相当于将映射后的像素值作为所在像素点计算图像分割质量时的权重,权重越大,则表示对于图像分割质量的提高作用越大,权重越小,则表示对于图像分割质量的降低作用越大。具体的,针对于像素值小于所述前景阈值且大于或等于所述背景阈值之间的像素点,是造成分割图的图像分割质量变低的原因,因此这一部分像素点所赋予的权重要很小,例如赋予为0或者接近0的值,因此,随着所述初始像素值由所述预设值变化至1,所述调整后的像素值由0逐渐变化至1,且随着所述初始像素值变大,所述目标映射关系的斜率不变或变大,也就是在接近于预设值附近的初始像素值,赋予的权重接近于0,随着初始像素值的变大,赋予的权重逐渐变 大,当初始像素值为1时,赋予的权重到达1。
本实施例中,随着所述初始像素值变大,所述目标映射关系的斜率可以逐渐变大,因此目标映射关系相当于是一个越来越陡的映射关系。
本实施例中,目标映射关系可以为预设的函数关系,例如参照图9,图9为本实施例中的多种目标映射关系的示意,其中图9中横坐标为初始像素值(其中,最小的初始像素值大于预设值,例如图9中预设值为0.5),其中图9中的纵坐标为0至1的值,图9所示的多种函数关系中靠下的两个函数关系中,随着所述初始像素值变大,所述目标映射关系的斜率变大,图9所示的多种函数关系中最上面的函数关系中,随着所述初始像素值变大,所述目标映射关系的斜率变小,图9所示的多种函数关系中从上面数排在第二的函数关系中,随着所述初始像素值变大,所述目标映射关系的斜率不变。
本申请实施例中,可以根据所述分割图中包括的像素点的调整后的像素值确定所述分割图的图像分割质量,其中,所述分割图中包括的像素点的调整后的像素值的平均值越大,所述图像分割质量越高。
示例性的,可以通过如下公式来计算分割图的图像分割质量:
Figure PCTCN2022077911-appb-000002
其中F为目标映射关系,把像素值大于0.5的像素点映射到另一个[0.5~1.0]区间的值,重要的一点是,把越接近1的值给予较高的映射值,分子为所述分割图中包括的像素点的调整后的像素值的总和,分母为初始像素值大于预设值(例如0.5)的像素点的数量。
504、根据所述图像分割质量,调整所述第一置信度,以获取所述分割图的第二置信度,其中,所述图像分割质量越高,所述第二置信度越大。
本申请实施例中,在获取到获取所述分割图的图像分割质量之后,可以基于图像分割质量对第一置信度进行调整,来确定作为所述分割图的第二置信度。
在一种实现中,可以确定所述图像分割质量与所述第一置信度的乘积作为所述分割图的第二置信度。
通过上述方式,使得所述分割图的第二置信度既可以指示所述检测框的定位准确度,也可以指示所述分割图的图像分割质量,其中,所述定位准确度越高,所述第二置信度越高,检测框的定位准确度越高,所述第二置信度越高。
在一种实现中,终端设备可以基于所述第二置信度高于阈值,在所述第一图像中的所述目标对象周围显示AR对象。
本申请实施例中,在AR特效显示的场景中,终端设备可以基于所述第二置信度高于阈值,在所述第一图像中的所述目标对象周围显示AR对象,具体的,可以基于目标对象的3D位置信息在目标对象周围显示AR对象。
在一种实现中,终端设备可以基于所述第二置信度高于阈值,将所述第一图像中的所述目标对象替换为第一对象,所述第一对象与所述目标对象不同。
本申请实施例中,在AI去路人的场景中,若目标对象为确定出的移动的路人,则终端设备可以基于所述第二置信度高于阈值,将目标对象替换为第一对象,其中,第一对象为 其他图像帧中目标对象所在的背景区域(也就是,在第一图像中,A区域为目标对象所在的区域,在其他图像帧中A区域是目标对象所在的背景区域,而目标对象移动到了图像的其他位置或者不在图像内)。
接下来以AR特效显示为例,描述本实施例提供的一个应用例:
首先进行用于进行目标检测的神经网络的训练:先标注好的人像分割数据(I),人像的3D信息数据(V),获取到数据集并训练联合网络,将得到的模型输出结果与真实的标签进行比较,计算差值,利用得到的差值反向更新网络参数,直到达到预先设置好的训练次数;在模型的推理过程中,将RGB视频流输入联合模型网络中,针对每一帧图像,得到该帧的3D信息以及分割掩膜(或者称之为分割图);判断该帧是否出需要显示AR翅膀效果,具体的,可使用图5对应的实施例中的分割图置信度确定方法对分割掩膜估计一个置信度,如果该置信度大于0.6,则该帧出AR翅膀效果。如果判断为输出AR翅膀效果,则利用3D信息重建出翅膀位置,再与分割掩膜进行虚实遮挡。
本申请实施例提供了一种分割图置信度确定方法,所述方法包括:获取第一图像;对所述第一图像进行目标检测,以获取所述第一图像中目标对象的检测框以及所述检测框的第一置信度,所述第一置信度用于指示所述检测框的定位准确度;对所述检测框内的图像进行图像分割,以获取所述目标对象对应的分割图以及获取所述分割图的图像分割质量;根据所述图像分割质量,调整所述第一置信度,以获取所述分割图的第二置信度,其中,所述图像分割质量越高,所述第二置信度越大。通过上述方式,利用分割图的图像分割质量对检测框的第一置信度进行调整,得到作为所述分割图的第二置信度,进而使得第二置信度不只包括检测框的定位精度,也可以包括分割图本身的分割质量信息,以此可以得到精度更高的分割图置信度。且本实施例中的分割图置信度确定方法不需要增加额外的网络来计算分割图的置信度,基于目标检测的结果以及分割图的结果就可以直接得到准确地分割图置信度,不需要增加很多额外的计算量,针对于端侧更为友好,使得方案易于部署在端侧来实现。
参照图10,图10为本申请实施例提供的一种分割图置信度确定装置的结构示意,如图10所示,本申请实施例提供的分割图置信度确定装置1000,包括:
获取模块1001,用于获取第一图像;
关于获取模块1001的具体描述可以参照步骤501的描述,这里不再赘述。
目标检测模块1002,用于对所述第一图像进行目标检测,以获取所述第一图像中目标对象的检测框以及所述检测框的第一置信度,所述第一置信度用于指示所述检测框的定位准确度;
关于目标检测模块1002的具体描述可以参照步骤502的描述,这里不再赘述。
图像分割模块1003,用于对所述检测框内的图像进行图像分割,以获取所述目标对象对应的分割图以及获取所述分割图的图像分割质量;
关于图像分割模块1003的具体描述可以参照步骤503的描述,这里不再赘述。
置信度确定模块1004,用于根据所述图像分割质量,调整所述第一置信度,以获取所 述分割图的第二置信度,其中,所述图像分割质量越高,所述第二置信度越大。
关于置信度确定模块1004的具体描述可以参照步骤504的描述,这里不再赘述。
在一种可能的实现中,所述图像分割质量用于指示所述分割图中所述目标对象的边界清晰度,其中,所述边界清晰度越高,所述图像分割质量越高。
在一种可能的实现中,所述分割图的第二置信度用于指示所述检测框的定位准确度以及所述分割图的图像分割质量,其中,所述定位准确度越高,所述第二置信度越高。
在一种可能的实现中,所述分割图包括所述目标对象所在的对象区域、所述目标对象的背景区域、以及所述对象区域和所述背景区域之间的边界区域,所述获取模块,用于:
将所述对象区域包括的像素点数量与目标区域包括的像素点数量的比值作为所述分割图的图像分割质量,其中,所述目标区域为所述对象区域以及所述边界区域的并集。
在一种可能的实现中,所述获取模块,用于:
将所述分割图中像素值大于前景阈值的像素点所在的区域确定为所述对象区域;
将所述分割图中像素值小于背景阈值的像素点所在的区域确定为所述背景区域;
将所述分割图中像素值大于所述背景阈值且小于所述前景阈值的像素点所在的区域确定为所述边界区域。
在一种可能的实现中,所述图像分割模块,用于对所述检测框内的图像进行图像分割,以获取所述目标对象对应的初始分割图,所述初始分割图包括多个像素点,以及每个像素点属于各个类别的概率;
将每个像素点属于各个类别的概率中的最大概率值作为所述分割图中每个像素点的像素值;
获取所述分割图中包括的部分像素点的像素值,其中,所述部分像素点的像素值大于预设值,所述预设值小于所述前景阈值且大于或等于所述背景阈值,所述前景阈值小于1;
根据所述分割图中包括的部分像素点的像素值确定所述分割图的图像分割质量,其中,所述分割图中包括的像素点的调整后的像素值的平均值越大,所述图像分割质量越高。
在一种可能的实现中,所述图像分割模块,用于获取所述分割图中包括的部分像素点的初始像素值,其中,所述部分像素点的初始像素值大于预设值,所述预设值小于所述前景阈值且大于或等于所述背景阈值,所述前景阈值小于1;
根据目标映射关系,对所述初始像素值进行调整,以获取所述分割图中包括的像素点的调整后的像素值,其中,所述目标映射关系表示所述初始像素值与所述调整后的像素值之间的映射关系,在所述目标映射关系中,随着所述初始像素值由所述预设值变化至1,所述调整后的像素值由0逐渐变化至1,且随着所述初始像素值变大,所述目标映射关系的斜率不变或变大;
根据所述分割图中包括的像素点的调整后的像素值确定所述分割图的图像分割质量,其中,所述分割图中包括的像素点的调整后的像素值的平均值越大,所述图像分割质量越高。
在一种可能的实现中,所述置信度确定模块,用于确定所述图像分割质量与所述第一置信度的乘积作为所述分割图的第二置信度。
在一种可能的实现中,所述装置还包括:
图像处理模块,用于基于所述第二置信度高于阈值,在所述第一图像中的所述目标对象周围显示AR对象;或,
基于所述第二置信度高于阈值,将所述第一图像中的所述目标对象替换为第一对象,所述第一对象与所述目标对象不同。
本申请实施例提供了一种分割图置信度确定装置,所述装置包括:获取模块,用于获取第一图像;目标检测模块,用于对所述第一图像进行目标检测,以获取所述第一图像中目标对象的检测框以及所述检测框的第一置信度,所述第一置信度用于指示所述检测框的定位准确度;图像分割模块,用于对所述检测框内的图像进行图像分割,以获取所述目标对象对应的分割图以及获取所述分割图的图像分割质量;置信度确定模块,用于根据所述图像分割质量,调整所述第一置信度,以获取所述分割图的第二置信度,其中,所述图像分割质量越高,所述第二置信度越大。通过上述方式,利用分割图的图像分割质量对检测框的第一置信度进行调整,得到作为所述分割图的第二置信度,进而使得第二置信度不只包括检测框的定位精度,也可以包括分割图本身的分割质量信息,以此可以得到精度更高的分割图置信度。且本实施例中的分割图置信度确定方法不需要增加额外的网络来计算分割图的置信度,基于目标检测的结果以及分割图的结果就可以直接得到准确地分割图置信度,不需要增加很多额外的计算量,针对于端侧更为友好,使得方案易于部署在端侧来实现。
本申请实施例还提供了一种服务器,请参阅图11,图11是本申请实施例提供的服务器一种结构示意图,具体的,服务器1100由一个或多个服务器实现,服务器1100可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1111(例如,一个或一个以上处理器)和存储器1132,一个或一个以上存储应用程序1142或数据1144的存储介质1130(例如一个或一个以上海量存储设备)。其中,存储器1132和存储介质1130可以是短暂存储或持久存储。存储在存储介质1130的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器1111可以设置为与存储介质1130通信,在服务器1100上执行存储介质1130中的一系列指令操作。
服务器1100还可以包括一个或一个以上电源1126,一个或一个以上有线或无线网络接口1150,一个或一个以上输入输出接口1158;或,一个或一个以上操作系统1141,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
具体的,服务器可以执行图5对应实施例中的分割图置信度确定方法。
图10中描述的分割图置信度确定装置1000可以为服务器中的模块,服务器中的处理器可以执行分割图置信度确定装置1000所执行的分割图置信度确定方法。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述服务器所执行的步骤。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用 于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例提供的服务器或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的数据处理方法,或者,以使训练设备内的芯片执行上述实施例描述的数据处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图12,图12为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 1200,NPU 1200作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1203,通过控制器1204控制运算电路1203提取存储器中的矩阵数据并进行乘法运算。
NPU 1200可以通过内部的各个器件之间的相互配合,来实现图6所描述的实施例中提供的模型训练方法,或者对训练得到的模型进行推理。
其中,NPU 1200中的运算电路1203可以执行获取第一神经网络模型以及对所述第一神经网络模型进行模型训练的步骤。
更具体的,在一些实现中,NPU 1200中的运算电路1203内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1203是二维脉动阵列。运算电路1203还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1203是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1202中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1201中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1208中。
统一存储器1206用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1205,DMAC被搬运到权重存储器1202中。输入数据也通过DMAC被搬运到统一存储器1206中。
BIU为Bus Interface Unit即,总线接口单元1210,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)1209的交互。
总线接口单元1210(Bus Interface Unit,简称BIU),用于取指存储器1209从外部存储器获取指令,还用于存储单元访问控制器1205从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1206或将权重数据搬运到权重存储器1202中或将输入数据数据搬运到输入存储器1201中。
向量计算单元1207包括多个运算处理单元,在需要的情况下,对运算电路1203的输 出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元1207能将经处理的输出的向量存储到统一存储器1206。例如,向量计算单元1207可以将线性函数;或,非线性函数应用到运算电路1203的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1207生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1203的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1204连接的取指存储器(instruction fetch buffer)1209,用于存储控制器1204使用的指令;
统一存储器1206,输入存储器1201,权重存储器1202以及取指存储器1209均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传 输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (20)

  1. 一种分割图置信度确定方法,其特征在于,所述方法包括:
    获取第一图像;
    对所述第一图像进行目标检测,以获取所述第一图像中目标对象的检测框以及所述检测框的第一置信度,所述第一置信度用于指示所述检测框的定位准确度;
    对所述检测框内的图像进行图像分割,以获取所述目标对象对应的分割图以及获取所述分割图的图像分割质量;
    根据所述图像分割质量,调整所述第一置信度,以获取所述分割图的第二置信度,其中,所述图像分割质量越高,所述第二置信度越大。
  2. 根据权利要求1所述的方法,其特征在于,所述图像分割质量用于指示所述分割图中所述目标对象的边界清晰度,其中,所述边界清晰度越高,所述图像分割质量越高。
  3. 根据权利要求1或2所述的方法,其特征在于,所述分割图的第二置信度用于指示所述检测框的定位准确度以及所述分割图的图像分割质量,其中,所述定位准确度越高,所述第二置信度越高。
  4. 根据权利要求1至3任一所述的方法,其特征在于,所述方法还包括:
    将所述分割图中像素值大于前景阈值的像素点所在的区域确定为所述对象区域;
    将所述分割图中像素值小于背景阈值的像素点所在的区域确定为所述背景区域;
    将所述分割图中像素值大于所述背景阈值且小于所述前景阈值的像素点所在的区域确定为所述边界区域;
    所述获取所述分割图的图像分割质量,包括:
    将所述对象区域包括的像素点数量与目标区域包括的像素点数量的比值作为所述分割图的图像分割质量,其中,所述目标区域为所述对象区域以及所述边界区域的并集。
  5. 根据权利要求1至4任一所述的方法,其特征在于,所述对所述检测框内的图像进行图像分割,以获取所述目标对象对应的分割图,包括:
    对所述检测框内的图像进行图像分割,以获取所述目标对象对应的初始分割图,所述初始分割图包括多个像素点,以及每个像素点属于各个类别的概率;
    将每个像素点属于各个类别的概率中的最大概率值作为所述分割图中每个像素点的像素值;
    所述获取所述分割图的图像分割质量,包括:
    获取所述分割图中包括的部分像素点的像素值,其中,所述部分像素点的像素值大于预设值,所述预设值小于所述前景阈值且大于或等于所述背景阈值,所述前景阈值小于1;
    根据所述分割图中包括的部分像素点的像素值确定所述分割图的图像分割质量,其中,所述分割图中包括的像素点的调整后的像素值的平均值越大,所述图像分割质量越高。
  6. 根据权利要求1至4任一所述的方法,其特征在于,所述获取所述分割图的图像分割质量,包括:
    获取所述分割图中包括的部分像素点的初始像素值,其中,所述部分像素点的初始像素值大于预设值,所述预设值小于所述前景阈值且大于或等于所述背景阈值,所述前景阈值小于1;
    根据目标映射关系,对所述初始像素值进行调整,以获取所述分割图中包括的像素点的调整后的像素值,其中,所述目标映射关系表示所述初始像素值与所述调整后的像素值之间的映射关系,在所述目标映射关系中,随着所述初始像素值由所述预设值变化至1,所述调整后的像素值由0逐渐变化至1,且随着所述初始像素值变大,所述目标映射关系的斜率不变或变大;
    根据所述分割图中包括的像素点的调整后的像素值确定所述分割图的图像分割质量,其中,所述分割图中包括的像素点的调整后的像素值的平均值越大,所述图像分割质量越高。
  7. 根据权利要求1至6任一所述的方法,其特征在于,所述根据所述图像分割质量,调整所述第一置信度,以获取所述分割图的第二置信度,包括:
    确定所述图像分割质量与所述第一置信度的乘积作为所述分割图的第二置信度。
  8. 根据权利要求1至7任一所述的方法,其特征在于,在所述获取所述分割图的第二置信度之后,所述方法还包括:
    基于所述第二置信度高于阈值,在所述第一图像中的所述目标对象周围显示AR对象;或,
    基于所述第二置信度高于阈值,将所述第一图像中的所述目标对象替换为第一对象,所述第一对象与所述目标对象不同。
  9. 一种分割图置信度确定装置,其特征在于,所述装置包括:
    获取模块,用于获取第一图像;
    目标检测模块,用于对所述第一图像进行目标检测,以获取所述第一图像中目标对象的检测框以及所述检测框的第一置信度,所述第一置信度用于指示所述检测框的定位准确度;
    图像分割模块,用于对所述检测框内的图像进行图像分割,以获取所述目标对象对应的分割图以及获取所述分割图的图像分割质量;
    置信度确定模块,用于根据所述图像分割质量,调整所述第一置信度,以获取所述分割图的第二置信度,其中,所述图像分割质量越高,所述第二置信度越大。
  10. 根据权利要求9所述的装置,其特征在于,所述图像分割质量用于指示所述分割图中所述目标对象的边界清晰度,其中,所述边界清晰度越高,所述图像分割质量越高。
  11. 根据权利要求9或10所述的装置,其特征在于,所述分割图的第二置信度用于指示所述检测框的定位准确度以及所述分割图的图像分割质量,其中,所述定位准确度越高,所述第二置信度越高。
  12. 根据权利要求9至11任一所述的装置,其特征在于,所述获取模块,用于:
    将所述分割图中像素值大于前景阈值的像素点所在的区域确定为所述对象区域;
    将所述分割图中像素值小于背景阈值的像素点所在的区域确定为所述背景区域;
    将所述分割图中像素值大于所述背景阈值且小于所述前景阈值的像素点所在的区域确定为所述边界区域;
    将所述对象区域包括的像素点数量与目标区域包括的像素点数量的比值作为所述分割图的图像分割质量,其中,所述目标区域为所述对象区域以及所述边界区域的并集。
  13. 根据权利要求9至11任一所述的装置,其特征在于,所述图像分割模块,用于对所述检测框内的图像进行图像分割,以获取所述目标对象对应的初始分割图,所述初始分割图包括多个像素点,以及每个像素点属于各个类别的概率;
    将每个像素点属于各个类别的概率中的最大概率值作为所述分割图中每个像素点的像素值;
    获取所述分割图中包括的部分像素点的像素值,其中,所述部分像素点的像素值大于预设值,所述预设值小于所述前景阈值且大于或等于所述背景阈值,所述前景阈值小于1;
    根据所述分割图中包括的部分像素点的像素值确定所述分割图的图像分割质量,其中,所述分割图中包括的像素点的调整后的像素值的平均值越大,所述图像分割质量越高。
  14. 根据权利要求9至11任一所述的装置,其特征在于,所述图像分割模块,用于获取所述分割图中包括的部分像素点的初始像素值,其中,所述部分像素点的初始像素值大于预设值,所述预设值小于所述前景阈值且大于或等于所述背景阈值,所述前景阈值小于1;
    根据目标映射关系,对所述初始像素值进行调整,以获取所述分割图中包括的像素点的调整后的像素值,其中,所述目标映射关系表示所述初始像素值与所述调整后的像素值之间的映射关系,在所述目标映射关系中,随着所述初始像素值由所述预设值变化至1,所述调整后的像素值由0逐渐变化至1,且随着所述初始像素值变大,所述目标映射关系的斜率不变或变大;
    根据所述分割图中包括的像素点的调整后的像素值确定所述分割图的图像分割质量,其中,所述分割图中包括的像素点的调整后的像素值的平均值越大,所述图像分割质量越高。
  15. 根据权利要求9至14任一所述的装置,其特征在于,所述置信度确定模块,用于确定所述图像分割质量与所述第一置信度的乘积作为所述分割图的第二置信度。
  16. 根据权利要求9至15任一所述的装置,其特征在于,所述装置还包括:
    图像处理模块,用于基于所述第二置信度高于阈值,在所述第一图像中的所述目标对象周围显示AR对象;或,
    基于所述第二置信度高于阈值,将所述第一图像中的所述目标对象替换为第一对象,所述第一对象与所述目标对象不同。
  17. 一种模型训练装置,其特征在于,所述装置包括存储器和处理器;所述存储器存储有代码,所述处理器被配置为获取所述代码,并执行如权利要求1至8任一所述的方法。
  18. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有一个或多个指令,所述指令在由一个或多个计算机执行时使得所述一个或多个计算机实施权利要求1至8任一所述的方法。
  19. 一种计算机程序产品,包括代码,其特征在于,在所述代码被执行时用于实现如权利要求1至8任一所述的方法。
  20. 一种芯片系统,其特征在于,包括处理器,所述处理器用于根据获取到的计算机指令执行处理,从而实现如权利要求1至8任一所述的方法。
PCT/CN2022/077911 2021-02-27 2022-02-25 一种分割图置信度确定方法及装置 WO2022179604A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110221912.1A CN113066048A (zh) 2021-02-27 2021-02-27 一种分割图置信度确定方法及装置
CN202110221912.1 2021-02-27

Publications (1)

Publication Number Publication Date
WO2022179604A1 true WO2022179604A1 (zh) 2022-09-01

Family

ID=76559221

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/077911 WO2022179604A1 (zh) 2021-02-27 2022-02-25 一种分割图置信度确定方法及装置

Country Status (2)

Country Link
CN (1) CN113066048A (zh)
WO (1) WO2022179604A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066048A (zh) * 2021-02-27 2021-07-02 华为技术有限公司 一种分割图置信度确定方法及装置
CN113469997B (zh) * 2021-07-19 2024-02-09 京东科技控股股份有限公司 平面玻璃的检测方法、装置、设备和介质
CN114858200B (zh) * 2022-04-19 2023-06-27 合众新能源汽车股份有限公司 车辆传感器检测到的对象的质量评价方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409371A (zh) * 2017-08-18 2019-03-01 三星电子株式会社 用于图像的语义分割的系统和方法
CN110852285A (zh) * 2019-11-14 2020-02-28 腾讯科技(深圳)有限公司 对象检测方法、装置、计算机设备和存储介质
US10713794B1 (en) * 2017-03-16 2020-07-14 Facebook, Inc. Method and system for using machine-learning for object instance segmentation
CN113066048A (zh) * 2021-02-27 2021-07-02 华为技术有限公司 一种分割图置信度确定方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555358B (zh) * 2018-06-01 2023-09-12 苹果公司 用于检测和识别ar/vr场景中的特征部的方法和设备
CN110288625B (zh) * 2019-07-04 2021-09-03 北京字节跳动网络技术有限公司 用于处理图像的方法和装置
CN110675407B (zh) * 2019-09-17 2022-08-05 北京达佳互联信息技术有限公司 一种图像实例分割方法、装置、电子设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10713794B1 (en) * 2017-03-16 2020-07-14 Facebook, Inc. Method and system for using machine-learning for object instance segmentation
CN109409371A (zh) * 2017-08-18 2019-03-01 三星电子株式会社 用于图像的语义分割的系统和方法
CN110852285A (zh) * 2019-11-14 2020-02-28 腾讯科技(深圳)有限公司 对象检测方法、装置、计算机设备和存储介质
CN113066048A (zh) * 2021-02-27 2021-07-02 华为技术有限公司 一种分割图置信度确定方法及装置

Also Published As

Publication number Publication date
CN113066048A (zh) 2021-07-02

Similar Documents

Publication Publication Date Title
WO2021136050A1 (zh) 一种图像拍摄方法及相关装置
EP3923634B1 (en) Method for identifying specific position on specific route and electronic device
WO2020077511A1 (zh) 一种拍摄场景下的图像显示方法及电子设备
WO2021104485A1 (zh) 一种拍摄方法及电子设备
WO2021078001A1 (zh) 一种图像增强方法及装置
WO2022179604A1 (zh) 一种分割图置信度确定方法及装置
EP4064284A1 (en) Voice detection method, prediction model training method, apparatus, device, and medium
WO2022127787A1 (zh) 一种图像显示的方法及电子设备
US20220262035A1 (en) Method, apparatus, and system for determining pose
CN113170037B (zh) 一种拍摄长曝光图像的方法和电子设备
CN112087649B (zh) 一种设备搜寻方法以及电子设备
CN114140365B (zh) 基于事件帧的特征点匹配方法及电子设备
WO2021057626A1 (zh) 图像处理方法、装置、设备及计算机存储介质
WO2022156473A1 (zh) 一种播放视频的方法及电子设备
US20230005277A1 (en) Pose determining method and related device
CN115619858A (zh) 一种物体重建方法以及相关设备
WO2022161386A1 (zh) 一种位姿确定方法以及相关设备
CN115150542B (zh) 一种视频防抖方法及相关设备
WO2021208677A1 (zh) 眼袋检测方法以及装置
WO2024055764A1 (zh) 图像处理方法及装置
WO2023216957A1 (zh) 一种目标定位方法、系统和电子设备
CN113468929A (zh) 运动状态识别方法、装置、电子设备和存储介质
CN114827442B (zh) 生成图像的方法和电子设备
WO2022033344A1 (zh) 视频防抖方法、终端设备和计算机可读存储介质
CN115880198B (zh) 图像处理方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22758968

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22758968

Country of ref document: EP

Kind code of ref document: A1