WO2023103915A1 - 目标识别方法、电子设备及存储介质 - Google Patents

目标识别方法、电子设备及存储介质 Download PDF

Info

Publication number
WO2023103915A1
WO2023103915A1 PCT/CN2022/136329 CN2022136329W WO2023103915A1 WO 2023103915 A1 WO2023103915 A1 WO 2023103915A1 CN 2022136329 W CN2022136329 W CN 2022136329W WO 2023103915 A1 WO2023103915 A1 WO 2023103915A1
Authority
WO
WIPO (PCT)
Prior art keywords
image information
target
target object
recognition result
image
Prior art date
Application number
PCT/CN2022/136329
Other languages
English (en)
French (fr)
Inventor
孙洪玲
刘彦宾
高文婷
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023103915A1 publication Critical patent/WO2023103915A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/96Management of image or video recognition tasks

Definitions

  • the present application relates to the technical field of computer vision, in particular to an object recognition method, electronic equipment and a storage medium.
  • the target recognition algorithm is light-weighted, for example, the pre-network of the original Single Shot MultiBox Detector (SSD) network structure, That is, the visual geometry group network (Visual Geometry Group Network, VGG-16) is replaced by a lightweight model of the SqueezeNet architecture, combined with a series of additional feature layers, the three-dimensional attributes of the model are reduced to about tens of megabytes, making It can run directly on sports equipment or embedded equipment.
  • SSD Single Shot MultiBox Detector
  • the embodiment of the present application provides a target recognition method applied to a terminal, including: collecting first image information; sending the first image information to a cloud server, so that the cloud server passes the target detection model Recognizing the target object in the first image information to generate an image recognition result; acquiring the image recognition result according to the cloud server responding to the first image information.
  • the embodiment of the present application provides a target recognition method applied to a cloud server, including: receiving first image information from a terminal; using a target detection model to identify the target object in the first image information , to obtain an image recognition result; and send the image recognition result to the terminal.
  • the embodiment of the present application provides an electronic device, including: a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the electronic device as described in the embodiment of the first aspect of the present application is implemented.
  • the embodiments of the present application provide a computer-readable storage medium, the storage medium stores a program, and the program is executed by a processor to achieve the goal described in any one of the embodiments of the first aspect of the application.
  • the recognition method, or the target recognition method described in any one of the embodiments of the second aspect of the present application are examples of the present application.
  • FIG. 1 is a schematic flow chart of a target recognition method provided by an embodiment of the present application
  • Fig. 2 is a schematic flow chart of a target recognition method provided by another embodiment of the present application.
  • Fig. 3 is a schematic flow chart of a target recognition method provided by another embodiment of the present application.
  • Fig. 4 is a schematic flow chart of a target recognition method provided by another embodiment of the present application.
  • Fig. 5 is a schematic flowchart of a target recognition method provided by another embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a target recognition method provided by another embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a target recognition method provided by another embodiment of the present application.
  • Fig. 8 is a schematic flowchart of a target recognition method provided by another embodiment of the present application.
  • Fig. 9 is a schematic diagram of an electronic device provided by an embodiment of the present application.
  • orientation descriptions such as up, down, front, back, left, right, etc. indicated orientations or positional relationships are based on the orientations or positional relationships shown in the drawings, and are only For the convenience of describing the present application and simplifying the description, it does not indicate or imply that the referred device or element must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be construed as limiting the embodiments of the present application.
  • references to the terms “one embodiment,” “some embodiments,” “exemplary embodiments,” “example,” “specific examples,” or “some examples” are intended to mean that the implementation A specific feature, structure, material, or characteristic described by an embodiment or example is included in at least one embodiment or example of the present application.
  • schematic representations of the above terms do not necessarily refer to the same embodiment or example.
  • the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
  • the target recognition algorithm is light-weighted, for example, the pre-network of the original Single Shot MultiBox Detector (SSD) network structure, That is, the visual geometry group network (Visual Geometry Group Network, VGG-16) is replaced by a lightweight model of the SqueezeNet architecture, combined with a series of additional feature layers, the three-dimensional attributes of the model are reduced to about tens of megabytes, making It can run directly on sports equipment or embedded equipment.
  • the target recognition accuracy of the lightweight model is low, and it is limited by the mobile phone system platform.
  • the computing platform of sports mobile phones is developing very rapidly, the computing power is still very limited, and it is difficult to achieve the unity of real-time and accuracy.
  • an embodiment of the present application provides an object recognition method, an electronic device, and a storage medium, which can prevent a terminal from bearing a heavy computing power burden when performing an object recognition function.
  • the embodiment of the present application provides a target recognition method applied to a terminal, including:
  • Step S101 collecting first image information
  • the first image information refers to information represented in the form of an image, which is used to provide materials for target recognition. It should be understood that the first image information includes but is not limited to: a single picture, a single frame picture, a group of pictures or a piece of video.
  • collecting the first image information may be collected through a camera device of the terminal, or may be collected in other ways, for example: obtaining the first image information by downloading from the network, obtaining the first image information by bluetooth transmission, and so on. There are various ways to collect the first image information, which will not be repeated here.
  • Step S102 sending the first image information to the cloud server, so that the cloud server can identify the target object in the first image information through the target detection model to generate an image recognition result;
  • Step S103 acquiring an image recognition result according to the cloud server responding to the first image information.
  • data transmission between the terminal and the cloud server may be performed in various ways. It mainly includes wired transmission and wireless transmission.
  • wired transmission refers to the way of transmitting information by using tangible media such as metal wires and optical fibers.
  • Optical or electrical signals can carry information such as byte arrays and images.
  • Wired transmission mainly uses wires or optical cables to realize communication transmission.
  • Wireless transmission refers to the long-distance transmission communication method between multiple nodes without transmission through conductors or cables.
  • Common long-distance wireless transmission methods include: mainly GPRS/CDMA, digital radio, spread spectrum microwave, wireless bridge and Satellite communication, short-wave communication technology, etc.; short-distance wireless communication standards that are widely used and have good development prospects include: Zig-Bee, Bluetooth (Bluetooth), wireless broadband (Wi-Fi), ultra-wideband (UWB) and near-field Communication (NFC), etc.
  • the image recognition result acquired by the terminal includes, but is not limited to: information such as the category, size, orientation, and spatial position of the target object in the first image information.
  • the task undertaken by the terminal is to collect the first image information and transmit the first image information to the cloud server, and receive the recognition result fed back by the cloud server.
  • the image recognition process is transferred to the cloud server to liberate the storage space and computing power of the terminal device.
  • terminals in the various embodiments of the present application include but are not limited to mobile phones, desktop computers, tablet computers, personal digital assistants, sales terminals, vehicle-mounted computers, notebook computers, palmtop computers, cameras, navigation devices, wearable devices, smart hands, etc. Ring and other terminals.
  • the target recognition method applied to the terminal may further include the following steps before performing step S102:
  • Step S201 encoding the first image information, converting the first image information into a byte array
  • step S201 the first image information is converted into the form of byte array.
  • step S201 only the first image information in the form of byte array needs to be transmitted, and then the first image information in the form of byte array is transferred to the cloud server. After the information is decoded and restored, image recognition can be performed.
  • Step S202 establishing a data transmission connection, and sending the storage capacity of the byte array to the cloud server;
  • Step S203 sending the image information packet to the cloud server based on the data transmission protocol.
  • the transmission protocols include but are not limited to User Datagram Protocol (User Datagram Protocol, UDP), Transmission Control Protocol (Transmission Control Protocol, TCP).
  • UDP User Datagram Protocol
  • TCP Transmission Control Protocol
  • the UDP protocol is a connectionless and unreliable transmission mode. Although it is often used to transmit video streams, since the subsequent detection results in this application need to be sent back in a more reliable transmission mode, some embodiments in this application use TCP
  • the protocol serves as a data transmission protocol for data interaction between the terminal and the cloud server.
  • the terminal after the terminal obtains the image recognition result, it can perform various types of processing on the image recognition result, including but not limited to: outputting the image recognition result, displaying the image recognition result on the terminal screen, according to Image recognition results track target objects, etc.
  • the terminal analyzes the image recognition result to obtain the label, bounding box, and overlay effect of the image recognition result, renders the image recognition result and displays it on the terminal screen.
  • the terminal after obtaining the image recognition result, can also track the target object according to the image recognition result, and collect the second image information of the target area where the target object is located, and then send the second image information to the cloud server, to get image recognition results with updated status.
  • the camera is usually at a fixed position, and the position of the target in the screen is relatively certain, so Only conventional preprocessing operations are required to output the image to the recognition algorithm.
  • AR Augmented Reality
  • the camera is controlled by the user, and the scene is also changing, the picture is unstable, and the position and size of the object to be recognized in the picture are inconsistent.
  • the image will carry redundant information other than the target object, increasing the complexity of calculation. Therefore, by tracking the target object through the previously obtained image recognition results, collecting the second image information of the target area where the target object is located, and then sending the second image information to the cloud server, the cloud server does not need to deal with redundancy other than the target object. information, thereby improving the efficiency of identifying target objects.
  • step S103 may further include:
  • Step S301 locking the target object in the first image information according to the image recognition result, and collecting the third image information
  • the third image information is used to collect motion data of the target object, where the third image information includes but not limited to: a group of pictures and a piece of video. It should be understood that if there is a single picture or a single frame that is sufficient to reflect the movement state of the target object over a period of time and can be used to collect movement data of the target object, then this picture should still be identified as the third image information.
  • the collection of the third image information may be collected through the camera device of the terminal, or may be collected in other ways, for example: obtaining the third image information by downloading from the network, obtaining the third image information by bluetooth transmission, etc. There are various ways to collect the third image information, which will not be repeated here.
  • Step S302 acquire the motion data of the target object in the third image information according to the third image information, and predict the position of the target object according to the motion data, and obtain the target area, wherein the target area is the next target object collected by the terminal The upcoming area in the group image information.
  • Step S303 collecting the second image information of the target area, and sending the second image information to the cloud server, so as to obtain an updated image recognition result.
  • the second image information is for the target area
  • the collected image information includes but is not limited to: a single picture, a single frame, a group of pictures or a video.
  • the collection of the second image information may be collected through the camera device of the terminal, or may be collected in other ways, for example: obtaining the second image information by downloading from the network, obtaining the second image information by bluetooth transmission, and so on. There are various ways to collect the second image information, which will not be repeated here.
  • the cloud server needs to process all frames included in the first image information and obtain an image recognition result.
  • the terminal after the terminal obtains the image recognition result, it can lock the target object in the first image information.
  • the third image information is taken, and motion data of the target object in the third image information is acquired.
  • the movement trend of the target object can be judged, so as to predict the position of the target object, and obtain the possible area where the target object may appear in the next picture, and the above-mentioned target object is in the next picture
  • the possible area is the target area.
  • Collecting the second image information for the target area and sending the second image information to the cloud server can save the computing power of the cloud server in the image recognition processing process, accelerate the image recognition processing process of the cloud server, so that the terminal can be faster
  • the image recognition result of the updated state is obtained from the cloud server in a timely manner. It should be understood that the image recognition result reflects the result obtained by recognizing the first image information, and the image recognition result in the update state reflects the result obtained by recognizing the third image information.
  • the motion data includes a displacement gradient and a time gradient of the target object in the third image information.
  • Step S302 acquires the motion data of the target object in the third image information according to the third image information, and performs position prediction on the target object according to the motion data to obtain the target area, including:
  • Step S401 selecting the sampling point (x, y) of the target object
  • Step S402 recording the displacement gradient and time gradient of the sampling point in the third image information
  • Step S403 calculating the motion vector of the sampling point according to the displacement gradient and time gradient of the sampling point;
  • the motion vector of the sampling point can be obtained Among them, f x and f y are the displacement gradients of the sampling points, and f t is the time gradient of the sampling points;
  • Step S404 predicting the position of the target object according to the motion vector to obtain the target area.
  • the optical flow is the instantaneous speed of the pixel movement of the space moving object on the observation imaging plane.
  • the optical flow method uses the changes of pixels in the image sequence in the time domain and the correlation between adjacent frames to find the corresponding relationship between the previous frame and the current frame, thereby calculating the motion of objects between adjacent frames.
  • the optical flow method is used to detect the slight movement change of the target object in the continuous image sequence, so as to estimate the possible movement position of the target object, and then calculate the next position of the target object collected by the terminal.
  • the target area appearing in the group image information, the target area defines the collection range of the subsequent second image information.
  • the motion data includes the motion direction and motion speed of the target object.
  • Step S302 acquires the motion data of the target object in the third image information according to the third image information, and performs position prediction on the target object according to the motion data to obtain the target area, which also includes:
  • Step S501 acquiring two consecutive frames with sequence numbers N-1 and N respectively from the third image information
  • Step S502 obtain the pose matrix of the N-1th frame, the pose matrix of the Nth frame, and the center point position of the target object from two consecutive frames, wherein Twc (N-1) is the N-1th The pose matrix of the frame picture, Twc (N) is the pose matrix of the Nth frame picture;
  • Twc (N+1) T velocity *Twc (N)
  • the N+1th frame picture is the next frame continuous picture of the Nth frame picture;
  • Step S505 predicting the target area where the target object will appear in the N+1th frame according to Twc (N+1) .
  • Twc (N+1) T velocity *Twc (N)
  • Twc (N) T velocity *Twc (N)
  • the size of the target object, the maximum radius of the target object is obtained, and the maximum three-dimensional area of the target object in space is defined by the divergence of the central point position, and the boundary of the three-dimensional area is projected onto the two-dimensional plane through the matrix, and the two-dimensional plane is obtained
  • the upper, lower, left, and right boundary values and then expand some pixel values outward according to the boundary values, and finally obtain the target area.
  • the target area predicted by the optical flow method in steps S401 to S404 is limited by the following two conditions: first, the brightness is constant. That is, when the same target moves between different frames, its brightness will not change. This is the assumption of the basic optical flow method, which is used to obtain the basic equation of the optical flow method. Second, time continuity or motion is "small motion". That is, changes in time will not cause drastic changes in the target position, and the displacement between adjacent frames should be relatively small, which is also an indispensable assumption for the optical flow method. Therefore, the actual application of the optical flow method to predict the target area has relatively high requirements for ambient light and motion range, and it is prone to tracking loss or drift.
  • some embodiments of the present application use the motion model detection algorithm provided in steps S501 to S505 to supplement steps S401 to S404, so that the terminal can generate more accurate predictions on the motion trend of the target object, so as to obtain more as a reliable target area.
  • the embodiment of the present application provides a target recognition method applied to a cloud server, including:
  • Step S601 receiving first image information from a terminal
  • Step S602 using the target detection model to identify the target object in the first image information to obtain an image recognition result
  • the target recognition model is a neural network-based object recognition model.
  • Some embodiments of the present application use a convolutional neural network (Convolutional Neural Networks, CNN) model as the target recognition model.
  • CNN is a type of Feedforward Neural Networks (Feedforward Neural Networks) that includes convolutional calculations and has a deep structure. It is one of the representative algorithms for deep learning.
  • the convolutional neural network has the ability to learn representations and can perform translation-invariant classification of input information according to its hierarchical structure, so it is also called "translation-invariant artificial neural network”.
  • Convolutional neural networks can perform object recognition through three types of methods: sliding window (sliding window), selective search (selective search) and YOLO (You Only Look Once).
  • the sliding window appeared first and was used for gesture recognition and other issues, but due to the large amount of calculation, it has been eliminated by the latter two.
  • Selectively search the corresponding region convolutional neural network (Region-based CNN)
  • the algorithm first judges whether a window may have a target object through general steps, and inputs it into a complex recognizer.
  • the YOLO algorithm defines object recognition as a regression problem for the probability of occurrence of each target in the segmentation frame in the image, and uses the same convolutional neural network to output the probability of each target, the center coordinates and the size of the frame for all segmentation frames.
  • Step S603 sending the image recognition result to the terminal.
  • the task undertaken by the cloud server is to first receive the first image information sent by the terminal, identify the first image information through the target detection model and obtain the image recognition result, and then use the image recognition result as a reference to the second image information.
  • a response of image information is sent to the terminal.
  • This application implements the target recognition method in the way of terminal-cloud collaboration, and transfers the computing power burden in the image recognition process from the terminal side to the cloud server side, so as to liberate the storage space and computing power of the terminal side, thereby improving the target recognition obtained by the terminal side. speed of results.
  • step S602 uses the target detection model to identify the target object in the first image information, and obtains the image recognition result, including:
  • Step S701 processing the first image information through a target detection model to obtain a target feature map of the target object
  • Step S702 performing a convolution operation on the target feature map to obtain an image recognition result.
  • the traditional CNN network is very difficult to train. With the deepening of the learning progress, the improvement of the correct rate of the traditional CNN model will often be saturated.
  • the Hourglass Network or DLANet is used as the main feature network model.
  • this part of the deep neural network can also be used as the backbone feature network of the target detection model to perform the function of identifying the target object, due to the large amount of parameters and other reasons, some embodiments of the present application use the residual network as the target of the backbone feature network. detection model.
  • the target detection model in some embodiments of the present application includes, but is not limited to, the CenterNet object recognition algorithm with Resnet50 as the backbone feature network.
  • the first image information is processed through the target detection model to obtain the target feature map of the target object, including: through the CenterNet object with Resnet50 as the backbone feature network
  • the recognition algorithm processes the first image information, obtains the last feature layer of the first image information, and then upsamples the last feature layer through deconvolution operation to obtain a high-resolution target feature map.
  • the convolution operation is performed on the target feature map to obtain an image recognition result, including: performing heat map convolution, central point convolution, and three-dimensional attribute convolution on the target feature map to obtain an image recognition result.
  • Step S801 the cloud server internally loads the CenterNet object recognition algorithm with Resnet50 as the backbone feature network;
  • Step S802 input the first image information into Resnet50, and obtain the shape of the last feature layer as (16, 16, 2048).
  • CenterNet uses triple deconvolution to perform upsampling to obtain a high-resolution target feature map ;
  • step S802 performing upsampling according to step S802 can obtain a higher resolution output.
  • the height and width of the feature layer will become twice the original, so after three times of deconvolution upsampling, the height and width of the obtained feature layer will become 8 times the original, at this time the feature layer
  • the height and width are 128x128, and the number of channels is 64, so that a high-resolution target feature map reflecting the target object can be obtained.
  • a high-resolution target feature map can improve the recognition accuracy of the target object.
  • Step S803 after obtaining the high-resolution target feature map, perform a convolution operation on the target feature map to generate heat map prediction, central point prediction, and three-dimensional attribute prediction, and obtain an image recognition result.
  • step S803 the heat map predicts that a heat map will be generated separately for each category of the target object, so the number of channels output is num_classes.
  • the number of channels output is num_classes.
  • For each heat map when a certain coordinate contains the center point of the target, then A key point will be generated at the target, and the output result is (128, 128, num_classes), representing whether there is an object at each thermal point, and the type of the object.
  • center point prediction At this time, the number of convolution channels is 6, and the final result is (128,128,6), which represents the offset and rotation of each object center from the hot point.
  • CenterNet uses center point prediction to predict the offset and width and height of the target center point to obtain the bounding box of the target.
  • the three-dimensional attribute prediction predicting the three-dimensional bounding box of each target, and returning the properties of the bounding box: two-dimensional offset of the center point, z-axis position prediction, 3D dimension size (ie length, width and height) and z-axis rotation angle.
  • the number of channels output is 8, and the final result is (128,128,8), which represents the prediction of the 3D bounding box of each object.
  • the final image recognition results include but are not limited to: whether there is an object, the type of the target object, the offset of the target object, the rotation of the target object, the offset of the center point of the target object, the target object The 3D dimensions of the object and the orientation of the target object.
  • (128, 128, num_classes), (128, 128, 6), (128, 128, 8) mentioned in the above-mentioned embodiments should not be regarded as limiting the application, and those skilled in the art will not violate the essence of the application Similar similar embodiments of the same kind can also be made with various modifications or replacements under shared conditions.
  • the cloud server before the cloud server returns the final recognition result to the terminal through the network, it also includes performing non-maximum suppression (Non-Maximum Suppression, NMS) algorithm processing on the image recognition result.
  • NMS Non-Maximum Suppression
  • NMS is an important link in the process of target recognition. It is used to extract effective feature points on the entire image, and then perform local search to extract the feature points with the highest local score. In the process of target detection, a large number of candidate boxes will be generated at the position of the same target. These candidate boxes may overlap with each other. At this time, we need to use NMS to find the best target bounding box and eliminate redundant bounding boxes. Therefore, in some embodiments of the present application, before the cloud server returns the final recognition result to the terminal through the network, it also includes performing NMS algorithm processing on the image recognition result.
  • FIG. 9 shows an electronic device 900 provided by an embodiment of the present application.
  • the electronic device 900 includes: a processor 901 , a memory 902 , and a computer program stored on the memory 902 and operable on the processor 901 , and the computer program is used to execute the above object recognition method when running.
  • the processor 901 and the memory 902 may be connected through a bus or in other ways.
  • the memory 902 as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs, such as the object recognition method described in the embodiment of the present application.
  • the processor 901 executes the non-transitory software programs and instructions stored in the memory 902 to implement the above object recognition method.
  • the memory 902 may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store and execute the above-mentioned target identification method.
  • the memory 902 may include a high-speed random access memory 902, and may also include a non-transitory memory 902, such as at least one storage device, a flash memory device or other non-transitory solid-state storage devices.
  • the memory 902 includes memory 902 remotely located relative to the processor 901 , and these remote memories 902 may be connected to the electronic device 900 through a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, motion communication networks, and combinations thereof.
  • the non-transitory software programs and instructions required to realize the above-mentioned target recognition method are stored in the memory 902.
  • the above-mentioned target recognition method is executed, for example, the method steps in FIG. 1 are executed.
  • the embodiment of the present application also provides a computer-readable storage medium storing computer-executable instructions, and the computer-executable instructions are used to execute the above object recognition method.
  • the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by one or more control processors, for example, executing steps S110 to S140 of the method in FIG. Method step S210 to step S280 in, method step S310 to step S340 in Fig. 3, method step S410 to step S440 in Fig. 4, method step S510 to step S530 in Fig. 5, method step S610 to step S530 in Fig. 6 Step S640, the method steps S710 to S730 in FIG. 7, and the method steps S810 to S830 in FIG. 8.
  • the object recognition method in the embodiments of the present application can be applied to a terminal or a cloud server.
  • the task undertaken by the terminal is to collect the first image information and transmit the first image information to the cloud server, and receive the recognition result fed back by the cloud server.
  • the task undertaken by the cloud server is to first receive the first image information sent by the terminal, identify the first image information through the target detection model and obtain the image recognition result, and then use the image recognition result as the first image information
  • the response is sent to the terminal, and the target recognition method is executed.
  • the work of collecting the first image information is completed by the terminal, and the process of image recognition processing of the first image information is completed by the cloud server.
  • This application implements the target recognition method in the way of terminal-cloud collaboration, and transfers the computing power burden in the image recognition process from the terminal side to the cloud server side, so as to liberate the storage space and computing power of the terminal side, thereby improving the target recognition obtained by the terminal side. speed of results.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • Computer storage media including, but not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, storage device storage or other magnetic storage devices, or Any other medium that can be used to store desired information and that can be accessed by a computer.
  • communication media typically embody computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本申请提供一种目标识别方法、电子设备及存储介质。第一方面,本申请提供一种目标识别方法,应用于终端,包括:采集第一图像信息(S101);将第一图像信息发送至云端服务器,以使云端服务器通过目标检测模型对第一图像信息中进行识别而生成图像识别结果(S102);根据云端服务器响应于第一图像信息,获取图像识别结果(S103)。第二方面,目标识别方法应用于云端服务器,包括:接收来自于终端的第一图像信息(S601);通过目标检测模型对第一图像信息进行识别,得出图像识别结果(S602);将图像识别结果发送至终端(S603)。

Description

目标识别方法、电子设备及存储介质
相关申请的交叉引用
本申请基于申请号为202111512596.X、申请日为2021年12月8日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及计算机视觉技术领域,特别涉及一种目标识别方法、电子设备及存储介质。
背景技术
随着科技的迅速发展,目标识别在人工智能、增强现实等相关领域具有重要意义,深度学习在目标识别任务中表现出优越的性能。然而,由于神经网络模型三维属性比较庞大,对存储空间和计算能力要求很高,因此神经网络模型难以在终端上正常运行。
相关技术中,为了能够将目标识别功能带入终端,将目标识别算法进行了轻量化操作,例如,将原始的单射多目标检测器(Single Shot MultiBox Detector,SSD)网络结构的前置网络,即视觉几何群网络(Visual Geometry Group Network,VGG-16)替换成压缩网(SqueezeNet)架构的轻量化模型,并结合一系列附加特征层,将模型三维属性减小到约几十兆左右,使其能够在运动设备或嵌入式设备上直接运行。然而从测试结果来看,尽管终端计算平台的发展非常迅速,但终端的计算能力仍然十分有限,难以达到正常应用的要求。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
第一方面,本申请实施例提供了一种目标识别方法,应用于终端,包括:采集第一图像信息;将所述第一图像信息发送至云端服务器,以使所述云端服务器通过目标检测模型对所述第一图像信息中的目标物体进行识别而生成图像识别结果;根据所述云端服务器响应于所述第一图像信息,获取所述图像识别结果。
第二方面,本申请实施例提供了一种目标识别方法,应用于云端服务器,包括:接收来自于终端的第一图像信息;通过目标检测模型对所述第一图像信息中的目标物体进行识别,得出图像识别结果;将所述图像识别结果发送至所述终端。
第三方面,本申请实施例提供了一种电子设备,包括:存储器、处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现如本申请第一方面实施例中任意一项所述的目标识别方法,或本申请第二方面实施例中任意一项所述的目标识别方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述存储介质存储有程序,所述程序被处理器执行实现如本申请第一方面实施例中任意一项所述的目标识别方法,或本申请第二方面实施例中任意一项所述的目标识别方法。
本申请的其他特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。
附图说明
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的 实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。
图1是本申请一个实施例提供的目标识别方法的流程示意图;
图2是本申请另一个实施例提供的目标识别方法的流程示意图;
图3是本申请另一个实施例提供的目标识别方法的流程示意图;
图4是本申请另一个实施例提供的目标识别方法的流程示意图;
图5是本申请另一个实施例提供的目标识别方法的流程示意图;
图6是本申请另一个实施例提供的目标识别方法的流程示意图;
图7是本申请另一个实施例提供的目标识别方法的流程示意图;
图8是本申请另一个实施例提供的目标识别方法的流程示意图;
图9是本申请一个实施例提供的电子设备的示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
在本申请的描述中,需要理解的是,涉及到方位描述,例如上、下、前、后、左、右等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本申请和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本申请实施例的限制。
在本申请的描述中,若干的含义是一个或者多个,多个的含义是两个以上,大于、小于、超过等理解为不包括本数,以上、以下、以内等理解为包括本数。如果有描述到第一、第二只是用于区分技术特征为目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示意性实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。
本申请的描述中,除非另有明确的限定,设置、安装、连接等词语应做广义理解,所属技术领域技术人员可以结合技术方案的具体内容合理确定上述词语在本申请中的具体含义。
随着科技的迅速发展,目标识别在人工智能、增强现实等相关领域具有重要意义,深度学习在目标识别任务中表现出优越的性能。然而,由于神经网络模型三维属性比较庞大,对存储空间和计算能力要求很高,因此神经网络模型难以在终端上正常运行。
相关技术中,为了能够将目标识别功能带入终端,将目标识别算法进行了轻量化操作,例如,将原始的单射多目标检测器(Single Shot MultiBox Detector,SSD)网络结构的前置网络,即视觉几何群网络(Visual Geometry Group Network,VGG-16)替换成压缩网(SqueezeNet)架构的轻量化模型,并结合一系列附加特征层,将模型三维属性减小到约几十兆左右,使其能够在运动设备或嵌入式设备上直接运行。然而从测试结果来看,轻量化模型进行目标识别的精确度较低,并且受手机系统平台限制。尽管运动手机的计算平台发展非常迅速,但是计算能力仍然十分有限,难以达到实时性和准确性的统一。
为了解决以上问题,本申请实施例提供了一种目标识别方法、电子设备及存储介质,能够使得终端在执行目标识别功能时免于承受繁重的算力负担。
下面结合附图进行详细说明。
参照图1,第一方面,本申请实施例提供了一种目标识别方法,应用于终端,包括:
步骤S101,采集第一图像信息;
需要说明的是,第一图像信息指的是以图像形式表现的信息,用于为目标识别提供材料。应理解,第一图像信息包括但不限于:单张图片、单帧画面、一组图片或者一段视频。另外,采集第一图像信息可以通过终端的摄像头设备进行采集,也可以通过其他方式进行采集,例如:以网络下载的方式获取第一图像信息、以蓝牙传输的方式获取第一图像信息等。采集第一图像信息的方式多种多样,在此不过多赘述。
步骤S102,将第一图像信息发送至云端服务器,以使云端服务器通过目标检测模型对第一图像信息中的目标物体进行识别而生成图像识别结果;
步骤S103,根据云端服务器响应于第一图像信息,获取图像识别结果。
需要说明的是,终端与云端服务器之间可通过多种方式进行数据传输。主要包括有线传输和无线传输。其中有线传输即利用金属导线、光纤等有形媒质传送信息的方式,光或电信号可以承载字节数组,图像等信息,有线传输主要以电线或者光缆实现通讯传导。无线传输是指多个节点间不经由导体或缆线传播进行的远距离传输通讯方式,常见的远距离无线传输方式有:主要有GPRS/CDMA、数传电台、扩频微波、无线网桥及卫星通信、短波通信技术等;应用较为广泛及具有较好发展前景的短距离无线通信标准有:Zig-Bee、蓝牙(Bluetooth)、无线宽带(Wi-Fi)、超宽带(UWB)和近场通信(NFC)等。可以理解的是,终端获取到的图像识别结果包括但不限于:第一图像信息中目标物体的类别、尺寸、朝向、空间位置等信息。
在本申请提供的一些实施例中,终端所承担的任务是采集第一图像信息并将第一图像信息传输至云端服务器,以及接收云端服务器反馈回来的识别结果。以端云协同的方式,将图像识别的处理过程转移到了云端服务器中进行,以解放终端设备的存储空间以及计算能力。
应理解,本申请各实施例中的终端包括但不限于手机、台式电脑、平板电脑、个人数字助理、销售终端、车载电脑、笔记本电脑、掌上电脑、摄像机、导航装置、可穿戴设备、智能手环等终端。
参照图1、图2,根据本申请提供的一些实施例,应用于终端的目标识别方法中,执行步骤S102之前还可以包括下述步骤:
步骤S201,对第一图像信息进行编码,将第一图像信息转化为字节数组;
需要说明的是,为了便于第一图像信息的传输,需要将第一图像信息分解成可消化的块或者信息包。而步骤S201中则将第一图像信息转换为字节数组的形式,后续步骤中仅需将字节数组形式的第一图像信息进行传输,再于云端服务器中将字节数组形式的第一图像信息进行解码还原,即可进行图像识别。
步骤S202,建立数据传输连接,并将字节数组的存储容量发送至云端服务器;
步骤S203,基于数据传输协议将图像信息包发送至云端服务器。
需要说明的是,在本申请一些实施例中,可以采取的传输协议包括但不限于用户数据包协议(User Datagram Protocol,UDP)、传输控制协议(Transmission Control Protocol,TCP)。UDP协议是面向无连接、不可靠的传输方式,虽然常常被应用于传输视频流,但由于 本申请中后续的检测结果需要以较可靠的传输方式传回,因此本申请中一些实施例采用TCP协议作为终端与云端服务器之间进行数据交互的数据传输协议。
在本申请提供的一些实施例中,终端在获取图像识别结果后,可以对图像识别结果进行各种类型的处理,包括但不限于:输出图像识别结果,将图像识别结果显示于终端屏幕,根据图像识别结果追踪目标物体等。例如,终端对图像识别结果分析得到图像识别结果的标签、包围盒、叠加效果,针对图像识别结果进行渲染并于终端屏幕显示。
根据本申请提供的一些实施例,终端在获取图像识别结果后还可以根据图像识别结果追踪目标物体,并采集目标物体所在目标区域的第二图像信息,进而将第二图像信息发送至云端服务器,以获取更新状态的图像识别结果。需要说明的是,在常见的深度学习应用场景中,如工业流水线中的物体检测、人脸识别等,通常摄像机是在一个固定的位置,目标在画面中出现的位置相对来说比较确定,所以只需要常规的预处理操作即可将图像输出给识别算法。但是在增强现实(Augmented Reality,AR)场景中,如AR说明书、AR远程指导等等,相机由用户控制,且场景也在变化,画面不稳定,待识别物体在画面中出现的位置和尺寸都不受限,图像中会携带除目标物体以外的冗余信息,增加计算的复杂度。因此,通过先前获得的图像识别结果对目标物体进行追踪,采集目标物体所在目标区域的第二图像信息,进而将第二图像信息发送至云端服务器,能够使得云端服务器无需处理目标物体以外的冗余信息,从而提高对目标物体进行识别的效率。
参照图1、图3,根据本申请提供的一些实施例,步骤S103根据云端服务器响应于第一图像信息,获取图像识别结果之后,还可以包括:
步骤S301,根据图像识别结果锁定第一图像信息中的目标物体,并采集第三图像信息;
需要说明的是,第三图像信息用于收集目标物体的运动数据,其中第三图像信息包括但不限于:一组图片、一段视频。应理解,若存在单张图片或者单帧画面足以反映目标物体在一段时间的运动状态并能够用于收集目标物体的运动数据,那么这张图片仍然应当被认定为第三图像信息。另外,采集第三图像信息可以通过终端的摄像头设备进行采集,也可以通过其他方式进行采集,例如:以网络下载的方式获取第三图像信息、以蓝牙传输的方式获取第三图像信息等。采集第三图像信息的方式多种多样,在此不过多赘述。
步骤S302,根据第三图像信息获取目标物体在第三图像信息中的运动数据,并根据运动数据对目标物体进行位置预测,得出目标区域,其中目标区域即为目标物体在终端采集的下一组图像信息中即将出现的区域。
步骤S303,对目标区域采集第二图像信息,并将第二图像信息发送至云端服务器,以获得更新状态的图像识别结果。
根据本申请提供的一些实施例,步骤S301与步骤S302预测了目标物体即将出现的目标区域后,就无需把目标物体不可能出现的画面纳入第二图像信息,因此第二图像信息是针对目标区域所采集的图像信息。其中第二图像信息包括但不限于:单张图片、单帧画面、一组图片或者一段视频。另外,采集第二图像信息可以通过终端的摄像头设备进行采集,也可以通过其他方式进行采集,例如:以网络下载的方式获取第二图像信息、以蓝牙传输的方式获取第二图像信息等。采集第二图像信息的方式多种多样,在此不过多赘述。
需要说明的是,云端服务器需要对囊括于第一图像信息的所有画面进行处理并得出图像识别结果。根据本申请提供的一些实施例,当终端获取到图像识别结果之后,即可锁定第一 图像信息中的目标物体。在锁定目标物体之后,采取第三图像信息,并获取第三图像信息中目标物体的运动数据。根据目标物体的运动数据即可对目标物体的运动趋势作出判断,从而对目标物体进行位置预测,得出目标物体在接下来的画面中可能出现的区域,而上述目标物体在接下来的画面中可能出现的区域即为目标区域。针对目标区域采集第二图像信息,并将第二图像信息发送至云端服务器,即可节省云端服务器在图像识别处理过程中的算力,加速云端服务器的图像识别处理过程,以使得终端可以更加快速地从云端服务器获取更新状态的图像识别结果。应理解,图像识别结果反映的是对第一图像信息进行识别而获取的结果,更新状态的图像识别结果反映的是对第三图像信息进行识别而获取的结果。
参考图3、图4,根据本申请提供的一些实施例,运动数据包括第三图像信息中目标物体的位移梯度与时间梯度。步骤S302根据第三图像信息获取目标物体在第三图像信息中的运动数据,并根据运动数据对目标物体进行位置预测,得出目标区域,包括:
步骤S401,选取目标物体的采样点(x,y);
步骤S402,记录第三图像信息中采样点的位移梯度与时间梯度;
步骤S403,根据采样点的位移梯度与时间梯度求出采样点的动作向量;
需要说明的是,将采样点的位移梯度与时间梯度代入方程:
Figure PCTCN2022136329-appb-000001
即可求出采样点的动作向量
Figure PCTCN2022136329-appb-000002
其中,f x和f y为采样点的位移梯度,f t为采样点的时间梯度;
步骤S404,根据动作向量对目标物体进行位置预测,得出目标区域。
需要说明的是,光流(optical flow)是空间运动物体在观察成像平面上的像素运动的瞬时速度。光流法是利用图像序列中像素在时间域上的变化以及相邻帧之间的相关性来找到上一帧跟当前帧之间存在的对应关系,从而计算出相邻帧之间物体的运动信息的一种方法。上述实施例中,步骤S401至步骤S404则通过光流法来检测连续的图像序列中目标物体微小的动作变化,从而推测出目标物体可能的运动位置,进而求出目标物体在终端采集的下一组图像信息中所出现的目标区域,以目标区域限定后续第二图像信息的采集范围。
参考图3、图5,根据本申请提供的一些实施例,运动数据包括目标物体的运动方向与运动速度。步骤S302根据第三图像信息获取目标物体在第三图像信息中的运动数据,并根据运动数据对目标物体进行位置预测,得出目标区域,还包括:
步骤S501,从第三图像信息中获取序号分别为N-1、N的两帧连续画面;
步骤S502,从两帧连续画面中分别获取第N-1帧画面的位姿矩阵、第N帧画面的位姿矩阵以及目标物体的中心点位置,其中Twc (N-1)为第N-1帧画面的位姿矩阵、Twc (N)为第N帧画面的位姿矩阵;
步骤S503,根据第N-1帧画面的位姿矩阵、第N帧画面的位姿矩阵与中心点位置,求出目标物体的运动方向以及运动速度T velocity,运动速度T velocity=Twc (N)*(Twc (N-1)) -1
步骤S504,根据运动方向与运动速度T velocity,求出第N+1帧画面中目标物体的位姿矩阵Twc (N+1),其中Twc (N+1)=T velocity*Twc (N),第N+1帧画面为第N帧画面的下一帧连续画面;
步骤S505,根据Twc (N+1)预测得出目标物体在第N+1帧画面中即将出现的目标区域。
需要说明的是,根据Twc (N+1)=T velocity*Twc (N)可以直接得出第N+1帧画面中目标物体中心点的位置以及旋转情况,再根据之前图像识别结果所检测出的目标物体尺寸,得到目标物体的最大半径,以中心点位置发散划定目标物体在空间中的最大三维区域,并将该三维 区域的边界经过矩阵投影到二维平面,得到在二维平面中上下左右的边界值,再根据边界值向外扩张一些像素值,最终得出目标区域。
参照图4、图5,由于步骤S401至步骤S404中的光流法预测目标区域受如下两个条件的限制:其一,亮度恒定不变。即同一目标在不同帧之间运动时,其亮度不会发生改变。这是基本光流法的假定,用于得到光流法基本方程。其二,时间连续或运动是“小运动”。即时间的变化不会引起目标位置的剧烈变化,相邻帧之间位移要比较小,同样也是光流法不可或缺的假定。因此,实际应用光流法预测目标区域时对环境光照以及运动幅度的要求比较高,易出现跟踪丢失或者漂移的情况。基于上述原因,本申请的一些实施例采用了步骤S501至步骤S505提供的运动模型检测算法对步骤S401至步骤S404进行补充,使得终端对目标物体的运动趋势产生更为准确的预测,以获取更为可靠的目标区域。
参照图6,第二方面,本申请实施例提供了一种目标识别方法,应用于云端服务器,包括:
步骤S601,接收来自于终端的第一图像信息;
步骤S602,通过目标检测模型对第一图像信息中的目标物体进行识别,得出图像识别结果;
需要说明的是,目标识别模型是一种基于神经网络的物体识别模型。本申请一些实施例选用卷积神经网络(Convolutional Neural Networks,CNN)模型作为目标识别模型。CNN是一类包含卷积计算且具有深度结构的前馈神经网络(Feedforward Neural Networks),是深度学习的代表算法之一。卷积神经网络具有表征学习能力,能够按其阶层结构对输入信息进行平移不变分类,因此也被称为“平移不变人工神经网络”。卷积神经网络可以通过三类方法进行物体识别:滑动窗口(sliding window)、选择性搜索(selective search)和YOLO(You Only Look Once)。滑动窗口出现最早,并被用于手势识别等问题,但由于计算量大,已经被后两者淘汰。选择性搜索对应区域卷积神经网络(Region-based CNN),该算法首先通过一般性步骤判断一个窗口是否可能有目标物体,并将其输入复杂的识别器中。YOLO算法将物体识别定义为对图像中分割框内各目标出现概率的回归问题,并对所有分割框使用同一个卷积神经网络输出各个目标的概率,中心坐标和框的尺寸。
步骤S603,将图像识别结果发送至终端。
需要说明的是,云端服务器所承担的任务是先接收终端所发出的第一图像信息,通过目标检测模型对第一图像信息进行识别并得出图像识别结果,之后再将图像识别结果作为对第一图像信息的响应,发送给终端。本申请以端云协同的方式执行目标识别方法,将图像识别处理过程中的算力负担从终端侧转移到云端服务器侧,以解放终端侧的存储空间和计算能力,从而提升终端侧获取目标识别结果的速度。
参照图6、图7,根据本申请提供的一些实施例,步骤S602通过目标检测模型对第一图像信息中的目标物体进行识别,得出图像识别结果,包括:
步骤S701,通过目标检测模型对第一图像信息进行处理,获取目标物体的目标特征图;
步骤S702,对目标特征图进行卷积运算,得出图像识别结果。
需要说明的是,传统CNN网络都很难训练,随着学习进度的加深,传统CNN模型的正确率的提升常常会出现饱和,例如以Hourglass Network或DLANet等为主干特征网络模型。虽然这部分深度神经网络也可以作为目标检测模型的主干特征网络而执行识别目标物体的功能, 然而基于参数量太大等原因,本申请的一些实施例选用以残差网络为主干特征网络的目标检测模型。采用残差网络作为目标检测模型的主干特征网络可以减轻训练的难度,更容易优化,随着网络加深也有助于提高正确率。因此,本申请一些实施例中的目标检测模型包括但不限于以Resnet50为主干特征网络的CenterNet物体识别算法。
参照图7、图8,本申请的一些实施例中,步骤S701中通过目标检测模型对第一图像信息进行处理,获取目标物体的目标特征图,包括:通过以Resnet50为主干特征网络的CenterNet物体识别算法对第一图像信息进行处理,获取第一图像信息的最后一个特征层,再通过反卷积运算对最后一个特征层进行上采样,即可获取高分辨率的目标特征图。而步骤S702中对目标特征图进行卷积运算,得出图像识别结果,包括:对目标特征图进行热力图卷积、中心点卷积以及三维属性卷积,并得出图像识别结果。
本申请其中一个实施例包括以下步骤:
步骤S801,云端服务器内部加载以Resnet50为主干特征网络的CenterNet物体识别算法;
步骤S802,将第一图像信息输入Resnet50,得到最后一个特征层的shape为(16,16,2048),对于该特征层,CenterNet利用三次反卷积进行上采样,获取高分辨率的目标特征图;
需要说明的是,按照步骤S802进行上采样能够获取更高的分辨率输出。每一次反卷积,特征层的高和宽会变为原来的两倍,因此在进行三次反卷积上采样后,获得的特征层的高和宽变为原来的8倍,此时特征层的高和宽为128x128,通道数为64,即可获取反映目标物体的高分辨率目标特征图,高分辨率的目标特征图能够提高目标物体的识别准确率。
步骤S803,在获取高分辨率的目标特征图之后,对目标特征图进行卷积运算以生成热力图预测、中心点预测以及三维属性预测,并得出图像识别结果。
针对步骤S803作出以下三个方面的说明。其一,热力图预测,针对每个目标物体的类别会单独产生一个热力图,因此输出的通道数为num_classes,对于每张热力图而言,当某个坐标处包含目标的中心点时,则会在该目标处产生一个关键点,输出结果为(128,128,num_classes),代表每一个热力点是否有物体存在,以及物体的种类。其二,中心点预测,此时卷积的通道数为6,最终结果为(128,128,6),代表每一个物体中心距离热力点偏移以及旋转的情况。应理解,与传统预测包围框的目标检测算法相比,CenterNet通过中心点预测来预测目标中心点的偏移量与宽高来获得目标的包围框。其三,三维属性预测,对每个目标的三维包围框进行预测,回归包围框的属性:中心点二维偏移量、z轴位置预测、3D维度尺寸(即长宽高)以及z轴旋转角度。此时输出的通道数为8,最终结果为(128,128,8),代表每一个物体3D包围框的预测情况。基于上述说明,最终得出的图像识别结果包括但不限于:是否有物体存在、目标物体的种类、目标物体的偏移情况、目标物体的旋转情况、目标物体中心点的偏移量、目标物体的3D维度尺寸以及目标物体的朝向。应理解,上述实施例中提到的(128,128,num_classes)、(128,128,6)、(128,128,8),不应视为对本申请的限制,熟悉本领域的技术人员在不违背本申请本质的共享条件下还可作出种种变形或替换的同类近似实施例。
根据本申请提供的一些实施例,云端服务器将最终的识别结果通过网络返回给终端之前,还包括对图像识别结果进行非极大值抑制(Non-Maximum Suppression,NMS)算法处理。NMS是目标识别过程中的重要的一个环节,用于在整个图像上提取有效的特征点,然后进行局部 搜索,取出局部得分最高的特征点。目标检测的过程中在同一目标的位置上会产生大量的候选框,这些候选框相互之间可能会有重叠,此时我们需要利用NMS找到最佳的目标边界框,消除冗余的边界框。因此,在本申请一些的实施例中,云端服务器将最终的识别结果通过网络返回给终端之前,还包括对图像识别结果进行NMS算法处理。
图9示出了本申请实施例提供的电子设备900。电子设备900包括:处理器901、存储器902及存储在存储器902上并可在处理器901上运行的计算机程序,计算机程序运行时用于执行上述的目标识别方法。
处理器901和存储器902可以通过总线或者其他方式连接。
存储器902作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序,如本申请实施例描述的目标识别方法。处理器901通过运行存储在存储器902中的非暂态软件程序以及指令,从而实现上述的目标识别方法。
存储器902可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储执行上述的目标识别方法。此外,存储器902可以包括高速随机存取存储器902,还可以包括非暂态存储器902,例如至少一个储存设备存储器件、闪存器件或其他非暂态固态存储器件。在一些实施方式中,存储器902包括相对于处理器901远程设置的存储器902,这些远程存储器902可以通过网络连接至该电子设备900。上述网络的实例包括但不限于互联网、企业内部网、局域网、运动通信网及其组合。
实现上述的目标识别方法所需的非暂态软件程序以及指令存储在存储器902中,当被一个或者多个处理器901执行时,执行上述的目标识别方法,例如,执行图1中的方法步骤S101至步骤S103、图2中的方法步骤S201至步骤S203、图3中的方法步骤S301至步骤S303、图4中的方法步骤S401至步骤S404、图5中的方法步骤S501至步骤S505、图6中的方法步骤S601至步骤S603、图7中的方法步骤S701至步骤S702、图8中的方法步骤S801至步骤S803。
本申请实施例还提供了计算机可读存储介质,存储有计算机可执行指令,计算机可执行指令用于执行上述的目标识别方法。
在一实施例中,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个或多个控制处理器执行,例如,执行图1中的方法步骤S110至步骤S140、图2中的方法步骤S210至步骤S280、图3中的方法步骤S310至步骤S340、图4中的方法步骤S410至步骤S440、图5中的方法步骤S510至步骤S530、图6中的方法步骤S610至步骤S640、图7中的方法步骤S710至步骤S730、图8中的方法步骤S810至步骤S830。
本申请实施例至少包括以下有益效果:本申请实施例中的目标识别方法可应用在终端或者云端服务器上。其中,终端所承担的任务是采集第一图像信息并将第一图像信息传输至云端服务器,以及接收云端服务器反馈回来的识别结果。而云端服务器所承担的任务是先接收终端所发出的第一图像信息,通过目标检测模型对第一图像信息进行识别并得出图像识别结果,之后再将图像识别结果作为对第一图像信息的响应,发送给终端,目标识别方法执行完毕。整个目标识别方法的执行步骤中,采集第一图像信息的工作由终端完成,对第一图像信息进行图像识别处理的过程由云端服务器完成。本申请以端云协同的方式执行目标识别方法,将图像识别处理过程中的算力负担从终端侧转移到云端服务器侧,以解放终端侧的存储空间和计算能力,从而提升终端侧获取目标识别结果的速度。
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、储存设备存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包括计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
还应了解,本申请实施例提供的各种实施方式可以任意进行组合,以实现不同的技术效果。
以上是对本申请的一些实施进行说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请本质的共享条件下还可作出种种等同的变形或替换,这些等同的变形或替换均包括在本申请权利要求所限定的范围内。

Claims (15)

  1. 一种目标识别方法,应用于终端,包括:
    采集第一图像信息;
    将所述第一图像信息发送至云端服务器,以使所述云端服务器对所述第一图像信息中的目标物体进行识别而生成图像识别结果;
    接收所述图像识别结果。
  2. 根据权利要求1所述的方法,还包括:
    根据所述图像识别结果追踪所述目标物体,并采集所述目标物体所在目标区域的第二图像信息;
    将所述第二图像信息发送至所述云端服务器,以获取更新状态的所述图像识别结果。
  3. 根据权利要求2所述的方法,其中,所述根据所述图像识别结果追踪所述目标物体,并采集第二图像信息,包括:
    根据所述图像识别结果锁定所述第一图像信息中的所述目标物体,并采集第三图像信息;
    根据所述第三图像信息获取所述目标物体在所述第三图像信息中的运动数据,并根据所述运动数据对所述目标物体进行位置预测得出所述目标区域,所述目标区域为所述终端采集的下一组图像信息中所述目标物体所在的区域;
    对所述目标区域采集所述第二图像信息。
  4. 根据权利要求3所述的方法,其中,所述根据所述第三图像信息获取所述目标物体在所述第三图像信息中的运动数据,包括:
    选取所述目标物体的采样点(x,y);
    记录所述第三图像信息中所述采样点的运动数据。
  5. 根据权利要求4所述的方法,其中,所述运动数据包括所述目标物体的位移梯度与时间梯度,所述根据所述运动数据对所述目标物体进行位置预测得出目标区域,包括:
    将所述位移梯度与所述时间梯度带入方程:
    Figure PCTCN2022136329-appb-100001
    求出所述采样点的动作向量
    Figure PCTCN2022136329-appb-100002
    其中,f x和f y为所述采样点的位移梯度,f t为所述采样点的时间梯度;
    根据所述动作向量对所述目标物体进行位置预测得出所述目标区域。
  6. 根据权利要求3至5任意一项所述的方法,其中,所述运动数据包括所述目标物体的运动方向与运动速度,所述根据所述第三图像信息获取所述目标物体在所述第三图像信息中的运动数据,还包括:
    从所述第三图像信息中获取序号分别为N-1、N的两帧连续画面;
    从所述两帧连续画面中分别获取位姿矩阵Twc (N-1)、Twc (N)以及所述目标物体的中心点位置;
    根据所述中心点位置与所述位姿矩阵Twc (N-1)、Twc (N),求出所述目标物体的所述运动方向以及所述运动速度T velocity,所述运动速度T velocity=Twc (N)*(Twc (N-1)) -1
  7. 根据权利要求6所述的方法,其中,所述根据所述运动数据对所述目标物体进行位置预测得出目标区域,包括:
    根据所述运动方向与所述运动速度T velocity,求出第N+1帧画面中所述目标物体的位姿矩阵Twc (N+1),其中Twc (N+1)=T velocity*Twc (N),所述第N+1帧画面为所述第N帧画面的下一帧连续画面;
    根据所述Twc (N+1)预测得出所述目标物体在第N+1帧画面中即将出现的所述目标区域。
  8. 一种目标识别方法,应用于云端服务器,包括:
    接收来自于终端的第一图像信息;
    对所述第一图像信息中的目标物体进行识别,得出图像识别结果;
    将所述图像识别结果发送至所述终端。
  9. 根据权利要求8所述的方法,其中,所述通过目标检测模型对所述第一图像信息中的目标物体进行识别,得出图像识别结果,包括:
    通过所述目标检测模型对所述第一图像信息进行处理,获取所述目标物体的目标特征图;
    对所述目标特征图进行卷积运算,得出所述图像识别结果。
  10. 根据权利要求9所述的方法,其中,所述目标检测模型包括以残差网络为主干特征网络的物体识别算法,所述通过所述目标检测模型对所述第一图像信息进行处理,获取所述目标物体的目标特征图,包括:
    通过所述以残差网络为主干特征网络的物体识别算法对所述第一图像信息进行处理,获取所述第一图像信息的最后一个特征层;
    通过反卷积运算对所述最后一个特征层进行上采样,获取高分辨率的所述目标特征图。
  11. 根据权利要求9所述的方法,其中,所述对所述目标特征图进行卷积运算,得出所述图像识别结果,包括:
    对所述目标特征图进行卷积运算以生成热力图预测、中心点预测以及三维属性预测,并得出所述图像识别结果。
  12. 根据权利要求8至11任意一项所述的方法,其中,所述图像识别结果包括:所述目标物体的类型、所述目标物体的尺寸、所述目标物体的空间位置、所述目标物体的朝向。
  13. 根据权利要求8至11任意一项所述的方法,其中,将所述图像识别结果发送至所述终端之前,还包括:
    对所述图像识别结果进行非极大值抑制算法处理。
  14. 一种电子设备,包括:存储器、处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时,实现如权利要求1至7中任意一项所述的目标识别方法或权利要求8至13中任意一项所述的目标识别方法。
  15. 一种计算机可读存储介质,存储有程序,所述程序被处理器执行时,实现如权利要求1至7中任意一项所述的目标识别方法或权利要求8至13中任意一项所述的目标识别方法。
PCT/CN2022/136329 2021-12-08 2022-12-02 目标识别方法、电子设备及存储介质 WO2023103915A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111512596.X 2021-12-08
CN202111512596.XA CN116310737A (zh) 2021-12-08 2021-12-08 目标识别方法、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023103915A1 true WO2023103915A1 (zh) 2023-06-15

Family

ID=86729644

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/136329 WO2023103915A1 (zh) 2021-12-08 2022-12-02 目标识别方法、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN116310737A (zh)
WO (1) WO2023103915A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875431A (zh) * 2017-02-10 2017-06-20 深圳前海大造科技有限公司 具有移动预测的图像追踪方法及扩增实境实现方法
CN106920310A (zh) * 2017-03-06 2017-07-04 珠海习悦信息技术有限公司 门禁控制方法、装置及系统
CN109862263A (zh) * 2019-01-25 2019-06-07 桂林长海发展有限责任公司 一种基于图像多维特征识别的移动目标自动跟踪方法
WO2020259481A1 (zh) * 2019-06-27 2020-12-30 Oppo广东移动通信有限公司 定位方法及装置、电子设备、可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875431A (zh) * 2017-02-10 2017-06-20 深圳前海大造科技有限公司 具有移动预测的图像追踪方法及扩增实境实现方法
CN106920310A (zh) * 2017-03-06 2017-07-04 珠海习悦信息技术有限公司 门禁控制方法、装置及系统
CN109862263A (zh) * 2019-01-25 2019-06-07 桂林长海发展有限责任公司 一种基于图像多维特征识别的移动目标自动跟踪方法
WO2020259481A1 (zh) * 2019-06-27 2020-12-30 Oppo广东移动通信有限公司 定位方法及装置、电子设备、可读存储介质

Also Published As

Publication number Publication date
CN116310737A (zh) 2023-06-23

Similar Documents

Publication Publication Date Title
US11145083B2 (en) Image-based localization
US20200042776A1 (en) Method and apparatus for recognizing body movement
CN113811920A (zh) 分布式姿势估计
CN111649724B (zh) 基于移动边缘计算的视觉定位方法和装置
Gargees et al. Incident-supporting visual cloud computing utilizing software-defined networking
US10977525B2 (en) Indoor localization using real-time context fusion of visual information from static and dynamic cameras
WO2020238284A1 (zh) 停车位的检测方法、装置与电子设备
Li et al. Camera localization for augmented reality and indoor positioning: a vision-based 3D feature database approach
US11915439B2 (en) Method and apparatus of training depth estimation network, and method and apparatus of estimating depth of image
EP4050305A1 (en) Visual positioning method and device
US11783588B2 (en) Method for acquiring traffic state, relevant apparatus, roadside device and cloud control platform
WO2021249114A1 (zh) 目标跟踪方法和目标跟踪装置
WO2022247414A1 (zh) 空间几何信息估计模型的生成方法和装置
WO2023083256A1 (zh) 位姿显示方法、装置及系统、服务器以及存储介质
CN111292420A (zh) 用于构建地图的方法和装置
CN114419519B (zh) 目标对象检测方法、装置、电子设备和存储介质
CN116194951A (zh) 用于基于立体视觉的3d对象检测与分割的方法和装置
CN114758068A (zh) 空间几何信息估计模型的训练方法及装置
Zhu et al. PairCon-SLAM: Distributed, online, and real-time RGBD-SLAM in large scenarios
WO2021088497A1 (zh) 虚拟物体显示方法、全局地图更新方法以及设备
WO2023103915A1 (zh) 目标识别方法、电子设备及存储介质
US20220375134A1 (en) Method, device and system of point cloud compression for intelligent cooperative perception system
CN114596475A (zh) 单应性流估计模型的训练方法、单应性流估计方法和装置
CN115937383B (zh) 渲染图像的方法、装置、电子设备及存储介质
CN116105720B (zh) 低照度场景机器人主动视觉slam方法、装置和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22903337

Country of ref document: EP

Kind code of ref document: A1