WO2021174513A1 - 一种图像处理系统、方法以及包括该系统的自动驾驶车辆 - Google Patents

一种图像处理系统、方法以及包括该系统的自动驾驶车辆 Download PDF

Info

Publication number
WO2021174513A1
WO2021174513A1 PCT/CN2020/078093 CN2020078093W WO2021174513A1 WO 2021174513 A1 WO2021174513 A1 WO 2021174513A1 CN 2020078093 W CN2020078093 W CN 2020078093W WO 2021174513 A1 WO2021174513 A1 WO 2021174513A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
image
convolutional neural
channel
image processing
Prior art date
Application number
PCT/CN2020/078093
Other languages
English (en)
French (fr)
Inventor
晋周南
王旭东
曹结松
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202080004424.9A priority Critical patent/CN112805723B/zh
Priority to PCT/CN2020/078093 priority patent/WO2021174513A1/zh
Publication of WO2021174513A1 publication Critical patent/WO2021174513A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names

Definitions

  • This application relates to the field of artificial intelligence, and in particular, to an image processing system and method, and an autonomous vehicle including the system.
  • autonomous driving technology With the rapid development of 5G communication and Internet of Vehicles technology, autonomous driving technology has become a research hotspot.
  • the core technologies in the field of automatic driving include intelligent environment perception, automatic navigation and positioning, driving behavior decision-making, and intelligent path planning and control.
  • object detection on road information (such as detecting road signs, pedestrians, etc.) is a prerequisite for driving behavior decision-making. From the perspective of image processing, object detection needs to analyze and determine the The category information and location information of each object.
  • machine learning methods based on neural networks are generally used for object detection, and the neural networks used for object detection need to be trained before use; currently, training is generally performed based on images (sets) acquired by camera devices of autonomous vehicles.
  • images there may be images shot through the same place multiple times, or images shot continuously (that is, images that are related in time or space), so there is a great similarity between the images, which is in the training Redundant data in the sample. Due to the existence of redundant data, the time required for training is increased, and there are too many identical/similar samples, which affects the processing of other samples by training, and may cause over-fitting.
  • key frames describe the turning point of the object's action or the switching of scenes. At the time, it contains richer information. Therefore, how to obtain key frames is an urgent problem for automatic driving.
  • the prior art generally needs to use the correlation between continuous frame images (that is, images that are related in time and space) for key frame acquisition. This not only has high processing redundancy, but also limits the availability of key frame acquisition. Select the range of the image set; on the other hand, the prior art does not consider the position information of the object when acquiring the key frame, so it is impossible to select a suitable key frame for predicting the position information of the object to be detected.
  • embodiments of the present application provide an image processing system, a method, and an automatic driving vehicle including the system.
  • an image processing system including a triplet architecture (Triplet) convolutional neural network (including a first convolutional neural network, a second convolutional neural network, and a third convolutional neural network) and channels Splicing department.
  • the convolutional neural network with triple architecture is configured to obtain the "three types" of information: the image, the object in the image, and the position of the object in the image, and perform feature extraction on these three types of information to obtain the features.
  • the image matrix is generated through the channel splicing unit, which includes the image, the object in the image, and the position information of the object.
  • the feature vector can be obtained based on the feature extraction of the image matrix, and the key frame can be obtained according to the clustering and analysis of the feature vector .
  • the image processing system of the present application can process disordered images (that is, images that are not related in time and/or space) and obtain key frames, thereby solving the problem of excessive redundant information in the process of obtaining key frames in the prior art.
  • the present application fully considers the position information of the object in the image in the process of feature extraction, thus improving the accuracy of key frame acquisition.
  • a hidden layer can be set after the channel stitching part to perform feature extraction on the image matrix to obtain feature vectors.
  • the hidden layer can be implemented using a neuron layer.
  • the input layer and the channel stitching part of the hidden layer Logical connection.
  • a fourth convolutional neural network may be set after the channel splicing part to perform feature extraction on the image matrix to obtain a feature vector, and the input layer of the convolutional neural network and the channel splicing part are logically connected.
  • the image processing system of the first aspect needs to be trained before it is used.
  • an architecture similar to the autoencoder is used for training.
  • the autoencoder is a type that can learn the input data efficiently through unsupervised learning. Representation of the artificial neural network.
  • the autoencoder in this application further includes: a channel separation part logically connected to the hidden layer or the output layer of the fourth convolutional neural network, the channel separation part is configured to perform channel separation on the output of the hidden layer or the convolutional neural network, and the channel The separation includes: image channel, object channel and object position information channel.
  • the above-mentioned image channel, object channel, and object position information channel are connected to the input logic of the fifth, sixth, and seventh convolutional neural networks, respectively, and the fifth, sixth, and seventh convolutional neural networks They are used to extract image features, object features, and object location information features, and use these features to reconstruct the image, the object in the image, and the location information of the object.
  • the above-mentioned first, second, and third convolutional neural networks belong to the encoding end of the self-encoder, and the above-mentioned fifth, sixth, and seventh convolutional neural networks belong to the decoding end of the self-encoder.
  • the first convolutional neural network, the second convolutional neural network, and the third convolutional neural network may include a downsampling layer.
  • the down-sampling layer can reduce the amount of calculation required for data processing and prevent over-fitting. Down-sampling can be achieved by, for example, a pooling layer (including maximum sampling, minimum sampling, and average sampling).
  • the fifth convolutional neural network, the sixth convolutional neural network, and the seventh convolutional neural network on the decoding side can include an upsampling layer, and the upsampling can restore the data dimension to realize the input Information reconstruction.
  • the hidden layer may include an even-numbered neuron layer. Since the encoding end and the decoding end are symmetrical, the use of an even-numbered hidden layer can be more conducive to the realization of the ( Neurons) have the same weights. Therefore, a hidden layer such as two or four neuron layers can be used.
  • a convolutional neural network can be used to replace the hidden layer.
  • the convolutional neural network can adopt a general architecture. Based on the reasons similar to the above-mentioned selection of the even-numbered hidden layer, the convolutional neural network It may include even-numbered convolutional layers, for example, a two-layer or four-layer convolutional neural network may be used.
  • this application also provides an image processing method, which can be executed by, for example, but not limited to, the trained image processing system of the first aspect.
  • the image to be processed can obtain image features, object features in the image, and object information.
  • Location information feature and fusion of the image feature, the object feature, and the location information feature of the object to obtain an image matrix.
  • a feature vector including the image feature, the object feature, and the location information feature of the object is obtained from the image matrix.
  • the feature vectors are clustered to obtain a clustering result. It is possible to use, for example, K-means clustering and centroid minimization cluster midpoint clustering.
  • the clustering result multiple cluster categories are obtained, each of the multiple cluster categories includes at least one image, the multiple cluster categories are sorted according to a set rule, and each of the multiple cluster categories is sorted. The first image after sorting is selected as the key frame, and the key frame is used as the training material of the object recognition algorithm.
  • an autonomous driving vehicle which includes the image processing system of the aforementioned first aspect.
  • an autonomous driving vehicle configured to communicate with the cloud.
  • the image processing system of the first aspect is installed in the cloud.
  • the image acquired by the autonomous driving vehicle is transmitted to the image processing system in the cloud.
  • the system processes the image to obtain the key frames.
  • an automatic driving assistance system which includes the image processing system of the aforementioned first aspect.
  • an automatic driving assistance system configured to communicate with the cloud.
  • the image processing system of the first aspect is set in the cloud, and the image obtained by the automatic driving assistance system is transmitted to the cloud image processing system.
  • the processing system processes the image to obtain the key frames.
  • a neural network processor is provided, and the neural network processor is configured as the image processing method of the foregoing second aspect.
  • a self-encoder including: an encoding end, a decoding end, and a hidden layer arranged between the encoding end and the decoding end, the encoding end includes: a first neural network, the first neural network includes at least one neuron Layer, the first neural network is configured to perform feature extraction on the image; the second neural network, the second neural network includes at least one neuron layer, and the second neural network is configured to perform feature extraction on objects in the image; and the third neural network Network, the third neural network includes at least one neuron layer, the third neural network is configured to perform feature extraction on the position information of the object in the image; the channel splicing part, the channel splicing part and the first neural network, the The output layer of the second neural network and the third neural network are logically connected, and the channel splicing section is configured to receive the output of the first neural network, the second neural network, and the third neural network and based on the received output Generating an image matrix; a hidden layer, the hidden layer including
  • the channel separation part is configured to channel the output of the hidden layer.
  • the channel separation includes: image channel, channel to be detected, and position information of the object to be detected Channel;
  • the fourth neural network the fourth neural network includes at least one neuron layer, the fourth convolutional neural network is configured to logically connect with the image channel and obtain image features;
  • the fifth neural network the fifth neural network includes at least one A neuron layer, the fifth convolutional neural network is configured to logically connect to the channel of the object to be detected and obtain features of the object to be detected;
  • the sixth neural network, the sixth neural network includes at least one neuron layer, and the sixth convolution The neural network is configured to logically connect with the position information channel of the object to be detected and obtain the position information feature of the object to be detected.
  • the self-encoder of the fifth aspect provides a general image processing system architecture. You can make appropriate changes to the self-encoder of the fifth aspect to obtain different image processing systems suitable for different scenarios. For example, Replace the first to sixth neural networks of the fifth aspect of the autoencoder with the first to sixth convolutional neural networks, or further replace the hidden layer of the fifth aspect of the autoencoder with a convolutional neural network. Obtain the various technical solutions in the first aspect.
  • Various embodiments of the present application provide an image processing system, a method, and an autonomous vehicle including the system, and the image processing system of the present application adopts a Triplet architecture.
  • the image processing system/method of the present application can simultaneously obtain features including image features, object features in the image, and location information features of the object, and obtain feature vectors based on these feature information, based on clustering of feature vectors And analysis can get the key frame image.
  • the system/method of the present application does not require continuous or spatial correlation of the processed images, that is, the system/method of the present application can process any image (set) and obtain the key frames therein. Therefore, the present application The system/method reduces redundant information processing and improves the efficiency of key frame acquisition.
  • the present application fully considers the position information of the object in the image in the process of feature extraction, and prediction based on the object position information improves the accuracy of key frame acquisition.
  • this application also provides an image processing method, a neural network processor, and a self-encoder architecture.
  • FIG. 1 is a schematic diagram of an image, an object in the image, and position information of the object provided by an embodiment of the present application;
  • FIG 2-1 is a schematic diagram of an image processing system provided by an embodiment of the present application.
  • Figure 2-2 is a schematic diagram of an image processing system provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of the architecture of a convolutional neural network provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of sharing weights between the encoding end and the decoding end of the image processing system provided by an embodiment of the present application;
  • FIG. 5 is a schematic diagram of training performed by an image processing system provided by an embodiment of the present application.
  • Fig. 6-1 is a schematic diagram of feature extraction performed by the trained image processing system provided by an embodiment of the present application.
  • Figure 6-2 is a schematic diagram of feature extraction performed by the trained image processing system provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a self-encoder provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of an image processing method provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of obtaining key frames from an image collection provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of an autonomous driving vehicle provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of the architecture of an image processing system provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of the architecture of a neural network processor provided by an embodiment of the present application.
  • Various embodiments of the present application provide an image processing system and method, and an automatic driving vehicle adopting the system.
  • the image processing system of the embodiment of the present application considers the three different information (ternary information) of the image, the object in the image, and the position information of the object when extracting the features of the image, and is based on the above three different information (Ternary information)
  • the encoder-decoder neural network structure of the Triplet architecture is designed, so that the object information and the position information of the object can be obtained at the same time when the image information is obtained, so it can be more accurately based on the object
  • the location information is predicted to obtain key frames.
  • the solution of the embodiment of the present application can not only be used to obtain key frames in a traditional continuous frame image set, but also can directly perform key frame acquisition on a disordered image set without using temporal and/or spatial correlation.
  • the image set reduces the degree of redundancy during processing, improves the efficiency of key frame acquisition, and also expands the range of selectable image sets for key frame acquisition.
  • FIG. 1 shows a schematic diagram of the ternary information 100 according to the embodiment of the present application, where 11 is a frame of image, in which an object 111 is included, and the object 111 is a sign ("School ahead, the vehicle is slow OK”), 12 is a separately separated object (that is, the sign in 11), and 13 is the position information of the object 111 in 11.
  • Figure 1 shows a frame of image and an object in it. It should be understood that this is only a schematic illustration.
  • a frame of image can also include multiple objects, and the object can be an object or an animal. It can also be a person.
  • the determination of the object can be performed manually, such as crowdsourcing; it can also be automatically implemented using general object segmentation and semantic segmentation machine learning methods, which is not limited in this application.
  • the position information of the object to be detected is determined by the X and Y channel values of the position of the pixel of the object to be detected in the image.
  • the data in 13 indicates the sign 111 in 11 The X and Y channel values.
  • FIG. 2-1 shows a schematic diagram of an image processing system 210 based on some embodiments.
  • the image processing system 210 mainly includes an encoding end 211 and a decoding end 212.
  • a hidden layer 24 is provided between the encoding ends 211 and 212.
  • the encoding end 211, the decoding end 212 and the hidden layer 24 constitute a self-encoder architecture as a whole.
  • the encoding end 201 includes three convolutional neural networks 21, 22, 23 and a channel splicing part 28.
  • the input of the channel splicing part 28 is logically connected to the output of the convolutional neural networks 21, 22, 23, and the decoding end includes three convolutions.
  • the product neural network 25, 26, 27 and the channel separation part 29, the output of the channel separation part 29 and the input of the convolutional neural network 25, 26, 27 are logically connected.
  • the input of the hidden layer 24 is logically connected to the output of the channel splicing part 28, and the output of the hidden layer 24 is logically connected to the input of the channel separating part 29.
  • the hidden layer 24 may be, for example, a fully connected neuron layer including an even-numbered layer. Since the encoding end and the decoding end are symmetrical, the use of an even-numbered hidden layer may be more conducive to implementation on the encoding end and the decoding end. The weights of the neurons at the end are the same.
  • the hidden layer includes two neuron layers. In other embodiments, the hidden layer includes four neuron layers, and the neuron layers can be fully connected.
  • a convolutional neural network may be used to replace the hidden layer 24 to obtain the image processing system 220.
  • Convolutional neural networks can use general architectures, such as (but not limited to) convolution-pooling-convolution-pooling-full-connected architecture. Based on the reasons similar to the above-mentioned hidden layer selection of even-numbered layers, convolutional neural networks The network may include even-numbered convolutional layers.
  • the convolutional neural network of the encoding end 201 and the decoding end 202 may adopt a general architecture setting, see FIG. 3, which shows a schematic diagram of the architecture of a convolutional neural network 300 in an image processing system: FIG. 3 It shows that the convolutional neural network 300 includes three modules. Each module includes a convolutional layer 31 and a pooling layer 33. There is an activation function (layer) 32 between the convolutional layer and the pooling layer. Finally, a fully connected layer 34 is provided as the output layer.
  • Convolution layer performs convolution operation on the input (image) data. Convolution operation is equivalent to filter operation in image processing, that is, multiplying, accumulating and accumulating the image in steps with a set size filter. Through the convolution operation, the characteristic part of the image can be extracted.
  • the Pooling Layer is used to reduce the space in the height and length directions. Pooling generally includes maximum pooling, minimum pooling, and average pooling. Pooling can reduce the data size, and can be robust/invariant to small changes in the input data.
  • the activation function may use functions such as ReLU, Sigmoid, Tanh, Maxout, etc., which are well-known in the field of machine learning.
  • the architecture example of the convolutional neural network shown in FIG. 3 is only a possible setting method, and those skilled in the art can change the number of convolutional layers and/or pooling layers according to actual needs. Depart from the spirit of this application.
  • this application in order to extract features in an image in depth, generally three or more convolutional layers are used.
  • the number of convolutional layers is large (for example, more than 5 layers), it is preferable to use the ReLU function as the activation function.
  • the neuron weights can be shared between the convolutional neural networks on the encoding side and the decoding side ( Figure 4 The dotted line indicates). By sharing weights, the number of parameters of the convolutional neural network can be reduced, and the computational efficiency can be improved.
  • the encoding end includes three identical convolutional neural networks 41, 42 and 43; the decoding end also includes three identical convolutional neural networks 45, 46, and 47.
  • the output of the channel splicing unit 48 on the encoding end is logically connected to the input of the hidden layer 44, and the output of the hidden layer 44 is logically connected to the input of the channel separating unit 49 on the decoding end.
  • the image processing system needs to be trained before feature extraction is performed on the image to obtain key frames.
  • the training process is introduced as follows:
  • Fig. 5 shows a network architecture basically consistent with the example of Fig. 2-2.
  • the three convolutional neural networks 51, 52, and 53 on the encoding end of Fig. 5 are configured to process the ternary information (image, object in the image, position information of the object) of a specific image frame respectively to extract feature information.
  • subsampling in order to reduce the amount of calculation required for data processing and prevent over-fitting, subsampling can be used in the convolutional neural networks 51, 52, and 53 at the encoding end.
  • the pooling layer implements downsampling, and the pooling layer can use maximum pooling, minimum pooling, or average pooling. Down-sampling can also be achieved by adjusting the convolution stride (Stride) to make the convolution stride greater than one.
  • the image feature information, the object feature information in the image, and the object location feature information in the image can be obtained in the output layers of the three neural networks, respectively.
  • the above three kinds of information are channel spliced through the channel splicing unit 58 to obtain an image matrix, which includes the above three kinds of feature information, namely image feature information, object feature information in the image, and object position feature in the image. information.
  • the image matrix is input to the convolutional neural network 54 located between the encoding end and the decoding end to perform feature extraction, and then the acquired features are channel-separated through the channel separation part 59 and then input to the three convolutional neural networks 55 on the decoding end. , 56, 57, reconstruct the image, the object in the image, and the position information of the object in the image. Since in the embodiment, downsampling is performed on the encoding side, the data is dimensionally reduced, and the upsampling process is performed in the convolutional neural networks 55, 56, 57 on the decoder side to restore the data dimension, In some embodiments, upsampling can be implemented using bilinear interpolation.
  • the image, the object in the image, and the position information of the object are reconstructed from the features acquired by the three convolutional neural networks 55, 56, 57 on the decoding side, and based on the reconstructed image, the object in the image, the position information of the object, and the input terminal (
  • the image on the encoding side), the object in the image, and the position information of the object are compared (learned), and the error back propagation (BP) method is used to train the weights of the neurons on the decoding side and the encoding side.
  • BP error back propagation
  • the encoding end (that is, the feature extraction end) can be used to perform feature extraction on the image to be processed.
  • the image processing system of the present application can perform feature extraction on disordered images, that is, images that are unrelated in time and/or space, and use the acquired features to perform key frame selection.
  • the above training process is also applicable to, for example, but not limited to, the network architecture shown in Figure 2-1.
  • the image matrix performs feature extraction in the hidden layer between the encoding end and the decoding end, and then After channel separation, the acquired features are respectively input to three convolutional neural networks at the decoding end to reconstruct the image, the object in the image, and the position information of the object in the image.
  • image feature extraction can be performed and key frames can be determined based on the feature extraction.
  • key frames can be determined based on the feature extraction.
  • Fig. 6-1 shows a schematic diagram of an image processing system provided by an embodiment for extracting features of an image after the training is completed.
  • the image, the object in the image, and the position information of the object are input into three convolutional neural networks 611, 612, and 613, respectively, and feature extraction is performed respectively.
  • down-sampling can also be used in the process of feature extraction, and down-sampling can be achieved by using, for example, pooling or adjusting the convolution stride (Stride) so that the convolution stride is greater than one.
  • the convolutional neural network without downsampling can also be used directly for image feature extraction without going against the spirit of this application.
  • the channel stitching section 614 performs channel stitching to obtain an image matrix including image feature information, object feature information in the image, and object position feature information in the image, and then use the hidden layer 615 to perform feature extraction on the image matrix. Finally, a feature vector expressed in the form of a one-dimensional vector is obtained. For each feature vector, three types of feature information are included: the image information, the object information in the image, and the position information of the object.
  • FIG. 6-2 shows a schematic diagram of performing feature extraction on an image after the training of the image processing system provided by the embodiment is completed.
  • the image, the object in the image, and the position information of the object are input into the three convolutional neural networks 621, 622, and 623 respectively, and feature extraction is performed respectively.
  • downsampling is also used in the feature extraction process. This can be achieved using, for example, pooling or adjusting the convolution stride (Stride) to make the convolution stride greater than one.
  • the convolutional neural network without downsampling can also be used directly for image feature extraction without going against the spirit of this application.
  • the channel stitching section 624 performs channel stitching to obtain an image matrix including image feature information, object feature information in the image, and object position feature information in the image, and then use convolutional neural network 625 to feature the image matrix Extraction, and finally obtain the feature vector in the form of a one-dimensional vector, for each feature vector, which includes three kinds of feature information: that is, the image information, the object information in the image, and the position information of the object.
  • the acquired feature vectors are clustered.
  • clustering methods known in the field of machine learning such as K-means clustering method or centroid minimization cluster midpoint distance clustering method, can be used to cluster the features.
  • the vector is clustered, and the target categories contained in different images and the number of objects in each category are counted, and the structure of Table 1 below is generated:
  • Cluster category 1 2 3 4 ........ Number of categories Image 1 1 1 1 0 ......... 3 Image 2 2 0 1 4 ......... 3 Image 3 0 1 2 0 ......... 2 Image 4 0 0 0 0 ......... 0 Number of objects 3 2 4 4 ......... To
  • Cluster category 2 1 3 4 ........ Number of categories Image 1 1 1 1 0 ......... 3 Image 3 1 0 2 0 ......... 2 Image 2 0 2 1 4 ......... 3 Image 4 0 0 0 0 ......... 0 Number of objects 2 3 4 4 ......... To
  • the categories of clusters can be set based on actual needs.
  • the above table shows four categories of clusters and four images for illustrative purposes only.
  • Each of the object's location information may have multiple classification categories, and the total number of clustering categories may range from hundreds to thousands.
  • the number of images can also range from hundreds to thousands.
  • the key frames can be selected according to the following based on the clustering results.
  • the specific steps are as follows:
  • the sorting rule is: sort in descending order based on the number of objects in the primary key category. When the number of objects in the primary key category is the same, Sort in descending order by the number of objects in the secondary key category); see Table 2, which shows the results of the first sorting of Table 1;
  • the key frame can be determined based on the clustering result.
  • FIG. 7 it shows a self-encoder, which includes an encoding end 701 and a decoding end 702, and a hidden layer 74 disposed between the encoding end 701 and the decoding end 702.
  • the encoding end 701 includes: a neural network 71, a neural network 72, a neural network 73, and a channel splicing 78.
  • the decoding end 702 includes: a neural network 75, a neural network 76, a neural network 77, and a channel separation 79.
  • the neural network 71, 72, 73, 75, 76, 77 includes at least one neuron layer.
  • the hidden layer 74 includes at least one neuron layer. In some embodiments, the hidden layer 74 may include an even number of neuron layers.
  • the neural networks 71, 72, and 73 can be configured to obtain the image, the object in the image, and the location information features of the object, respectively.
  • the channel stitching section 78 is configured to be logically connected to the output layer of the neural network 71, 72, and 73.
  • the channel stitching section 78 receives the output of the neural network 71, 72, 73 and generates an image matrix based on the received output.
  • the input layer of the hidden layer 74 is logically connected to the channel splicing part, and the hidden layer 74 is configured to perform feature extraction on the image matrix.
  • the channel separation part 79 is logically connected to the output layer of the hidden layer.
  • the channel separation 79 is configured to channel separation of the output of the hidden layer 74.
  • the channel separation includes: an image channel, an object channel, and an object position information channel.
  • the neural networks 75, 76, and 77 can be configured to be logically connected to the image channel, the object channel, and the object position information channel respectively, and obtain the image feature, the object feature, and the object location information feature.
  • FIG. 8 shows the flow of an image processing method based on some embodiments of the present application, including:
  • feature extraction may include firstly acquiring image features, object features in the image, and location information features of the object, and then Obtain feature vectors based on image features, object features, and object location information features;
  • the feature vectors can be clustered using, for example, K-means clustering or clustering with centroid minimization cluster midpoint distance clustering method;
  • the set rules may include, for example, the above-mentioned steps (1)-(5);
  • an autonomous driving vehicle 1000 may include a driving system 101, a control system 102, a driving system 103, and the like.
  • the sensor system 101 may include, for example, but not limited to, a positioning system (GPS), an inertial navigation (IM), a laser radar (Lidar), a millimeter wave radar, a camera, and the like.
  • the control system 102 may include, for example, but not limited to, systems/devices such as an autonomous vehicle computing platform, and the control system may include an automatic driving system (Autonomous Driving System: ADS for short) 104.
  • ADS automatic Driving System
  • the driving system 103 may include, for example, but not limited to, an engine, a transmission device, an electric energy source, a wire control system, and the like.
  • the sensor system 101, the control system 102, and the drive system 103 can be communicatively linked.
  • the image processing system described in the above various embodiments can be configured on the automatic driving assistance system of the control system, which can be based on various image frames/streams acquired by the camera of the sensor system during the driving of the vehicle. Process to obtain the key frames; under normal circumstances, the automatic driving vehicle 1000 runs for a day under normal conditions.
  • the image frames/streams collected by the camera often reach the scale of several gigabytes or even dozens of gigabytes, which are processed by the image processing system
  • the key frame set that can be selected from these image frames/streams is generally only tens of M in size. Therefore, the use of the technical solution of this application can significantly eliminate redundant data, and these acquired key frames can be used for subsequent evaluations. Neural network training for target detection algorithm. Refer to FIG. 9, which shows a schematic diagram of the automatic driving vehicle 1000 according to the embodiment acquiring key frames, thereby eliminating redundant data.
  • the images acquired by the autonomous vehicle 91 during the driving process include the three frames of images shown in Figure 9: 901, 902 and 903; these three frames all include the road and vehicles 92 on the road, which are different from the images 901 and 902 What is: A pedestrian 93 appears in the image 903.
  • 903 can be selected and marked as a key frame. Accordingly, the images 901 and 902 are redundant and can be deleted.
  • the three frames of images exemplarily given in FIG. 9 have a certain relevance in time and space, but the image processing system of the technical solution of the present application can also process unordered image sets that are not related in time and space. And get the key frame.
  • the technical solution of the present application can also be configured in the cloud.
  • the image frames/streams acquired by the vehicle can be transmitted to the cloud through the communication network.
  • the image frames/streams are processed in the cloud to obtain key frames.
  • the key frames can be used to train the neural network of the target detection algorithm.
  • an automatic driving system (Autonomous Driving System: ADS) for autonomous driving vehicles is provided, which may include the image processing system of the present application.
  • the acquired various image frames/streams are processed to acquire the key frames.
  • the image processing system of the present application can also be configured in the cloud, and the images acquired by the automatic driving assistance system during the driving of the vehicle are transmitted to the image processing system in the cloud, and the above-mentioned image frames/streams are processed in the cloud.
  • Process to obtain key frames, and the obtained key frames can be used for subsequent training of the neural network of the object detection algorithm.
  • a Neural-Network Processing Unit (NPU) is provided.
  • the neural network processor may be set in, for example, but not limited to, the control system 102 as shown in FIG. 10, and the algorithms of various image processing systems provided by the embodiments can all be implemented in the neural network processor.
  • FIG. 11 shows an image processing system architecture 1100 provided by an embodiment of the present application.
  • a data collection device 116 is used to collect image data.
  • the data collection device 116 stores the training data in the database 113, and the training device 112 trains based on the training data maintained in the database 113 to obtain the target model/rule 1171 (that is, the self-encoding in various embodiments of the present application). Model).
  • the target model/rule 1171 is obtained by training an autoencoder model. It should be noted that in actual applications, the training data maintained in the database 113 may not all come from the collection of the data collection device 116, and may also be received from other devices.
  • the training device 112 does not necessarily perform the training of the target model/rule 1171 completely based on the training data maintained by the database 113. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application. Limitations of the embodiment. It should also be noted that at least part of the training data maintained in the database 113 may also be used for the process of processing the processing to be processed by the execution device 111.
  • the target model/rule 1171 trained according to the training device 112 can be applied to different systems or devices, such as the execution device 111 shown in FIG. 11.
  • the execution device 210 can be a terminal, such as a mobile phone terminal, a tablet computer, Laptops, AR/VR, vehicle-mounted terminals, etc., can also be servers or clouds.
  • the execution device 111 is configured with an input/output (I/O) interface 1110 for data interaction with external devices.
  • I/O input/output
  • the preprocessing module 118 and the preprocessing module 119 are used to perform preprocessing according to the input data (such as the image to be processed) received by the I/O interface 1110.
  • the preprocessing module 118 and the preprocessing module may not be provided.
  • 119 there may only be one preprocessing module, and the calculation module 117 is directly used to process the input data.
  • the execution device 111 When the execution device 111 preprocesses the input data, or when the calculation module 117 of the execution device 111 performs calculations and other related processing, the execution device 111 can call the data, codes, etc. in the database 115 for corresponding processing.
  • the data, instructions, etc. obtained by corresponding processing can be stored in the data storage system 250.
  • the I/O interface 1110 returns the processing result to the image-enhanced image to be processed as described above, and returns the resulting output image to the client device 114 so as to provide it to the user.
  • the training device 112 can generate corresponding target models/rules 1171 based on different training data for different goals or tasks, and the corresponding target models/rules 1171 can be used to achieve the above goals or complete the above-mentioned goals.
  • the above tasks provide users with the desired results.
  • FIG. 11 is only a schematic diagram of an image processing system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the database 115 is an external memory relative to the execution device 111. In other cases, the data storage system 115 may also be placed in the execution device 111.
  • FIG. 12 is a hardware structure of a chip provided by an embodiment of the present application.
  • the chip includes a neural network processor 120 (neural-network processing unit, NPU).
  • the chip can be set in the execution device 111 as shown in FIG. 11 to complete the calculation work of the calculation module 117.
  • the chip can also be set in the training device 112 shown in FIG. 11 to complete the training work of the training device 112 and output the target model/rule 1171.
  • the NPU 400 is mounted on a main central processing unit (CPU) as a coprocessor, and the main CPU allocates tasks.
  • the core part of the NPU 400 is the arithmetic circuit 123, and the controller 126 controls the arithmetic circuit 123 to extract data from the memory (weight memory or input memory) and perform calculations.
  • the arithmetic circuit 123 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 123 is a two-dimensional systolic array. The arithmetic circuit 123 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 123 is a general-purpose matrix processor.
  • the arithmetic circuit 123 fetches the data corresponding to matrix B from the weight memory 122 and caches it on each PE in the arithmetic circuit 123.
  • the arithmetic circuit 123 fetches the matrix A data and matrix B from the input memory 401 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 124 (accumulator).
  • the vector calculation unit 129 can perform further processing on the output of the arithmetic circuit 123, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 129 can be used for network calculations in the non-convolutional/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • the vector calculation unit 129 can store the processed output vector to the unified memory 127.
  • the vector calculation unit 129 may apply a nonlinear function to the output of the arithmetic circuit 123, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 129 generates a normalized value, a combined value, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 123, for example for use in a subsequent layer in a neural network.
  • the unified memory 127 is used to store input data and output data.
  • the weight data directly passes through the storage unit access controller 128 (direct memory access controller, DMAC) to store the input data in the external memory into the input memory 1210 and/or the unified memory 127, and store the weight data in the external memory into the weight memory 122 , And store the data in the unified memory 127 into the external memory.
  • DMAC direct memory access controller
  • the bus interface unit 121 (bus interface unit, BIU) is used to implement interaction between the main CPU, the DMAC, and the fetch memory 125 through the bus.
  • the instruction fetch buffer 125 connected to the controller 126 is used to store instructions used by the controller 126.
  • the controller 126 is used to call the instructions cached in the instruction fetch memory 125 to control the working process of the computing accelerator.
  • the unified memory 127, the input memory 1210, the weight memory 122, and the instruction fetch memory 125 are all on-chip (On-Chip) memories.
  • the external memory is a memory external to the NPU.
  • the external memory can be a double data rate synchronous dynamic random access memory.
  • Memory double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (HBM) or other readable and writable memory.
  • Various embodiments of the present application provide an image processing system, method, and an autonomous vehicle including the system.
  • the image processing system of the present application adopts the Triplet architecture.
  • the image processing system/method of the present application It is possible to obtain features including image features, object features in the image, and object location information features at the same time, and obtain feature vectors based on these feature information, and obtain key frame images based on clustering and analysis of feature vectors.
  • the system/method of the present application does not require continuous frames for the processed images, that is, the system/method of the present application can process arbitrary, disordered images and obtain key frames therein.
  • the system/method of the present application The problem of redundant information processing caused by the need to use continuous frames in the process of acquiring key frames in the prior art is solved, and the efficiency of acquiring key frames is improved.
  • the present application fully considers the position information of the object in the image in the process of feature extraction, thus improving the accuracy of key frame acquisition.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of units is only a logical business division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • business units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be realized in the form of hardware or software business unit.
  • the integrated unit is realized in the form of a software business unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. .
  • the services described in the present invention can be implemented by hardware, software, firmware, or any combination thereof.
  • these services can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium.
  • the computer-readable medium includes a computer storage medium and a communication medium, where the communication medium includes any medium that facilitates the transfer of a computer program from one place to another.
  • the storage medium may be any available medium that can be accessed by a general-purpose or special-purpose computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

本申请涉及人工智能领域,公开了一种图像处理系统、方法以及包括该系统的自动驾驶车辆,本申请的图像处理系统采用Triplet架构,对一帧图像而言,本申请的图像处理系统/方法可以同时提取包括图像特征、图像中的对象特征以及对象的位置信息特征,并基于这些特征信息获取特征向量,基于对特征向量的聚类和分析即可获得关键帧图像。本申请的系统/方法,对于所处理的图像集没有时间上连续或者空间上关联的要求,即本申请的系统/方法可以对任意的图像集进行处理并获取其中的关键帧,降低了处理的冗余度,提升了关键帧获取的效率;另一方面,本申请在特征提取的过程中充分考虑了对象在图像中的位置信息,因此提升了关键帧获取的准确度。

Description

一种图像处理系统、方法以及包括该系统的自动驾驶车辆 技术领域
本申请涉及人工智能领域,特别地,涉及一种图像处理系统、方法以及包括该系统的自动驾驶车辆。
背景技术
随着5G通信和车联网技术的快速发展,自动驾驶技术已经成为研究热点。自动驾驶领域核心技术包括智能环境感知、自动导航定位、驾驶行为决策和智能路径规划控制等。在自动驾驶技术中,对道路信息进行对象检测(例如检测道路上的标识、行人等)是进行驾驶行为决策的先决条件,从图像处理的角度来看看,对象检测需要分析和确定图像中的各个对象的类别信息和位置信息。
目前,一般使用基于神经网络的机器学习方法来进行对象检测,用于对象检测的神经网络在使用前需要进行训练;目前一般采用基于自动驾驶车辆摄像装置获取的图像(集)进行训练。在这些图像(集)中可能存在多次通过同一地点拍摄的图像,或者连续拍摄的图像(即时间或空间上存在关联的图像),因此图像之间存在很大的相似性,就是存在于训练样本中的冗余数据。由于冗余数据的存在,增加了训练的所需的时长,而且相同/相似样本的过多,影响了训练对其它样本的处理,可能造成过拟合。实际上,在对目标检测的神经网络进行训练的过程中,只需要少量的关键帧即可达到较好的训练效果,一般而言,关键帧描述了对象动作的转折时点,或者场景的切换时点,包含了较丰富的信息。因此如何获取关键帧对于自动驾驶而言是一个亟待解决的问题。
现有技术一般需要利用连续帧图像(即时间和空间上存在关联的图像)之间的关联关系进行关键帧获取,这样不仅在处理上冗余度较高,也限缩了关键帧获取的可选择图像集范围;另一方面,现有技术在进行关键帧获取的时候没有考虑对象的位置信息,因此不能选择出对待检测对象位置信息的预测合适的关键帧。
发明内容
为了解决相关技术问题,本申请实施例提供了一种图像处理系统、方法以及包括该系统的自动驾驶车辆。
作为本申请的一方面,提供一种图像处理系统,包括三重架构(Triplet)的卷积神经网络(包括第一卷积神经网络、第二卷积神经网络和第三卷积神经网络)和通道拼接部。对一帧图像而言,三重架构的卷积神经网络配置为获取图像、图像中的对象、图像中的对象的位置这“三种”信息并对这三种信息进行特征提取,获取后的特征经由通道拼接部生成图像矩阵,图像矩阵中包括了图像、图像中的对象、对象的位置信息,基于对图像矩阵的特征提取可以获得特征向量,再根据特征向量的聚类和分析可以获取关键帧。本申请的图像处理系统 可以对无序图像(即在时间和/或空间上没有关联性的图像)进行处理并获取关键帧,从而解决了现有技术中关键帧获取过程中冗余信息过多的问题,提升了关键帧获取的效率和普适性。另一方面,本申请在特征提取的过程中充分考虑了对象在图像中的位置信息,因此提升了关键帧获取的准确度。
结合第一方面的一种可能实现方式,可以在通道拼接部后设置隐层对图像矩阵进行特征提取以获取特征向量,隐层可以使用神经元层来实现,隐层的输入层和通道拼接部逻辑连接。
结合第一方面的一种可能实施方式,可以在通道拼接部后设置第四卷积神经网络对图像矩阵进行特征提取以获取特征向量,卷积神经网络的输入层和通道拼接部逻辑连接。
第一方面的图像处理系统在使用之前需要经过训练,在一种可能实施方式中,采用类似于自编码器的架构进行训练,自编码器是一种能够通过无监督学习,学到输入数据高效表示的人工神经网络。本申请中的自编码器还包括:与隐层或者第四卷积神经网络的输出层逻辑连接的通道分离部,通道分离部配置为将隐层或者卷积神经网络的输出进行通道分离,通道分离包括:图像通道、对象通道和对象位置信息通道。上述图像通道、对象通道和对象位置信息通道分别和第五卷积神经网络、第六卷积神经网络和第七卷积神经网络的输入逻辑连接,第五、第六、第七卷积神经网络分别用于提取图像特征、对象特征以及对象位置信息特征并利用这些特征重建图像、图像中的对象、对象的位置信息。上述第一、第二、第三卷积神经网络属于自编码器的编码端,而上述第五、第六、第七卷积神经网络属于自编码器的解码端。
结合第一方面的一种可能实施方式,第一卷积神经网络、第二卷积神经网络、第三卷积神经网络可以包括降采样层。降采样层可以减少数据处理所需的计算量,并且防止过拟合现象。可以通过例如池化层(包括最大值采样、最小值采样、平均值采样)来实现降采样。在编码端使用了降采样层的情况下,解码端的第五卷积神经网络、第六卷积神经网络、第七卷积神经网络可以包括升采样层,升采样可以恢复数据维度以实现对输入信息的重建。
结合第一方面的一种可能实施方式,隐层可以包括偶数层神经元层,由于编码端和解码端是对称的结构,使用偶数层的隐层可以更有利于实现在编码端和解码端的(神经元)权重一致。因此,可以使用例如两层或者四层神经元层的隐层。
结合第一方面的一种可能实施方式,可以使用卷积神经网络来替换隐层,卷积神经网络可以采用通用的架构,基于和上述偶数层隐层的选择相类似的理由,卷积神经网络可以包括偶数层的卷积层,例如可以使用两层或者四层卷积层的卷积神经网络。
第二方面,本申请还提供一种图像处理方法,可以由例如但不限于训练好的第一方面的图像处理系统来执行,对待处理的图像,获取图像特征、图像中的对象特征、对象的位置信息特征;并融合所述图像特征、所述对象特征和所述对象的位置信息特征以得到图像矩阵。从图像矩阵中获取包括所述图像特征、所述对象特征和所述对象的位置信息特征的特征向量。
结合第二方面的一种可能实施方式,对特征向量进行聚类以得到聚类结果。可以使用例如包括K均值聚类(K-means)和质心最小化簇中点聚类。依据所述聚类结果得到多个聚类类别,多个聚类类别中的每一个包括至少一个图像,对多个聚类类别按照设定规则进行排序,对多个聚类类别中的每一个选取排序完成后的第一个图像作为关键帧,关键帧作为对象识别算法的训练材料。
第三方面,提供一种自动驾驶车辆,其包括前述第一方面的图像处理系统。
第四方面,提供一种自动驾驶车辆,其其配置为与云端通信连接,在云端设置有前述第一 方面的图像处理系统,自动驾驶车辆获取的图像被传输至云端的图像处理系统,图像处理系统对图像进行处理以获取其中的关键帧。
第五方面,提供一种自动驾驶辅助系统,其包括前述第一方面的图像处理系统。
第六方面,提供一种自动驾驶辅助系统,其其配置为与云端通信连接,在云端设置有前述第一方面的图像处理系统,自动驾驶辅助系统获取的图像被传输至云端图像处理系统,图像处理系统对图像进行处理以获取其中的关键帧。
第七方面,提供一种神经网络处理器,神经网络处理器配置为前述第二方面的图像处理方法。
第八方面,提供一种自编码器,包括:编码端、解码端、设置在编码端和解码端之间的隐层,编码端包括:第一神经网络,第一神经网络包括至少一个神经元层,第一神经网络配置为对图像进行特征提取;第二神经网络,第二神经网络包括至少一个神经元层,第二神经网络配置为对所述图像中的对象进行特征提取;第三神经网络,第三神经网络包括至少一个神经元层,第三神经网络配置为对所述图像中的对象的位置信息进行特征提取;通道拼接部,通道拼接部与所述第一神经网络、所述第二神经网络、所述第三神经网络的输出层逻辑连接,所述通道拼接部配置为接收第一神经网络、所述第二神经网络、所述第三神经网络的输出并基于接收的输出生成图像矩阵;隐层,所述隐层包括至少一个神经元层,所述隐层的输入层和所述通道拼接部逻辑连接,所述隐层配置为对所述图像矩阵进行特征提取;解码端包括:通道分离部,通道分离部与隐层的输出层逻辑连接,通道分离部配置为将隐层的输出进行通道分离,通道分离包括:图像通道、待检测对象通道和待检测对象位置信息通道;第四神经网络,第四神经网络包括至少一个神经元层,第四卷积神经网络配置为与所述图像通道逻辑连接并获取图像特征;第五神经网络,第五神经网络包括至少一个神经元层,所述第五卷积神经网络配置为与所述待检测对象通道逻辑连接并获取待检测对象特征;第六神经网络,第六神经网络包括至少一个神经元层,第六卷积神经网络配置为与所述待检测对象位置信息通道逻辑连接并获取待检测对象位置信息特征。第五方面的自编码器给出了一种通用的图像处理系统的架构,可以对第五方面的自编码器进行合适的改动即可获取到不同的适合不同场景的图像处理系统,例如可以将第五方面的自编码器的第一至第六神经网络更换为第一至第六卷积神经网络,或者进一步地将第五方面的自编码器的隐层更换为卷积神经网络,即可得到第一方面的各个技术方案。
本申请的各种实施例提供了一种图像处理系统、方法以及包括该系统的自动驾驶车辆,本申请的图像处理系统采用Triplet架构。对一帧图像而言,本申请的图像处理系统/方法可以同时获取包括图像特征、图像中的对象特征以及对象的位置信息特征,并基于这些特征信息获取特征向量,基于对特征向量的聚类和分析即可获得关键帧图像。本申请的系统/方法,对于所处理的图像没有时间上连续或者空间上关联的要求,即本申请的系统/方法可以对任意的图像(集)进行处理并获取其中的关键帧,因此本申请的系统/方法降低冗余信息处理,提升了关键帧获取的效率。另一方面,本申请在特征提取的过程中充分考虑了对象在图像中的位置信息,基于对象位置信息预测提升了关键帧获取的准确度。另外,本申请还提供了一种图像处理方法,一种神经网络处理器,以及一种自编码器架构。
附图说明
图1是本申请实施例提供的一种图像、图像中的对象、对象的位置信息示意图;
图2-1是本申请实施例提供的图像处理系统的示意图;
图2-2是本申请实施例提供的图像处理系统的示意图;
图3是本申请实施例提供的卷积神经网络的架构示意图;
图4是本申请实施例提供的图像处理系统的编码端和解码端共享权重的示意图;
图5是本申请实施例提供的图像处理系统进行训练的示意图;
图6-1是本申请实施例提供的训练完成的图像处理系统进行特征提取的示意图;
图6-2是本申请实施例提供的训练完成的图像处理系统进行特征提取的示意图;
图7是本申请实施例提供的自编码器的示意图;
图8是本申请实施例提供的图像处理方法的流程示意图;
图9是本申请实施例提供的从图像集合中获取关键帧的示意图;
图10是本申请实施例提供的自动驾驶车辆的示意图;
图11是本申请实施例提供的一种图像处理系统的架构示意图;
图12是本申请实施例提供的一种神经网络处理器的架构示意图
具体实施方式
本申请各种实施例提供了一种图像处理系统、方法以及采用该系统的自动驾驶车辆。本申请实施例的图像处理系统,在对图像进行特征提取时,考虑了图像、图像中的对象、对象的位置信息这三种不同的信息(三元信息),并基于上述三种不同的信息(三元信息)设计了Triplet(三重)型架构的编码器-解码器神经网络结构,从而在对图像进行信息获取的时候同时获取了对象信息以及对象的位置信息,因此可以更加准确地基于对象的位置信息预测获取关键帧。另一方面,本申请实施例的方案,不仅可以用于传统的连续帧图像集中的关键帧获取,也可以直接对无序图像集进行关键帧获取,而无须使用时间和/或空间上关联的图像集,即降低了处理时的冗余程度,提升了关键帧获取的效率,也扩展了关键帧获取的可选择图像集范围。
参见图1,其示出了本申请实施例的三元信息100的示意,其中11为一帧图像,在该帧图像中包括有对象111,对象111为一个指示牌(“前方学校,车辆慢行”),12为单独分离出的对象(即11中的指示牌),13为对象111在11中的位置信息。
在图一中示出了一帧图像和其中的一个对象,应当理解的是,这仅是示意说明,在一帧图像中也可以包括多个对象,而对象可以是物体,也可以是动物,也可以是人。
在一些实施例中,对于对象的确定可以采用人工的方式,例如众包;也可以采用通用的对象分割、语义分割的机器学习方法来自动实现,本申请对此不做限定。
在一些实施例中,待检测对象的位置信息由待检测对象的像素在图像中的位置的X、Y通道值所确定,以图1为例,13中的数据指示了11中的指示牌111的X、Y通道值。
参见图2-1,其示出了基于一些实施例的图像处理系统210的示意图,图像处理系统210主要包括编码端211和解码端212,在编码端211和212之间设置有隐层24,编码端211,解码端212和隐层24整体上构成自编码器架构。
编码端201包括三个卷积神经网络21、22、23和通道拼接部28,通道拼接部28的输入分别和卷积神经网络21、22、23的输出相逻辑连接,解码端包括三个卷积神经网络25、26、 27和通道分离部29,通道分离部29的输出和卷积神经网络25、26、27的输入相逻辑连接。
隐层24的输入和通道拼接部28的输出相逻辑连接,隐层24的输出和通道分离部29的输入相逻辑连接。
在一些实施例中,隐层24可以是例如包括偶数层的全连接的神经元层,由于编码端和解码端是对称的结构,使用偶数层的隐层可以更有利于实现在编码端和解码端的神经元的权重一致。在一些实施例中,隐层包括两层神经元层,在另外一些实施例中,隐层包括四层神经元层,神经元层之间可以采用全连接。
在另一些实施例中,参见图2-2,可以使用卷积神经网络来替换隐层24以得到图像处理系统220。卷积神经网络可以采用通用的架构,例如(但不限于)卷积-池化-卷积-池化-全连接的架构,基于和上述隐层选用偶数层所相类似的理由,卷积神经网络可以包括偶数层的卷积层。
在一些实施例中,编码端201和解码端202的卷积神经网络可以采用通用的架构设置,参见图3,其示出了图像处理系统中的一个卷积神经网络300的架构示意:图3中示意卷积神经网络300包括三个模块,每个模块均包括卷积层31和池化层33,在卷积层和池化层之间有激活函数(层)32,在三个模块的最后,设置有全连接层34作为输出层。
卷积层(Convolution Layer)对输入的(图像)数据进行卷积运算,卷积运算相当于图像处理中的滤波器运算,即以设定大小的滤波器按步长对图像进行乘积累加运算,通过卷积运算,可以提取出图像中的特征部分。
池化层(Pooling Layer)用于缩小高、长方向上的空间的运算,池化一般包括最大值池化、最小值池化和平均值池化。池化可以减少数据规模,并可以对输入数据的微小变化具有鲁棒性/不变性。
在一些实施例中,激活函数可以采用机器学习领域习知的ReLU、Sigmoid、Tanh、Maxout等函数。
应当理解的是,图3所示出的卷积神经网络的架构示例仅仅是一种可能的设置方式,本领域技术人员可以依据实际需要改变卷积层和/或池化层的数目而不会背离本申请的精神。在本申请中,为了深度地提取图像中的特征,一般采用三层以上的卷积层。而当卷积层数目较多时(例如:大于5层),优选使用ReLU函数作为激活函数。
应当理解的是,可以在编码端和解码端使用架构完全相同的卷积神经网络(例如图3所示出的架构),也可以使用不同的卷积神经网络的架构,本申请对此不做限定。
在一些实施例中,参见图4,当在编码端和解码端使用完全相同的卷积神经网络的架构时,在编码端和解码端的卷积神经网络之间可以共享神经元权重(图4中的虚线示意),通过共享权重,可以降低卷积神经网络的参数数量,提升运算效率。在图4所示实施例,编码端包括三个完全相同的卷积神经网络41、42和43;解码端也包括三个完全相同的卷积神经网络45、46和47。编码端的通道拼接部48的输出与隐层44的输入相逻辑连接,隐层44的输出和解码端的通道分离部49的输入相逻辑连接。
在实施例中,图像处理系统在对图像进行特征提取以获取关键帧之前需要进行训练,训练的过程介绍如下:
参见图5,其示出了和图2-2示例基本一致的网络架构。图5编码端的三个卷积神经网络51、52、53被配置为分别对一个具体图像帧的三元信息(图像,图像中的对象,对象的位置 信息)进行处理以提取特征信息。
在一些实施例中,为了减少数据处理所需的计算量,并且防止过拟合现象,可以在编码端的卷积神经网络51、52、53中采用降采样(Subsampling),具体而言,可以通过例如池化层来实现降采样,池化层可以使用最大值池化、最小值池化或者平均值池化。也可以通过调节卷积步幅(Stride)使其卷积步幅大于一来实现降采样。
经过上述编码端的三个卷积神经网络51、52、53的处理,可以分别在上述三个神经网络的输出层获取到图像特征信息、图像中的对象特征信息以及图像中的对象位置特征信息。然后将上述三种信息经由通道拼接部58进行通道拼接以获取一个图像矩阵,该图像矩阵中包括了上述三种特征信息,即图像特征信息、图像中的对象特征信息以及图像中的对象位置特征信息。
将图像矩阵输入到位于编码端和解码端之间的卷积神经网络54,进行特征提取,然后将获取的特征经由通道分离部59进行通道分离后分别输入到解码端的三个卷积神经网络55、56、57,对图像、图像中的对象、图像中的对象位置信息进行重建。由于在实施例中,在编码端进行了降采样,数据被降维,而在解码端(Decoder)的卷积神经网络55、56、57中进行了升采样(Upsampling)过程以恢复数据维度,在一些实施例中,升采样可以使用双线性插值来实现。
经由解码端的三个卷积神经网络55、56、57所获取的特征重建图像、图像中的对象、对象的位置信息,并基于重建的图像、图像中的对象,对象的位置信息和输入端(编码端)的图像、图像中的对象、对象的位置信息进行比较(学习),使用误差反向传播(BP)法来训练解码端和编码端的神经元的权重。应当理解的是,应当使用足够数量的、不同的图像帧来对图像处理系统进行训练,以使得图像处理系统的编码端和解码端的神经元的权重被训练到收敛。基于编码-解码的过程,编码端和解码端可以学习到对三元图像信息进行特征获提取和表达。
当对图像处理系统训练完成后,即可使用编码端(即特征提取端)对待处理的图像进行特征提取。应当理解的是,本申请的图像处理系统可以对无序图像,即时间和/或空间上无关联的图像进行特征提取并利用上述获取的特征进行关键帧选取。
应当理解的是,上述训练过程同样适用于例如但不限于图2-1所示的网络架构,区别仅在于图像矩阵在位于编码端和解码端之间的隐层进行特征提取,然后将后将获取的特征进行通道分离后分别输入到解码端的三个卷积神经网络,对图像、图像中的对象、图像中的对象位置信息进行重建。
在图像处理系统训练完成后,即可进行图像的特征提取并基于特征提取来确定关键帧,下面描述使用训练完的图像处理系统进行特征提取的过程:
参见图6-1,其示出了实施例提供的一种图像处理系统在训练完成后,对图像进行特征提取的示意图。
将图像、图像中的对象以及对象的位置信息分别输入到三个卷积神经网络611、612、613中,分别进行特征提取。在实施例中,特征提取的过程中同样可以使用降采样,降采样可以使用例如池化或者调节卷积步幅(Stride)使其卷积步幅大于一来实现。应当理解的是,也可以直接使用不带降采样的卷积神经网络来进行图像特征提取而不违背本申请的精神。
图像特征提取完成后,经过通道拼接部614进行通道拼接获得包括图像特征信息、图像 中的对象特征信息以及图像中的对象位置特征信息的图像矩阵,然后使用隐层615对图像矩阵进行特征提取,最终获得以一维向量形式表示的特征向量,对于每个特征向量而言,其中均包括了三种特征信息:即图像的信息、图像中的对象信息、对象的位置信息。
参见图6-2,其示出了实施例提供的一种图像处理系统训练完成后,对图像进行特征提取的示意图。
将图像、图像中的对象以及对象的位置信息分别输入到三个卷积神经网络中621、622、623,分别进行特征提取,在实施例中,特征提取的过程中同样使用降采样,降采样可以使用例如池化或者调节卷积步幅(Stride)使其卷积步幅大于一来实现。应当理解的是,也可以直接使用不带降采样的卷积神经网络来进行图像特征提取而不违背本申请的精神。
图像特征提取完成后,经过通道拼接部624进行通道拼接获得包括图像特征信息、图像中的对象特征信息以及图像中的对象位置特征信息的图像矩阵,然后使用卷积神经网络625对图像矩阵进行特征提取,最终获得以一维向量形式表示的特征向量,对于每个特征向量而言,其中均包括了三种特征信息:即图像的信息、图像中的对象信息、对象的位置信息。
在一些实施例中,对所获取的特征向量进行聚类,具体地,可以使用机器学习领域习知的聚类方法,例如K均值聚类法或质心最小化簇中点距离聚类法对特征向量进行聚类,统计不同图像中包含的目标类别以及各个类别的对象数量,生成下表1的结构:
表1:特征聚类结果与图像对应关系对应表
聚类类别 1 2 3 4 ........ 类别数量
图像1 1 1 1 0 ......... 3
图像2 2 0 1 4 ......... 3
图像3 0 1 2 0 ......... 2
图像4 0 0 0 0 ......... 0
对象数量 3 2 4 4 .........  
表2:第一次排序后结果
聚类类别 2 1 3 4 ........ 类别数量
图像1 1 1 1 0 ......... 3
图像3 1 0 2 0 ......... 2
图像2 0 2 1 4 ......... 3
图像4 0 0 0 0 ......... 0
对象数量 2 3 4 4 .........  
应当理解的是,聚类的类别可以基于实际需求而设置,上表中仅仅出于示例性说明而示出了四个聚类的类别和四个图像,实际上,对于图像、图像中的对象、对象的位置信息中的每一个都可能有多个分类类别,总的聚类类别数可能在几百至上千个,同样地,图像数也可以有几百至上千个。
在聚类完成后,即可基于聚类结果按照下述进行关键帧选取,具体步骤为:
(1)设图像集合为U,关键帧集合为V
(2)按照对象数量对类别进行升序排序,得到排序后的类别集合S;
(3)以集合S中排序后的类别顺序作为“主键-次键”划分的依据,对图像进行排序,排序规则为:基于主键类别的对象数量进行降序排序,主键类别的对象数量相同时,以次键类别的对象数量降序排序);参见表2,其示出了对表1进行了第一次排序后的结果;
(4)选择步骤(3)所得排序表(表2)中第一项对应的图像作为关键帧,将其从集合U移入集合V中,同时将该项对应的图像所包含的类别从集合S中剔除;以表2为例,选取图像1代表聚类类别2,这意味着图像1整体上可以“代表”聚类类别2这一簇图像族,对于聚类类别2这一簇图像族来说,图像1即可以“代表”它们,作为这一簇图像族的关键帧;
(5)重复步骤(2)-(4),直到集合S或集合U为空。
通过上述关键帧选取流程,即可基于聚类结果确定关键帧。
参见图7,其示出了一种自编码器,包括编码端701和解码端702,以及设置在编码端701和解码端702之间的隐层74。
编码端701包括:神经网络71,神经网络72,神经网络73,通道拼接78。
解码端702包括:神经网络75,神经网络76,神经网络77,通道分离79。
神经网络71、72、73、75、76、77包括至少一个神经元层。
隐层74包括至少一个神经元层,在一些实施例中,隐层74可以包括偶数个神经元层。
神经网络71、72、73可以被配置为分别获取图像、图像中的对象、对象的位置信息特征,通道拼接部78配置为分别与神经网络71、72、73的输出层逻辑连接,通道拼接部78接收神经网络71、72、73的输出并基于接收的输出生成图像矩阵。
隐层74的输入层和所述通道拼接部逻辑连接,隐层74配置为对所述图像矩阵进行特征提取。
通道分离部79与所述隐层的输出层逻辑连接,通道分离79配置为将隐层74的输出进行通道分离,通道分离包括:图像通道、对象通道和对象位置信息通道。
神经网络75、76、77可以被配置为分别与图像通道,对象通道,对象位置信息通道逻辑连接并获取图像特征、对象特征和对象位置信息特征。
可以对图7所示出的自编码器做适应性变化,例如可以将神经网络71-73、75-76替换为卷积神经网络,即可得到如图2-1所示的图像处理系统,而如果进一步地将隐层74替换为卷积神经网络,即可得到如图2-2所示的图像处理系统。本领域技术人员可以依据实际的情况来对图7所示出的自编码器进行适应性地调整而不背离本申请的精神。
参见图8,其示出了基于本申请一些实施例的图像处理方法流程,包括:
81,开始;
82,对图像进行特征提取以得到特征向量,基于本申请的实施例,对于一帧图像而言,特征提取可以包括首先进行图像特征、图像中的对象特征、对象的位置信息特征的获取,然后基于图像特征、对象特征和对象位置信息特征得到特征向量;
83,对特征向量进行聚类,基于本申请的实施例,可以使用例如K均值聚类或者或质心最小化簇中点距离聚类法对特征向量进行聚类;
84,依据聚类结果得到关键帧,对聚类结果进行分析,并按照设定规则处理即可得到关 键帧;这里的设定规则可以包括例如上述的步骤(1)-(5);
85,结束。
在一些实施例中,参见图10,提供一种自动驾驶车辆1000,其可以包括驱动系统101,控制系统102,驱动系统103等。传感器系统101可以包括例如但不限于定位系统(GPS)、惯导(IM)、激光雷达(Lidar)、毫米波雷达、相机等。控制系统102可以包括例如但不限于自动驾驶车辆计算平台等系统/装置,控制系统可以包括自动驾驶系统(Autonomous Driving System:简称ADS)104。驱动系统103可以包括例如但不限于引擎、传动装置、电动能量源、线控系统等。传感器系统101、控制系102统、驱动系统103之间可以通信地链接。在一些实施例中,上述各种实施例所描述的图像处理系统可以配置在控制系统的自动驾驶辅助系统上,其可以在车辆行驶过程中基于传感器系统的相机所获取的各种图像帧/流进行处理以获取其中的关键帧;一般情况下,自动驾驶车辆1000正常情况下行驶一天基于相机收集到的图像帧/流往往要达到几个G甚至几十G的规模,而经过图像处理系统处理后可以从这些图像帧/流中选取到的关键帧集合一般只有几十M的大小,因此使用本申请的技术方案可以显著地消除冗余数据,而这些获取的关键帧可以用于后续的对于目标检测算法的神经网络的训练。参见图9,其给出了实施例提供的自动驾驶车辆1000获取关键帧,从而消除了冗余数据的示意。自动驾驶车辆91在行驶过程中获取的图像包括图9所示的三帧图像:901,902和903;在这三帧图像中均包括了道路和道路上的车辆92,与图像901和902不同的是:在图像903中出现了一个行人93。经过本申请实施例的图像处理系统处理后,可以确定图像903为关键帧,因此可以将903选择出并标记为关键帧,相应地,图像901和902即为冗余的,可以将它们删除。应当理解的是:图9中示例性给出的三帧图像在时间和空间上有一定的关联性,但是本申请技术方案的图像处理系统同样可以对时空上没有关联的无序图像集合进行处理并获取关键帧。
在另外一些实施例中,也可以将本申请技术方案配置在云端,车辆所获取的图像帧/流可以通过通信网络传输到云端,在云端对图像帧/流进行处理以获取关键帧,所获取的关键帧可以用于对于目标检测算法的神经网络的训练。
在另外一些实施例中,提供一种用于自动驾驶车辆的自动驾驶系统(Autonomous Driving System:简称ADS),其可以包括本申请的图像处理系统,图像处理系统对车辆在行驶过程中基于相机所获取的各种图像帧/流进行处理以获取其中的关键帧。在另外一些实施例中,也可以将本申请的图像处理系统配置在云端,自动驾驶辅助系统在车辆行驶过程中获取的图像被传输至云端的图像处理系统,在云端对上述图像帧/流进行处理以获取关键帧,所获取的关键帧可以用于后续对于对象检测算法的神经网络的训练。
在另外一些实施例中,提供一种神经网络处理器(Neural-Network Processing Unit,NPU)。该神经网络处理器可以被设置在例如但不限于如图10所示的控制系统102中,实施例提供的各种图像处理系统的算法均可在该神经网络处理器中得以实现。
图11示出了本申请实施例提供的一种图像处理系统架构1100。
在图11中,数据采集设备116用于采集图像数据。
在采集到图像数据之后,数据采集设备116将这些训练数据存入数据库113,训练设备112基于数据库113中维护的训练数据训练得到目标模型/规则1171(即本申请各种实施例中的自编码器模型)。
在本申请提供的实施例中,该目标模型/规则1171是通过训练自编码器模型得到的。需要说明的是,在实际的应用中,所述数据库113中维护的训练数据不一定都来自于数据采集设备116的采集,也有可能是从其他设备接收得到的。
另外需要说明的是,训练设备112也不一定完全基于数据库113维护的训练数据进行目标模型/规则1171的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。还需要说明的是,数据库113中维护的训练数据中的至少部分数据也可以用于执行设备111对待处理处理进行处理的过程。
根据训练设备112训练得到的目标模型/规则1171可以应用于不同的系统或设备中,如应用于图11所示的执行设备111,所述执行设备210可以是终端,如手机终端,平板电脑,笔记本电脑,AR/VR,车载终端等,还可以是服务器或者云端等。
在图11中,执行设备111配置输入/输出(input/output,I/O)接口1110,用于与外部设备进行数据交互。
预处理模块118和预处理模块119用于根据I/O接口1110接收到的输入数据(如待处理图像)进行预处理,在本申请实施例中,也可以没有预处理模块118和预处理模块119(也可以只有其中的一个预处理模块),而直接采用计算模块117对输入数据进行处理。
在执行设备111对输入数据进行预处理,或者在执行设备111的计算模块117执行计算等相关的处理过程中,执行设备111可以调用数据库115中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统250中。
最后,I/O接口1110将处理结果,如上述得到待处理图像增强图像,即将得到的输出图像返回给客户设备114,从而提供给用户。
值得说明的是,训练设备112可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则1171,该相应的目标模型/规则1171即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
值得注意的是,图11仅是本申请实施例提供的一种图像处理系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图11中,数据库115相对执行设备111是外部存储器,在其它情况下,也可以将数据存储系统115置于执行设备111中。
图12是本申请实施例提供的一种芯片的硬件结构,该芯片包括神经网络处理器120(neural-network processing unit,NPU)。该芯片可以被设置在如图11所示的执行设备111中,用以完成计算模块117的计算工作。该芯片也可以被设置在如图11所示的训练设备112中,用以完成训练设备112的训练工作并输出目标模型/规则1171。
NPU 400作为协处理器挂载到主中央处理器(central processing unit,CPU)上,由主CPU分配任务。NPU 400的核心部分为运算电路123,控制器126控制运算电路123提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路123内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路123是二维脉动阵列。运算电路123还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路123是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路123从权重存储器 122中取矩阵B相应的数据,并缓存在运算电路123中每一个PE上。运算电路123从输入存储器401中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器124(accumulator)中。
向量计算单元129可以对运算电路123的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元129可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元能129将经处理的输出的向量存储到统一存储器127。例如,向量计算单元129可以将非线性函数应用到运算电路123的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元129生成归一化的值、合并值,或二者均有。
在一些实现中,处理过的输出的向量能够用作到运算电路123的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器127用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器128(direct memory access controller,DMAC)将外部存储器中的输入数据存入到输入存储器1210和/或统一存储器127、将外部存储器中的权重数据存入权重存储器122,以及将统一存储器127中的数据存入外部存储器。
总线接口单元121(bus interface unit,BIU),用于通过总线实现主CPU、DMAC和取指存储器125之间进行交互。
与控制器126连接的取指存储器125(instruction fetch buffer),用于存储控制器126使用的指令。控制器126用于调用取指存储器125中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器127,输入存储器1210,权重存储器122以及取指存储器125均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
本申请的各种实施例提供了一种图像处理系统、方法以及包括该系统的自动驾驶车辆,本申请的图像处理系统采用Triplet架构,对一帧图像而言,本申请的图像处理系统/方法可以同时获取包括图像特征、图像中的对象特征以及对象的位置信息特征,并基于这些特征信息获取特征向量,基于对特征向量的聚类和分析即可获得关键帧图像。本申请的系统/方法,对于所处理的图像没有连续帧的要求,即本申请的系统/方法可以对任意的、无序的图像进行处理并获取其中的关键帧,因此本申请的系统/方法解决了现有技术中对关键帧获取的过程中需要使用连续帧所导致的冗余信息处理问题,提升了关键帧获取的效率。另一方面,本申请在特征提取的过程中充分考虑了对象在图像中的位置信息,因此提升了关键帧获取的准确度。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方 法、产品或设备固有的其它步骤或单元。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑业务划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各业务单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件业务单元的形式实现。
集成的单元如果以软件业务单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的业务可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些业务存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
以上的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上仅为本发明的具体实施方式而已。
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (26)

  1. 一种图像处理系统,包括:第一卷积神经网络,第二卷积神经网络,第三卷积神经网络和通道拼接部,所述通道拼接部与所述第一卷积神经网络、所述第二卷积神经网络、所述第三卷积神经网络的输出层逻辑连接;
    所述第一卷积神经网络配置为:获取图像并对所述图像进行特征提取;
    所述第二卷积神经网络配置为:获取所述图像中的对象并对所述图像中的对象进行特征提取;
    所述第三卷积神经网络配置为:获取所述图像中的对象的位置信息并对所述图像中的对象的位置信息进行特征提取;
    通道拼接部,所述通道拼接部与所述第一卷积神经网络、所述第二卷积神经网络、所述第三卷积神经网络的输出层相连接逻辑连接,所述通道拼接部配置为接收所述第一卷积神经网络、所述第二卷积神经网络、所述第三卷积神经网络的输出并基于接收的输出生成图像矩阵。
  2. 根据权利要求1所述的图像处理系统,还包括:
    隐层,所述隐层包括至少一个神经元层,所述隐层的输入层和所述通道拼接部逻辑连接,所述隐层配置为对所述图像矩阵进行特征提取。
  3. 根据权利要求2所述的图像处理系统,其中:
    所述隐层包括偶数个全连接的神经元层。
  4. 根据权利要求1所述的图像处理系统,还包括:
    第四卷积神经网络,所述第四卷积神经网络的输入层和所述通道拼接部逻辑连接,所述第四卷积神经网络配置为对所述图像矩阵进行特征提取。
  5. 根据权利要求4所述的图像处理系统,其中:
    所述第四卷积神经网络包括偶数个卷积层。
  6. 根据权利要求3所述的图像处理系统,还包括:
    通道分离部,所述通道分离部与所述隐层的输出层逻辑连接,所述通道分离部配置为将所述隐层的输出特征进行通道分离,所述通道分离包括:图像通道、对象通道和对象位置信息通道。
  7. 根据权利要求5所述的图像处理系统,还包括:
    通道分离部,所述通道分离部与所述第四卷积神经网络的输出层逻辑连接,所述通道分离部配置为将所述第四神经网络的输出进行通道分离,所述通道分离包括:图像通道、对象通道和对象位置信息通道。
  8. 根据权利要求6-7任一所述的图像处理系统,还包括:
    第五卷积神经网络,所述第五卷积神经网络配置为与所述图像通道逻辑连接并提取图像特征;
    第六卷积神经网络,所述第六卷积神经网络配置为与所述对象通道逻辑连接并提取待检测对象特征;
    第七卷积神经网络,所述第七卷积神经网络配置为与所述对象位置信息通道逻辑连接并提取待检测对象位置信息特征。
  9. 根据权利要求8所述的图像处理系统,其中:
    所述第一卷积神经网络、第二卷积神经网络、第三卷积神经网络包括降采样层。
  10. 根据权利要求9所述的图像处理系统,其中:
    所述降采样层为池化层,所述池化层包括最大值池化、最小值池化或者平均值池化中的至少一种。
  11. 根据权利要求10所述的图像处理系统,其中:
    所述降采样层配置为以大于1的步数执行卷积操作以实现降采样。
  12. 根据权利要求11所述的图像处理系统,其中:
    所述第五卷积神经网络、第六卷积神经网络、第七卷积神经网络均包括升采样层。
  13. 根据权利要求12所述的图像处理系统,其中:
    所述升采样层配置为执行双线性插值以实现升采样。
  14. 根据权利要求13所述的图像处理系统,其中:
    所述第一卷积神经网络、第二卷积神经网络和第三卷积神经网络之间共享权值。
  15. 根据权利要求14所述的图像处理系统,其中:
    所述第五卷积神经网络、第六卷积神经网络和第七卷积神经网络之间共享权值。
  16. 一种图像处理方法,包括:
    提取图像特征;
    提取所述图像中的对象特征;
    提取所述图像中的对象的位置信息特征;
    融合所述图像特征、所述对象特征和所述对象的位置信息特征以得到图像矩阵。
  17. 根据权利要求16所述的图像处理方法,还包括:
    从图像矩阵中提取包括所述图像特征、所述对象特征和所述对象的位置信息特征的特征 向量。
  18. 根据权利要求17所述的图像处理方法,其中:
    对所述特征向量进行聚类以得到聚类结果。
  19. 根据权利要求18所述的图像处理方法,其中:
    所述聚类包括K均值聚类(K-means)和质心最小化簇中点聚类。
  20. 根据权利要求19所述的图像处理方法,还包括:
    依据所述聚类结果得到多个聚类类别,所述多个聚类类别中的每一个包括至少一个图像,对所述多个聚类类别按照设定规则进行排序,对所述多个聚类类别中的每一个选取排序完成后的第一个图像作为关键帧,所述关键帧作为对象识别算法的训练材料。
  21. 一种自动驾驶车辆,其包括如权利要求1-15任一所述的图像处理系统。
  22. 一种自动驾驶车辆,其配置为与云端通信连接,在所述云端设置有有如权利要求1-15任一所述的图像处理系统,所述自动驾驶车辆获取的图像被传输至所述图像处理系统,所述图像处理系统对所述获取的图像进行处理以获取关键帧。
  23. 一种自动驾驶辅助系统,其包括如权利要求1-15任一所述的图像处理系统。
  24. 一种自动驾驶辅助系统,其配置为与云端通信连接,在所述云端设置有有如权利要求1-15任一所述的图像处理系统,所述自动驾驶辅助系统获取的图像被传输至所述图像处理系统,所述图像处理系统对所述自动驾驶辅助系统获取的图像进行处理以获取关键帧。
  25. 一种神经网络处理器,所述神经网络处理器配置为可执行如权利要求16-20任一所述的图像处理方法。
  26. 一种自编码器,包括:
    编码端,所述编码端包括:
    第一神经网络,所述第一神经网络包括至少一个神经元层,所述第一神经网络配置为获取图像并对图像进行特征提取;
    第二神经网络,所述第二神经网络包括至少一个神经元层,所述第二神经网络配置为获取所述图像中的对象并对所述图像中的对象进行特征提取;
    第三神经网络,所述第三神经网络包括至少一个神经元层,所述第三神经网络配置为获取所述图像中的对象的位置信息并对所述图像中的对象的位置信息进行特征提取;
    通道拼接部,所述通道拼接部与所述第一神经网络、所述第二神经网络、所述第三神经网络的输出层逻辑逻辑连接,所述通道拼接部配置为接收所述第一神经网络、所述第二神经网络、所述第三神经网络的输出并基于接收的输出生成图像矩阵;
    隐层,所述隐层包括至少一个神经元层,所述隐层的输入层和所述通道拼接逻辑连接,所述隐层配置为对所述图像矩阵进行特征提取;
    解码端,所述解码端包括:
    通道分离部,所述通道分离部与所述隐层的输出层逻辑连接,所述通道分离部配置为将所述隐层的输出进行通道分离,所述通道分离部包括:图像通道、对象通道和对象位置信息通道;
    第四神经网络,所述第四神经网络包括至少一个神经元层,所述第四卷积神经网络配置为与所述图像通道逻辑连接并提取图像特征;
    第五神经网络,所述第五神经网络包括至少一个神经元层,所述第五卷积神经网络配置为与所述对象通道逻辑连接并提取待检测对象特征;
    第六神经网络,所述第六神经网络包括至少一个神经元层,所述第六卷积神经网络配置为与所述待检测对象位置信息通道逻辑连接并提取对象位置信息特征。
PCT/CN2020/078093 2020-03-06 2020-03-06 一种图像处理系统、方法以及包括该系统的自动驾驶车辆 WO2021174513A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080004424.9A CN112805723B (zh) 2020-03-06 2020-03-06 一种图像处理系统、方法以及包括该系统的自动驾驶车辆
PCT/CN2020/078093 WO2021174513A1 (zh) 2020-03-06 2020-03-06 一种图像处理系统、方法以及包括该系统的自动驾驶车辆

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/078093 WO2021174513A1 (zh) 2020-03-06 2020-03-06 一种图像处理系统、方法以及包括该系统的自动驾驶车辆

Publications (1)

Publication Number Publication Date
WO2021174513A1 true WO2021174513A1 (zh) 2021-09-10

Family

ID=75809241

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/078093 WO2021174513A1 (zh) 2020-03-06 2020-03-06 一种图像处理系统、方法以及包括该系统的自动驾驶车辆

Country Status (2)

Country Link
CN (1) CN112805723B (zh)
WO (1) WO2021174513A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419471A (zh) * 2022-03-29 2022-04-29 北京云迹科技股份有限公司 一种楼层识别方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038713A (zh) * 2017-04-12 2017-08-11 南京航空航天大学 一种融合光流法和神经网络的运动目标捕捉方法
CN107610113A (zh) * 2017-09-13 2018-01-19 北京邮电大学 一种图像中基于深度学习的小目标的检测方法及装置
CN107609635A (zh) * 2017-08-28 2018-01-19 哈尔滨工业大学深圳研究生院 一种基于物体检测与光流计算的物体物理速度估计方法
US20190034762A1 (en) * 2017-07-27 2019-01-31 Toyota Jidosha Kabushiki Kaisha Perception device
CN109902806A (zh) * 2019-02-26 2019-06-18 清华大学 基于卷积神经网络的噪声图像目标边界框确定方法
US20190304105A1 (en) * 2018-04-03 2019-10-03 Altumview Systems Inc. High-performance visual object tracking for embedded vision systems

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102034267A (zh) * 2010-11-30 2011-04-27 中国科学院自动化研究所 基于关注度的目标物三维重建方法
CN106529419B (zh) * 2016-10-20 2019-07-26 北京航空航天大学 视频显著性堆栈式聚合的对象自动检测方法
CN109359048A (zh) * 2018-11-02 2019-02-19 北京奇虎科技有限公司 一种生成测试报告的方法、装置及电子设备
CN110096950B (zh) * 2019-03-20 2023-04-07 西北大学 一种基于关键帧的多特征融合行为识别方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038713A (zh) * 2017-04-12 2017-08-11 南京航空航天大学 一种融合光流法和神经网络的运动目标捕捉方法
US20190034762A1 (en) * 2017-07-27 2019-01-31 Toyota Jidosha Kabushiki Kaisha Perception device
CN107609635A (zh) * 2017-08-28 2018-01-19 哈尔滨工业大学深圳研究生院 一种基于物体检测与光流计算的物体物理速度估计方法
CN107610113A (zh) * 2017-09-13 2018-01-19 北京邮电大学 一种图像中基于深度学习的小目标的检测方法及装置
US20190304105A1 (en) * 2018-04-03 2019-10-03 Altumview Systems Inc. High-performance visual object tracking for embedded vision systems
CN109902806A (zh) * 2019-02-26 2019-06-18 清华大学 基于卷积神经网络的噪声图像目标边界框确定方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419471A (zh) * 2022-03-29 2022-04-29 北京云迹科技股份有限公司 一种楼层识别方法、装置、电子设备及存储介质
CN114419471B (zh) * 2022-03-29 2022-08-30 北京云迹科技股份有限公司 一种楼层识别方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN112805723B (zh) 2022-08-09
CN112805723A (zh) 2021-05-14

Similar Documents

Publication Publication Date Title
WO2020238293A1 (zh) 图像分类方法、神经网络的训练方法及装置
CN111797893B (zh) 一种神经网络的训练方法、图像分类系统及相关设备
WO2021042828A1 (zh) 神经网络模型压缩的方法、装置、存储介质和芯片
Mendes et al. Exploiting fully convolutional neural networks for fast road detection
US20220215227A1 (en) Neural Architecture Search Method, Image Processing Method And Apparatus, And Storage Medium
WO2021155792A1 (zh) 一种处理装置、方法及存储介质
WO2021147325A1 (zh) 一种物体检测方法、装置以及存储介质
CN110175671A (zh) 神经网络的构建方法、图像处理方法及装置
WO2021164750A1 (zh) 一种卷积层量化方法及其装置
US12026938B2 (en) Neural architecture search method and image processing method and apparatus
JP2017062781A (ja) 深層cnnプーリング層を特徴として用いる、類似度に基づく重要な対象の検知
Ayachi et al. Pedestrian detection based on light-weighted separable convolution for advanced driver assistance systems
CN110222718B (zh) 图像处理的方法及装置
CN111882031A (zh) 一种神经网络蒸馏方法及装置
WO2022217434A1 (zh) 感知网络、感知网络的训练方法、物体识别方法及装置
WO2021129668A1 (zh) 训练神经网络的方法和装置
CN113191241A (zh) 一种模型训练方法及相关设备
CN110705600A (zh) 一种基于互相关熵的多深度学习模型融合方法、终端设备及可读存储介质
CN113537462A (zh) 数据处理方法、神经网络的量化方法及相关装置
CN115375781A (zh) 一种数据处理方法及其装置
CN114492723A (zh) 神经网络模型的训练方法、图像处理方法及装置
Karthika et al. A novel convolutional neural network based architecture for object detection and recognition with an application to traffic sign recognition from road scenes
CN113128285A (zh) 一种处理视频的方法及装置
WO2024160219A1 (zh) 一种模型量化方法及其装置
Yuan et al. Low-res MobileNet: An efficient lightweight network for low-resolution image classification in resource-constrained scenarios

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20923324

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20923324

Country of ref document: EP

Kind code of ref document: A1