WO2020244653A1 - 物体识别方法及装置 - Google Patents
物体识别方法及装置 Download PDFInfo
- Publication number
- WO2020244653A1 WO2020244653A1 PCT/CN2020/094803 CN2020094803W WO2020244653A1 WO 2020244653 A1 WO2020244653 A1 WO 2020244653A1 CN 2020094803 W CN2020094803 W CN 2020094803W WO 2020244653 A1 WO2020244653 A1 WO 2020244653A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- task
- frame
- network
- candidate
- header
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional objects
- G06V20/647—Three-dimensional objects by matching two-dimensional images to three-dimensional objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
- G06V10/235—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on user input or interaction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/758—Involving statistics of pixels or of feature values, e.g. histogram matching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/24—Character recognition characterised by the processing or recognition method
- G06V30/248—Character recognition characterised by the processing or recognition method involving plural approaches, e.g. verification by template match; Resolving confusion among similar patterns, e.g. "O" versus "Q"
- G06V30/2504—Coarse or fine approaches, e.g. resolution of ambiguities or multiscale approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Definitions
- the backbone network is configured to receive input pictures, perform convolution processing on the input pictures, and output feature maps with different resolutions corresponding to the pictures;
- the ROI-ALIGN module is configured to extract the features of the region where the candidate 2D frame is located from a feature map provided by the backbone network according to the region predicted by the RPN module;
- each picture determines the task to which each picture belongs; wherein each picture is labeled with one or more data types, and the multiple data types are subsets of all data types, and one data type corresponds to A task
- the Header decision module is used to determine the Header to be trained for each picture according to the task to which each picture belongs;
- the device further includes: a data equalization module, configured to perform data equalization on pictures belonging to different tasks.
- a data equalization module configured to perform data equalization on pictures belonging to different tasks.
- Communication interface used to communicate with other devices or communication networks
- the memory is used to store application program codes for executing the above solutions, and the processor controls the execution.
- the processor is configured to execute application program codes stored in the memory.
- each perception task shares the same backbone network, which doubles the amount of calculation and improves the calculation speed of the perception network model; the network structure is easy to expand, and 2D can be expanded by adding only one or several headers.
- Detection type Each parallel Header has independent RPN and RCNN modules, and only needs to detect the objects of the task to which it belongs, so that during the training process, it can avoid accidental injury to objects of other tasks that are not labeled.
- Figure 2 is a schematic diagram of a CNN feature extraction model provided by an embodiment of the application.
- FIG. 4 is a schematic diagram of a framework of a perception network application system based on multiple parallel headers provided by an embodiment of the application;
- FIG. 6 is a schematic structural diagram of an ADAS/AD-based perception system based on multiple parallel headers provided by an embodiment of the application;
- FIG. 7 is a schematic diagram of a flow of basic feature generation provided by an embodiment of this application.
- FIG. 9 is a schematic diagram of an Anchor corresponding to another RPN layer object provided by an embodiment of the application.
- FIG. 10 is a schematic diagram of another ROI-ALIGN process provided by an embodiment of this application.
- FIG. 11 is a schematic diagram of the implementation and structure of another RCNN provided by an embodiment of the application.
- FIG. 13 is a schematic diagram of the implementation and structure of another serial header provided by an embodiment of the application.
- 15 is a schematic diagram of a training method for partially labelled data provided by an embodiment of the application.
- FIG. 16 is a schematic diagram of another training method for partially labelled data provided by an embodiment of this application.
- FIG. 18 is a schematic diagram of another training method for partially labeling data provided by an embodiment of the application.
- FIG. 19 is a schematic diagram of the application of a perception network based on multiple parallel headers provided by an embodiment of this application.
- 20 is a schematic diagram of the application of a perception network based on multiple parallel headers provided by an embodiment of the application;
- FIG. 21 is a schematic flowchart of a sensing method provided by an embodiment of this application.
- FIG. 22 is a schematic diagram of a 2D detection process provided by an embodiment of this application.
- FIG. 23 is a schematic diagram of a 3D detection process of a terminal device according to an embodiment of the application.
- FIG. 24 is a schematic diagram of a mask prediction process provided by an embodiment of this application.
- 25 is a schematic diagram of a key point coordinate prediction process provided by an embodiment of the application.
- FIG. 26 is a schematic diagram of a training process of a perception network provided by an embodiment of this application.
- FIG. 27 is a schematic diagram of a realization structure of a sensing network based on multiple parallel Headers provided by an embodiment of the application;
- FIG. 28 is a schematic diagram of a realization structure of a sensing network based on multiple parallel Headers provided by an embodiment of the application;
- FIG. 29 is a diagram of an apparatus for training a multi-task perception network based on partially labeled data according to an embodiment of this application;
- FIG. 30 is a schematic flowchart of an object detection method provided by an embodiment of this application.
- FIG. 31 is a flowchart of training a multi-task perception network based on partial annotation data according to an embodiment of the application.
- the embodiments of the present application are mainly applied in fields such as driving assistance, automatic driving, and mobile phone terminals that need to complete various perception tasks.
- the application system framework of the present invention is shown in Figure 4, a single picture is obtained through frame extraction of the video, and the picture is sent to the Mulit-Header perception network of the present invention to obtain the 2D, 3D, Mask (mask) of the object of interest in the picture. Membrane), key points and other information.
- These detection results are output to the post-processing module for processing. For example, they are sent to the planning control unit in the automatic driving system for decision-making, and the beauty algorithm is sent to the mobile phone terminal for processing to obtain the beauty pictures.
- the following is a brief introduction to the two application scenarios of ADAS/ADS visual perception system and mobile phone beauty
- Application scenario 1 ADAS/ADS visual perception system
- Application scenario 2 mobile phone beauty function
- the mask and key points of the human body are detected through the perception network provided by the embodiments of the present application, and the corresponding parts of the human body can be zoomed in and out, such as waist reduction and buttocks operation, so as to output beauty picture of.
- Application scenario 3 Image classification scenario:
- the object recognition device After obtaining the image to be classified, the object recognition device adopts the object recognition method of the present application to obtain the category of the object in the image to be classified, and then can classify the image to be classified according to the category of the object in the image to be classified.
- photos For photographers, many photos are taken every day, including animals, people, and plants. Using the method of this application, photos can be quickly classified according to the content of the photos, and can be divided into photos containing animals, photos containing people, and photos containing plants.
- the object recognition device After the object recognition device acquires the image of the product, it then uses the object recognition method of the present application to acquire the category of the product in the image of the product, and then classifies the product according to the category of the product. For a wide variety of commodities in large shopping malls or supermarkets, the object recognition method of the present application can quickly complete the classification of commodities, reducing time and labor costs.
- the method and device provided by the embodiments of the present application can also be used to expand the training database.
- the I/O interface 112 of the execution device 120 can convert images processed by the execution device (such as image blocks or images containing objects). Together with the object category input by the user, it is sent to the database 130 as a pair of training data, so that the training data maintained by the database 130 is richer, thereby providing richer training data for the training work of the training device 130.
- the method for training a CNN feature extraction model involves computer vision processing, and can be specifically applied to data processing methods such as data training, machine learning, and deep learning.
- data processing methods such as data training, machine learning, and deep learning.
- For training data (such as the image of the object in this application or Image blocks and object categories) perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc., and finally obtain a trained CNN feature extraction model; and the embodiment of this application will input data (such as this The image of the object in the application) is input into the trained CNN feature extraction model to obtain output data (for example, the 2D, 3D, Mask, key points and other information of the object of interest in the image obtained in this application).
- output data for example, the 2D, 3D, Mask, key points and other information of the object of interest in the image obtained in this application.
- a neural network can be composed of neural units.
- a neural unit can refer to an arithmetic unit that takes xs and intercept 1 as inputs.
- the output of the arithmetic unit can be:
- s 1, 2,...n, n is a natural number greater than 1
- Ws is the weight of xs
- b is the bias of the neural unit.
- f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
- the output signal of the activation function can be used as the input of the next convolutional layer.
- the activation function can be a sigmoid function.
- a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
- the input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field.
- the local receptive field can be a region composed of several neural units.
- Deep Neural Network also known as multi-layer neural network
- DNN Deep Neural Network
- the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
- the first layer is the input layer
- the last layer is the output layer
- the number of layers in the middle are all hidden layers.
- the layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer.
- DNN looks complicated, it is not complicated in terms of the work of each layer.
- Training a deep neural network is also a process of learning a weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (a weight matrix formed by vectors W of many layers).
- Convolutional Neural Network (CNN, Convolutional Neuron Network) is a deep neural network with a convolutional structure.
- the convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer.
- the feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolution feature map.
- the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
- a neuron can be connected to only part of the neighboring neurons.
- a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
- Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Sharing weight can be understood as the way to extract image information has nothing to do with location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the image information obtained by the same learning can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
- the convolution kernel can be initialized in the form of a matrix of random size. During the training of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
- RNN Recurrent Neural Networks
- the layers are fully connected, and the nodes in each layer are not connected.
- this ordinary neural network has solved many problems, it is still powerless for many problems. For example, if you want to predict what the next word of a sentence is, you generally need to use the previous word, because the preceding and following words in a sentence are not independent. The reason why RNN is called recurrent neural network is that the current output of a sequence is also related to the previous output.
- the specific form is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer are no longer unconnected but connected, and the input of the hidden layer includes not only The output of the input layer also includes the output of the hidden layer at the previous moment.
- RNN can process sequence data of any length.
- the training of RNN is the same as the training of traditional CNN or DNN.
- the error back propagation algorithm is also used, but there is a difference: that is, if the RNN is expanded into the network, the parameters, such as W, are shared; this is not the case with the traditional neural network as the example above.
- the output of each step depends not only on the current step of the network, but also on the state of the previous steps of the network. This learning algorithm is called Backpropagation Through Time (BPTT).
- BPTT Backpropagation Through Time
- Convolutional neural networks can use backpropagation (BP) algorithms to modify the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, forwarding the input signal to the output will cause error loss, and the parameters in the initial super-resolution model are updated by backpropagating the error loss information, so that the error loss is converged.
- the backpropagation algorithm is a backpropagation motion dominated by error loss, and aims to obtain the optimal super-resolution model parameters, such as a weight matrix.
- an embodiment of the present application provides a system architecture 110.
- the data collection device 170 is used to collect training data.
- the training data includes: the image or image block and the category of the object; the training data is stored in the database 130, and the training device 130 is trained based on the training data maintained in the database 130 to obtain the CNN feature extraction model 101 (explanation: 101 here is the model trained in the training phase described earlier, which may be a perception network for feature extraction, etc.).
- Embodiment 1 will use Embodiment 1 to describe in more detail how the training device 130 obtains the CNN feature extraction model 101 based on the training data.
- the CNN feature extraction model 101 can be used to implement the perception network provided by the embodiment of the application, that is, the image to be recognized or The image block is input into the CNN feature extraction model 101 after relevant preprocessing, and then the 2D, 3D, Mask, key points and other information of the object of interest in the image or image block to be recognized can be obtained.
- the CNN feature extraction model 101 in the embodiment of the present application may specifically be a CNN convolutional neural network.
- the training data maintained in the database 130 may not all come from the collection of the data collection device 170, and may also be received from other devices.
- the training device 130 does not necessarily perform the training of the CNN feature extraction model 101 completely based on the training data maintained by the database 130. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application. Limitations of Examples.
- the CNN feature extraction model 101 obtained by training according to the training device 130 can be applied to different systems or devices, such as the execution device 120 shown in FIG. 1.
- the execution device 120 may be a terminal, such as a mobile phone terminal, a tablet computer, Laptops, AR/VR, vehicle-mounted terminals, etc., can also be servers or clouds.
- the execution device 120 is configured with an I/O interface 112 for data interaction with external devices.
- the user can input data to the I/O interface 112 through the client device 150.
- the input data is described in the embodiment of the present application. It can include: the image to be recognized or the image block or picture.
- the execution device 120 may call the data storage system 160
- the data, codes, etc. are used for corresponding processing, and the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 160.
- the I/O interface 112 returns the processing result, such as the 2D, 3D, Mask, key points and other information of the image or image block or the object of interest in the image obtained above to the client device 150, so as to provide the user.
- the processing result such as the 2D, 3D, Mask, key points and other information of the image or image block or the object of interest in the image obtained above to the client device 150, so as to provide the user.
- the client device 150 may be a planning control unit in an automatic driving system or a beauty algorithm module in a mobile phone terminal.
- the training device 130 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete The above tasks provide the user with the desired result.
- the user can manually set input data, and the manual setting can be operated through the interface provided by the I/O interface 112.
- the client device 150 can automatically send input data to the I/O interface 112. If the client device 150 is required to automatically send the input data and the user's authorization is required, the user can set the corresponding authority in the client device 150.
- the user can view the result output by the execution device 120 on the client device 150, and the specific presentation form may be a specific manner such as display, sound, and action.
- the client device 150 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data and store it in the database 130 as shown in the figure.
- the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result of the output I/O interface 112 as a new sample as shown in the figure.
- the data is stored in the database 130.
- FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
- the data storage system 160 is an external memory relative to the execution device 120. In other cases, the data storage system 160 may also be placed in the execution device 120.
- the CNN feature extraction model 101 is trained according to the training device 130.
- the CNN feature extraction model 101 in this embodiment of the application may be a CNN convolutional neural network, or it may be based on multiple Headers as described in the following embodiments. Perception network.
- a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture.
- the deep learning architecture refers to the use of machine learning algorithms.
- Multi-level learning is carried out on the abstract level of
- CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the input image.
- a convolutional neural network (CNN) 210 may include an input layer 220, a convolutional layer/pooling layer 230 (the pooling layer is optional), and a neural network layer 230.
- the convolutional layer/pooling layer 230 as shown in FIG. 2 may include layers 221-226 as shown in Examples.
- layer 221 is a convolutional layer
- layer 222 is a pooling layer
- layer 223 is a convolutional layer.
- Layers, 224 is the pooling layer
- 225 is the convolutional layer
- 226 is the pooling layer; in another implementation, 221 and 222 are the convolutional layers, 223 is the pooling layer, and 224 and 225 are the convolutional layers.
- Layer, 226 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
- the convolution layer 221 can include many convolution operators.
- the convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix.
- the convolution operator is essentially It can be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. ...It depends on the value of stride) to complete the work of extracting specific features from the image.
- the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same.
- the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension convolution output, but in most cases, a single weight matrix is not used, but multiple weight matrices of the same size (row ⁇ column) are applied. That is, multiple homogeneous matrices.
- the output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" mentioned above.
- Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image.
- the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted by the multiple weight matrices of the same size have the same size, and then the multiple extracted feature maps of the same size are combined to form a convolution operation. Output.
- weight values in these weight matrices need to be obtained through a lot of training in practical applications.
- Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 210 can make correct predictions. .
- the initial convolutional layer (such as 221) often extracts more general features, which can also be called low-level features;
- the features extracted by the subsequent convolutional layers (for example, 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
- the pooling layer can be a convolutional layer followed by a layer.
- the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
- the only purpose of the pooling layer is to reduce the size of the image space.
- the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image.
- the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
- the maximum pooling operator can take the pixel with the largest value within a specific range as the result of the maximum pooling.
- the operators in the pooling layer should also be related to the image size.
- the size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
- the convolutional neural network 210 After processing by the convolutional layer/pooling layer 230, the convolutional neural network 210 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 230 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 210 needs to use the neural network layer 230 to generate one or a group of required classes of output. Therefore, the neural network layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 2) and an output layer 240. The parameters contained in the multiple hidden layers can be based on specific task types. Relevant training data of, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc...
- the output layer 240 has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error.
- the convolutional neural network 210 shown in FIG. 2 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.
- FIG. 3 is a hardware structure of a chip provided by an embodiment of the present invention.
- the chip includes a neural network processor 30.
- the chip can be set in the execution device 120 as shown in FIG. 1 to complete the calculation work of the calculation module 111.
- the chip can also be set in the training device 130 as shown in FIG. 1 to complete the training work of the training device 130 and output the target model/rule 101.
- the algorithms of each layer in the convolutional neural network as shown in Figure 2 can be implemented in the chip as shown in Figure 3.
- Neural network processor NPU 30, NPU is mounted on the main CPU (Host CPU) as a coprocessor, and the Host CPU distributes tasks.
- the core part of the NPU is the arithmetic circuit 303.
- the controller 304 controls the arithmetic circuit 303 to extract data from the memory (weight memory or input memory) and perform calculations.
- the arithmetic circuit 303 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 303 is a general-purpose matrix processor.
- the arithmetic circuit fetches the corresponding data of matrix B from the weight memory 302 and caches it on each PE in the arithmetic circuit.
- the arithmetic circuit fetches the matrix A data and matrix B from the input memory 301 to perform matrix operations, and the partial or final result of the obtained matrix is stored in an accumulator 308.
- the vector calculation unit 307 can perform further processing on the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
- the vector calculation unit 307 can be used for network calculations in the non-convolutional/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .
- the vector calculation unit 307 can store the processed output vector to the unified buffer 306.
- the vector calculation unit 307 may apply a nonlinear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate the activation value.
- the vector calculation unit 307 generates a normalized value, a combined value, or both.
- the processed output vector can be used as an activation input to the arithmetic circuit 303, for example for use in a subsequent layer in a neural network.
- the calculation of the perception network provided by the embodiment of the present application may be performed by 303 or 307.
- the unified memory 306 is used to store input data and output data.
- the weight data directly transfers the input data in the external memory to the input memory 301 and/or the unified memory 306 through the storage unit access controller 305 (Direct Memory Access Controller, DMAC), and stores the weight data in the external memory into the weight memory 302, And the data in the unified memory 306 is stored in the external memory.
- DMAC Direct Memory Access Controller
- the Bus Interface Unit (BIU) 310 is used to implement interaction between the main CPU, the DMAC, and the fetch memory 309 through the bus.
- An instruction fetch buffer 309 connected to the controller 304 is used to store instructions used by the controller 304;
- the controller 304 is used to call the instructions cached in the memory 309 to control the working process of the computing accelerator.
- the input data here in this application is a picture
- the output data is information such as 2D, 3D, Mask, and key points of the object of interest in the picture.
- the unified memory 306, the input memory 301, the weight memory 302, and the instruction fetch memory 309 are all on-chip (On-Chip) memories.
- the external memory is a memory external to the NPU.
- the external memory can be a double data rate synchronous dynamic random access memory.
- Memory Double Data Rate Synchronous Dynamic Random Access Memory, referred to as DDR SDRAM
- High Bandwidth Memory (HBM) or other readable and writable memory.
- FIG. 5 is a schematic structural diagram of a multi-head sensing network provided by an embodiment of this application. As shown in Figure 5, the sensing network includes:
- the backbone network 401 is used to receive input pictures, perform convolution processing on the input pictures, and output feature maps with different resolutions corresponding to the pictures; that is, to output feature maps with different sizes corresponding to the pictures;
- Any parallel headend is used to detect task objects in a task according to the feature map output by the backbone network, and output the 2D frame of the area where the task object is located and the confidence level corresponding to each 2D frame;
- each of the parallel Header completes the detection of different task objects; wherein, the task object is the object that needs to be detected in the task; the higher the confidence, the higher the confidence, the presence of the 2D box corresponding to the confidence The greater the probability of the object corresponding to the task.
- the parallel Header completes different 2D detection tasks, such as the parallel Header0 completes the detection of the car, and outputs the 2D box and confidence of Car/Truck/Bus; the parallel Header1 completes the human detection, and outputs the Pedestrian/Cyclist/Tricyle 2D box and Confidence level: Parallel Header2 completes the detection of traffic lights, and outputs the 2D frame and confidence level of Red_Trafficligh/Green_Trafficlight/Yellow_TrafficLight/Black_TrafficLight.
- ROI-ALIGN module used to extract the features of the region where the candidate 2D frame is located from a feature map provided by the backbone network according to the region predicted by the RPN module;
- Serial Keypoint Header is based on the 2D frame provided by the front-end parallel Header (the 2D frame at this time is an accurate compact 2D frame), and these 2D frames are added to a feature map of Backbone through the ROI-ALIGN module The feature of the area is taken out, and then the key point coordinates of the object inside the 2D box are returned through a small network (Keypoint_Header in Figure 5).
- the data equalization module 2904 is used to perform data equalization on pictures belonging to different tasks.
- the perception network mainly includes 3 parts: Backbone, Parallel Header and Serial Header.
- the serial header is not necessary, and the reason has been described in the foregoing embodiment, and will not be repeated here.
- 8 parallel headers have completed the 8 categories of 2D detection in Table 1 at the same time, and a number of serial headers are connected in series after Header0 ⁇ 2 to further complete the 3D/Mask/Keypoint detection. It can be seen from FIG. 6 that the present invention can flexibly increase and delete the Header according to the needs of the business, thereby realizing different functional configurations.
- Backbone can use a variety of existing convolutional network frameworks, such as VGG16, Resnet50, Inception-Net, etc.
- Resnet18 as Backbone to illustrate the basic feature generation process. The process is shown in Figure 7.
- This feature map is the same as C1; C2 continues to be processed by the third convolution module (Res18-Conv3) of Resnet18 to generate Featuremap C3.
- This feature map is further down compared to C2 Sampling, the number of channels is doubled, and its resolution is H/8*W/8*128; finally C3 is processed by Res18-Conv4 to generate Featuremap C4, and its resolution is H/16*W/16*256.
- Anchor For parallel Header1, the person in charge is detected, and the main shape of the person is elongated, so Anchor can be designed to be elongated; for parallel Header4, it is responsible for the detection of traffic signs, and the main shape of traffic signs is square Therefore, the Anchor can be designed as a square.
- features are deducted on Backbone’s C4 feature map.
- the area on C4 of each Proposal is the dark area indicated by the arrow in the figure. In this area, interpolation and sampling are used to deduct the fixed resolution. Rate characteristics. Assuming that the number of Proposal is N and the width and height of the features extracted by ROI-ALIGN are both 14, the size of the features output by ROI-ALIGN is N*14*14*256 (the number of channels of the features extracted by ROI-ALIGN The number of channels is the same as that of C4, both are 256 channels). These features will be sent to the subsequent RCNN module for subdivision.
- the ROI-ALIGN module extracts the features of the area where each 2D box is located on C4 according to the accurate 2D box provided by the parallel Header. If the number of 2D boxes is M, then the feature size output by the ROI-ALIGN module is M*14 *14*256, which is first processed by the fifth convolution module (Res18-Conv5) of Resnet18, and the output feature size is N*7*7*512, and then processed through a Global Avg Pool (average pooling layer) , Averaging the 7*7 features of each channel in the input features to obtain M*512 features, where each 1*512-dimensional feature vector represents the feature of each 2D box.
- a Global Avg Pool average pooling layer
- orientation angle of the object in the frame (orientation, M*1 vector in the figure) and centroid point coordinates (centroid, M*2 vector in the figure) are returned to the object in the frame through three fully connected layers FC.
- These two values represent the centroid x/y coordinates) and length, width and height (dimention in the figure)
- the embodiment of the present application also describes the training process of the perception network in detail.
- Each type of data only needs to label a specific type of object, so that targeted collection can be carried out, instead of marking all objects of interest in each picture, thereby reducing the cost of data collection and labeling.
- using this method to prepare data has very flexible scalability.
- In the case of increasing the detection type of the detected object only one or more headers need to be added, and the label data type of the new object can be provided. Mark the newly added objects on the data.
- the 3D detection function in Header0 it is necessary to provide independent 3D annotation data, and mark the 3D information (centroid point coordinates, orientation angle, length, width, height) of each car on the data set; in order to train Header0
- the Mask detection function in needs to provide independent Mask labeling data, and mark the mask of each car on the data set; in particular, the Parkingslot detection in Header2 needs to detect key points.
- This task requires the data set to also include parking spaces Mark out the 2D box and key points (in fact, you only need to mark out the key points, and the 2D box of the parking space can be automatically generated by the coordinates of the key points)
- This label determines which headers in the network can be used to train the picture, which will be described in detail in the subsequent training process.
- the expansion methods include but are not limited to copy expansion. Randomly scramble the balanced data, and then send it to the network for training, as shown in Figure 15.
- the embodiment of the present invention proposes a multi-header-based high-performance extensible sensing network. Each sensing task shares the same backbone network, which doubles the amount of calculation and network parameters. Table 3 shows the calculation and parameter statistics for the Single-Header network to achieve a single function.
- Table 4 shows the calculation amount and parameter amount for implementing all the functions of this embodiment using a Multi-Header network.
- the area where the task object exists can be predicted on one or more feature maps provided by the backbone network to obtain a candidate area, and output matching the candidate A candidate 2D frame of the region; wherein the template frame is obtained based on the statistical features of the task object to which it belongs, and the statistical features include the shape and size of the object.
- the 2D candidate frame serves as the 2D frame of the region.
- the aforementioned 2D frame may be a rectangular frame.
- S3004 Based on the 2D frame of the task object to which the task belongs, extract the features of the area where the 2D frame is located from one or more feature maps on the backbone network, and perform tasks on the task to which the 2D frame is located based on the features of the area where the 2D frame is located.
- the 3D information, Mask information or Keypiont information of the object is used for prediction.
- the detection of the area where the large object is located can be completed on the low-resolution feature, and the RPN module completes the detection of the area where the small object is located on the high-resolution feature map.
- an embodiment of the present invention also provides a method for training a multi-task perception network based on partial annotation data, and the method includes:
- S3101 Determine the task to which each picture belongs according to the label data type of each picture; wherein, each picture is labeled with one or more data types, and the multiple data types are subsets of all data types, and one data Type corresponds to a task;
- S3102 Determine the Header to be trained for each picture according to the task to which each picture belongs;
- S3104 For each picture, perform gradient backhaul through the header to be trained, and adjust the header to be trained and the parameters of the backbone network based on the loss value.
- S31020 Perform data equalization on pictures belonging to different tasks.
- step S210 the picture is input to the network
- step S220 the process of "generating basic features" is entered.
- each task has an independent "2D detection” process and optional "3D detection” process, "Mask detection” process and "Keypoint detection” process.
- the core process is described below.
- the "2D detection” process predicts the 2D box and confidence level of each task based on the feature map generated by the "basic feature generation” process. Specifically, the "2D detection” process can be further subdivided into the “2D candidate region prediction” process, the “2D candidate region feature extraction” process, and the “2D candidate region fine classification” process, as shown in FIG. 22.
- the "2D candidate region sub-classification" process is implemented by the RCNN module in Figure 5, which uses a neural network to further predict the features of each Proposal, output the confidence of each Proposal belonging to each category, and at the same time, the coordinates of the Proposal 2D box Make adjustments to output a more compact 2D frame.
- the "3D detection” process predicts the 3D information such as the coordinates of the centroid point, the orientation angle, and the length, width, and height of the object inside each 2D frame based on the 2D frame provided by the "2D detection” process and the feature map generated by the "basic feature generation” process.
- “3D detection” consists of two sub-processes, as shown in Figure 23.
- the "2D candidate region feature extraction” process is implemented by the ROI-ALIGN module in Figure 5. According to the coordinates of the 2D frame, the feature map of the region where each 2D frame is located is subtracted on a feature map provided by the "Basic feature generation” process. Take it out and resize it to a fixed size to get the characteristics of each 2D box.
- the "3D centroid/direction/length, width and height prediction” process is implemented by the 3D_Header in Figure 5. It mainly returns 3D information such as the coordinates of the centroid point, orientation angle, length, width, and height of objects inside the 2D box based on the characteristics of each 2D box.
- the "Mask Detection” process predicts the detailed mask of the objects inside each 2D frame based on the 2D frame provided by the "2D Detection” process and the feature map generated by the "Basic Feature Generation” process. Specifically, “Mask detection” consists of two sub-processes, as shown in Figure 24.
- the "2D candidate region feature extraction” process is implemented by the ROI-ALIGN module in Figure 5. According to the coordinates of the 2D frame, the feature map of the region where each 2D frame is located is subtracted on a feature map provided by the "Basic feature generation” process. Take it out and resize it to a fixed size to get the characteristics of each 2D box.
- the "mask prediction” process is implemented by the Mask_Header in Figure 5, which mainly returns the mask where the object inside the 2D frame is based on the characteristics of each 2D frame.
- the “Keypoint prediction” process predicts the mask of the objects inside each 2D frame based on the 2D frame provided by the “2D detection” process and the feature map generated by the "Basic Feature Generation” process. Specifically, “Keypoint prediction” consists of two sub-processes, as shown in Figure 25.
- the "2D candidate region feature extraction” process is implemented by the ROI-ALIGN module in Figure 5. According to the coordinates of the 2D frame, the feature map of the region where each 2D frame is located is subtracted on a feature map provided by the "Basic feature generation” process. Take it out and resize it to a fixed size to get the characteristics of each 2D box.
- the "key point coordinate prediction" process is implemented by Keypoint_Header in Figure 5, which mainly returns the coordinates of the key points of the objects inside the 2D box based on the characteristics of each 2D box.
- FIG. 26 The training process of the embodiment of the present invention is shown in FIG. 26.
- the red box part is the core training process.
- the core training process is introduced below.
- the amount of data for each task is extremely unbalanced. For example, the number of pictures containing people will be much larger than that of traffic signs.
- the data between tasks must be balanced. Specifically, a small amount of data is expanded.
- the expansion methods include but are not limited to copy expansion.
- Each picture can be divided into one or more tasks according to the data type it is labeled. For example, if a picture is only marked with traffic signs, then this picture belongs to the task of traffic signs only; if a picture is marked with both people and cars , Then this kind of picture belongs to two tasks of people and cars at the same time.
- Loss only the Loss of the Header corresponding to the task of the current picture is calculated, and the Loss of other tasks are not calculated. For example, if the currently input training picture belongs to the task of people and cars, then only the Loss of the header of the people and the car is calculated, and the Loss of the rest (such as traffic lights and traffic signs) are not calculated.
- the embodiments of this application comprehensively consider the shortcomings of the existing methods, and propose a multi-header-based high-performance scalable sensing network to simultaneously implement different sensing tasks (2D/3D/key points/semantic segmentation, etc.) on the same network ), each sensing task in the network shares the same backbone network, which significantly saves computation.
- its network structure is easy to expand, and only needs to be expanded by adding one or several headers.
- the embodiment of the present application also proposes a method for training a multi-task perception network based on partially labeled data.
- Each task uses an independent data set, and there is no need to perform full-task annotation on the same image.
- the training data of different tasks is convenient for balancing. The data of different tasks will not inhibit each other.
- FIG. 27 is a schematic diagram of an application system of the perception network.
- the perception network 2000 includes at least one processor 2001, at least one memory 2002, At least one communication interface 2003 and at least one display device 2004.
- the processor 2001, the memory 2002, the display device 2004, and the communication interface 2003 are connected through a communication bus and complete mutual communication.
- the communication interface 2003 is used to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area networks (WLAN), etc.
- devices or communication networks such as Ethernet, radio access network (RAN), wireless local area networks (WLAN), etc.
- RAN radio access network
- WLAN wireless local area networks
- the memory 2002 may be read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), or other types that can store information and instructions
- the dynamic storage device can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, optical disc storage (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be used by a computer Any other media accessed, but not limited to this.
- the memory can exist independently and is connected to the processor through a bus.
- the memory can also be integrated with the processor.
- the memory 2002 is used to store application program codes for executing the above solutions, and the processor 2001 controls the execution.
- the processor 2001 is configured to execute application program codes stored in the memory 2002.
- the code stored in the memory 2002 can execute the multi-header-based object perception method provided above.
- the processor 2001 may also adopt or one or more integrated circuits to execute related programs to implement the Multi-Header-based object perception method or model training method in the embodiment of the present application.
- the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
- the steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
- the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
- the storage medium is located in the memory 2002, and the processor 2001 reads the information in the memory 2002 and completes the object perception method or model training method in the embodiment of the present application in combination with its hardware.
- the communication interface 2003 uses a transceiving device such as but not limited to a transceiver to implement communication between the recommendation device or the training device and other devices or communication networks. For example, the image to be recognized or training data can be obtained through the communication interface 2003.
- a transceiving device such as but not limited to a transceiver to implement communication between the recommendation device or the training device and other devices or communication networks.
- the image to be recognized or training data can be obtained through the communication interface 2003.
- the object corresponding to each task is detected independently for different tasks according to the feature map provided by the backbone network, and the 2D frame of the candidate area of the object corresponding to each task is output.
- the processor 2001 specifically performs the following steps: predict the region where the task object exists on one or more feature maps to obtain a candidate region, and output a matching candidate region Candidate 2D frame; according to the candidate area obtained by the RPN module, extract the feature of the area where the candidate area is located from a feature map; refine the feature of the candidate area to obtain the candidate area corresponding to each The confidence level of the object category; each of the objects is an object in a corresponding task; the coordinates of the candidate area are adjusted to obtain a second candidate 2D frame, and the second 2D candidate frame is greater than the actual candidate 2D frame.
- the objects are more matched, and a 2D candidate frame with a confidence greater than a preset threshold is selected as the 2D frame of the candidate
- the processor 2001 when performing prediction on the region where the task object exists on one or more feature maps to obtain a candidate region, and outputting a candidate 2D frame matching the candidate region, the processor 2001 specifically Perform the following steps:
- the template frame is obtained based on the statistical characteristics of the task object to which it belongs, and the statistical characteristics include the shape and size of the object.
- the processor 2001 further executes the following steps:
- the feature of the object is extracted from one or more feature maps on the backbone network, and the 3D, Mask or Keypiont of the object is predicted.
- the detection of candidate regions of large objects is completed on a low-resolution feature map, and the detection of candidate regions of small objects is completed on a high-resolution feature map.
- the 2D frame is a rectangular frame.
- the structure of the perception network may be implemented by a server, and the server may be implemented with the structure in FIG. 28.
- the server 2110 includes at least one processor 2101, at least one memory 2102, and at least one communication interface 2103. .
- the processor 2101, the memory 2102, and the communication interface 2103 are connected through a communication bus and complete mutual communication.
- the communication interface 2103 is used to communicate with other devices or communication networks, such as Ethernet, RAN, and WLAN.
- the memory 2102 can be ROM or other types of static storage devices that can store static information and instructions, RAM or other types of dynamic storage devices that can store information and instructions, or it can be EEPROM) CD-ROM or other optical disk storage, optical disk storage (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be used by a computer Any other media accessed, but not limited to this.
- the memory can exist independently and is connected to the processor through a bus.
- the memory can also be integrated with the processor.
- the memory 2102 is used to store application program codes for executing the above solutions, and the processor 2101 controls the execution.
- the processor 2101 is configured to execute application program codes stored in the memory 2102.
- the code stored in the memory 2102 can execute the multi-header-based object perception method provided above.
- the processor 2101 may also adopt or one or more integrated circuits to execute related programs to implement the Multi-Header-based object perception method or model training method in the embodiment of the present application.
- the processor 2101 may also be an integrated circuit chip with signal processing capability.
- each step of the recommended method of the present application can be completed by an integrated logic circuit of hardware in the processor 2101 or instructions in the form of software.
- each step of the training method in the embodiment of the present application can be completed by an integrated logic circuit of hardware in the processor 2101 or instructions in the form of software.
- the aforementioned processor 2001 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component.
- the methods, steps, and module block diagrams disclosed in the embodiments of the present application can be implemented or executed.
- the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
- the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
- the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
- the storage medium is located in the memory 2102, and the processor 2101 reads the information in the memory 2102, and completes the object perception method or model training method in the embodiment of the present application in combination with its hardware.
- the communication interface 2103 uses a transceiving device such as but not limited to a transceiver to implement communication between the recommending device or training device and other devices or communication networks.
- a transceiving device such as but not limited to a transceiver to implement communication between the recommending device or training device and other devices or communication networks.
- the image to be recognized or training data can be obtained through the communication interface 2103.
- the bus may include a path for transferring information between various components of the device (for example, the memory 2102, the processor 2101, and the communication interface 2103).
- the processor 2101 specifically executes the following steps: predict the region where the task object exists on one or more feature maps to obtain a candidate region, and output a candidate 2D frame matching the candidate region According to the candidate regions obtained by the RPN module, extract the features of the region where the candidate region is located from a feature map; refine the features of the candidate region to obtain the confidence of each object category corresponding to the candidate region Degree; each of the objects is an object in a corresponding task; adjusting the coordinates of the candidate area to obtain a second candidate 2D frame, the second 2D candidate frame is more matching with the actual object than the candidate 2D frame, And select a 2D candidate frame with a confidence greater than a preset threshold as the 2D frame of the candidate area.
- This application provides a computer-readable medium that stores program code for device execution, and the program code includes an object sensing method for executing the embodiment shown in FIG. 21, 22, 23, 24, or 25 Related content.
- the present application provides a computer-readable medium that stores program code for device execution, and the program code includes related content for executing the training method of the embodiment shown in FIG. 26.
- This application provides a computer program product containing instructions.
- the computer program product runs on a computer, the computer executes the relevant content of the sensing method of the embodiment shown in FIG. 21, 22, 23, 24 or 25.
- This application provides a computer program product containing instructions.
- the computer program product runs on a computer, the computer executes the relevant content of the training method in the embodiment shown in FIG. 26.
- the present application provides a chip, the chip includes a processor and a data interface, the processor reads instructions stored in a memory through the data interface, and executes the instructions as shown in FIG. 21, 22, 23, 24, 25 or 26 Related content of the sensing method of the embodiment.
- the present application provides a chip that includes a processor and a data interface, and the processor reads instructions stored on a memory through the data interface, and executes related content of the training method of the embodiment shown in FIG. 26.
- the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory.
- the The processor is configured to execute related content of the sensing method of the embodiment shown in FIG. 21, 22, 23, 24 or 25, or execute related content of the training method of the embodiment shown in FIG. 26.
- Each sensing task shares the same backbone network, which saves double the amount of calculation; the network structure is easy to expand, and the 2D detection type can be expanded by adding one or more headers.
- Each parallel Header has independent RPN and RCNN modules, and only needs to detect the objects of the task to which it belongs, so that during the training process, it can avoid accidental injury to objects of other tasks that are not labeled.
- a special Anchor can be customized for the scale and aspect ratio of the object of each task, thereby increasing the overlap ratio of the Anchor and the object, thereby increasing the recall rate of the object by the RPN layer.
- Each task uses an independent data set, and there is no need to mark all tasks on the same picture, saving annotation costs.
- the task expansion is flexible and simple. When adding a new task, you only need to provide the data of the new task. Do not mark the new object on the original data.
- the training data of different tasks can be easily balanced, so that each task gets equal training opportunities, and avoids the large amount of data from overwhelming the small amount of data.
- the disclosed device may be implemented in other ways.
- the device embodiments described above are only illustrative.
- the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable memory.
- the technical solution of the present invention essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present invention.
- the aforementioned memory includes: U disk, ROM, RAM, mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
- the program can be stored in a computer-readable memory, and the memory can include: flash disk , ROM, RAM, magnetic disk or CD, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Human Computer Interaction (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
Description
Single-Header-Model@720p | GFlops | Parameters(M) |
Vehicle(Car/Truck/Tram) | 235.5 | 17.76 |
Vehicle+Mask+3D | 235.6 | 32.49 |
Person(Pedestrian/Cyclist/Tricycle) | 235.5 | 17.76 |
Person+Mask | 235.6 | 23.0 |
Motocycle/Bicycle | 235.5 | 17.76 |
TrafficLight(Red/Green/Yellow/Black) | 235.6 | 17.76 |
TrafficSign(Trafficsign/Guideside/Billboard) | 235.5 | 17.75 |
TrafficCone/TrafficStick/FireHydrant | 235.5 | 17.75 |
Parkingslot(with keypoint) | 235.6 | 18.98 |
全功能网络(多个single-Header网络) | 1648.9 | 145.49 |
Multi-Header-Model@720p | GFlops | Parameters(M) |
全功能网络(单个Multi-Header网络) | 236.6 | 42.16 |
类别 | Single-Header | Multi-Header |
Car | 91.7 | 91.6 |
Tram | 81.8 | 80.1 |
Pedestrian | 73.6 | 75.2 |
Cyclist | 81.8 | 83.3 |
TrafficLight | 98.3 | 97.5 |
TrafficSign | 95.1 | 94.5 |
Parkingslot(point precision/recall) | 94.01/80.61 | 95.17/78.89 |
3D(mean_orien_err/mecentroid_dist_err) | 2.95/6.78 | 2.88/6.34 |
Claims (17)
- 一种基于多个头端(Header)的感知网络,其特征在于,所述感知网络包括主干网络和多个并行Header,所述多个并行Header和所述主干网络连接;所述主干网络,用于接收输入的图片,并对所述输入的图片进行卷积处理,输出对应所述图片的具有不同分辨率的特征图;所述多个并行Header中的每个Header,用于根据所述主干网络输出的特征图,对一个任务中的任务物体进行检测,输出所述任务物体所在区域的2D框以及每个2D框对应的置信度;其中,所述每个并行Header完成不同的任务物体的检测;其中,所述任务物体为该任务中需要检测的物体;所述置信度越高,表示所述对应该置信度的2D框内存在所述任务所对应的任务物体的概率越大。
- 根据权利要求1所述的感知网络,其特征在于,所述每个并行头端包括候选区域生成网络(RPN)模块、感兴趣区域提取(ROI-ALIGN)模块和区域卷积神经网络(RCNN)模块,所述每个并行头端的RPN模块独立于其它并行头端的RPN模块;所述每个并行头端的ROI-ALIGN模块独立于其它并行头端的ROI-ALIGN模块;所述每个并行头端的RCNN模块独立于其它并行头端的RCNN模块,其中,对于所述每个并行头端:所述RPN模块用于:在所述主干网络提供的一个或者多个特征图上预测所述任务物体所在的区域,并输出匹配所述区域的候选2D框;所述ROI-ALIGN模块用于:根据所述RPN模块预测得到的区域,从所述主干网络提供的一个特征图中扣取出所述候选2D框所在区域的特征;所述RCNN模块用于:通过神经网络对所述候选2D框所在区域的特征进行卷积处理,得到所述候选2D框属于各个物体类别的置信度;所述各个物体类别为所述并行头端对应的任务中的物体类别;通过神经网络对所述候选区域2D框的坐标进行调整,使得调整后的2D候选框比所述候选2D框与实际物体的形状更加匹配,并选择置信度大于预设阈值的调整后的2D候选框作为所述区域的2D框。
- 根据权利要求1或2所述的感知网络,其特征在于,所述2D框为矩形框。
- 根据权利要求2所述的感知网络,其特征在于,所述RPN模块用于:基于所属任务对应的物体的模板框(Anchor),在主干网络提供的一个或者多个特征图上对存在该任务物体的区域进行预测以得到候选区域,并输出匹配所述候选区域的候选2D框;其中,所述模板框是基于其所属的任务物体的统计特征得到的,所述统计特征包括所述物体的形状和大小。
- 根据权利要求1-4任一项所述的感知网络,其特征在于,所述感知网络还包括一个或多个串行Header;所述串行Header与一个所述并行Header连接;所述串行Header用于:利用其连接的所述并行Header提供的所属任务的任务物体的2D框,在所述主干网络上的一个或多个特征图上提取所述2D框所在区域的特征,根据所述2D框所在区域的特征对所述所属任务的任务物体的3D信息、Mask信息或Keypiont信息进行预测。
- 根据权利要求1-4任一项所述的感知网络,其特征在于,所述RPN模块用于在不 同的分辨率的特征图上预测不同大小物体所在的区域。
- 根据权利要求6所述的感知网络,其特征在于,所述RPN模块用于在低分辨率的特征图上完成大物体所在区域的检测,在高分辨率的特征图上完成小物体所在区域的检测。
- 一种物体检测方法,其特征在于,所述方法包括:接收输入的图片;对所述输入的图片进行卷积处理,输出对应所述图片的具有不同分辨率的特征图;根据所述特征图,针对不同的任务独立检测每个任务中的任务物体,输出所述每个任务物体所在区域的2D框以及每个2D框对应的置信度;其中,所述任务物体为该任务中需要检测的物体;所述置信度越高,表示对应所述该置信度的2D框内存在所述任务所对应的任务物体的概率越大。
- 根据权利要求8所述的物体检测方法,其特征在于,所述根据所述特征图,针对不同的任务独立检测每个任务中的任务物体,输出所述每个任务物体所在区域的2D框以及每个2D框对应的置信度,包括:在一个或者多个特征图上预测所述任务物体所在的区域,并输出匹配所述区域的候选2D框;根据所述任务物体所在的区域,从一个特征图中扣取出所述候选2D框所在区域的特征;对所述候选2D框所在区域的特征进行卷积处理,得到所述候选2D框属于各个物体类别的置信度;所述各个物体类别为所述一个任务中的物体类别;通过神经网络对所述候选区域2D框的坐标进行调整,使得调整后的2D候选框比所述候选2D框与实际物体的形状更加匹配,并选择置信度大于预设阈值的调整后的2D候选框作为所述区域的2D框。
- 根据权利要求9所述的物体检测方法,其特征在于,所述2D框为矩形框。
- 根据权利要求9所述的物体检测方法,其特征在于,在一个或者多个特征图上预测所述任务物体所在的区域,并输出匹配所述区域的候选2D框为:基于所属任务对应的物体的模板框(Anchor),在所述主干网络提供的一个或者多个特征图上对存在该任务物体的区域进行预测以得到候选区域,并输出匹配所述候选区域的候选2D框;其中,所述模板框是基于其所属的任务物体的统计特征得到的,所述统计特征包括所述物体的形状和大小。
- 根据权利要求8-11任一项所述的物体检测方法,其特征在于,所述方法还包括:基于所属任务的任务物体的2D框,在主干网络上的一个或多个特征图上提取所述2D框所在区域的特征,根据所述2D框所在区域的特征对所述所属任务的任务物体的3D信息、Mask信息或Keypiont信息进行预测。
- 根据权利要求8-12任一项所述的物体检测方法,其特征在于,在低分辨率的特征上完成大物体所在区域的检测,在高分辨率的特征图上完成小物体所在区域的检测。
- 一种基于部分标注数据训练多任务感知网络的方法,其特征在于,所述感知网络包括主干网络和多个并行头端(Header),所述方法包括:根据每张图片的标注数据类型,确定每张图片所属的任务;其中,所述每张图片标注 一个或者多个数据类型,所述多个数据类型是所有数据类型的子集,所述所有数据类型中的每一个数据类型对应一个任务;根据所述每张图片所属的任务,决定所述每张图片所需训练的Header;计算所述每张图片所需训练的Header的损失值;对于所述每张图片,通过所述所需训练的Header进行梯度回传,并基于所述损失值调整所述所需训练的Header以及所述主干网络的参数。
- 如权利要求14所述的训练多任务感知网络的方法,其特征在于,计算所述每张图片所需训练的Header的损失值之前,所述方法还包括:对属于不同任务的图片进行数据均衡。
- 一种基于部分标注数据训练多任务感知网络的装置,其特征在于,所述感知网络包括主干网络和多个并行头端(Header),所述装置包括:任务确定模块,用于根据每张图片的标注数据类型,确定每张图片所属的任务;其中,所述每张图片标注一个或者多个数据类型,所述多个数据类型是所有数据类型的子集,所述所有数据类型中的每一个数据类型对应一个任务;Header决定模块,用于根据所述每张图片所属的任务,决定所述每张图片所需训练的Header;损失值计算模块,针对所述每张图片,用于计算所述Header决定模块决定出的Header的损失值;调整模块,针对所述每张图片,通过计算所述Header决定模块决定出的Header进行梯度回传,并基于所述损失值计算模块得到的损失值调整所述所需训练的Header以及主干网络的参数。
- 如权利计16所述的训练多任务感知网络的方法,其特征在于,所述装置还包括:数据均衡模块,用于对属与不同任务的图片进行数据均衡。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP20817904.4A EP3916628A4 (en) | 2019-06-06 | 2020-06-08 | OBJECT IDENTIFICATION METHOD AND DEVICE |
JP2021538658A JP7289918B2 (ja) | 2019-06-06 | 2020-06-08 | 物体認識方法及び装置 |
US17/542,497 US20220165045A1 (en) | 2019-06-06 | 2021-12-06 | Object recognition method and apparatus |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910493331.6 | 2019-06-06 | ||
CN201910493331.6A CN110298262B (zh) | 2019-06-06 | 2019-06-06 | 物体识别方法及装置 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/542,497 Continuation US20220165045A1 (en) | 2019-06-06 | 2021-12-06 | Object recognition method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020244653A1 true WO2020244653A1 (zh) | 2020-12-10 |
Family
ID=68027699
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/094803 WO2020244653A1 (zh) | 2019-06-06 | 2020-06-08 | 物体识别方法及装置 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20220165045A1 (zh) |
EP (1) | EP3916628A4 (zh) |
JP (1) | JP7289918B2 (zh) |
CN (1) | CN110298262B (zh) |
WO (1) | WO2020244653A1 (zh) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112869829A (zh) * | 2021-02-25 | 2021-06-01 | 北京积水潭医院 | 一种智能镜下腕管切割器 |
CN113657486A (zh) * | 2021-08-16 | 2021-11-16 | 浙江新再灵科技股份有限公司 | 基于电梯图片数据的多标签多属性分类模型建立方法 |
CN114596624A (zh) * | 2022-04-20 | 2022-06-07 | 深圳市海清视讯科技有限公司 | 人眼状态检测方法、装置、电子设备及存储介质 |
FR3121110A1 (fr) * | 2021-03-24 | 2022-09-30 | Psa Automobiles Sa | Procédé et système de contrôle d’une pluralité de systèmes d’aide à la conduite embarqués dans un véhicule |
WO2022246989A1 (zh) * | 2021-05-26 | 2022-12-01 | 腾讯云计算(北京)有限责任公司 | 一种数据识别方法、装置、设备及可读存储介质 |
CN115661784A (zh) * | 2022-10-12 | 2023-01-31 | 北京惠朗时代科技有限公司 | 一种面向智慧交通的交通标志图像大数据识别方法与系统 |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11922314B1 (en) * | 2018-11-30 | 2024-03-05 | Ansys, Inc. | Systems and methods for building dynamic reduced order physical models |
US11462112B2 (en) * | 2019-03-07 | 2022-10-04 | Nec Corporation | Multi-task perception network with applications to scene understanding and advanced driver-assistance system |
CN110298262B (zh) * | 2019-06-06 | 2024-01-02 | 华为技术有限公司 | 物体识别方法及装置 |
CN110675635B (zh) * | 2019-10-09 | 2021-08-03 | 北京百度网讯科技有限公司 | 相机外参的获取方法、装置、电子设备及存储介质 |
WO2021114031A1 (zh) * | 2019-12-09 | 2021-06-17 | 深圳市大疆创新科技有限公司 | 一种目标检测方法和装置 |
CN112989900A (zh) * | 2019-12-13 | 2021-06-18 | 深动科技(北京)有限公司 | 一种精确检测交通标志或标线的方法 |
CN111291809B (zh) * | 2020-02-03 | 2024-04-12 | 华为技术有限公司 | 一种处理装置、方法及存储介质 |
CN111598000A (zh) * | 2020-05-18 | 2020-08-28 | 中移(杭州)信息技术有限公司 | 基于多任务的人脸识别方法、装置、服务器和可读存储介质 |
CN112434552A (zh) * | 2020-10-13 | 2021-03-02 | 广州视源电子科技股份有限公司 | 神经网络模型调整方法、装置、设备及存储介质 |
WO2022126523A1 (zh) * | 2020-12-17 | 2022-06-23 | 深圳市大疆创新科技有限公司 | 物体检测方法、设备、可移动平台及计算机可读存储介质 |
CN112614105B (zh) * | 2020-12-23 | 2022-08-23 | 东华大学 | 一种基于深度网络的3d点云焊点缺陷检测方法 |
CN113065637B (zh) * | 2021-02-27 | 2023-09-01 | 华为技术有限公司 | 一种感知网络及数据处理方法 |
CN117157679A (zh) * | 2021-04-12 | 2023-12-01 | 华为技术有限公司 | 感知网络、感知网络的训练方法、物体识别方法及装置 |
CN113191401A (zh) * | 2021-04-14 | 2021-07-30 | 中国海洋大学 | 基于视觉显著性共享的用于三维模型识别的方法及装置 |
CN113255445A (zh) * | 2021-04-20 | 2021-08-13 | 杭州飞步科技有限公司 | 多任务模型训练及图像处理方法、装置、设备及存储介质 |
CN114723966B (zh) * | 2022-03-30 | 2023-04-07 | 北京百度网讯科技有限公司 | 多任务识别方法、训练方法、装置、电子设备及存储介质 |
CN114821269A (zh) * | 2022-05-10 | 2022-07-29 | 安徽蔚来智驾科技有限公司 | 多任务目标检测方法、设备、自动驾驶系统和存储介质 |
CN116385949B (zh) * | 2023-03-23 | 2023-09-08 | 广州里工实业有限公司 | 一种移动机器人的区域检测方法、系统、装置及介质 |
CN116543163B (zh) * | 2023-05-15 | 2024-01-26 | 哈尔滨市科佳通用机电股份有限公司 | 一种制动连接管折断故障检测方法 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124409A1 (en) * | 2015-11-04 | 2017-05-04 | Nec Laboratories America, Inc. | Cascaded neural network with scale dependent pooling for object detection |
CN108520229A (zh) * | 2018-04-04 | 2018-09-11 | 北京旷视科技有限公司 | 图像检测方法、装置、电子设备和计算机可读介质 |
US10223610B1 (en) * | 2017-10-15 | 2019-03-05 | International Business Machines Corporation | System and method for detection and classification of findings in images |
CN109712118A (zh) * | 2018-12-11 | 2019-05-03 | 武汉三江中电科技有限责任公司 | 一种基于Mask RCNN的变电站隔离开关检测识别方法 |
CN109784194A (zh) * | 2018-12-20 | 2019-05-21 | 上海图森未来人工智能科技有限公司 | 目标检测网络构建方法和训练方法、目标检测方法 |
CN109815922A (zh) * | 2019-01-29 | 2019-05-28 | 卡斯柯信号有限公司 | 基于人工智能神经网络的轨道交通地面目标视频识别方法 |
CN110298262A (zh) * | 2019-06-06 | 2019-10-01 | 华为技术有限公司 | 物体识别方法及装置 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019028725A1 (en) * | 2017-08-10 | 2019-02-14 | Intel Corporation | CONVOLUTIVE NEURAL NETWORK STRUCTURE USING INVERTED CONNECTIONS AND OBJECTIVITY ANTERIORITIES TO DETECT AN OBJECT |
US10679351B2 (en) * | 2017-08-18 | 2020-06-09 | Samsung Electronics Co., Ltd. | System and method for semantic segmentation of images |
CN109598186A (zh) * | 2018-10-12 | 2019-04-09 | 高新兴科技集团股份有限公司 | 一种基于多任务深度学习的行人属性识别方法 |
-
2019
- 2019-06-06 CN CN201910493331.6A patent/CN110298262B/zh active Active
-
2020
- 2020-06-08 WO PCT/CN2020/094803 patent/WO2020244653A1/zh active Application Filing
- 2020-06-08 EP EP20817904.4A patent/EP3916628A4/en active Pending
- 2020-06-08 JP JP2021538658A patent/JP7289918B2/ja active Active
-
2021
- 2021-12-06 US US17/542,497 patent/US20220165045A1/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124409A1 (en) * | 2015-11-04 | 2017-05-04 | Nec Laboratories America, Inc. | Cascaded neural network with scale dependent pooling for object detection |
US10223610B1 (en) * | 2017-10-15 | 2019-03-05 | International Business Machines Corporation | System and method for detection and classification of findings in images |
CN108520229A (zh) * | 2018-04-04 | 2018-09-11 | 北京旷视科技有限公司 | 图像检测方法、装置、电子设备和计算机可读介质 |
CN109712118A (zh) * | 2018-12-11 | 2019-05-03 | 武汉三江中电科技有限责任公司 | 一种基于Mask RCNN的变电站隔离开关检测识别方法 |
CN109784194A (zh) * | 2018-12-20 | 2019-05-21 | 上海图森未来人工智能科技有限公司 | 目标检测网络构建方法和训练方法、目标检测方法 |
CN109815922A (zh) * | 2019-01-29 | 2019-05-28 | 卡斯柯信号有限公司 | 基于人工智能神经网络的轨道交通地面目标视频识别方法 |
CN110298262A (zh) * | 2019-06-06 | 2019-10-01 | 华为技术有限公司 | 物体识别方法及装置 |
Non-Patent Citations (1)
Title |
---|
See also references of EP3916628A4 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112869829A (zh) * | 2021-02-25 | 2021-06-01 | 北京积水潭医院 | 一种智能镜下腕管切割器 |
CN112869829B (zh) * | 2021-02-25 | 2022-10-21 | 北京积水潭医院 | 一种智能镜下腕管切割器 |
FR3121110A1 (fr) * | 2021-03-24 | 2022-09-30 | Psa Automobiles Sa | Procédé et système de contrôle d’une pluralité de systèmes d’aide à la conduite embarqués dans un véhicule |
WO2022246989A1 (zh) * | 2021-05-26 | 2022-12-01 | 腾讯云计算(北京)有限责任公司 | 一种数据识别方法、装置、设备及可读存储介质 |
CN113657486A (zh) * | 2021-08-16 | 2021-11-16 | 浙江新再灵科技股份有限公司 | 基于电梯图片数据的多标签多属性分类模型建立方法 |
CN113657486B (zh) * | 2021-08-16 | 2023-11-07 | 浙江新再灵科技股份有限公司 | 基于电梯图片数据的多标签多属性分类模型建立方法 |
CN114596624A (zh) * | 2022-04-20 | 2022-06-07 | 深圳市海清视讯科技有限公司 | 人眼状态检测方法、装置、电子设备及存储介质 |
CN115661784A (zh) * | 2022-10-12 | 2023-01-31 | 北京惠朗时代科技有限公司 | 一种面向智慧交通的交通标志图像大数据识别方法与系统 |
CN115661784B (zh) * | 2022-10-12 | 2023-08-22 | 北京惠朗时代科技有限公司 | 一种面向智慧交通的交通标志图像大数据识别方法与系统 |
Also Published As
Publication number | Publication date |
---|---|
CN110298262B (zh) | 2024-01-02 |
JP7289918B2 (ja) | 2023-06-12 |
JP2022515895A (ja) | 2022-02-22 |
US20220165045A1 (en) | 2022-05-26 |
CN110298262A (zh) | 2019-10-01 |
EP3916628A4 (en) | 2022-07-13 |
EP3916628A1 (en) | 2021-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020244653A1 (zh) | 物体识别方法及装置 | |
WO2020253416A1 (zh) | 物体检测方法、装置和计算机存储介质 | |
Mendes et al. | Exploiting fully convolutional neural networks for fast road detection | |
Sakaridis et al. | Semantic foggy scene understanding with synthetic data | |
CN110070107B (zh) | 物体识别方法及装置 | |
WO2021164751A1 (zh) | 一种感知网络结构搜索方法及其装置 | |
WO2021147325A1 (zh) | 一种物体检测方法、装置以及存储介质 | |
WO2021155792A1 (zh) | 一种处理装置、方法及存储介质 | |
WO2021218786A1 (zh) | 一种数据处理系统、物体检测方法及其装置 | |
Liu et al. | Segmentation of drivable road using deep fully convolutional residual network with pyramid pooling | |
WO2021164750A1 (zh) | 一种卷积层量化方法及其装置 | |
Li et al. | Implementation of deep-learning algorithm for obstacle detection and collision avoidance for robotic harvester | |
WO2022217434A1 (zh) | 感知网络、感知网络的训练方法、物体识别方法及装置 | |
Grigorev et al. | Depth estimation from single monocular images using deep hybrid network | |
Xing et al. | Traffic sign recognition using guided image filtering | |
Yang et al. | A fusion network for road detection via spatial propagation and spatial transformation | |
CN114764856A (zh) | 图像语义分割方法和图像语义分割装置 | |
US20230401826A1 (en) | Perception network and data processing method | |
CN114972182A (zh) | 一种物体检测方法及其装置 | |
Sun et al. | Semantic-aware 3D-voxel CenterNet for point cloud object detection | |
CN113534189A (zh) | 体重检测方法、人体特征参数检测方法及装置 | |
WO2023029704A1 (zh) | 数据处理方法、装置和系统 | |
CN115272992B (zh) | 一种车辆姿态估计方法 | |
Fan et al. | Pose recognition for dense vehicles under complex street scenario | |
Dube | To See Is to Believe |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20817904 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2021538658 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 20817904 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2020817904 Country of ref document: EP Effective date: 20210825 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |