CN116721151A

CN116721151A - Data processing method and related device

Info

Publication number: CN116721151A
Application number: CN202210192082.9A
Authority: CN
Inventors: 严欣; 王君乐
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2023-09-08

Abstract

The embodiment of the application discloses a data processing method and a related device, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like. The method comprises the steps of obtaining an image to be detected, inputting the image to be detected into a neural network model comprising I encoders and I decoders, determining the depth of each pixel point in the image to be detected through semantic information output by the I decoders of the neural network model, determining the depth of an object included in the image to be detected according to the depth of each pixel point in the image to be detected, and further determining the distance between a camera and the object included in the image to be detected. Therefore, the number of paths in the neural network is increased through the I group encoder and the I group decoder so as to improve information flow in the neural network, semantic information and spatial information in an image to be detected are fully fused, the accuracy of estimating depth of a neural network model is improved, and the accuracy of identifying object depth in the image to be detected is further improved.

Description

Data processing method and related device

Technical Field

The application relates to the technical field of automatic driving, in particular to a data processing method and a related device.

Background

With the development of scientific technology, unmanned technology is receiving extensive attention from various field scholars. A vehicle equipped with unmanned technology is called an autonomous vehicle, and the autonomous vehicle can update its map information by sensing the surrounding environment, so that the vehicle can continuously sense its position.

In the related art, an automatic driving vehicle shoots a surrounding environment through a camera to obtain an image to be detected, and recognizes the depth of an object in the surrounding environment in the image to be detected, so as to obtain information such as the distance between the object and the automatic driving vehicle. For example, the current car shoots an image in front of the car to obtain an image to be detected, and the depth of the front car in the image to be detected is identified, so that the distance between the current car and the front car is obtained.

However, in the related art, accuracy of object depth recognition in a picture to be detected is not high.

Disclosure of Invention

In order to solve the technical problems, the application provides a data processing method and a related device, which are used for improving the accuracy of object depth identification in a picture to be detected.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides a data processing method, including:

Acquiring an image to be detected;

inputting the image to be detected into a neural network model comprising I encoders and I decoders, and determining the depth of each pixel point in the image to be detected through semantic information output by the I decoder of the neural network model; each encoder comprises N encoding units, each decoder comprises N decoding units, the input of the nth encoding unit of the ith encoder is the semantic feature output by the nth encoding unit of the ith-1 encoder, the semantic information output by the nth decoding unit of the ith-1 decoder and the semantic feature output by the nth-1 encoding unit of the ith encoder, the semantic feature output by the nth encoding unit of the ith encoder is input to the (n+1) th encoding unit of the ith encoder and the nth decoding unit of the ith decoder, I and N are integers larger than 1, I is a positive integer smaller than or equal to I, and N is a positive integer smaller than or equal to N;

and determining the depth of an object included in the image to be detected according to the depth of each pixel point in the image to be detected.

In another aspect, an embodiment of the present application provides a data processing apparatus, including: the device comprises an acquisition unit, a neural network model unit and a determination unit;

The acquisition unit is used for acquiring an image to be detected;

the neural network model unit is used for inputting the image to be detected into a neural network model comprising I encoders and I decoders, and determining the depth of each pixel point in the image to be detected through semantic information output by the I decoder of the neural network model; each encoder comprises N encoding units, each decoder comprises N decoding units, the input of the nth encoding unit of the ith encoder is the semantic feature output by the nth encoding unit of the ith-1 encoder, the semantic information output by the nth decoding unit of the ith-1 decoder and the semantic feature output by the nth-1 encoding unit of the ith encoder, the semantic feature output by the nth encoding unit of the ith encoder is input to the (n+1) th encoding unit of the ith encoder and the nth decoding unit of the ith decoder, I and N are integers larger than 1, I is a positive integer smaller than or equal to I, and N is a positive integer smaller than or equal to N;

the determining unit is used for determining the depth of the object included in the image to be detected according to the depth of each pixel point in the image to be detected.

In another aspect, an embodiment of the present application provides a computer device, the device including a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method of the above aspect according to instructions in the program code.

In another aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program for executing the method described in the above aspect.

In another aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method described in the above aspect.

According to the technical scheme, the acquired image to be detected is input into a neural network model, the neural network model comprises I encoders and I decoders, each encoder comprises N units, each decoder comprises N decoding units, the nth encoding unit of the ith encoder is taken as an example, the input of the nth encoding unit of the ith encoder is the semantic feature output by the nth encoding unit of the ith-1 encoder, the semantic information output by the nth decoding unit of the ith-1 decoder and the semantic feature output by the nth-1 encoding unit of the ith encoder, the output semantic feature can be used as the input of the (n+1) th encoding unit of the ith encoder and the nth decoding unit of the ith decoder until the ith decoder outputs the semantic information, the depth of each pixel point in the image to be detected is determined according to the output semantic information of the ith decoder, and the depth of an object included in the image to be detected is further determined. Therefore, the number of paths in the neural network is increased through the I group encoder and the I group decoder so as to improve information flow in the neural network, semantic information and spatial information in an image to be detected are fully fused, the accuracy of estimating depth of a neural network model is improved, and the accuracy of identifying object depth in the image to be detected is further improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario of a data processing method according to an embodiment of the present application;

FIG. 2 is a flowchart of a data processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a neural network model according to an embodiment of the present application;

fig. 4 is a schematic diagram of a feature fusion module according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a basic module according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a basic module according to an embodiment of the present application;

fig. 7 is a schematic diagram of a visualization result of a data processing method in a ki tti2015 test set according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

In the related art, after an image to be detected is obtained, the image to be detected is input into a deep neural network model, the deep neural network model adopts a group of encoder-decoder structures (such as U-net: convolutional networks for biomedical image segmentation), wherein an encoder adopts a pre-trained deep residual network (Deep residual network, resNet) model, semantic features of the image to be detected can be extracted through the encoder, the semantic features are input into the decoder to obtain semantic information, and the depth of each pixel point in the image to be detected is determined according to the semantic information, so that the depth of an object in the image to be detected is obtained according to the depth of the pixel point, and the distance between the object and an automatic driving vehicle is obtained. However, the number of paths inside the neural network including a set of encoder-decoder structures is low, the information flow in the neural network is poor, and semantic information and spatial information in an image to be detected cannot be fully fused, so that the accuracy of object depth identification in the image to be detected is not high.

Based on the above, the embodiment of the application provides a data processing method, which increases the number of paths in a neural network through a plurality of groups of encoders and decoders so as to improve the information flow in the neural network, fully fuses semantic information and spatial information in an image to be detected, improves the accuracy of estimating depth of a neural network model, and further improves the accuracy of identifying object depth in the image to be detected.

The data processing method provided by the embodiment of the application is realized based on artificial intelligence, wherein the artificial intelligence (Artificial Intelligence, AI) is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In the embodiment of the application, the artificial intelligence technology mainly comprises the artificial neural network in machine learning/deep learning, wherein the number of paths in the network is increased through a plurality of groups of encoders and decoders in the artificial neural network so as to improve the information flow in the network, semantic information and spatial information in an image to be detected are fully fused, the accuracy of model depth estimation is improved, and the accuracy of object depth identification in an image to be detected is further improved. Among them, machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The data processing method or apparatus according to the present application, wherein a plurality of servers can be organized into a blockchain, and the servers are nodes on the blockchain.

The data processing method provided by the embodiment of the application can be applied to data processing equipment with data processing capability, such as terminal equipment and a server. The terminal equipment can be mobile phones, computers, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals, aircrafts and the like, but is not limited to the mobile phones, the computers, the intelligent voice interaction equipment, the intelligent household appliances, the vehicle-mounted terminals, the aircrafts and the like; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein. The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like.

The intelligent transportation system (Intelligent Traffic System, ITS) is also called an intelligent transportation system (Intelligent Transportation System), and is an integrated transportation system for effectively and comprehensively applying advanced scientific technologies (information technology, computer technology, data communication technology, sensor technology, electronic control technology, automatic control theory, operation study, artificial intelligence and the like) to transportation, service control and vehicle manufacturing, and enhancing the connection among vehicles, roads and users, thereby forming a comprehensive transportation system for guaranteeing safety, improving efficiency, improving environment and saving energy. Or alternatively;

The intelligent vehicle-road cooperative system (Intelligent Vehicle Infrastructure Cooperative Systems, IVICS), which is simply called a vehicle-road cooperative system, is one development direction of an Intelligent Transportation System (ITS). The vehicle-road cooperative system adopts advanced wireless communication, new generation internet and other technologies, carries out vehicle-vehicle and vehicle-road dynamic real-time information interaction in all directions, develops vehicle active safety control and road cooperative management on the basis of full-time idle dynamic traffic information acquisition and fusion, fully realizes effective cooperation of people and vehicles and roads, ensures traffic safety, improves traffic efficiency, and forms a safe, efficient and environment-friendly road traffic system.

In the embodiment of the application, the automatic driving technology mainly comprises behavior decision and the like, and the distance between an object in the image to be detected and the vehicle is determined by shooting the image to be detected through the automatic driving vehicle, so that the driving safety of the vehicle is improved.

In order to facilitate understanding of the technical scheme of the present application, the data processing method provided by the embodiment of the present application is introduced below by using the vehicle-mounted terminal as the data processing device in combination with an actual application scenario.

Referring to fig. 1, the application scenario of a data processing method according to an embodiment of the present application is shown. In the application scenario shown in fig. 1, an autonomous vehicle (host vehicle) travels on a road following a preceding vehicle, the host vehicle being mounted with an in-vehicle terminal 100 and a camera 200.

The camera 200 photographs a road image in front of the host vehicle, and transmits the road image in front to the in-vehicle terminal 100 as an image to be detected so as to determine a distance between the host vehicle and the preceding vehicle through the in-vehicle terminal 100.

Included in the in-vehicle terminal 100 is a method for determining the frontA neural network model of vehicle depth, the neural network model comprising 2 encoders and 2 decoders, each encoder comprising 5 coding units (e.g. F ₀₀ 、F ₁₀ 、F ₂₀ 、F ₃₀ And F ₄₀ ) Each decoder includes 5 decoding units (e.g. F ₀₃ 、F ₁₃ 、F ₂₃ 、F ₃₃ And F ₄₃ ) The 3 rd coding unit (i.e. F ₂₂ ) To illustrate the internal structure of a neural network.

The 3 rd coding unit of the 1 st encoder is output (i.e. F ₂₀ ) The 3 rd decoding unit of the 1 st decoder (i.e. F ₂₁ ) The output semantic information and the 2 nd coding unit of the 2 nd encoder (i.e. F ₁₂ ) The output semantic feature is input to the 3 rd coding unit of the 2 nd encoder, and the semantic feature can be output by the 3 rd coding unit of the 2 nd encoder as the 4 th coding unit of the 2 nd encoder (i.e. F ₃₂ ) And the 3 rd decoding unit of the 2 nd decoder (i.e. F ₂₃ ) Is input to the computer. From this, it can be seen that the input and output of the 3 rd coding unit of the 2 nd encoder are diversified, increasing the number of paths inside the neural network.

The depth of each pixel point in the image to be detected is determined through semantic information output by a 2 nd decoder of the neural network model, and then the depth of an object included in the image to be detected, namely the depth of a front vehicle, is determined. So as to determine the distance between the host vehicle and the front vehicle through the depth of the front vehicle.

Therefore, after the image to be detected is input into the neural network model, the number of paths in the neural network can be increased sufficiently through the 2-group encoder-decoder structure so as to improve the information flow in the neural network, semantic information and spatial information in the image to be detected are fused sufficiently, the accuracy of estimating depth of the neural network model is improved, and the accuracy of identifying object depth in the image to be detected is further improved.

The following describes a data processing method provided by an embodiment of the present application with reference to the accompanying drawings, wherein a server is used as a data processing device.

Referring to fig. 2, a flowchart of a data processing method according to an embodiment of the present application is shown. As shown in fig. 2, the data processing method includes the steps of:

s201: and acquiring an image to be detected.

After the automatic driving vehicle collects the image to be detected, the image to be detected can be uploaded to a server (such as a cloud server) through the vehicle-mounted terminal, and the distance between an object in the image to be detected and the vehicle is judged through the server.

S202: inputting the image to be detected into a neural network model comprising I encoders and I decoders, and determining the depth of each pixel point in the image to be detected through semantic information output by the I decoder of the neural network model.

The neural network model is described first. The neural network includes a plurality of sets of encoder-decoder structures. Wherein in the encoder-decoder architecture the encoder itself is a series of convolutional networks. The network is mainly composed of a convolution layer, a pooling layer and a batch normalization layer (BN). The convolution layer is responsible for acquiring local features of the image, the pooling layer downsamples the image and transmits the scale-invariant features to the next layer, and the BN is mainly used for normalizing the distribution of the training image and accelerating learning. In general terms, the encoder classifies and analyzes low-level local pixel values of an image to obtain high-level semantic information. The decoder carries out up-sampling on the reduced characteristic image, and then carries out convolution processing on the up-sampled image, so as to perfect the geometric shape of the object and make up the detail loss caused by the fact that the pooling layer reduces the object in the encoder.

The neural network model comprises I encoders and I decoders, I being integers greater than 1, i.e. the neural network model comprises a plurality of sets of encoder-decoder structures. The encoder is used for extracting semantic features of the image to be detected, the input of the decoder is the semantic features of the image to be detected, and the input of the decoder is semantic information of the image to be detected, such as a character sequence with an unfixed length.

In a network structure similar to a rack, which is included in the neural network model, from left to right, the 1 st column is the 1 st encoder, the 2 nd column is the 1 st decoder, the 3 rd column is the 2 nd encoder, the 4 th column is the 2 nd decoder, and so on, the 2 nd column is the I-th encoder, and the 1 st column is the I-th decoder. Each encoder includes N encoding units, each decoder includes N decoding units, and N is an integer greater than 1. From top to bottom, the 1 st coding unit of the 1 st row 1 st encoder, the 2 nd coding unit of the 1 st row 1 st encoder, and so on, the 1 st decoding unit of the 1 st decoder of the 1 st row of the reciprocal, and so on.

Taking the nth coding unit of the ith encoder as an example, the input of the nth coding unit is the semantic feature output by the nth coding unit of the ith-1 encoder, the semantic information output by the nth decoding unit of the ith-1 decoder and the semantic feature output by the nth-1 coding unit of the ith encoder, and the semantic feature can be output through the nth coding unit of the ith encoder and can be input to the (n+1) th coding unit of the ith encoder and the nth decoding unit of the ith decoder. Wherein I is a positive integer less than or equal to I, and N is a positive integer less than or equal to N. Thus, the image to be detected is input into the neural network comprising a plurality of groups of encoder-decoder structures, and the depth of each pixel point in the image to be detected can be determined through semantic information output by the I-th decoder (namely the last column) of the neural network.

The 2 nd to the I th encoders can not only take the semantic information output by the decoder of the previous column as input, but also take the semantic characteristics output by the encoder of the previous column as input, compared with a neural network only adopting a group of encoder-decoder structures, the number of paths inside the neural network of the cascade multi-group encoder-decoder structures is more, the information flow in the neural network can be improved, the semantic information and the spatial information in the image to be detected are fully fused, the accuracy of estimating the depth by a neural network model is improved, and the accuracy of identifying the depth of an object in the image to be detected is further improved.

S203: and determining the depth of the object included in the image to be detected according to the depth of each pixel point in the image to be detected.

Because an object is generally located in an area in the image to be detected, the depth of the pixel point corresponding to the object is not greatly different, so that whether the image to be detected includes the object, several objects and the depth corresponding to each object can be determined through the depth of each pixel point in the image to be detected. Further determining the distance between the camera and the photographed object. Thus, the depth of the object included in the image to be detected is determined according to the depth of each pixel point in the image to be detected. And the distance between the camera shooting the image to be detected and the object included in the image to be detected can be determined according to the depth of the object.

As a possible implementation, the encoder may rely on the underlying network to extract multi-resolution semantic features, i.e. the resolution of the semantic features output by different coding units is different. The multi-scale characteristics are extracted through different coding units, the multi-scale characteristics are used as input, and the accuracy of estimating the depth of the neural network model can be further improved subsequently. As a possible implementation manner, N may be 5, that is, each encoder includes 5 coding units, and resolutions corresponding to semantic features output by the 5 coding units are 1/1, 1/2, 1/4, 1/8, and 1/16 of the original image respectively. For example, the 3 rd coding unit of the 1 st encoder outputs semantic features with a resolution of 1/4 of the original.

The embodiment of the application is not particularly limited to the base network. For example, the underlying network may include one or more layers of convolutional networks for depth-aware information flow. Further, the base network may employ a different convolutional neural network framework, such as an 18-layer weighted residual network (ResNet-18), and the like. For another example, the base network may also be a large model with better performance, so as to further improve the feature extraction capability, thereby improving the performance of the neural network model. For another example, the base network can also use a lightweight neural network structure (such as MobileNet and the like) and a parameter sharing to realize the lightweight of the model, so that the neural network model provided by the embodiment of the application can be better deployed to expand the application range.

As a possible implementation manner, the resolutions of the semantic features output by the different encoding units are different, and the description is given below taking as an example that the resolution of the semantic feature output by the jth encoding unit of the ith encoder is consistent with the resolution of the image to be detected, where j is a positive integer less than or equal to I.

The neural network model also includes a depth convolution layer for extracting the depth of each pixel point. The input of the depth convolution layer is semantic information output by a j decoding unit of an I decoder, and the output of the depth convolution layer is the depth of each pixel point in the image to be detected. The input of the jth decoding unit of the ith decoder is semantic information output by the jth decoding unit of the ith decoder and semantic features output by the jth encoding unit of the ith encoder, namely the jth decoding unit of the ith decoder can learn the features of the picture to be detected under different resolutions, so that the depth of each pixel point determined by the depth convolution layer is more accurate, and the robustness of the neural network model in processing pictures with different resolutions is improved.

As a possible implementation manner, the resolution of the semantic features output by the 1 st coding unit of the I-th encoder is consistent with the resolution of the image to be detected, and in a network structure similar to a rack, which is included in the neural network model, the resolutions of the semantic features output by different coding units decrease sequentially from top to bottom, that is, the resolution of the semantic features output by the n-1 st coding unit is greater than the resolution of the semantic features output by the n-th coding unit. Therefore, compared with the process of amplifying the picture to be detected, the process of reducing the labeling quantity of the picture to be detected is more convenient and takes less time, and the robustness of the neural network model is improved, and meanwhile, the process is simpler and takes less time.

It should be noted that the depth convolution layer may be 0-3 layers, and is configured to output the depth of each pixel point in the image to be detected corresponding to different resolutions.

Further, when the neural network model is trained, the neural network model to be trained further comprises N-1 deep convolutional layers, and the first N-1 decoding units included in the I decoder respectively correspond to one deep convolutional layer. The following is a description with reference to fig. 3.

Referring to fig. 3, a schematic diagram of a neural network model according to an embodiment of the present application is shown. The neural network comprises two sets of encoder-decoder structures, each encoder comprising 5 coding units and each decoder comprising 5 decoding units, wherein the first 4 decoding units of the 2 nd decoder correspond to one depth convolutional layer respectively.

For example, taking the 1 st encoder as an example, the image to be detected is input into the neural network model, the 1 st encoding unit of the 1 st encoder (i.e. F ₀₀ ) The resolution of the output semantic features is 1/1 of the original image resolution, the 2 nd coding unit of the 1 st encoder (namely F ₁₀ ) The resolution of the output semantic features is 1/2 of the original image resolution, the 3 rd coding unit of the 1 st encoder (namely F ₂₀ ) The resolution of the output semantic features is 1/4 of the original image resolution, the 4 th coding unit of the 1 st encoder (namely F ₃₀ ) The resolution of the output semantic features is 1/8 of the original image resolution, the 5 th coding unit of the 1 st encoder (namely F ₄₀ ) The resolution of the output semantic features is 1/16 of the resolution of the original image. Decoding 2The first 4 decoding units are respectively connected with a deep neural network.

In fig. 3, outputs of different resolutions are connected between columns using residuals (dashed arrows in the figure), with solid down arrows representing downsampling and solid up arrows representing downsampling.

Therefore, after the image to be detected is input into the neural network model, the semantic information output by the I-th decoder is input into the corresponding depth convolution layer, and the depth corresponding to the pixel point under each resolution is determined by setting different loss functions and gradient return for different depth convolution layers, so that the neural network model obtained through training can identify images with different resolutions, and the practical requirements are met.

It should be noted that, when training the neural network model, a large-resolution image may also be used as an input, so as to improve the capability of the model to extract structural information.

As a possible implementation manner, the neural network model further includes a feature fusion module, through which features of different resolutions input to the encoding unit are fused. The feature fusion module comprises a splicing layer, a weight layer, an adjusting layer and a first fusion layer. See in particular S401-S404:

S401: for the nth coding unit of the ith coder, inputting the semantic features output by the nth coding unit of the ith-1 coder, the semantic information output by the nth decoding unit of the ith-1 decoder and the semantic features output by the nth-1 coding unit of the ith coder into a splicing layer, and obtaining the splicing features comprising a plurality of parts through the splicing layer.

By means of the splicing layer, the low-resolution features and the upper-layer high-resolution features are spliced, and the spliced features comprising a plurality of parts are obtained.

S402: and inputting the splicing characteristics into a weight layer, and determining the weight corresponding to each part.

Through the weight layer, a weight corresponding to each of the plurality of portions included in the splice feature may be determined.

S403: and multiplying each feature in the spliced features by the corresponding weight through the adjustment layer to obtain the adjustment features.

S404: and inputting the adjustment features into the first fusion layer to obtain first fusion features.

Therefore, the multi-scale features can be fused better through the feature fusion module so as to combine semantic information and space information in the features, and the accuracy of determining the depth of each pixel point in the image to be detected by the neural network model is further improved.

With the 3 rd coding unit (i.e. F) of the 2 nd encoder in FIG. 3 ₂₂ ) For example, F ₂₀ Semantic features of the output, F ₂₁ Output semantic information and F ₁₂ Input of the semantic features of the output to F ₂₂ Through F ₂₂ Semantic features can be output. This is described in detail below with reference to fig. 4.

Referring to fig. 4, a schematic diagram of a feature fusion module according to an embodiment of the present application is shown. After the neural network model further comprises a feature fusion module, F is carried out ₂₀ Semantic features of the output, F ₂₁ Output semantic information and F ₁₂ The output semantic features are input into a splicing layer, the splicing features comprising three parts are obtained through the splicing layer, the splicing features are input into a weight layer, the weight corresponding to each of the three parts is determined, each feature in the splicing features is multiplied by the corresponding weight through an adjusting layer, the adjusting features are obtained, and the adjusting features are input into a first fusion layer, so that the first fusion features are obtained.

The first fusion layer was a convolution of 1×1. Inputting the obtained first fusion characteristic into a corresponding coding unit, namely F ₂₂ Through F ₂₂ Semantic features can be output.

It should be noted that the weight layer may further include a convolution layer, a full connection layer, and a normalized exponential function (softmax) layer to solve the weights of each part.

As a possible implementation, the neural network may further comprise a base module by which the features of different resolutions input to the decoding unit are fused. Specifically, the semantic features output by the nth encoding unit of the ith encoder and the semantic information output by the (n+1) th decoding unit of the ith decoder are input into a base module, and a second fusion feature is obtained through the base module so as to be input into the nth decoding unit of the ith decoder.

The embodiment of the application is not particularly limited to the basic module, and will be described by taking two modes as examples.

Mode one: the basic module comprises a second fusion layer, a first convolution layer, a first batch processing layer, a third fusion layer and a first activation layer.

It should be noted that, the second fusion layer is used for fusing the features with different resolutions, the first convolution layer is used for extracting the features, the first batch processing layer is used for normalizing the features, the third fusion layer is used for fusing the different features, and the first activation layer is used for improving the difference between the features so as to further improve the generalization. The following is a description with reference to fig. 5.

Referring to fig. 5, a schematic diagram of a base module according to an embodiment of the present application is shown. The semantic features (horizontal input as shown in fig. 5) output by the nth coding unit of the ith encoder and the semantic information (vertical input as shown in fig. 5) output by the (n+1) th decoding unit of the ith decoder are input to a second fusion layer, the features with different resolutions are fused through the second fusion layer to obtain second fusion sub-features, the second fusion sub-features are sequentially input to the first convolution layer and the first batch processing layer to obtain first extraction features, and the first extraction features and the second fusion sub-features are sequentially input to the third fusion layer and the first activation layer to obtain second fusion features, so that the second fusion features are input to the nth decoding unit of the ith decoder.

Mode two: the basic module comprises a second fusion layer, a first convolution layer, a first batch processing layer, a second activation layer, a second convolution layer, a second batch processing layer, a third fusion layer and a first activation layer.

It should be noted that the weights of the first convolution layer are different from those of the second convolution layer, so that the learned characteristics are further adjusted by the different weights. The second activation layer is used for improving the difference between the features so as to further improve generalization, the second convolution layer is used for extracting the features, and the second batch processing layer is used for normalizing the features. The following is a description with reference to fig. 6.

Referring to fig. 6, a schematic diagram of a base module according to an embodiment of the present application is shown. The semantic features (horizontal input as shown in fig. 5) output by the nth coding unit of the ith encoder and the semantic information (vertical input as shown in fig. 5) output by the (n+1) th decoding unit of the ith decoder are input to a second fusion layer, the features with different resolutions are fused through the second fusion layer to obtain second fusion sub-features, the second fusion sub-features are sequentially input to a first convolution layer and a first batch processing layer to obtain first extraction features, the first extraction features are sequentially input to a second activation layer, the second convolution layer and the second batch processing layer to obtain second extraction features, and the second extraction features and the second fusion sub-features are sequentially input to a third fusion layer and the first activation layer to obtain second fusion features, so that the second fusion features are input to the nth decoding unit of the ith decoder.

As a possible implementation manner, in the process of training the neural network, a pose estimation model with the number of first-stage convolution channels being 6 may be used for training. The following is a detailed description.

The images of two adjacent frames are input into a gesture estimation model, and the gesture estimation network can predict the relative gesture, rotation and translation parameters of a single 6-dof (dof is the degree of freedom, and the degree of freedom related to the 3 positions of up and down, front and back and left and right is added besides 3 rotation angles), or the gesture difference of the images of two adjacent frames. Then, inputting two adjacent frames of images into a neural network model to be trained respectively, and obtaining the depth of each pixel point of each frame of image respectively, so as to obtain the depth difference of each pixel point in the two adjacent frames of images, and optimizing the neural network model to be trained according to the depth difference and the gesture difference, so as to obtain the neural network model. For example, if the depth difference and the posture difference do not match (e.g., the depth difference is small and the posture difference is large), the neural network to be trained needs to be adjusted.

It should be noted that for the use of the pose estimation model, it may be built on top of ResNet-18, with the first level convolution path going from 3 to 6 to allow adjacent frames to feed the network. Correspondingly, the output of the pose estimation model is the relative pose parameterized with a 6-dof vector. Wherein the first three dimensions represent translation vectors and the last three dimensions represent euler angles.

The neural network model is described below as being trained using the KITTI2015 dataset. The KITTI data set is a computer vision algorithm evaluation data set in the currently international largest automatic driving scene. The data set is used for evaluating the performance of computer vision technologies such as stereo images (stereo), optical flows (optical flow), visual ranging (visual distance), 3D object detection (object detection) and the like in a vehicle-mounted environment. The KITTI comprises real image data acquired from scenes such as urban areas, villages, highways and the like, up to 15 vehicles and 30 pedestrians in each image, and various degrees of shielding and cutting.

The scheme may be model trained using a video sequence comprised of the KITTI2015 dataset by training an architecture built on a self-supervising loss function over a sequence of moving images (including monocular and binocular). The model architecture includes two models, one neural network model to determine the depth of each pixel point on the monocular image and the other pose estimation model to determine the pose between adjacent frames. This approach does not require labeling of the training dataset. Instead, it uses the reprojection relationship of successive time frames and poses in the image sequence for training.

Training is performed using the reprojection relationship of successive time frames and poses in the image sequence. The remapping loss function is obtained by inputting a video segment, assuming that the target image is positioned in a t frame, and taking the t frame as the input of a neural network model to obtain the depth D of each pixel point in the target image _t . Inputting the T 'frame (t+1 frame or T-1 frame) adjacent to the T' frame into the gesture estimation model to obtain a camera movement track T _t→t′ . Knowing the camera reference K, the reconstruction process is derived from the neighboring frames (frame t+1 or frame t-1) The calculation of the transformation matrix starts. The rotation and translation information is then used to calculate a mapping from the adjacent frames to the target frame. Finally, the depth map of the target image predicted from the neural network model and the matrix converted from the pose estimation model are projected into a camera with an internal reference matrix to obtain a reconstructed target image. This process requires converting the depth map into a 3D point cloud, converting the 3D point cloud into another coordinate system, and then converting the 3D point into a 2D point using camera internal parameters. The resulting 2D points are used as a sampling grid for bilinear interpolation from the target image. The loss function L _p Is based on a similarity loss function between the target image and the reconstructed target image, and is specifically expressed as follows:

I _t′→t ＝I _t′ 〈proj(D _t ,T _t→t′ ,K)>

Where pe is the optical reconstruction loss, such as the L1 distance of the pixel space. I _t Is a target image, I _t′ Is the source image. proj () is the mapping depth D _t At I _t′ And a corresponding 2D coordinate result.<>Representative is a sampling operation. And for simplicity of labeling, the values of the internal parameters of all images are considered equal here. The model samples the source image using bilinear interpolation. Meanwhile, the model uses an L1-norm loss function (for minimizing the sum of absolute differences of target and estimated values) and an SSIM loss function (for comparing the similarity of two images from three dimensions of brightness, contrast and structure) to compose the final remapped loss function, which is specifically expressed as follows:

where α=0.85. The purpose of the remapping loss function is to reduce the difference between the target image and the reconstructed target image, which needs to be used in both the pose estimation model and the neural network model estimation process.

Because the model assumes that the camera is moving in a static scene (e.g., a static scene), the resulting re-projection error is multiplied by a mask that solves the problems associated with static scene changes. In particular, an object is moving at a speed similar to the camera, or the camera has stopped while other objects are moving, i.e. those objects that are stationary in the camera coordinate system. These relatively stationary objects should theoretically have an infinite depth. This solution solves this problem using an automatic mask method that filters pixels that move synchronously with the camera. The mask is binary and is 1 if the minimum photometric error between the target image and the reconstructed target image is less than the minimum photometric error of the target image and the source image, and is 0 otherwise. When the camera is stationary, this approach can mask all pixels in the image (the probability in an actual scene is low). When an object moves at the same speed as the camera, it can cause the pixels of the stationary object in the image to be masked.

The model thus uses μ to contain pixels whose remap loss of the target image and its mapped image is smaller than the remap loss of the target image and the source image, as follows:

wherein [ (i ] is an iferson mount to prevent both the camera and the other object from moving at similar speeds. While the scheme merges together the losses of the individual scales. The lower resolution depth map is up-sampled to the higher input image resolution, then re-projected at the higher input resolution, resampled and the photometric error calculated. This allows depth maps on various scales to achieve the same target, i.e. an accurate high resolution reconstruction of the target image. An edge-sensitive smoothness penalty between the input/target images is then added. This encourages the model to learn sharp edges and eliminate noise, as follows:

L＝μL _p +γL _S

wherein L is _S L for smoothness loss _p The loss is remapped, L is the final loss function,partial differentiation for representing depth and RGB (red, green, blue) image, +.>Representing the depth of the image at time t, +.>Partial differentiation for representing depth and RGB (red, green, blue) image, +.>The matrix is represented, the exponential operation is element-by-element operation, gamma is the weight of smoothness loss, and belongs to super-parameters, for example, 1X 10- ³ 。

Next, using the neural network shown in fig. 3 as an example, training and use of the neural network by an image in the ki tti2015 dataset will be described.

The training process is described first.

The KITTI2015 dataset was divided into training and test sets using Eigen (an open source template library) partitioning. The same camera intrinsic K is used for each image of the training set and the principal point of the camera is set as the average of the image center and focal length for all images. Each image will be resized to a fixed size, with the resolution of the image adjusted to 196 x 640 in embodiments of the present application.

In the construction of the training set, not only the current image is saved for each data, but also the previous and subsequent frame images are input for calculation of the reprojection loss function, and scaling the current image to the sizes of 1/2, 1/4, 1/8 and 1/16 strengthens the robustness of the model with the loss of multiple resolutions.

The multi-scale features of the input image were extracted on a pre-trained ResNet-18 network on an ImageNet dataset (a large scale hierarchical image database, imageNet: A large-scale hierarchical image database) as encoders for the neural network model. Wherein, the gesture estimation model and the neural network model are trained for 20 cycles, the batch size is 12, the initial learning rate is 1e-3 and decays to 1e-4 after 15 cycles. And obtaining a trained neural network model through the training mode.

After the training process is described, a test process is described.

In the test section, the similarity of the depth and the true depth labels is compared, and the results are calculated under a plurality of indices, for example, one or more combinations of absolute relative error (abs_rel), square relative error (sq_rel), root mean square error (rmse), logarithmic root mean square error (rmse_log), accuracy (a 1, a2, a 3), and the like. The specific formula is as follows:

accuracy:

where N is the total number of pixels, D _i Is the ith pixelThe depth of the depth-wise direction of the depth,is the true depth corresponding to the i-th pixel. T is a threshold, three thresholds 1.25 (a 1), 1.25 are used in this embodiment ² (a2)，1.25 ³ (a3)。/>

The comparison of the test results of this example with the test results of the Monodepth2 method is shown in Table 1.

TABLE 1

model	abs_rel	sq_rel	rmse	rmse_log	a1	a2	a3
								MonoDepth2	0.115	0.903	4.863	0.193	0.877	0.959	0.981
This embodiment	0.108	0.829	4.747	0.188	0.884	0.960	0.983

It should be noted that abs_rel, sq_rel, rmse, and rmse_log are the smallest values, and the better the model effect. The larger the values a1, a2, a3 are, the better the model effect is. It can be seen that, for the KITTI2015 test set, the test results of this embodiment are greatly improved over the Monodepth2 model.

Therefore, the embodiment of the application has the same structure and characteristic fusion module as a plurality of groups of encoder-decoder, and the neural network model automatically learns from the unlabeled video sequence data through self-supervision learning, thereby improving the performance. The object's outline is more clear in the test set visualization due to better extraction of semantic information of the image and better fusion of multi-resolution features, as shown in fig. 7.

The embodiment of the application also provides a data processing device aiming at the data processing method provided by the embodiment.

Referring to fig. 8, a schematic diagram of a data processing apparatus according to an embodiment of the present application is shown. The data processing apparatus 800 includes: an acquisition unit 801, a neural network model unit 802, and a determination unit 803;

the acquiring unit 801 is configured to acquire an image to be detected;

the neural network model unit 802 is configured to input the image to be detected into a neural network model including I encoders and I decoders, and determine a depth of each pixel point in the image to be detected according to semantic information output by an I decoder of the neural network model; each encoder comprises N encoding units, each decoder comprises N decoding units, the input of the nth encoding unit of the ith encoder is the semantic feature output by the nth encoding unit of the ith-1 encoder, the semantic information output by the nth decoding unit of the ith-1 decoder and the semantic feature output by the nth-1 encoding unit of the ith encoder, the semantic feature output by the nth encoding unit of the ith encoder is input to the (n+1) th encoding unit of the ith encoder and the nth decoding unit of the ith decoder, I and N are integers larger than 1, I is a positive integer smaller than or equal to I, and N is a positive integer smaller than or equal to N;

The determining unit 803 is configured to determine a depth of an object included in the image to be detected according to a depth of each pixel point in the image to be detected.

As a possible implementation manner, the neural network model further includes a deep convolution layer, and the neural network model unit 802 is configured to:

inputting semantic information output by a j decoding unit of an I decoder of the neural network model to the depth convolution layer, and determining the depth of each pixel point in the image to be detected by the depth convolution layer; the input of the jth decoding unit of the ith decoder is semantic information output by the jth decoding unit of the ith decoder and semantic features output by the jth encoding unit of the ith encoder, the resolution of the semantic features output by the jth encoding unit of the ith encoder is consistent with the resolution of the image to be detected, and j is a positive integer less than or equal to I.

As a possible implementation manner, the neural network model further includes a fusion feature module, where the fusion feature module includes a splicing layer, a weight layer, an adjusting layer, and a first fusion layer, and the apparatus further includes a first fusion unit configured to:

Inputting semantic features output by an nth coding unit of an ith-1 encoder, semantic information output by an nth decoding unit of an ith-1 decoder and semantic features output by an nth-1 coding unit of the ith encoder into the splicing layer, and obtaining splicing features comprising a plurality of parts through the splicing layer;

inputting the splicing characteristics into the weight layer, and determining the weight corresponding to each part;

multiplying each feature in the spliced features by a corresponding weight through the adjustment layer to obtain adjustment features;

and inputting the adjustment characteristic into the first fusion layer to obtain a first fusion characteristic so as to be input into an nth coding unit of the ith encoder.

As a possible implementation manner, the neural network further includes a base module, and the apparatus further includes a second fusing unit configured to:

inputting semantic features output by an nth encoding unit of an ith encoder and semantic information output by an (n+1) th decoding unit of an ith decoder into the base module, and obtaining a second fusion feature through the base module so as to be input into the nth decoding unit of the ith decoder.

As a possible implementation manner, the base module includes a second fusion layer, a first convolution layer, a first batch processing layer, a third fusion layer, and a first activation layer, where the second fusion unit is configured to:

inputting semantic features output by an nth coding unit of an ith encoder and semantic information output by an (n+1) th decoding unit of an ith decoder into the second fusion layer, and obtaining second fusion sub-features through the second fusion layer;

inputting the second fusion sub-feature into the first convolution layer, and obtaining a first extracted feature through the first convolution layer and the first batch processing layer;

and inputting the first extracted feature and the second fusion sub-feature into the third fusion layer, and obtaining the second fusion feature through the third fusion layer and the first activation layer.

As a possible implementation manner, the base module includes a second fusion layer, a first convolution layer, a first batch processing layer, a second activation layer, a second convolution layer, a second batch processing layer, a third fusion layer, and a first activation layer, where a weight of the first convolution layer is different from a weight of the second convolution layer, and the second fusion unit is configured to:

inputting the first extracted features to the second activation layer, and obtaining second extracted features through the second activation layer, the second convolution layer and the second batch processing layer;

and inputting the second extracted feature and the second fusion sub-feature into the third fusion layer, and obtaining the second fusion feature through the third fusion layer and the first activation layer.

As a possible implementation manner, the apparatus further includes a training unit, configured to:

the neural network model is trained by a pose estimation model with a number of first-stage convolution channels of 6.

The embodiment of the application also provides a computer device, which is the computer device introduced above, the computer device can be a server or a terminal device, the data processing device can be built in the server or the terminal device, and the computer device provided by the embodiment of the application is introduced from the aspect of hardware materialization. Fig. 9 is a schematic structural diagram of a server, and fig. 10 is a schematic structural diagram of a terminal device.

Referring to fig. 9, which is a schematic diagram of a server structure according to an embodiment of the present application, the server 1400 may have a relatively large difference due to configuration or performance, and may include one or more central processing units (Central Processing Units, CPU) 1422 and a memory 1432, one or more application programs 1442, or a storage medium 1430 (e.g., one or more mass storage devices) for data 1444. Wherein the memory 1432 and storage medium 1430 can be transitory or persistent storage. The program stored in the storage medium 1430 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, a CPU 1422 may be provided in communication with the storage medium 1430 to execute a series of instruction operations in the storage medium 1430 on the server 1400.

Server 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input/output interfaces 1458, and/or one or more operating systems 1441, such as a Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 9.

Wherein, the CPU 1422 is configured to perform the following steps:

acquiring an image to be detected;

Optionally, the CPU 1422 may also perform method steps of any specific implementation of the data processing method in the embodiment of the present application.

Referring to fig. 10, the structure of a terminal device according to an embodiment of the present application is shown. Fig. 10 is a block diagram illustrating a part of a structure of a smart phone related to a terminal device provided by an embodiment of the present application, where the smart phone includes: radio Frequency (RF) circuitry 1510, memory 1520, input unit 1530, display unit 1540, sensor 1550, audio circuitry 1560, wireless fidelity (WiFi) module 1570, processor 1580, power supply 1590, and the like. Those skilled in the art will appreciate that the smartphone structure shown in fig. 10 is not limiting of the smartphone and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The following describes each component of the smart phone in detail with reference to fig. 10:

the RF circuit 1510 may be used for receiving and transmitting signals during a message or a call, and particularly, after receiving downlink information of a base station, the signal is processed by the processor 1580; in addition, the data of the design uplink is sent to the base station.

The memory 1520 may be used to store software programs and modules, and the processor 1580 implements various functional applications and data processing of the smartphone by running the software programs and modules stored in the memory 1520.

The input unit 1530 may be used to receive input numerical or character information and generate key signal inputs related to user settings and function control of the smart phone. In particular, the input unit 1530 may include a touch panel 1531 and other input devices 1532. The touch panel 1531, also referred to as a touch screen, may collect touch operations on or near the user and drive the corresponding connection device according to a predetermined program. The input unit 1530 may include other input devices 1532 in addition to the touch panel 1531. In particular, other input devices 1532 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 1540 may be used to display information input by a user or information provided to the user and various menus of the smart phone. The display unit 1540 may include a display panel 1541, and optionally, the display panel 1541 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), or the like.

The smartphone may also include at least one sensor 1550, such as a light sensor, a motion sensor, and other sensors. Other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the smart phone are not described in detail herein.

Audio circuitry 1560, speaker 1561, and microphone 1562 may provide an audio interface between a user and a smart phone. The audio circuit 1560 may transmit the received electrical signal converted from audio data to the speaker 1561, and be converted into a sound signal by the speaker 1561 for output; on the other hand, the microphone 1562 converts the collected sound signals into electrical signals, which are received by the audio circuit 1560 for conversion into audio data, which is processed by the audio data output processor 1580 for transmission to, for example, another smart phone via the RF circuit 1510 or for output to the memory 1520 for further processing.

Processor 1580 is a control center of the smartphone, connects various parts of the entire smartphone with various interfaces and lines, performs various functions of the smartphone and processes data by running or executing software programs and/or modules stored in memory 1520, and invoking data stored in memory 1520. In the alternative, processor 1580 may include one or more processing units.

The smart phone also includes a power source 1590 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 1580 via a power management system, such as to provide for managing charging, discharging, and power consumption.

Although not shown, the smart phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In an embodiment of the present application, the memory 1520 included in the smart phone may store program codes and transmit the program codes to the processor.

The processor 1580 included in the smart phone may execute the data processing method provided in the foregoing embodiment according to the instructions in the program code.

The embodiment of the application also provides a computer readable storage medium for storing a computer program for executing the data processing method provided in the above embodiment.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the data processing methods provided in the various alternative implementations of the above aspects.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, where the above program may be stored in a computer readable storage medium, and when the program is executed, the program performs steps including the above method embodiments; and the aforementioned storage medium may be at least one of the following media: read-Only Memory (ROM), RAM, magnetic disk or optical disk, etc.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment is mainly described in a different point from other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, with reference to the description of the method embodiments in part. The apparatus and system embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present application should be included in the scope of the present application. Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. A method of data processing, the method comprising:

acquiring an image to be detected;

2. The method of claim 1, wherein the neural network model further comprises a depth convolution layer, wherein the determining the depth of each pixel in the image to be detected by semantic information output by an I-th decoder of the neural network model comprises:

3. The method of claim 1, wherein the neural network model further comprises a fused feature module comprising a stitching layer, a weighting layer, an adjustment layer, and a first fused layer, the method further comprising:

4. The method of claim 1, wherein the neural network further comprises a base module, the method further comprising:

5. The method of claim 4, wherein the base module includes a second fusion layer, a first convolution layer, a first batch layer, a third fusion layer, and a first activation layer, the inputting semantic features output by an nth encoding unit of an ith encoder and semantic information output by an n+1th decoding unit of an ith decoder into the base module, the obtaining, by the base module, the second fusion features includes:

6. The method of claim 4, wherein the base module includes a second fusion layer, a first convolution layer, a first batch layer, a second activation layer, a second convolution layer, a second batch layer, a third fusion layer, and a first activation layer, the weight of the first convolution layer is different from the weight of the second convolution layer, the semantic features output by an nth coding unit of an ith encoder and the semantic information output by an n+1th decoding unit of an ith decoder are input into the base module, and the second fusion features are obtained by the base module, including:

7. The method according to any one of claims 1-6, further comprising:

8. A data processing apparatus, the apparatus comprising: the device comprises an acquisition unit, a neural network model unit and a determination unit;

the acquisition unit is used for acquiring an image to be detected;

9. A computer device, the device comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-7 according to instructions in the program code.

10. A computer readable storage medium, characterized in that the computer readable storage medium is for storing a computer program for executing the method of any one of claims 1-7.