WO2023165361A1 - 一种数据处理方法及相关设备 - Google Patents

一种数据处理方法及相关设备 Download PDF

Info

Publication number
WO2023165361A1
WO2023165361A1 PCT/CN2023/077191 CN2023077191W WO2023165361A1 WO 2023165361 A1 WO2023165361 A1 WO 2023165361A1 CN 2023077191 W CN2023077191 W CN 2023077191W WO 2023165361 A1 WO2023165361 A1 WO 2023165361A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
data
feature set
target
score
Prior art date
Application number
PCT/CN2023/077191
Other languages
English (en)
French (fr)
Inventor
陈醒濠
王一凯
王秀东
王云鹤
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023165361A1 publication Critical patent/WO2023165361A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and in particular, to a data processing method and related equipment.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is the branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that respond in ways similar to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • modal data such as text, pictures, video and audio
  • the technology of multimodal feature fusion is more important.
  • the perception system of self-driving vehicles has been greatly improved.
  • a vehicle with assisted driving or automatic driving functions usually needs to be equipped with different sensors to complement each other under different working conditions.
  • Typical sensor modalities include: camera, radar, lidar, high-precision map, etc.
  • the strategy adopted by multi-modal fusion is to combine the inputs of different modalities and input them into the same transformer structure to obtain the final output.
  • Embodiments of the present application provide a data processing method and related equipment. By replacing the features between different modal data, the information of different modal data can be fused efficiently, so that the acquired data features have the characteristics of multi-modal data, and the expression ability of data features is improved.
  • the first aspect of the embodiment of the present application provides a data processing method.
  • the method is applied to a multi-modal fusion scene.
  • the method includes: acquiring the first data and the second data, and the modes of the first data and the second data are different; acquiring The first feature set of the first data and the second feature set of the second data; use the second target feature in the second feature set to replace the first target feature in the first feature set to obtain the third feature set, the second target
  • the feature corresponds to the first target feature; the data feature is obtained based on the third feature set and the second feature set, and the data feature is used to realize the computer vision task.
  • the corresponding relationship between the second target feature and the first target feature can be determined according to the spatial relationship and semantic relationship between the first data and the second data, It may also be determined according to the position of the feature in the feature set, etc., and how to specifically determine the corresponding relationship of features in different feature sets is not limited here.
  • the information of different modal data can be efficiently fused, so that the acquired data features have the characteristics of multi-modal data, and the expressive ability of the data features is improved.
  • the above step: acquiring data features based on the third feature set and the second feature set includes: using the third target feature in the first feature set to replace the second A fourth target feature in the feature set is obtained to obtain a fourth feature set, and the third target feature corresponds to the fourth target feature; data features are obtained based on the third feature set and the fourth feature set.
  • the third target feature can be used to replace the fourth target feature, so as to realize the feature interchange between the first feature set and the second feature set.
  • the third feature set can be made to have the characteristics of the modal data corresponding to the second feature set
  • the fourth feature set can also be made to have the characteristics of the modal data corresponding to the first feature set, and then the subsequent improvement based on the third feature set and the fourth
  • the expressiveness of the data features generated by the feature set improves the accuracy and/or precision of the subsequent computer vision task results.
  • the method before using the second target feature in the second feature set to replace the first target feature in the first feature set, the method further includes: obtaining the first target feature The first score set of a feature set, the first feature in the first feature set corresponds to the first score in the first score set; obtain the second score set of the second feature set, the second feature The second feature in the set is in one-to-one correspondence with the second score in the second set of scores; the second target feature is determined based on the first set of scores and/or the second set of scores.
  • the second target feature or the first target feature is determined by introducing the score of the feature.
  • the score can be an indicator for judging the importance of a feature (for example, the bigger the better), or it can be used for evaluating Indicators of feature inefficiency (for example, the smaller the score, the better), etc.
  • Indicators of feature inefficiency for example, the smaller the score, the better
  • obtaining the first score set of the first feature set includes: evaluating each feature in the first feature set based on the scoring network to obtain The first score set, the scoring network is used to evaluate the importance of features; obtaining the second score set of the second feature set includes: evaluating each feature in the second feature set based on the scoring network to obtain the second score gather.
  • the importance of the features is evaluated by introducing a scoring network, so that the subsequently determined second target features and the first target features are more reasonable.
  • the output value of the scoring network follows a sparse distribution. That is, it can be understood that the output value of the scoring network is more sparse, so that the scores of some features are quite different from the scores of other features, and then it is determined which features are useful or useless.
  • the scoring network can be trained using the L1 norm during the training process.
  • the scores of certain features are greatly different from those of other features, so as to determine which features are useful or useless.
  • obtaining the first score set of the first feature set includes: performing a mathematical operation on each first feature in the first feature set to obtain The first score set, the mathematical operation is based on the operation of each first feature itself, the mathematical operation includes a rank operation or a modulo operation; obtaining the second score set of the second feature set includes: in the second feature set Perform mathematical operations on each of the second features to obtain the second collection of scores.
  • the above step: obtaining the first feature set of the first data and the second feature set of the second data includes: obtaining the first feature set and the second feature set based on a neural network
  • the second feature set, neural network includes attention network, multi-layer perceptron, pooling layer or convolutional layer.
  • the first feature set and the second feature set are obtained based on neural networks, and can be applied to scenarios such as attention networks, multi-layer perceptrons, pooling layers, or convolutional layers.
  • the above step: obtaining the first feature set and the second feature set based on the neural network includes: splitting the first data to obtain a plurality of first sub-data; dividing the second data to obtain a plurality of second sub-data; inputting the plurality of first sub-data and the second sub-data into the neural network to obtain a first feature set and a second feature set.
  • the input of the neural network is obtained by splitting the modal data, so that the number of features in the subsequently obtained feature set is related to the number of splits, and then the subsequent calculation process is controlled.
  • replacing the first target feature in the first feature set with the second target feature in the second feature set includes: based on residual position coding
  • the second target feature is used to replace the first target feature
  • the residual position coding is used to determine the position of each feature in the first feature set and the second feature set.
  • the position of the replaced feature is determined by residual position coding, thereby ensuring that the position of the feature in the original feature set is not changed when the feature is replaced.
  • the foregoing neural network further includes a first network layer, and a structure of the first network layer is related to the neural network.
  • the first feature set and the second feature set can be the output of the first network layer, that is, no matter where the first feature set and the second feature set belong to in the neural network, they can be obtained through different models.
  • the replacement between the features of the state data can improve the expressive ability of the subsequent data features.
  • the above steps further include: inputting data features into a second network layer to obtain a result of a computer vision task, and the second network layer is related to the computer vision task.
  • the data features can obtain the result of the computer vision task through the second network layer, and since the data features are replaced by features between different modal data, the result is more accurate.
  • the above-mentioned computer vision task is a classification task, and the second network layer is a fully connected layer; or the computer vision task is a segmentation task or a detection task, and the second network layer is a convolutional neural network layer or an upsampling layer.
  • the method can be applied to computer vision tasks in different scenarios, and can accurately complete detection tasks, segmentation tasks, classification tasks, and the like.
  • the second aspect of the embodiment of the present application provides a data processing device, the data processing device is applied to a multi-modal fusion scene, and the data processing device includes: an acquisition unit, configured to acquire the first data and the second data, the first data and the second data The modalities of the two data are different; the acquisition unit is also used to acquire the first feature set of the first data and the second feature set of the second data; the replacement unit is used to replace the first feature set with the second target feature in the second feature set The first target feature in a feature set, get the first Three feature sets, the second target feature corresponds to the first target feature; the acquisition unit is used to acquire data features based on the third feature set and the second feature set, and the data features are used to implement computer vision tasks.
  • the above acquisition unit is specifically configured to use the third target feature in the first feature set to replace the fourth target feature in the second feature set to obtain the first Four feature sets, the third target feature corresponds to the fourth target feature; the acquiring unit is specifically configured to acquire data features based on the third feature set and the fourth feature set.
  • the above-mentioned acquiring unit is further configured to acquire the first score set of the first feature set, the first feature in the first feature set and the first score The first scores in the value set correspond one-to-one; the acquisition unit is also used to acquire the second score set of the second feature set, the second feature in the second feature set and the second score in the second score set The values are in one-to-one correspondence; the data processing device further includes: a determining unit, configured to determine the second target feature based on the first set of scores and/or the second set of scores.
  • the above acquisition unit is specifically configured to evaluate each feature in the first feature set based on the scoring network to obtain the first score set, and the scoring network uses The importance of the evaluation feature; the acquisition unit is specifically used to evaluate each feature in the second feature set based on the scoring network to obtain the second score set.
  • the output value of the scoring network follows a sparse distribution.
  • the above-mentioned acquisition unit is specifically configured to perform a mathematical operation on each first feature in the first feature set to obtain the first score set, and the mathematical operation is Based on the operation performed on each first feature itself, the mathematical operation includes a rank operation or a modulo operation; the acquisition unit is specifically configured to perform a mathematical operation on each second feature in the second feature set to obtain a second score set.
  • the above acquisition unit is specifically configured to acquire the first feature set and the second feature set based on a neural network
  • the neural network includes an attention network, a multi-layer perceptron , pooling layer or convolutional layer.
  • the above-mentioned acquiring unit is specifically configured to split the first data to obtain a plurality of first sub-data; the acquiring unit is specifically configured to split the second data to obtain A plurality of second sub-data; an acquisition unit specifically configured to input the plurality of first sub-data and second sub-data into the neural network to obtain a first feature set and a second feature set.
  • the above replacement unit is specifically configured to replace the first target feature with the second target feature based on the residual position code, and the residual position code is used to determine the first target feature The position of each feature in the feature set and the second feature set.
  • the above neural network further includes a first network layer, and the structure of the first network layer is related to the neural network.
  • the above-mentioned acquisition unit is further configured to input data features into a second network layer to obtain results of computer vision tasks, and the second network layer is related to computer vision tasks.
  • the above-mentioned computer vision task is a classification task, and the second network layer is a fully connected layer; or the computer vision task is a segmentation task or a detection task, and the second network layer is a convolutional neural network layer or an upsampling layer.
  • the third aspect of the embodiment of the present application provides a data processing device, including: a processor, the processor is coupled with a memory, the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the data processing device realizes the above The method in the first aspect or any possible implementation of the first aspect.
  • the fourth aspect of the embodiment of the present application provides a computer-readable medium, on which computer programs or instructions are stored, and when the computer programs or instructions are run on the computer, the computer can perform any possible operation of the first aspect or the first aspect. method in the implementation of .
  • a fifth aspect of the embodiments of the present application provides a computer program product, which, when executed on a computer, causes the computer to execute the method in the foregoing first aspect or any possible implementation manner of the first aspect.
  • the technical effects brought by the second, third, fourth, fifth aspects or any one of the possible implementations may refer to the first aspect or the technical effects brought by different possible implementations of the first aspect, here No longer.
  • the embodiments of the present application have the following advantages: By using the features between different modal data to replace, the information of different modal data can be efficiently fused, so that the acquired data features have the characteristics of multi-modal data. Features, improve the expressive ability of data features.
  • FIG. 1 is a schematic structural diagram of a system architecture provided in an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • FIG. 3A is a schematic structural diagram of a data processing system provided by an embodiment of the present application.
  • FIG. 3B is another schematic structural diagram of the data processing system provided by the embodiment of the present application.
  • FIG. 4 is a schematic flow chart of a data processing method provided in an embodiment of the present application.
  • FIG. 5A is an example diagram of the first data provided by the embodiment of the present application.
  • Fig. 5B is an example diagram of the second data provided by the embodiment of the present application.
  • FIG. 6A is another example diagram of the first data provided by the embodiment of the present application.
  • FIG. 6B is another example diagram of the second data provided by the embodiment of the present application.
  • FIG. 7A is an example diagram of the first data provided by the embodiment of the present application.
  • Fig. 7B is an exemplary diagram of the second data provided by the embodiment of the present application.
  • FIG. 8A is another example diagram of the first data provided by the embodiment of the present application.
  • Fig. 8B is another example diagram of the second data provided by the embodiment of the present application.
  • Fig. 9 is several example diagrams of the neural network provided by the embodiment of the present application.
  • FIG. 10A is an example diagram of the position of the feature set in the neural network provided by the embodiment of the present application.
  • FIG. 10B is another example diagram of the position of the feature set in the neural network provided by the embodiment of the present application.
  • FIG. 11 is an exemplary flow chart of a data processing method provided in an embodiment of the present application.
  • FIG. 12 is another exemplary flow chart of the data processing method provided by the embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • FIG. 14 is another schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • Embodiments of the present application provide a data processing method and related equipment. By replacing the features between different modal data, the information of different modal data can be fused efficiently, so that the acquired data features have the characteristics of multi-modal data, and the expression ability of data features is improved.
  • Multimodal Fusion is responsible for combining information of multiple modalities for target prediction (classification or regression). It is one of the earliest research directions of MMML and is currently the most widely used direction. It also has other common aliases. , such as multi-source information fusion (Multi-source Information Fusion), multi-sensor fusion (Multi-sensor Fusion). After entering the era of deep learning, the technology of multimodal feature fusion is more important. For example, the perception system of self-driving vehicles has been greatly improved. In order to obtain more robust and accurate perception results, a vehicle with assisted driving or automatic driving functions usually needs to be equipped with different sensors to complement each other under different working conditions. Typical sensor modalities include: camera, radar, lidar, high-precision map, etc. At present, the strategy adopted by multi-modal fusion is to combine the inputs of different modalities and input them into the same transformer structure to obtain the final output.
  • the embodiment of the present application provides a data processing method.
  • the transformer structure to the lane line detection task, the long-range relationship between lane lines can be effectively modeled.
  • the ability to perceive the scene can be improved. Reduce misjudgment in scenarios where lane lines are occluded by vehicles.
  • a neural network may be composed of neural units, and a neural unit may refer to an operation unit that takes X s and an intercept 1 as input, and the output of the operation unit may be:
  • W s is the weight of X s
  • b is the bias of the neuron unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a Relu function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • W is a weight vector, and each value in the vector represents the weight value of a neuron in this layer of neural network.
  • the vector W determines the space transformation from the input space to the output space above, that is, the weight W of each layer controls how to transform the space.
  • the purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight formed by the vector W of many layers matrix). Therefore, the training process of the neural network is essentially to learn the way to control the spatial transformation, and more specifically, to learn the weight matrix.
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of a convolutional layer and a subsampling layer.
  • the feature extractor can be seen as a filter, and the convolution process can be seen as convolving the same trainable filter with an input image or convolutional feature map.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can only be connected to some adjacent neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
  • Neural units of the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as a way to extract image information that is independent of location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. That means that the image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used.
  • multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the transformer structure is a feature extraction network (category in the convolutional neural network) that includes an encoder and a decoder.
  • Encoder feature learning under the global receptive field through self-attention, such as pixel features.
  • Decoder Learn the features of the required modules, such as the features of the output box, through self-attention and cross-attention.
  • the attention also known as the attention mechanism.
  • the attention mechanism can quickly extract important features of sparse data.
  • the attention mechanism occurs between the encoder and the decoder, or between the input sentence and the generated sentence.
  • the self-attention mechanism in the self-attention model occurs within the input sequence or within the output sequence, and can extract the connection between words that are far apart in the same sentence, such as syntactic features (phrase structure).
  • the self-attention mechanism provides an effective modeling way to capture global context information through QKV. Assuming the input is Q(query), the context is stored in the form of key-value pairs (K, V). Then the attention mechanism is actually a mapping function from query to a series of key-value pairs (key, value). The essence of the attention function can be described as a mapping from a query to a series of (key-value) pairs.
  • Attention essentially assigns a weight coefficient to each element in the sequence, which can also be understood as soft addressing. If each element in the sequence is stored in the form of (K, V), then attention completes the addressing by calculating the similarity between Q and K. The similarity calculated by Q and K reflects the importance of the V value taken out, that is, the weight, and then the weighted summation obtains the final feature value.
  • the calculation of attention is mainly divided into three steps.
  • the first step is to calculate the similarity between the query and each key to obtain the weight.
  • Commonly used similarity functions are dot product, stitching, perceptron, etc.; then the second step is generally to use a softmax function (on the one hand, it can be normalized to obtain a probability distribution in which the sum of all weight coefficients is 1.
  • the characteristics of the softmax function can be used to highlight the weights of important elements) to normalize these weights; finally, the weights and the corresponding The key value value is weighted and summed to get the final feature value.
  • the specific calculation formula can be as follows:
  • d is the dimension of QK matrix.
  • attention includes self-attention and cross-attention.
  • Self-attention can be understood as special attention, that is, the input of QKV is consistent. While the input of QKV in cross-attention is not consistent. Attention is to use the similarity between features (such as the inner product) as a weight to integrate the query feature as the update value of the current feature. Self-attention is the attention extracted based on the attention of the feature map itself.
  • the setting of the convolution kernel limits the size of the receptive field, resulting in the network often requiring multiple layers of stacking to focus on the entire feature map.
  • the advantage of self-attention is that its attention is global, and it can obtain the global spatial information of the feature map through simple query and assignment.
  • the special point of self-attention in the query key value (QKV) model is that the corresponding input of QKV is consistent. The QKV model will be described later.
  • Feedforward neural network is the first simple artificial neural network invented.
  • each neuron belongs to a different layer. Neurons in each layer can receive signals from neurons in the previous layer and generate signals to output to the next layer.
  • Layer 0 is called the input layer
  • the last layer is called the output layer
  • other intermediate layers are called hidden layers. There is no feedback in the entire network, and the signal propagates unidirectionally from the input layer to the output layer.
  • Multilayer perceptron (MLP)
  • a multilayer perceptron also known as a multilayer perceptron, is a feed-forward artificial neural network model that maps an input to a single output.
  • bilinear interpolation (bilinear), deconvolution (Transposed Convolution) and anti-pooling (Unpooling).
  • modality refers to the way things happen or exist
  • multimodal refers to the combination of two or more modalities in various forms.
  • Each source or form of information can be called a modality, and the current research field mainly focuses on the processing of modalities such as images, texts, and voices.
  • the modality mentioned above can also be understood as "sense", that is, the channel through which organisms receive information through sensory organs and experiences.
  • human beings have modalities such as vision, hearing, touch, taste and smell.
  • Multimodality can be understood as the fusion of multiple senses.
  • humans can communicate with smart devices through multiple channels such as voice, body language, information carriers (such as text, pictures, audio, video, etc.), and the environment. Smart devices After fusing multi-modal information, it can make judgments on human intentions, and feed back to humans through text, sound, light strips and other methods.
  • Multimodal data refers to data with multiple different modalities, and the modalities can include text, images, audio and video, and so on. It can be understood that in some scenarios, images with different structures may also be called different modalities, for example, RGB images and depth images are data of different modalities. Texts with different structures can also be called different modalities, for example, Chinese and English are data of different modalities. Audio in different formats can also be referred to as different modalities, for example, waveform audio files (MAV) and audio video interleaved (AVI) are data of different modalities, and so on.
  • MAV waveform audio files
  • AVI audio video interleaved
  • Multimodal fusion in deep learning refers to the technology in which machines obtain information from multiple fields such as text, images, voice, and video, and realize information conversion and fusion, thereby improving model performance.
  • the reason for the fusion of modalities is that different modalities have different expressions and different perspectives on things, so there are some intersections (so there is information redundancy) and complementarity (so they are better than single features) There may even be a variety of different information interactions between modalities. If multi-modal information can be processed reasonably, rich feature information can be obtained.
  • an embodiment of the present invention provides a system architecture 100 .
  • the data collection device 160 is used to collect training data.
  • the training data includes: data of multiple different modalities. Wherein, modality may refer to text, image, video and audio.
  • training data can include RGB image + depth image, RGB image and point cloud data, etc.
  • the training data is stored in the database 130 , and the training device 120 obtains the target model/rule 101 based on training data maintained in the database 130 .
  • the training device 120 obtains the target model/rule 101 based on the training data, and the target model/rule 101 can be used to implement the computer vision task applied by the data processing method provided by the embodiment of the present application.
  • the computer vision task may include: classification task, segmentation task, detection task or image generation task, etc.
  • the target model/rule 101 in the embodiment of the present application may specifically include a self-attention network, a multi-layer perceptron, a pooling layer, and the like. It should be noted that, in practical applications, the training data maintained in the database 130 may not all be collected by the data collection device 160, but may also be received from other devices.
  • the training device 120 does not necessarily perform the training of the target model/rules 101 based entirely on the training data maintained by the database 130, and it is also possible to obtain training data from the cloud or other places for model training. Limitations of the Examples.
  • the target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. , augmented reality (augmented reality, AR) equipment/virtual reality (virtual reality, VR) equipment, vehicle-mounted terminals, etc.
  • the execution device 110 may also be a server or a cloud.
  • execution equipment 110 An I/O interface 112 is configured for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140.
  • the input data may include: images to be detected.
  • the input data may be input by the user, or uploaded by the user through the shooting device, and of course, may also come from a database, which is not limited here.
  • the preprocessing module 113 is used to perform preprocessing according to the input data received by the I/O interface 112.
  • the preprocessing module 113 may be used to split the input data to obtain sub-data sets. For example, if the input image is an image, the preprocessing module 113 is used to split the image to obtain multiple image blocks.
  • the execution device 110 When the execution device 110 preprocesses the input data, or in the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result, such as the obtained result corresponding to the above target task, to the client device 140, so as to provide it to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above-mentioned goals or complete above tasks, thereby providing the desired result to the user.
  • the user can manually specify the input data, and the manual specification can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the client device 140 is required to automatically send the input data to obtain the user's authorization, the user can set the corresponding authority in the client device 140 .
  • the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be specific ways such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal, collecting the input data input to the I/O interface 112 as shown in the figure and the output results of the output I/O interface 112 as new sample data, and storing them in the database 130 .
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 as shown in the figure and the output result of the output I/O interface 112 as a new sample The data is stored in database 130 .
  • accompanying drawing 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation, for example, in accompanying drawing 1 , the data storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • a target model/rule 101 is obtained through training by a training device 120 , and the target model/rule 101 in the embodiment of the present application may specifically be a target neural network.
  • a chip hardware structure provided by the embodiment of the present application is introduced below.
  • FIG. 2 is a hardware structure of a chip provided by an embodiment of the present invention, and the chip includes a neural network processor 20 .
  • the chip can be set in the execution device 110 shown in FIG. 1 to complete the computing work of the computing module 111 .
  • the chip can also be set in the training device 120 shown in FIG. 1 to complete the training work of the training device 120 and output the target model/rule 101 .
  • the neural network processor 20 can be a neural network processor (neural-network processing unit, NPU), a tensor processor (tensor processing unit, TPU), or a graphics processor (graphics processing unit, GPU) and other processors suitable for large-scale XOR operation processing.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processor
  • the neural network processor 20 is mounted on a main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks.
  • the core part of the NPU is the operation circuit 203, and the controller 204 controls the operation circuit 203 to extract data in the memory (weight memory or input memory) and perform operations.
  • the operation circuit 203 includes multiple processing units (process engine, PE).
  • arithmetic circuit 203 is a two-dimensional systolic array.
  • the arithmetic circuit 203 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 203 is a general-purpose matrix processor.
  • the operation circuit 203 fetches the data corresponding to the matrix B from the weight memory 202, and caches it in each PE in the operation circuit.
  • the operation circuit fetches the data of matrix A from the input memory 201 and performs matrix operation with matrix B, and the obtained partial results or final results of the matrix are stored in the accumulator 208 .
  • the vector computing unit 207 can perform further processing on the output of the computing circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector calculation unit 207 can be used for network calculations of non-convolution/non-FC layers in neural networks, such as pooling (Pooling), batch normalization (Batch Normalization), local response normalization (Local Response Normalization), etc. .
  • the vector computation unit can 207 store the vector of the processed output to the unified buffer 206 .
  • the vector computing unit 207 may apply a non-linear function to the output of the computing circuit 203, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 207 generates normalized values, binned values, or both.
  • the vector of processed outputs can be used as an activation input to the arithmetic circuit 203, for example for use in a subsequent layer in a neural network.
  • the unified memory 206 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 201 and/or unified memory 206 through the storage unit access controller 205 (direct memory access controller, DMAC), stores the weight data in the external memory into the weight memory 202, And store the data in the unified memory 206 into the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (bus interface unit, BIU) 210 is configured to implement interaction between the main CPU, DMAC and instruction fetch memory 209 through the bus.
  • An instruction fetch buffer 209 connected to the controller 204 is used to store instructions used by the controller 204.
  • the controller 204 is configured to invoke instructions cached in the memory 209 to control the operation process of the computing accelerator.
  • the unified memory 206, the input memory 201, the weight memory 202, and the instruction fetch memory 209 are all on-chip (On-Chip) memories
  • the external memory is a memory outside the NPU
  • the external memory can be a double data rate synchronous dynamic random Memory (double data rate synchronous dynamic random access memory, referred to as DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • DDR SDRAM double data rate synchronous dynamic random Memory
  • HBM high bandwidth memory
  • FIG. 3A is a schematic structural diagram of a data processing system provided by an embodiment of the present application.
  • the data processing system includes a terminal device (in FIG. 3A , only the terminal device is a mobile phone as an example) and a data processing device. Understandably, terminal equipment other than In addition to mobile phones, it can also be a tablet computer (pad), a portable game console, a handheld computer (personal digital assistant, PDA), a notebook computer, an ultra mobile personal computer (ultra mobile personal computer, UMPC), a handheld computer, a netbook, Vehicle media player equipment, wearable electronic equipment, virtual reality (virtual reality, VR) terminal equipment, augmented reality (augmented reality, AR), vehicle, vehicle terminal, aircraft terminal, intelligent robot and other terminal equipment.
  • the terminal device is the initiator of the data processing, and as the initiator of the data processing request, usually the user initiates the request through the terminal device.
  • the above-mentioned data processing device may be a device or server having a data processing function such as a cloud server, a network server, an application server, and a management server.
  • the data processing device receives the data processing request from the terminal device through the interactive interface, and then performs data processing such as machine learning, deep learning, search, reasoning, and decision-making through the memory for storing data and the processor link of data processing.
  • the storage in the data processing device may be a general term, including local storage and a database storing historical data, and the database may be on the data processing device or on other network servers.
  • the terminal device can receive user instructions, for example, the terminal device can obtain multiple data input/selected by the user (for example: images, texts, audio, etc. collected by the terminal device through the terminal device), Then initiate a request to the data processing device, so that the data processing device executes a data processing application (for example, computer vision tasks such as classification, segmentation, detection, image generation, etc.) corresponding processing results.
  • a data processing application for example, computer vision tasks such as classification, segmentation, detection, image generation, etc.
  • the terminal device may acquire multiple images input by the user, and then initiate an image detection request to the data processing device, so that the data processing device detects the image, thereby obtaining a detection result of the image, and displaying the detection result of the image, to for users to view and use.
  • the data processing device may execute the data processing method of the embodiment of the present application.
  • Fig. 3B is another schematic structural diagram of the data processing system provided by the embodiment of the present application.
  • the terminal device in Fig. 3B, only the terminal device is a mobile phone as an example
  • the terminal device can directly obtain
  • a plurality of data for example: data is image, text, audio, etc.
  • the specific process is similar to that in FIG. 3A . Refer to the above description, and will not repeat it here.
  • the terminal device may receive instructions from the user, for example, the terminal device may acquire multiple images selected by the user in the terminal device, and then the terminal device itself executes the Data processing applications (for example, computer vision tasks such as classification, segmentation, detection, image generation, etc.), so as to obtain the corresponding processing results for the image, and display the processing results for users to watch and use.
  • the Data processing applications for example, computer vision tasks such as classification, segmentation, detection, image generation, etc.
  • the terminal device can collect images in real time or periodically, and then the terminal device itself executes data processing applications (for example, classification, segmentation, detection, image generation, etc.) computer vision tasks), so as to obtain the corresponding processing results for the image, and implement functions (classification functions, segmentation functions, detection functions, image generation functions, etc.) according to the processing results.
  • data processing applications for example, classification, segmentation, detection, image generation, etc.
  • functions classification functions, segmentation functions, detection functions, image generation functions, etc.
  • the terminal device itself can execute the data processing method of the embodiment of the present application.
  • the above-mentioned terminal device in FIG. 3A and FIG. 3B may specifically be the client device 140 or the execution device 110 in FIG. 1, and the data processing device in FIG. 3A may specifically be the execution device 110 in FIG.
  • the data storage system 150 may be integrated on the execution device 110, or set on the cloud or other network servers.
  • the processors in Fig. 3A and Fig. 3B can carry out data training/machine learning/deep learning through neural network models or other models (such as attention model, MLP, etc.), and use the data to finally train or learn the model for multiple
  • the data executes the data processing application, so as to obtain the corresponding processing result.
  • the data processing method may be executed by the data processing device, or may be executed by components of the data processing device (such as a processor, a chip, or a chip system, etc.).
  • the data processing device may be a cloud device (as shown in FIG. 3A ), or a terminal device (such as a mobile phone as shown in FIG. 3B ).
  • this method can also be executed by a system composed of cloud devices and terminal devices (as shown in the aforementioned FIG. 3A ).
  • the method may be processed by the CPU in the data processing device, or jointly processed by the CPU and the GPU, or other processors suitable for neural network calculations may be used instead of the GPU, which is not limited in this application.
  • the above-mentioned terminal equipment can be a mobile phone, a tablet computer (pad), a portable game console, a handheld computer (personal digital assistant, PDA), a notebook computer, an ultra mobile personal computer (ultra mobile personal computer, UMPC), a handheld computer, a netbook, a vehicle-mounted computer, etc.
  • the application scenarios provided by the method provided in the embodiments of the present application are mainly multi-modal fusion scenarios, which can be specifically applied to computer vision tasks such as classification scenarios, segmentation scenarios, detection scenarios, and image generation scenarios, or can be applied to semantic segmentation, indoor scenarios, etc. Perception, outdoor driving, etc.
  • the data involved in the embodiments of the present application may refer to text, images, audio and video, etc. For the convenience of description, this article only uses images as an example for illustration.
  • FIG. 4 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • the method may include steps 401 to 404 . Steps 401 to 404 will be described in detail below.
  • Step 401 acquiring first data and second data.
  • the data processing device to obtain the first data and the second data, which can be collected/photographed, received from other devices, or obtained from the database.
  • the selection method and the like are not specifically limited here.
  • the data processing device may be a vehicle, and the first data and the second data may be data collected by the vehicle in real time or periodically, which is not limited here.
  • first data and the second data are data related to images as an example for exemplary description.
  • first data and second data may also be related to text, audio and video, etc.
  • the data is not limited here.
  • the first data and the second data belong to isomorphic multimodal data.
  • isomorphic multimodal data means that the presentation mode of the modality to which the first data belongs is the same as the presentation mode of the modality to which the second data belongs.
  • the first data is an RGB image
  • the second image is a depth image
  • the second data is a depth image. Both the first data and the second data are presented in images.
  • the first data and the second data are image data, or the first data and the second data are text data, or the first data and the second data are audio data, etc., which are not limited here.
  • Example 1 the first data is an RGB image as shown in FIG. 5A, and the second image is a depth image as shown in FIG. 5B.
  • Example 1 can be applied to a cloud service scenario (such as a semantic segmentation scenario), and the data processing device can be a smart camera, a smart robot, and the like.
  • the first data and the second data belong to heterogeneous multimodal data.
  • heterogeneous multimodal data means that the multimodality to which the first data belongs is different from the multimodality to which the second data belongs.
  • the first data is image data
  • the second data is point cloud data
  • the first data is text data
  • the second data is audio data, etc., which are not specifically limited here.
  • Example 2 the first data is an RGB image as shown in Figure 6A, and the second image is point cloud data as shown in Figure 6B.
  • Example 2 can be applied to an automatic driving scenario (for example, an intelligent perception scenario), and the data processing device can be a smart car or the like.
  • Step 402 acquiring a first feature set of the first data and a second feature set of the second data.
  • the data processing device After the data processing device acquires the first data and the second data, it may acquire the first feature set of the first data and the second feature set of the second data.
  • the first data is split to obtain a plurality of first sub-data.
  • a first feature set is obtained based on a plurality of first sub-data
  • a second feature set is obtained based on a plurality of second sub-data.
  • the number of splits of the first data is related to the number of features in the first feature set.
  • the number of split first data is the same as the number of features in the first feature set.
  • the number of split second data is related to the number of features in the second feature set.
  • the number of split second data is the same as the number of features in the second feature set.
  • the rules for splitting data can be set according to actual needs.
  • the splitting rule is to divide the whole/part of the data evenly or unevenly, etc. Specifically, there is no limitation here.
  • the division of the first data and multiple first sub-data in FIG. 5A may be as shown in FIG. 7A .
  • the splitting of the second data and multiple second sub-data as shown in FIG. 5B may be as shown in FIG. 7B.
  • the splitting of the first data and multiple first sub-data in FIG. 6A may be as shown in FIG. 8A .
  • the splitting of the second data and the plurality of second sub-data as shown in FIG. 6B may be as shown in FIG. 8B .
  • the second data when the second data is point cloud data, the second data may be sampled to obtain sampling points, and then the sampling points may be used as the second sub-data.
  • the first feature set may be acquired based on the plurality of first sub-data
  • the second feature set may be acquired based on the plurality of second sub-data.
  • a plurality of first feature sets of the first sub-data are acquired based on the neural network
  • a plurality of second feature sets of the second sub-data are acquired based on the neural network.
  • one feature can also be set to correspond to multiple sub-data
  • multiple features can also be set to correspond to one sub-data, which is not specifically limited here.
  • the neural network mentioned above may include an attention network, a multi-layer perceptron (MLP), a pooling layer, etc., which are not limited here.
  • MLP multi-layer perceptron
  • pooling layer etc.
  • the neural network can also only include an attention network, a multi-layer perceptron, a pooling layer or a convolutional layer, and the like.
  • the first feature set and the second feature set can be The output of the attention network (as shown in Figure 10A), can also be the output of the first network layer (as shown in Figure 10B) and so on (for example, the first feature set and the second feature set can be input attention network features), which are not limited here.
  • the attention network may include L sub-modules, or it may be understood that the attention network is a network with an L-layer structure, wherein each layer has the same structure.
  • Step 403 replacing the first target feature in the first feature set with the second target feature in the second feature set.
  • the second target feature in the second feature set may be used to replace the first target feature in the first feature set to obtain a third feature set.
  • the second target feature corresponds to the first target feature.
  • the corresponding relationship between the second target feature and the first target feature can be determined according to the spatial relationship and semantic relationship between the first data and the second data, or can be determined according to the position of the feature in the feature set, etc. How to determine it is different The corresponding relationship in the feature set is not limited here.
  • the data processing device may first acquire the first score set of the first feature set and the second score set of the second feature set, and use the first score
  • the set and/or the second score set determine the first target feature and the second target feature, and then use the second target feature to replace the first target feature in the first feature set to obtain a third feature set.
  • the score set for obtaining the feature set first, where the score set includes multiple scores, and the score can be used to evaluate the importance of the feature (the bigger the better), and can also be used to evaluate the invalidity of the feature ( The smaller the better) and so on.
  • the number of scores in the score set may have a one-to-one correspondence with the features in the feature set.
  • features can also be scored in different latitudes.
  • one feature may correspond to multiple scores.
  • there is no limit to the number of scores corresponding to the features that is, there may be one or more.
  • the embodiment of the present application only uses one feature corresponding to one score as an example for illustration.
  • a scoring network can be introduced, which can be used to evaluate the importance of features.
  • each feature in the first feature set is evaluated based on the scoring network to obtain the first score set.
  • Each feature in the second feature set is evaluated based on the scoring network to obtain a second score set.
  • each feature in the first feature set is input into the scoring network to obtain the first score set.
  • Each feature in the second feature set is input into the scoring network to obtain a second score set.
  • the scoring network can be trained using the L1 norm during the training process.
  • the above-mentioned mathematical operation can be understood as a mathematical operation on the feature itself, which may include a rank operation (for example, the feature is in the form of a matrix), a modulo operation (for example, the feature is in the form of a vector), etc., which are not limited here.
  • a score set may be obtained by performing a rank operation on the feature matrix. Specifically, a ranking operation is performed on each feature matrix in the first feature set to obtain a first score set. A rank operation is performed on each feature matrix in the second feature set to obtain a second score set.
  • a score set may be obtained by performing a modulo operation on the feature vectors. Specifically, a modulo operation is performed on each feature vector in the first feature set to obtain a first score set. A modulo operation is performed on each feature vector in the second feature set to obtain a second score set.
  • the data processing device After the data processing device obtains the score set corresponding to the feature set, it can determine the first target feature and the second target feature based on the first score set and/or the second score set, and then use the second target feature to replace the first target feature .
  • the corresponding relationship between the second target feature and the first target feature may be determined according to a first preset rule, etc., which is not specifically limited here.
  • determining the second target feature is equivalent to determining the first target feature, or determining the first target feature is equivalent to determining the second target feature. Therefore, the first target feature and the second target feature may be determined based on the first set of scores and/or the second set of scores.
  • the corresponding relationship between the second target feature and the first target feature can be determined according to the spatial relationship and semantic relationship between the first data and the second data, or can be determined according to the position of the feature in the feature set, etc. How to specifically determine the corresponding relationship in different feature sets is not limited here.
  • the first preset rule may be related to the spatial relationship, semantic relationship, etc. between multimodal data.
  • the data processing device may determine the first target feature in the first feature set based on the first score set and the second preset rule. After the first target feature is re-determined, the second target feature corresponding to the first target feature can be re-determined according to the first preset rule.
  • the data processing device may determine the second target feature in the second feature set based on the second score set and the second preset rule. After the second target feature is re-determined, the first target feature corresponding to the second target feature can be re-determined according to the first preset rule.
  • the aforementioned first preset rule is specifically used to determine the correspondence between the first feature in the first feature set and the second feature in the second feature set.
  • the relationship may be one-to-one, one-to-many or many-to-one, which is not specifically limited here.
  • the first preset rule can be set according to actual needs.
  • the first preset rule includes: the feature at the first position in the first feature set corresponds to the feature at the second position in the second feature set.
  • the first preset rule includes: the feature at the first position in the first feature set corresponds to the feature at the first position in the second feature set, etc.
  • the first preset rule can also be In other cases, there is no limitation here.
  • the positions of the features in the first feature set and the positions of the features in the second feature set may be determined by means of residual position coding and the like.
  • the first feature set includes A1 feature, A2 feature, A3 feature, A4 feature, A5 feature, and A6 feature in sequence.
  • the second feature set sequentially includes B1 features, B2 features, B3 features, B4 features, B5 features, and B6 features.
  • the above-mentioned first preset rule may be that A1 corresponds to B1, A2 corresponds to B2, A3 corresponds to B3, A4 corresponds to B4, A5 corresponds to B5, and A6 corresponds to B6.
  • A1 and B2, A2 and B3, A3 and B4, A4 and B5, A5 and B6, A6 and B1 may correspond.
  • A1 corresponds to B5
  • A2 corresponds to B3
  • A3 corresponds to B1
  • A4 corresponds to B2
  • A5 corresponds to B6
  • B4 corresponds to B4
  • the first target feature is a feature in the first feature set involved in the above correspondence
  • the second target feature is a feature in the second feature set corresponding to the first target feature.
  • the aforementioned second preset rule is specifically used to determine the first target feature and/or the second target feature.
  • the second preset rule can be set according to actual needs.
  • the second preset rule may specifically be related to the size of the score, the preset score, and the like.
  • the second preset rule includes: determining the feature with the smallest score in the first score set as the first target feature.
  • the second preset rule includes: determining the feature with the highest score in the second score set as the second target feature.
  • the second preset rule includes: The feature with the largest score in the first score set is defined as the first target feature.
  • the second preset rule includes: determining the feature with the smallest score in the second score set as the second target feature.
  • the second preset rule includes: determining a feature in the second score set whose score is equal to the preset score as the second target feature.
  • the second preset rule includes: determining that the feature whose score is equal to the preset score in the first score set is the first target feature, etc. In practical applications, the second preset rule can also be other situations, There is no limit here.
  • the neural network includes an attention network
  • the attention network may include L submodules, or it may be understood that the attention network is a network with an L-layer structure, and each layer has the same structure.
  • the scoring network is denoted as s l
  • for the first feature in the first data set of layer l (which can be denoted as ) can be recorded as Then the process of using the second target feature in the second feature set to replace the first target feature in the first feature set can be expressed by the following formula:
  • represents element-wise multiplication, is an indicator function, if the subscript of the indicator function satisfies the condition, the output of the indicator function is 1; if the subscript of the indicator function does not meet the condition, the output of the indicator function is 0. 1 can be understood as replacement, and 0 can be understood as not replace.
  • a and B represent the corresponding relationship between the features in the first feature set and the features in the second feature set (for example, the first target feature Corresponding to the second target feature ).
  • the above formula can be understood as, the feature with a score smaller than ⁇ in the first feature set (for example, the first target feature) is replaced by the feature corresponding to the feature in the second feature set (for example, the second target feature).
  • the first data and the second data are presented in the same manner as isomorphic multimodal data, represents the identity map.
  • the first data and the second data belong to heterogeneous multimodal data
  • the first data is an RGB image
  • the second data is a point cloud
  • the target task is a detection task.
  • the spatial relationship between the point cloud and the image (such as the aforementioned first preset rule is related to the spatial relationship between the multimodal data) can be used to perform projection to find the corresponding relationship between the image slice and the point cloud.
  • it may include: assuming that there are N img image patches (patches) and N point 3D sampling points as the input of the neural network.
  • N point -N img mapping The process of projecting the nth point onto the corresponding nth img image slice can be expressed as follows:
  • K and R t are the internal and external parameters of the camera, represents the 3D coordinates of the point, Represents the 2D pixels of the image, W and P are the width of the original image and the width of the image slice, respectively.
  • a distribution scheme can be fixed in advance: a B (A) ⁇ ⁇ 0, 1 ⁇ N ; in this case, the feature replacement between multimodal data
  • the expression can be as follows:
  • M is the number of different modes in the multi-modal data, and other explanations can refer to the description in the aforementioned formula, which will not be repeated here.
  • Step 404 acquiring data features based on the third feature set and the second feature set.
  • the data processing device After the data processing device acquires the third feature set, it can acquire data features based on the third feature set and the second feature set, and the data features are used to implement computer vision tasks, and the computer vision tasks include classification tasks, segmentation tasks, detection tasks or image generation tasks and more.
  • the positions of the data features in the neural network depend on the positions of the first feature set and the second feature set.
  • the position of the data feature in the neural network may be processed by fusion of the positions of the first feature set and the second feature set.
  • the position of the data feature in the neural network may also be the position of one or more network layers after the position of the first feature set and the second feature set.
  • the neural network includes an attention network, a first network layer and a second network layer. If the first feature set and the second feature set are the output of the attention network, then obtaining data features based on the third feature set and the second feature set may include: inputting the third feature set and the second feature set into the first network layer to obtain data characteristics. If the first feature set and the second feature set are the output of the first network layer, then obtaining data features based on the third feature set and the second feature set may include: inputting the third feature set and the second feature set into the second network layer get data features.
  • the neural network includes a multi-layer perceptron, a first network layer and a second network layer. If the first feature set and the second feature set are the output of the multi-layer perceptron, then obtaining data features based on the third feature set and the second feature set may include: inputting the third feature set and the second feature set into the first network layer get data features. The data features can then be input into the second network layer to obtain the result of the target task. If the first feature set and the second feature set are the output of the first network layer, then obtaining data features based on the third feature set and the second feature set may include: inputting the third feature set and the second feature set into the second network layer get data features.
  • the neural network includes a pooling layer, a first network layer and a second network layer. If the first feature set and the second feature set are the outputs of the pooling layer, then obtaining data features based on the third feature set and the second feature set may include: inputting the third feature set and the second feature set into the first network layer to obtain data characteristics. If the first feature set and the second feature set are the output of the first network layer, then obtaining data features based on the third feature set and the second feature set may include: inputting the third feature set and the second feature set into the second network layer get data features.
  • the above-mentioned second network layer is related to the target task, and can be set according to actual needs, and is not specifically limited here.
  • the second network layer can be a fully connected layer.
  • the second network layer may be a convolutional neural network layer or an upsampling layer.
  • step 403 only describes replacing the first target feature in the first feature set with the second target feature in the second feature set to obtain a third feature set.
  • the third target feature in the first feature set may also be used to replace the fourth target feature in the second feature set to obtain a fourth feature set.
  • the features in the two feature sets can be swapped (for example: the acquisition process of the third feature set), or the features in the two feature sets can be Interchange (for example: the process of obtaining the third feature set and the fourth feature set), which is not limited here.
  • the process of obtaining the fourth feature set can refer to the description of the aforementioned step 403 , which will not be repeated here.
  • the third target feature corresponds to the fourth target feature
  • the specific corresponding relationship can refer to the above-mentioned corresponding relationship between the first target feature and the second target feature, which will not be repeated here.
  • the above-mentioned feature replacement can be performed on at least one more layer. For example, you can To perform feature replacement only on a certain layer. For another example, the above-mentioned feature replacement is performed in multiple layers. For another example, the above-mentioned feature replacement is performed at each layer, which is not specifically limited here.
  • step 404 may include replacing the target feature in the second feature set with the third target feature in the first feature set
  • the fourth target feature is used to obtain a fourth feature set.
  • the data features are then obtained based on the third feature set and the fourth feature set.
  • the position of the third target feature in the first feature set corresponds to the position of the fourth target feature in the second feature set.
  • the feature set is at the position shown in Figure 10A
  • the first data is shown in Figure 5A
  • the second data is shown in Figure 5B
  • the first feature set and the second feature set are interchanged 1. Determining the feature to be replaced through the scoring network as an example will be described in conjunction with FIG. 11 .
  • the neural network includes attention network, first network layer, and second network layer.
  • the first data is an RGB image
  • the second data is a depth image
  • the first data and the second data are isomorphic multimodal data.
  • the first data and the second image are divided into 6 image slices respectively, and input into the attention network and the first network layer to obtain the first feature set (A1, A2, A3, A4, A5, A6) and the second feature set ( B1, B2, B3, B4, B5, B6).
  • Use the scoring network trained with L1 to score the first feature set and the second feature set, and determine the features to be replaced in each feature set according to the scoring value.
  • the corresponding relationship between the first feature set and the second feature set includes: A1-B1, A2-B2, A3-B3, A4-B4, A5-B5, A6-B6.
  • the scoring network it is determined that the first target features in the first feature set include A2 and A6.
  • Determining the second target features in the second feature set based on the scoring network includes B2 and B6.
  • the third target features in the first feature set determined based on the scoring network include A3 and A4.
  • the fourth target features in the second feature set determined based on the scoring network include B3 and B4.
  • a fourth feature set (B1, B2, A3, A4, B5, B6).
  • the residual position coding can be used for alignment.
  • the above-mentioned corresponding relationship has been described above, and will not be repeated here.
  • the feature set is at the position shown in Figure 10A
  • the first data is shown in Figure 6A
  • the second data is shown in Figure 6B
  • the first feature set and the second feature set are interchanged 1.
  • Using the scoring network to determine the feature to be replaced is described in conjunction with FIG. 12 as an example.
  • the neural network includes a multi-layer perceptron, a first network layer, and a second network layer.
  • the first data is an RGB image
  • the second data is point cloud data
  • the first data and the second data are heterogeneous multimodal data.
  • the first data is split into 5 image slices
  • the second data is sampled to obtain sampling points
  • the 5 image slices and sampling points (in Figure 12, the sampling points are divided into 6) are input into the multi-layer perceptron and the first network layer
  • the first feature set (A1, A2, A3, A4, A5) and the second feature set (B1, B2, B3, B4, B5, B6) are obtained.
  • the scoring network trained with L1 to score the first feature set and the second feature set and determine the features to be replaced in each feature set according to the scoring value.
  • the corresponding relationship between the first feature set and the second feature set includes: A1-B2, A2-B6, A4-B5, and A5-B3.
  • it is determined that the first target feature in the first feature set includes A1.
  • Determining the second target feature in the second feature set based on the scoring network includes B2.
  • the third target features in the first feature set determined based on the scoring network include A5, A4, and A2.
  • the fourth target features in the second feature set determined based on the scoring network include B3, B5, and B6. Then replace the fourth target feature in the second feature set with the third target feature in the first feature set to obtain a fourth feature set (B1, B2, A5, B4, A4, A2). After replacing the third feature set and the fourth feature set, you can enter the next layer of neural network, etc., and then input the second network layer, and output the image with the detection frame and the point cloud with the detection frame. In order to ensure that the position of the feature replacement is accurate, the residual position coding can be used for alignment.
  • the information of different modal data can be efficiently fused, so that the acquired data features have the characteristics of multi-modal data, and the expression of data features can be improved. ability, thus making the results of target tasks based on data feature acquisition more accurate.
  • the scores of some features are quite different from the scores of other features, so as to determine which features are useful or useless.
  • the position of the replacement feature is determined by residual position coding, thereby ensuring that the position of the feature in the original feature set is not changed when the feature is replaced.
  • the above methods include: fully convolutional network (Fully Convolutional Networks, FCN), RefineNet, FuseNet, self-supervised model adaptation method (self-supervised model adaptation, SSMA), conditional normalization method (cross-Iteration batch normalization, CBN ), RGB-D fusion network (RGB-D fusion network, RDFNet), channel exchange network (channel exchanging network, CEN), fusion (fusion) method, input splicing (concat) method, MIX method provided by the embodiment of the present application.
  • the 32S in FCN-32S means that the feature map of the convolutional layer is 1/32 of the original image.
  • w/o means the same structure model without multimodal fusion. [Ti] indicates a tiny model, [s] indicates a small model, and the small model has more layers and channels than the tiny model.
  • An embodiment of the data processing device in the embodiment of the present application includes:
  • An acquisition unit 1301, configured to acquire first data and second data, where the modes of the first data and the second data are different;
  • the acquiring unit 1301 is further configured to acquire a first feature set of the first data and a second feature set of the second data;
  • a replacement unit 1302 configured to replace the first target feature in the first feature set with the second target feature in the second feature set to obtain a third feature set, the second target feature corresponding to the first target feature;
  • the obtaining unit 1301 is configured to obtain data features based on the third feature set and the second feature set, and the data features are used to implement computer vision tasks.
  • the data processing device may further include: a determining unit 1303, configured to determine the second target feature based on the first score set and/or the second score set.
  • each unit in the data processing device is similar to those described in the foregoing embodiments shown in FIGS. 1 to 12 , and will not be repeated here.
  • the replacement unit 1302 uses features between different modal data to replace, which can efficiently fuse information of different modal data, so that the acquired data features have the characteristics of multi-modal data, and improve the expressive ability of data features.
  • the data processing device may include a processor 1401 , a memory 1402 and a communication port 1403 .
  • the processor 1401, memory 1402 and communication port 1403 are interconnected by wires.
  • program instructions and data are stored in the memory 1402 .
  • the memory 1402 stores program instructions and data corresponding to the steps executed by the data processing device in the corresponding implementations shown in FIGS. 1 to 12 .
  • the processor 1401 is configured to execute the steps executed by the data processing device shown in any one of the embodiments shown in FIGS. 1 to 12 .
  • the communication port 1403 may be used for receiving and sending data, and for performing steps related to acquiring, sending, and receiving in any of the above-mentioned embodiments shown in FIG. 1 to FIG. 12 .
  • the data processing device may include more or fewer components than those shown in FIG. 14 , which is only an example in the present application and not limited thereto.
  • the disclosed systems, devices and methods can be used to its way to achieve.
  • the device embodiments described above are only illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be fully or partially realized by software, hardware, firmware or any combination thereof.
  • the integrated units When the integrated units are implemented using software, they may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, all or part of the processes or functions according to the embodiments of the present invention will be generated.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (solid state disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种数据处理方法,该方法应用于多模态融合场景,方法包括:获取第一数据与第二数据,第一数据与第二数据的模态不同;获取第一数据的第一特征集合与第二数据的第二特征集合;使用第二特征集合中的第二目标特征替换第一特征集合中的第一目标特征,得到第三特征集合,第二目标特征与第一目标特征对应;基于第三特征集合与第二特征集合获取数据特征,数据特征用于实现计算机视觉任务。通过使用不同模态数据之间的特征进行替换,可以高效融合不同模态数据的信息,使得获取的数据特征具有多模态数据的特性,提高数据特征的表达能力。

Description

一种数据处理方法及相关设备
本申请要求于2022年3月2日提交中国专利局、申请号为202210203516.0、发明名称为“一种数据处理方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人工智能领域,尤其涉及一种数据处理方法及相关设备。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
对于文本、图片、视音频等不同模态数据在不同程度上具有不同层次的知识表达性,需要研究不同模态信息的特征表示和学习方法,实现多模态数据的协同表示。进入深度学习时代后,多模态特征融合的技术更加重要。例如,自动驾驶车辆的感知系统得到了大幅提升。为了获得更加鲁棒和准确的感知结果,一辆具备辅助驾驶或自动驾驶功能的车辆,通常需要配备不同的传感器,在不同的工况下互相补充。典型的传感器模态包括:摄像头、雷达、激光雷达、高精地图等。
目前,多模态融合采用的策略是将不同模态的输入拼合起来,输入到同一个transformer结构中,得到最终的输出。
然而,上述多模态融合采用的策略只是简单的对输入进行拼合,并不适用于多模态融合的所有场景。
发明内容
本申请实施例提供了一种数据处理方法及相关设备。通过使用不同模态数据之间的特征进行替换,可以高效融合不同模态数据的信息,使得获取的数据特征具有多模态数据的特性,提高数据特征的表达能力。
本申请实施例第一方面提供了一种数据处理方法,该方法应用于多模态融合场景,方法包括:获取第一数据与第二数据,第一数据与第二数据的模态不同;获取第一数据的第一特征集合与第二数据的第二特征集合;使用第二特征集合中的第二目标特征替换第一特征集合中的第一目标特征,得到第三特征集合,第二目标特征与第一目标特征对应;基于第三特征集合与第二特征集合获取数据特征,数据特征用于实现计算机视觉任务。其中,第二目标特征与第一目标特征的对应关系可以根据第一数据与第二数据的空间关系、语义关系等所确定, 也可以根据特征在特征集合中的位置所确定等,对具体如何确定不同特征集合中特征对应关系的方式此处不做限定。
本申请实施例中,通过使用不同模态数据之间的特征进行替换,可以高效融合不同模态数据的信息,使得获取的数据特征具有多模态数据的特性,提高数据特征的表达能力。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第三特征集合与第二特征集合获取数据特征,包括:使用第一特征集合中的第三目标特征替换第二特征集合中的第四目标特征,得到第四特征集合,第三目标特征与第四目标特征对应;基于第三特征集合与第四特征集合获取数据特征。
该种可能的实现方式中,不仅使用第二目标特征替换第一目标特征,还可以使用第三目标特征替换第四目标特征,实现第一特征集合与第二特征集合之间特征的互换。可以使得第三特征集合具有第二特征集合对应的模态数据的特征,还可以使得第四特征集合具有第一特征集合对应的模态数据的特征,进而提升后续基于第三特征集合与第四特征集合生成的数据特征的表达能力,提升后续得到计算机视觉任务结果的准确度和/或精确度。
可选地,在第一方面的一种可能的实现方式中,上述步骤:使用第二特征集合中的第二目标特征替换第一特征集合中的第一目标特征之前,方法还包括:获取第一特征集合的第一分值集合,第一特征集合中的第一特征与第一分值集合中的第一分值一一对应;获取第二特征集合的第二分值集合,第二特征集合中的第二特征与第二分值集合中的第二分值一一对应;基于第一分值集合和/或第二分值集合确定第二目标特征。
该种可能的实现方式中,通过引入特征的分值来确定第二目标特征或第一目标特征,该分值可以是评判特征重要程度的指标(例如越大越好),也可以是用于评估特征无效性的指标(例如分值越小越好)等,通过该种方式可以将一个模态数据中不重要的特征被另一个模态数据中重要的特征进行替换,进而提升被替换特征所在的特征集合对于模态数据的表达。
可选地,在第一方面的一种可能的实现方式中,上述步骤:获取第一特征集合的第一分值集合,包括:基于打分网络对第一特征集合中的各个特征进行评估,得到第一分值集合,打分网络用于评估特征的重要性;获取第二特征集合的第二分值集合,包括:基于打分网络对第二特征集合中的各个特征进行评估,得到第二分值集合。
该种可能的实现方式中,通过引入打分网络对特征的重要性进行评估,进而使得后续确定的第二目标特征与第一目标特征更加合理。
可选地,在第一方面的一种可能的实现方式中,上述打分网络的输出值服从稀疏分布。即可以理解为打分网络的输出值更加稀疏,使得某些特征的分值与另外一些特征的分值差别较大,进而确定哪些特征是有用的或无用的。例如,打分网络在训练过程中可以是使用L1范数进行训练。
该种可能的实现方式中,使得某些特征的分值与另外一些特征的分值差别较大,进而确定哪些特征是有用的或无用的。
可选地,在第一方面的一种可能的实现方式中,上述步骤:获取第一特征集合的第一分值集合,包括:对第一特征集合中的各个第一特征进行数学运算,得到第一分值集合,数学运算是基于各个第一特征本身进行的运算,数学运算包括求秩运算或求模运算;获取第二特征集合的第二分值集合,包括:对第二特征集合中的各个第二特征进行数学运算,得到第二 分值集合。
该种可能的实现方式中,通过特征本身的数学运算,减少引入判断分值的其他结构,简化整体的网络结构。
可选地,在第一方面的一种可能的实现方式中,上述步骤:获取第一数据的第一特征集合与第二数据的第二特征集合,包括:基于神经网络获取第一特征集合与第二特征集合,神经网络包括注意力网络、多层感知机、池化层或卷积层。
该种可能的实现方式中,第一特征集合与第二特征集合是基于神经网络获取的,并可以适用于注意力网络、多层感知机、池化层或卷积层等场景。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于神经网络获取第一特征集合与第二特征集合,包括:拆分第一数据得到多个第一子数据;拆分第二数据得到多个第二子数据;将多个第一子数据与第二子数据输入神经网络,得到第一特征集合与第二特征集合。
该种可能的实现方式中,通过对模态数据的拆分获取神经网络的输入,使得后续得到特征集合中特征的数量与拆分的数量相关,进而控制后续的计算过程。
可选地,在第一方面的一种可能的实现方式中,上述步骤:使用第二特征集合中的第二目标特征替换第一特征集合中的第一目标特征,包括:基于残差位置编码使用第二目标特征替换第一目标特征,残差位置编码用于确定第一特征集合与第二特征集合中各个特征所在的位置。
该种可能的实现方式中,通过残差位置编码确定替换特征的位置,进而保证替换特征时不改变特征在原特征集合中的位置。
可选地,在第一方面的一种可能的实现方式中,上述的神经网络还包括第一网络层,第一网络层的结构与神经网络相关。
该种可能的实现方式中,第一特征集合与第二特征集合可以是第一网络层的输出,即无论第一特征集合与第二特征集合属于神经网络中的什么位置,都可以通过不同模态数据的特征之间的替换来提升后续数据特征的表达能力。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:将数据特征输入第二网络层获取计算机视觉任务的结果,第二网络层与计算机视觉任务相关。
该种可能的实现方式中,数据特征可以通过第二网络层得到计算机视觉任务的结果,由于数据特征是经过不同模态数据之间的特征替换,从而使得该结果更加准确。
可选地,在第一方面的一种可能的实现方式中,上述的计算机视觉任务为分类任务,第二网络层为全连接层;或者计算机视觉任务为分割任务或检测任务,第二网络层为卷积神经网络层或上采样层。
该种可能的实现方式中,该方法可以应用于不同场景的计算机视觉任务,可以准确的完成检测任务、分割任务、分类任务等。
本申请实施例第二方面提供了一种数据处理设备,数据处理设备应用于多模态融合场景,数据处理设备包括:获取单元,用于获取第一数据与第二数据,第一数据与第二数据的模态不同;获取单元,还用于获取第一数据的第一特征集合与第二数据的第二特征集合;替换单元,用于使用第二特征集合中的第二目标特征替换第一特征集合中的第一目标特征,得到第 三特征集合,第二目标特征与第一目标特征对应;获取单元,用于基于第三特征集合与第二特征集合获取数据特征,数据特征用于实现计算机视觉任务。
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,具体用于使用第一特征集合中的第三目标特征替换第二特征集合中的第四目标特征,得到第四特征集合,第三目标特征与第四目标特征对应;获取单元,具体用于基于第三特征集合与第四特征集合获取数据特征。
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,还用于获取第一特征集合的第一分值集合,第一特征集合中的第一特征与第一分值集合中的第一分值一一对应;获取单元,还用于获取第二特征集合的第二分值集合,第二特征集合中的第二特征与第二分值集合中的第二分值一一对应;数据处理设备还包括:确定单元,用于基于第一分值集合和/或第二分值集合确定第二目标特征。
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,具体用于基于打分网络对第一特征集合中的各个特征进行评估,得到第一分值集合,打分网络用于评估特征的重要性;获取单元,具体用于基于打分网络对第二特征集合中的各个特征进行评估,得到第二分值集合。
可选地,在第二方面的一种可能的实现方式中,上述打分网络的输出值服从稀疏分布。
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,具体用于对第一特征集合中的各个第一特征进行数学运算,得到第一分值集合,数学运算是基于各个第一特征本身进行的运算,数学运算包括求秩运算或求模运算;获取单元,具体用于对第二特征集合中的各个第二特征进行数学运算,得到第二分值集合。
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,具体用于基于神经网络获取第一特征集合与第二特征集合,神经网络包括注意力网络、多层感知机、池化层或卷积层。
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,具体用于拆分第一数据得到多个第一子数据;获取单元,具体用于拆分第二数据得到多个第二子数据;获取单元,具体用于将多个第一子数据与第二子数据输入神经网络,得到第一特征集合与第二特征集合。
可选地,在第二方面的一种可能的实现方式中,上述的替换单元,具体用于基于残差位置编码使用第二目标特征替换第一目标特征,残差位置编码用于确定第一特征集合与第二特征集合中各个特征所在的位置。
可选地,在第二方面的一种可能的实现方式中,上述的神经网络还包括第一网络层,第一网络层的结构与神经网络相关。
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,还用于将数据特征输入第二网络层获取计算机视觉任务的结果,第二网络层与计算机视觉任务相关。
可选地,在第二方面的一种可能的实现方式中,上述的计算机视觉任务为分类任务,第二网络层为全连接层;或者计算机视觉任务为分割任务或检测任务,第二网络层为卷积神经网络层或上采样层。
本申请实施例第三方面提供了一种数据处理设备,包括:处理器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该数据处理设备实现上述 第一方面或第一方面的任意可能的实现方式中的方法。
本申请实施例第四方面提供了一种计算机可读介质,其上存储有计算机程序或指令,当计算机程序或指令在计算机上运行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。
本申请实施例第五方面提供了一种计算机程序产品,该计算机程序产品在计算机上执行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。
其中,第二、第三、第四、第五方面或者其中任一种可能实现方式所带来的技术效果可参见第一方面或第一方面不同可能实现方式所带来的技术效果,此处不再赘述。
从以上技术方案可以看出,本申请实施例具有以下优点:通过使用不同模态数据之间的特征进行替换,可以高效融合不同模态数据的信息,使得获取的数据特征具有多模态数据的特性,提高数据特征的表达能力。
附图说明
图1为本申请实施例提供的系统架构的结构示意图;
图2为本申请实施例提供的一种芯片硬件结构示意图;
图3A为本申请实施例提供的数据处理系统的一个结构示意图;
图3B为本申请实施例提供的数据处理系统的另一结构示意图;
图4为本申请实施例提供的数据处理方法一个流程示意图;
图5A为本申请实施例提供的第一数据的一种示例图;
图5B为本申请实施例提供的第二数据的一种示例图;
图6A为本申请实施例提供的第一数据的另一种示例图;
图6B为本申请实施例提供的第二数据的另一种示例图;
图7A为本申请实施例提供的第一数据的一种示例图;
图7B为本申请实施例提供的第二数据的一种示例图;
图8A为本申请实施例提供的第一数据的另一种示例图;
图8B为本申请实施例提供的第二数据的另一种示例图;
图9为本申请实施例提供的神经网络的几种示例图;
图10A为本申请实施例提供的特征集合在神经网络中位置的一种示例图;
图10B为本申请实施例提供的特征集合在神经网络中位置的另一种示例图;
图11为本申请实施例提供的数据处理方法一种示例流程图;
图12为本申请实施例提供的数据处理方法另一种示例流程图;
图13为本申请实施例提供的数据处理设备的一个结构示意图;
图14为本申请实施例提供的数据处理设备的另一个结构示意图。
具体实施方式
本申请实施例提供了一种数据处理方法及相关设备。通过使用不同模态数据之间的特征进行替换,可以高效融合不同模态数据的信息,使得获取的数据特征具有多模态数据的特性,提高数据特征的表达能力。
多模态融合(Multimodal Fusion)负责联合多个模态的信息,进行目标预测(分类或者回归),属于MMML最早的研究方向之一,也是目前应用最广的方向,它还存在其他常见的别名,例如多源信息融合(Multi-source Information Fusion)、多传感器融合(Multi-sensor Fusion)。进入深度学习时代后,多模态特征融合的技术更加重要。例如,自动驾驶车辆的感知系统得到了大幅提升。为了获得更加鲁棒和准确的感知结果,一辆具备辅助驾驶或自动驾驶功能的车辆,通常需要配备不同的传感器,在不同的工况下互相补充。典型的传感器模态包括:摄像头、雷达、激光雷达、高精地图等。目前,多模态融合采用的策略是将不同模态的输入拼合起来,输入到同一个transformer结构中,得到最终的输出。
然而,上述多模态融合采用的策略只是简单的对输入进行拼合,并不适用于多模态融合的所有场景。
为了解决上述技术问题,本申请实施例提供一种数据处理方法,一方面,通过将transformer结构应用于车道线检测任务上,可以有效地建模车道线之间的长程联系。另一方面,通过在车道线检测的过程中增加图像中对象的检测框位置信息,可以提升对场景的感知能力。减少由于车道线被车辆遮挡场景下的误判。下面将结合附图对本申请实施例的数据处理方法及相关设备进行详细的介绍。
为了便于理解,下面先对本申请实施例主要涉及的相关术语和概念进行介绍。
1、神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以Xs和截距1为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为Xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是Relu函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
神经网络中的每一层的工作可以用数学表达式y=a(Wx+b)来描述:从物理层面神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由Wx完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重 矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
2、卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使同一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
3、transformer
transformer结构是一种包含编码器与解码器的特征提取网络(类别于卷积神经网络)。
编码器:通过自注意力的方式在全局感受野下进行特征学习,例如像素点的特征。
解码器:通过自注意力与交叉注意力来学习所需模块的特征,例如输出框的特征。
下面对注意力(也可以称为注意力机制)进行描述:
注意力机制可以快速提取稀疏数据的重要特征。注意力机制是发生在编码器和解码器之间,也可以说是发生在输入句子和生成句子之间。而自注意力模型中的自注意力机制则发生在输入序列内部,或者输出序列内部,可以抽取到同一个句子内间隔较远的单词之间的联系,比如句法特征(短语结构)。自注意力机制通过QKV提供了一种有效的捕捉全局上下文信息的建模方式。假定输入为Q(query),以键值对(K,V)形式存储上下文。那么注意力机制其实是query到一系列键值对(key,value)上的映射函数。attention函数的本质可以被描述为一个查询(query)到一系列(键key-值value)对的映射。attention本质上是为序列中每个元素都分配一个权重系数,这也可以理解为软寻址。如果序列中每一个元素都以(K,V)形式存储,那么attention则通过计算Q和K的相似度来完成寻址。Q和K计算出来的相似度反映了取出来的V值的重要程度,即权重,然后加权求和就得到最后的特征值。
注意力的计算主要分为三步,第一步是将query和每个key进行相似度计算得到权重,常用的相似度函数有点积,拼接,感知机等;然后第二步一般是使用一个softmax函数(一方面可以进行归一化,得到所有权重系数之和为1的概率分布。另一方面可以用softmax函数的特性突出重要元素的权重)对这些权重进行归一化;最后将权重和相应的键值value进行加权求和得到最后的特征值。具体计算公式可以如下:
其中,d为QK矩阵的维度。
另外,注意力包括自注意力与交叉注意力,自注意可以理解为是特殊的注意力,即QKV的输入一致。而交叉注意力中的QKV的输入不一致。注意力是利用特征之间的相似程度(例如内积)作为权重来集成被查询特征作为当前特征的更新值。自注意力是基于特征图本身的关注而提取的注意力。
对于卷积而言,卷积核的设置限制了感受野的大小,导致网络往往需要多层的堆叠才能关注到整个特征图。而自注意的优势就是它的关注是全局的,它能通过简单的查询与赋值就能获取到特征图的全局空间信息。自注意力在查询、键、值(query key value,QKV)模型中的特殊点在于QKV对应的输入是一致的。后续会对QKV模型进行描述。
4、前馈神经网络
前馈神经网络(feedforward neural network,FNN)是最早发明的简单人工神经网络。在前馈神经网络中,各神经元分别属于不同的层。每一层的神经元可以接收前一层神经元的信号,并产生信号输出到下一层。第0层称为输入层,最后一层称为输出层,其它中间层称为隐藏层。整个网络中无反馈,信号从输入层向输出层单向传播。
5、多层感知器(multilayer perceptron,MLP)
多层感知器,也可以称为多层感知机,是一种前馈人工神经网络模型,其将输入映射到单一的输出的上。
6、损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
7、上采样
在应用在计算机视觉的深度学习领域,由于输入图像通过卷积神经网络(CNN)提取特征后,输出的尺寸往往会变小,而有时我们需要将图像恢复到原来的尺寸以便进行进一步的计算(例如:图像的语义分割),这个采用扩大图像尺寸,实现图像由小分辨率到大分辨率的映射的操作,叫做上采样(Upsample)。
其中,上采样有3种常见的方法:双线性插值(bilinear)、反卷积(Transposed Convolution)以及反池化(Unpooling)。
8、模态、多模态、多模态数据、多模态融合
一般来说,模态是指事物发生或存在的方式,多模态是指两个或者两个以上的模态的各种形式的组合。
对每一种信息的来源或者形式,都可以称为一种模态(Modality),目前研究领域中主要是对图像,文本,语音等模态的处理。
上述中的模态也可以理解为是“感官”,即生物凭借感知器官与经验来接收信息的通道,例如:人类有视觉、听觉、触觉、味觉和嗅觉等等模态。多模态可以理解为是多种感官进行融合,例如,人类可以通过声音、肢体语言、信息载体(例如文字、图片、音频、视频等)、环境等多个通道与智能设备进行交流,智能设备融合多模态信息后作出对人类的意图判断,并通过文字、声音、灯带等多种方式反馈给人类。
多模态数据是指多个模态不同的数据,模态可以包括文本、图像、音视频等。可以理解的是,在某些场景下,不同结构的图像也可以称为不同模态,例如,RGB图像与深度图像为不同模态的数据。不同结构的文本也可以称为不同模态,例如,中文与英文为不同模态的数据。不同格式的音频也可以称为不同模态,例如,波形声音文件(MAV)与音频视频交错格式(audio video interleaved,AVI)为不同模态的数据等等。
深度学习中的多模态融合指机器从文本、图像、语音、视频等多个领域获取信息,实现信息转换和融合,从而提升模型性能的技术。之所以要对模态进行融合,是因为不同模态的表现方式不一样,看待事物的角度也会不一样,所以存在一些交叉(所以存在信息冗余),互补(所以比单特征更优秀)的现象,甚至模态间可能还存在多种不同的信息交互,如果能合理的处理多模态信息,就能得到丰富特征信息。
下面介绍本申请实施例提供的系统架构。
参见附图1,本发明实施例提供了一种系统架构100。如系统架构100所示,数据采集设备160用于采集训练数据,本申请实施例中训练数据包括:多个不同模态的数据。其中,模态可以是指文本、图像、视音频。例如:训练数据可以包括RGB图像+深度图像,也可以包括RGB图像与点云数据等等。并将训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。下面将更详细地描述训练设备120如何基于训练数据得到目标模型/规则101,该目标模型/规则101能够用于实现本申请实施例提供的数据处理方法所应用的计算机视觉任务。该计算机视觉任务可以包括:分类任务、分割任务、检测任务或图像生成任务等。本申请实施例中的目标模型/规则101具体可以包括自注意力网络、多层感知机、池化层等。需要说明的是,在实际的应用中,数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图1所示的执行设备110,执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)设备/虚拟现实(virtual reality,VR)设备,车载终端等。当然,执行设备110还可以是服务器或者云端等。在附图1中,执行设备110 配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,输入数据在本申请实施例中可以包括:待检测图像。另外该输入数据可以是用户输入的,也可以是用户通过拍摄设备上传的,当然还可以来自数据库,具体此处不做限定。
预处理模块113用于根据I/O接口112接收到的输入数据进行预处理,在本申请实施例中,预处理模块113可以用于对输入数据进行拆分得到子数据集合。例如:输入图像为图像,预处理模块113用于随图像进行拆分得到多个图像块。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如得到的上述目标任务对应的结果返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在附图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,附图1仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图1中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图1所示,根据训练设备120训练得到目标模型/规则101,本申请实施例中的目标模型/规则101具体可以为目标神经网络。
下面介绍本申请实施例提供的一种芯片硬件结构。
图2为本发明实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器20。该芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。
神经网络处理器20可以是神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing  unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:神经网络处理器20作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路203,控制器204控制运算电路203提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路203内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路203是二维脉动阵列。运算电路203还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路203是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路203从权重存储器202中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器201中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器208中。
向量计算单元207可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元207可以用于神经网络中非卷积/非FC层的网络计算,如池化(Pooling),批归一化(Batch Normalization),局部响应归一化(Local Response Normalization)等。
在一些实现中,向量计算单元能207将经处理的输出的向量存储到统一缓存器206。例如,向量计算单元207可以将非线性函数应用到运算电路203的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元207生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路203的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器206用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器205(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器201和/或统一存储器206、将外部存储器中的权重数据存入权重存储器202,以及将统一存储器206中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)210,用于通过总线实现主CPU、DMAC和取指存储器209之间进行交互。
与控制器204连接的取指存储器(instruction fetch buffer)209,用于存储控制器204使用的指令。
控制器204,用于调用指存储器209中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器206,输入存储器201,权重存储器202以及取指存储器209均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
接下来介绍几种本申请的应用场景。
图3A为本申请实施例提供的数据处理系统的一个结构示意图,该数据处理系统包括终端设备(图3A中仅以终端设备是手机为例)以及数据处理设备。可以理解的是,终端设备除了 可以是手机之外,还可以是平板电脑(pad)、便携式游戏机、掌上电脑(personal digital assistant,PDA)、笔记本电脑、超级移动个人计算机(ultra mobile personal computer,UMPC)、手持计算机、上网本、车载媒体播放设备、可穿戴电子设备、虚拟现实(virtual reality,VR)终端设备、增强现实(augmented reality,AR)、车辆、车载终端、飞机终端、智能机器人等终端设备。终端设备为数据处理的发起端,作为数据处理请求的发起方,通常由用户通过终端设备发起请求。
上述数据处理设备可以是云服务器、网络服务器、应用服务器以及管理服务器等具有数据处理功能的设备或服务器。数据处理设备通过交互接口接收来自终端设备的数据处理请求,再通过存储数据的存储器以及数据处理的处理器环节进行机器学习,深度学习,搜索,推理,决策等方式的数据处理。数据处理设备中的存储器可以是一个统称,包括本地存储以及存储历史数据的数据库,数据库可以在数据处理设备上,也可以在其它网络服务器上。
在图3A所示的数据处理系统中,终端设备可以接收用户的指令,例如终端设备可以获取用户输入/选择的多个数据(例如:终端设备通过终端设备采集的图像、文本、音频等),然后向数据处理设备发起请求,使得数据处理设备针对终端设备得到的该多个数据执行数据处理应用(例如,分类、分割、检测、图像生成等的计算机视觉任务),从而得到针对多个数据的对应的处理结果。示例性的,终端设备可以获取用户输入的多张图像,然后向数据处理设备发起图像检测请求,使得数据处理设备对该图像进行检测,从而得到图像的检测结果,并显示图像的检测结果,以供用户观看和使用。
在图3A中,数据处理设备可以执行本申请实施例的数据处理方法。
图3B为本申请实施例提供的数据处理系统的另一结构示意图,在图3B中,终端设备(图3B中仅以终端设备是手机为例)直接作为数据处理设备,该终端设备能够直接获取多个数据(例如:数据是图像、文本、音频等),并直接由终端设备本身的硬件进行处理,具体过程与图3A相似,可参考上面的描述,在此不再赘述。
可选地,在图3B所示的数据处理系统中,终端设备可以接收用户的指令,例如终端设备可以获取用户在终端设备中所选择的多张图像,然后再由终端设备自身针对该图像执行数据处理应用(例如,分类、分割、检测、图像生成等的计算机视觉任务),从而得到针对该图像的对应的处理结果,并显示处理结果,以供用户观看和使用。
可选地,在图3B所示的数据处理系统中,终端设备可以实时或周期性的采集图像,然后再由终端设备自身针对该图像执行数据处理应用(例如,分类、分割、检测、图像生成等计算机视觉任务),从而得到针对该图像的对应的处理结果,并根据处理结果实现功能(分类功能、分割功能、检测功能、图像生成功能等等)。
在图3B中,终端设备自身就可以执行本申请实施例的数据处理方法。
上述图3A和图3B中的终端设备具体可以是图1中的客户设备140或执行设备110,图3A中的数据处理设备具体可以是图1中的执行设备110,其中,数据存储系统150可以存储执行设备110的待处理数据,数据存储系统150可以集成在执行设备110上,也可以设置在云上或其它网络服务器上。
图3A和图3B中的处理器可以通过神经网络模型或者其它模型(例如注意力模型、MLP等)进行数据训练/机器学习/深度学习,并利用数据最终训练或者学习得到的模型针对多个 数据执行数据处理应用,从而得到相应的处理结果。
下面对本申请实施例提供的数据处理方法进行描述。该方法可以由数据处理设备执行,也可以由数据处理设备的部件(例如处理器、芯片、或芯片系统等)执行。该数据处理设备可以是云端设备(如前述图3A所示),也可以是终端设备(例如图3B所示的手机)。当然,该方法也可以是由云端设备和终端设备构成的系统执行(如前述图3A所示)。可选地,该方法可以由数据处理设备中的CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。
上述的终端设备可以是手机、平板电脑(pad)、便携式游戏机、掌上电脑(personal digital assistant,PDA)、笔记本电脑、超级移动个人计算机(ultra mobile personal computer,UMPC)、手持计算机、上网本、车载媒体播放设备、可穿戴电子设备、虚拟现实(virtual reality,VR)终端设备、增强现实(augmented reality,AR)终端设备等数显产品。
本申请实施例提供的方法所适用的应用场景主要是多模态融合场景,具体可以应用于分类场景、分割场景、检测场景、图像生成场景等计算机视觉任务,或者可应用于语义分割、室内场景感知、室外驾驶等。另外,本申请实施例所涉及的数据可以是指文本、图像、音视频等,为了方便描述,本文仅以数据是图像为例进行示例性说明。
请参阅图4,本申请实施例提供的数据处理方法的一个流程示意图,该方法可以包括步骤401至步骤404。下面对步骤401至步骤404进行详细说明。
步骤401,获取第一数据与第二数据。
本申请实施例中,数据处理设备获取第一数据与第二数据的方式有多种方式,可以是通过采集/拍摄的方式,也可以是通过接收其他设备发送的方式,还可以是从数据库中选取的方式等,具体此处不做限定。
可选地,若应用于自动驾驶场景,数据处理设备可以是车辆,第一数据与第二数据可以是车辆实时采集的数据,也可以是周期性采集的数据,具体此处不做限定。
本申请实施例中,仅以第一数据与第二数据是与图像相关的数据为例进行示例性描述,在实际应用中,第一数据与第二数据还可以是与文本、音视频等相关的数据,具体此处不做限定。
本申请实施例中由于第一数据与第二数据之间的关系,可以分为多种情况,下面分别描述:
第一种,第一数据与第二数据属于同构多模态数据。
其中,同构多模态数据是指第一数据所属的模态的呈现方式与第二数据所属的模态的呈现方式相同,例如,第一数据为RGB图像,第二图像为深度图像,第一数据与第二数据的呈现方式都为图像。
可选地,第一数据与第二数据是图像数据,或者第一数据与第二数据是文本数据,或者第一数据与第二数据为音频数据等等,具体此处不做限定。
示例1,第一数据是如图5A所示的RGB图像,第二图像是如图5B所示的深度图像。该示例1可以应用于云服务场景(例如语义分割场景),数据处理设备可以是智能摄像头、智能机器人等。
第二种,第一数据与第二数据属于异构多模态数据。
其中,异构多模态数据是指第一数据所属的多模态与第二数据所属的多模态不同。
可选地,第一数据是图像数据,第二数据是点云数据。或者第一数据是文本数据,第二数据是音频数据等等,具体此处不做限定。
示例2,第一数据是如图6A所示的RGB图像,第二图像是如图6B所示的点云数据。该示例2可以应用于自动驾驶场景(例如智能感知场景),数据处理设备可以是智能汽车等。
步骤402,获取第一数据的第一特征集合与第二数据的第二特征集合。
数据处理设备获取第一数据与第二数据之后,可以获取第一数据的第一特征集合与第二数据的第二特征集合。
可选地,拆分第一数据得到多个第一子数据。拆分第二数据得到多个第二子数据。并基于多个第一子数据得到第一特征集合,基于多个第二子数据得到第二特征集合。
可选地,拆分第一数据的数量与第一特征集合中特征的数量有关。例如,拆分第一数据的数量与第一特征集合中特征的数量相同等。同理,拆分第二数据的数量与第二特征集合中特征的数量有关。例如,拆分第二数据的数量与第二特征集合中特征的数量相同等。
本申请实施例中,拆分数据(包括第一数据/第二数据)的规则可以根据实际需要设置,例如,拆分规则是对数据的整体/部分进行均匀等分、不均匀等分等,具体此处不做限定。
示例性的,延续前述示例1的举例,如图5A的第一数据的拆分以及多个第一子数据可以如图7A所示。如图5B的第二数据的拆分以及多个第二子数据可以如图7B所示。
示例性的,延续前述示例2的举例,如图6A的第一数据的拆分以及多个第一子数据可以如图8A所示。如图6B的第二数据的拆分以及多个第二子数据可以如图8B所示。其中,如图8B所示,在第二数据为点云数据的情况下,可以对第二数据进行采样得到采样点,再将采样点作为第二子数据。
可选地,获取多个第一子数据与多个第二子数据之后,可以基于多个第一子数据获取第一特征集合,基于多个第二子数据获取第二特征集合。具体的,基于神经网络获取多个第一子数据的第一特征集合,基于神经网络获取多个第二子数据的第二特征集合。其中,多个第一子数据与第一特征集合中特征的个数可以一一对应,多个第二子数据与第二特征集合中特征的个数可以一一对应。当然,在实际应用中,也可以设置一个特征对应多个子数据,还可以设置多个特征对应一个子数据,具体此处不做限定。
上述中的神经网络可以包括注意力网络、多层感知机(multi-layer perceptron,MLP)、池化层等,具体此处不做限定。
示例性的,神经网络的三种结构示例可以如图9所示,可以理解的是,神经网络也可以只包括注意力网络、多层感知机、池化层或卷积层等。换句话说,对于第一特征集合与第二特征集合在神经网络中的位置不做限定,以神经网络包括注意力网络与第一网络层为例,第一特征集合与第二特征集合可以是注意力网络的输出(如图10A所示),也可以是第一网络层的输出(如图10B所示)等等(例如,第一特征集合与第二特征集合可以是输入注意力网络的特征),具体此处不做限定。另外,为了方便后续举例描述,注意力网络可以包括L个子模块,或者理解为注意力网络是L层结构的网络,其中,每层的结构相同。
步骤403,使用第二特征集合中的第二目标特征替换第一特征集合中的第一目标特征。
数据处理设备获取第一特征集合与第二特征集合之后,可以使用第二特征集合中的第二目标特征替换第一特征集合中的第一目标特征,得到第三特征集合。其中,第二目标特征与第一目标特征对应。第二目标特征与第一目标特征的对应关系可以根据第一数据与第二数据的空间关系、语义关系等所确定,也可以根据特征在特征集合中的位置所确定等,对具体如何确定不同特征集合中的对应关系此处不做限定。
可选地,数据处理设备获取第一特征集合与第二特征集合之后,可以先获取第一特征集合的第一分值集合与第二特征集合的第二分值集合,并使用第一分值集合和/或第二分值集合确定第一目标特征与第二目标特征,再使用第二目标特征替换第一特征集合中的第一目标特征,得到第三特征集合。
下面先对获取特征集合的分值集合进行描述,其中,分值集合包括多个分值,分值可以用于评估特征的重要性(越大越好),也可以用于评估特征的无效性(越小越好)等等。另外,分值集合中分值的数量可以与特征集合中的特征一一对应。当然,也可以对特征进行不同纬度的打分,该种方式下,一个特征可能对应多个分值。此处对特征对应分值的数量不做限定,即可以是一个,也可以是多个。为了方便后续描述,本申请实施例仅以一个特征对应一个打分进行示例性说明。
本申请实施例中获取特征集合的分值集合的方式有很多,下面分别描述:
1,基于打分网络获取特征集合对应的分值集合。
该种方式下,可以引入打分网络,该打分网络可以用于评估特征的重要性。
可选地,基于打分网络对第一特征集合中的各个特征进行评估,得到第一分值集合。基于打分网络对第二特征集合中的各个特征进行评估,得到第二分值集合。具体的,将第一特征集合中的各特征输入打分网络,得到第一分值集合。将第二特征集合中的各特征输入打分网络,得到第二分值集合。
另外,为了保证打分网络的输出值服从稀疏分布。即可以理解为打分网络的输出值更加稀疏,使得某些特征的分值与另外一些特征的分值差别较大,进而确定哪些特征是有用的或无用的。打分网络在训练过程中可以是使用L1范数进行训练。
2,基于特征集合中各特征的数学运算获取特征集合对应的分值集合。
其中,上述的数学运算可以理解为是对特征本身的数学运算,可以包括求秩运算(例如特征为矩阵形式)、求模运算(例如特征为向量形式)等,具体此处不做限定。
可选地,在第一特征集合与第二特征集合中特征的表现形式是矩阵的情况下,可以通过对特征矩阵进行求秩运算,进而得到分值集合。具体的,对第一特征集合中的各特征矩阵进行求秩运算,得到第一分值集合。对第二特征集合中的各特征矩阵进行求秩运算,得到第二分值集合。
可选地,在第一特征集合与第二特征集合中特征的表现形式是向量的情况下,可以通过对特征向量进行求模运算,进而得到分值集合。具体的,对第一特征集合中的各特征向量进行求模运算,得到第一分值集合。对第二特征集合中的各特征向量进行求模运算,得到第二分值集合。
可以理解的是,上述两种获取特征集合对应的分值集合的方式只是举例,在实际应用中,还可以有其他方式获取分值集合,具体此处不做限定。
数据处理设备获取特征集合对应的分值集合之后,可以基于第一分值集合和/或第二分值集合确定第一目标特征与第二目标特征,再使用第二目标特征替换第一目标特征。
其中,第二目标特征与第一目标特征的对应关系可以是根据第一预设规则等方式进行确定,具体此处不做限定。换句话说,确定了第二目标特征就相当于确定了第一目标特征,或者确定了第一目标特征就相当于确定了第二目标特征。因此,可以基于第一分值集合和/或第二分值集合确定第一目标特征与第二目标特征。如前所述,第二目标特征与第一目标特征的对应关系可以根据第一数据与第二数据的空间关系、语义关系等所确定,也可以根据特征在特征集合中的位置所确定等,对具体如何确定不同特征集合中的对应关系此处不做限定。换句话说,第一预设规则可以与多模态数据之间的空间关系、语义关系等相关。
可选地,数据处理设备可以基于第一分值集合与第二预设规则确定第一特征集合中的第一目标特征。再确定第一目标特征之后,根据第一预设规则可以再确定与第一目标特征对应的第二目标特征。
可选地,数据处理设备可以基于第二分值集合与第二预设规则确定第二特征集合中的第二目标特征。再确定第二目标特征之后,根据第一预设规则可以再确定与第二目标特征对应的第一目标特征。
下面对上述所提到的第一预设规则与第二预设规则分别进行描述。
上述的第一预设规则具体用于确定第一特征集合中第一特征与第二特征集合中第二特征的对应关系。该关系可以是一对一,也可以一对多或多对一,具体此处不做限定。第一预设规则可以根据实际需要设置。例如,第一预设规则包括:第一特征集合中第一个位置的特征与第二特征集合中的第二个位置的特征对应。又例如,第一预设规则包括:第一特征集合中第一个位置的特征与第二特征集合中第一个位置的特征对应等等,在实际应用中,第一预设规则还可以是其他情况,此处不做限定。其中,第一特征集合中特征的位置与第二特征集合中特征的位置可以是根据残差位置编码等方式确定。
示例性的,第一特征集合依次包括A1特征、A2特征、A3特征、A4特征、A5特征、A6特征。第二特征集合依次包括B1特征、B2特征、B3特征、B4特征、B5特征、B6特征。则上述的第一预设规则可以是A1与B1、A2与B2、A3与B3、A4与B4、A5与B5、A6与B6分别对应。也可以是A1与B2、A2与B3、A3与B4、A4与B5、A5与B6、A6与B1对应。还可以是A1与B5、A2与B3、A3与B1、A4与B2、A5与B6、A6与B4对应等等。其中,第一目标特征为上述一种对应关系涉及的第一特征集合中的特征,第二目标特征为第二特征集合中与第一目标特征对应的特征。
上述的第二预设规则具体用于确定第一目标特征和/或第二目标特征。第二预设规则可以根据实际需要设置。第二预设规则具体可以与分值大小、预设分值等相关。例如,第二预设规则包括:确定第一分值集合中分值最小的特征为第一目标特征。又例如,第二预设规则包括:确定第二分值集合中分值最大的特征为第二目标特征。又例如,第二预设规则包括:确 定第一分值集合中分值最大的特征为第一目标特征。又例如,第二预设规则包括:确定第二分值集合中分值最小的特征为第二目标特征。又例如,第二预设规则包括:确定第二分值集合中分值等于预设分值的特征为第二目标特征。又例如,第二预设规则包括:确定第一分值集合中分值等于预设分值的特征为第一目标特征等等,在实际应用中,第二预设规则还可以是其他情况,此处不做限定。
示例性的,假设神经网络包括注意力网络,注意力网络可以包括L个子模块,或者理解为注意力网络是L层结构的网络,每层的结构相同。打分网络记为sl,则对于第l层第一数据集合中第一特征(可以记为)的第一分值可以记为则使用第二特征集合中的第二目标特征替换第一特征集合中的第一目标特征的过程可以通过下述公式进行表示:
其中,为第一特征集合中待被替换的特征(例如第一目标特征),⊙表示按元素相乘,为指示函数,若指示函数的下标满足条件,指示函数的输出为1;若指示函数的下标不满足条件,指示函数的输出为0。1可以理解为是替换,0可以理解为是不替换。θ可以理解为前述第二预设规则中的预设分值,具体数值可以根据实际需要设置,例如:θ=0.01。表示第一特征集合中待被替换的特征的分值(例如是第一目标特征的分值),表示将第一数据集合中的第一目标特征投影到第二数据集合中的第二目标特征。A与B表示第一特征集合中的特征与第二特征集合中的特征的对应关系(例如,第一目标特征对应第二目标特征)。上述公式可以理解为,第一特征集合中分值小于θ的特征(例如,第一目标特征)被与该特征在第二特征集合中对应的特征(例如,第二目标特征)替换。
可选地,在第一数据与第二数据呈现方式相同的为同构多模态数据的情况下,表示恒等映射。在第一数据与第二数据属于异构多模态数据的情况下,以第一数据是RGB图像,第二数据是点云、目标任务是检测任务为例。可以利用点云和图像之间的空间关系(如前述第一预设规则与多模态数据之间的空间关系相关)进行投影找出图像片和点云之间的对应关系。具体可以包括:假设有Nimg个图像片(patch)和Npoint个3D采样点,作为神经网络的输入。Npoint-Nimg的映射将第npoint个投影至对应的第nimg个图像片上的过程可以表示如下:

其中,K和Rt为相机内外参,表示点的3D坐标,表示图的2D像素,W和P分别为原图像宽度和图像片的宽度。
另外,在多模态数据大于两个时,例如:前述已经获取第三数据等等。为了防止不同数据对应特征集合之间特征的对应关系混乱,可以预先固定一种分配方案:aB(A)∈{0,1}N;该种情况下多模态数据之间的特征替换的表达可以如下:
其中,M为多模态数据中不同模态的数量,其余解释可以参考前述公式中的描述,此处不再赘述。
步骤404,基于第三特征集合与第二特征集合获取数据特征。
数据处理设备获取第三特征集合之后,可以基于第三特征集合与第二特征集合获取数据特征,该数据特征用于实现计算机视觉任务,计算机视觉任务包括分类任务、分割任务、检测任务或图像生成任务等等。
本申请实施例中,数据特征在神经网络中的位置取决于第一特征集合与第二特征集合所在的位置。例如,数据特征在神经网络中的位置可以是第一特征集合与第二特征集合所在的位置进行的融合等处理。又例如,数据特征在神经网络中的位置还可以是第一特征集合与第二特征集合所在的位置往后推一个或多个网络层的位置。本申请实施例中对于数据特征所在神经网络中的位置不做限定。
示例性的,神经网络包括注意力网络、第一网络层以及第二网络层。若第一特征集合与第二特征集合是注意力网络的输出,则基于第三特征集合与第二特征集合获取数据特征可以包括:将第三特征集合与第二特征集合输入第一网络层得到数据特征。若第一特征集合与第二特征集合是第一网络层的输出,则基于第三特征集合与第二特征集合获取数据特征可以包括:将第三特征集合与第二特征集合输入第二网络层得到数据特征。
示例性的,神经网络包括多层感知机、第一网络层以及第二网络层。若第一特征集合与第二特征集合是多层感知机的输出,则基于第三特征集合与第二特征集合获取数据特征可以包括:将第三特征集合与第二特征集合输入第一网络层得到数据特征。可以再将数据特征输入第二网络层得到目标任务的结果。若第一特征集合与第二特征集合是第一网络层的输出,则基于第三特征集合与第二特征集合获取数据特征可以包括:将第三特征集合与第二特征集合输入第二网络层得到数据特征。
示例性的,神经网络包括池化层、第一网络层以及第二网络层。若第一特征集合与第二特征集合是池化层的输出,则基于第三特征集合与第二特征集合获取数据特征可以包括:将第三特征集合与第二特征集合输入第一网络层得到数据特征。若第一特征集合与第二特征集合是第一网络层的输出,则基于第三特征集合与第二特征集合获取数据特征可以包括:将第三特征集合与第二特征集合输入第二网络层得到数据特征。
其中,上述的第二网络层与目标任务相关,可以根据实际需要设置,具体此处不做限定。例如,目标任务为分类任务时,第二网络层可以为全连接层。又例如,目标任务为分割任务或检测任务时,第二网络层可以为卷积神经网络层或上采样层。
另外,需要注意的是,前述步骤403只是描述了使用第二特征集合中的第二目标特征替换第一特征集合中的第一目标特征,得到第三特征集合。在实际应用中,还可以使用第一特征集合中的第三目标特征替换第二特征集合中的第四目标特征,得到第四特征集合。换句话说,在本申请实施例提供的数据处理方法中,可以对两个特征集合中的特征进行单换(例如:第三特征集合的获取过程),也可以对两个特征集合中的特征进行互换(例如:第三特征集合与第四特征集合的获取过程),具体此处不做限定。其中,对于使用第一特征集合中的第三目标特征替换第二特征集合中的第四目标特征,得到第四特征集合的过程可以参考前述步骤403的描述,此处不再赘述。其中,第三目标特征与第四目标特征对应,具体的对应关系可以参考前述第一目标特征与第二目标特征的对应关系,此处不再赘述。
其次,对于神经网络包括多层结构时,可以再至少一层进行上述的特征替换。例如,可 以只在某一层进行特征替换。又例如,在多层分别进行上述的特征替换。又例如,在每一层都进行上述的特征替换,具体此处不做限定。
可选地,若本申请实施例提供的数据处理方法包括两个特征集合中的特征进行互换,则步骤404可以包括,使用第一特征集合中的第三目标特征替换第二特征集合中的第四目标特征,得到第四特征集合。再基于第三特征集合与第四特征集合获取数据特征。其中,第三目标特征在第一特征集合中的位置与第四目标特征在第二特征集合中的位置对应。
为了更直观的看出本申请实施例提供的数据处理方法的过程,下面延续上述举例进行示例性描述。
示例性的,延续上述示例1,以特征集合在前述图10A所示的位置、第一数据如图5A所示、第二数据如图5B所示、第一特征集合与第二特征集合互换、通过打分网络确定待替换的特征为例结合图11进行描述。
请参阅图11,神经网络包括注意力网络、第一网络层、第二网络层第一数据为RGB图像,第二数据为深度图像,第一数据与第二数据为同构多模态数据。第一数据与第二图像分别拆分为6个图像片,并输入注意力网络以及第一网络层得到第一特征集合(A1、A2、A3、A4、A5、A6)与第二特征集合(B1、B2、B3、B4、B5、B6)。使用用L1训练的打分网络对第一特征集合与第二特征集合进行打分,并根据打分值确定各特征集合中待替换的特征。第一特征集合与第二特征集合的对应关系包括:A1-B1、A2-B2、A3-B3、A4-B4、A5-B5、A6-B6。并基于打分网络确定第一特征集合中的第一目标特征包括A2、A6。基于打分网络确定第二特征集合中的第二目标特征包括B2、B6。再使用第二特征集合中的第二目标特征替换第一特征集合中的第一目标特征,得到第三特征集合(A1、B2、A3、A4、A5、B6)。同理,基于打分网络确定第一特征集合中的第三目标特征包括A3、A4。基于打分网络确定第二特征集合中的第四目标特征包括B3、B4。再使用第一特征集合中的第三目标特征替换第二特征集合中的第四目标特征,得到第四特征集合(B1、B2、A3、A4、B5、B6)。进行替换得到第三特征集合与第四特征集合之后,可以在进入下一层神经网络等等,再输入第二网络层,并对输出进行融合处理,得到语义分割结果(例如输出每个像素语义分类的结果)。为了特征替换时的位置准确,可以利用残差位置编码进行对齐。另外,上述的对应关系前述已经做了描述,此处不再赘述。
示例性的,延续上述示例2,以特征集合在前述图10A所示的位置、第一数据如图6A所示、第二数据如图6B所示、第一特征集合与第二特征集合互换、通过打分网络确定待替换的特征为例结合图12进行描述。
请参阅图12,神经网络包括多层感知机、第一网络层、第二网络层第一数据为RGB图像,第二数据为点云数据,第一数据与第二数据为异构多模态数据。第一数据拆分为5个图像片,第二数据采样得到采样点,并将5个图像片与采样点(图12中,采样点分为6份)输入多层感知机以及第一网络层得到第一特征集合(A1、A2、A3、A4、A5)与第二特征集合(B1、B2、B3、B4、B5、B6)。使用用L1训练的打分网络对第一特征集合与第二特征集合进行打分,并根据打分值确定各特征集合中待替换的特征。第一特征集合与第二特征集合的对应关系包括:A1-B2、A2-B6、A4-B5、A5-B3。并基于打分网络确定第一特征集合中的第一目标特征包括A1。 基于打分网络确定第二特征集合中的第二目标特征包括B2。再使用第二特征集合中的第二目标特征替换第一特征集合中的第一目标特征,得到第三特征集合(B2、A2、A3、A4、A5)。同理,基于打分网络确定第一特征集合中的第三目标特征包括A5、A4、A2。基于打分网络确定第二特征集合中的第四目标特征包括B3、B5、B6。再使用第一特征集合中的第三目标特征替换第二特征集合中的第四目标特征,得到第四特征集合(B1、B2、A5、B4、A4、A2)。进行替换得到第三特征集合与第四特征集合之后,可以在进入下一层神经网络等等,再输入第二网络层,并对输出带有检测框的图像与带有检测框的点云。为了特征替换时的位置准确,可以利用残差位置编码进行对齐。
本申请实施例中,一方面,通过使用不同模态数据之间的特征进行替换,可以高效融合不同模态数据的信息,使得获取的数据特征具有多模态数据的特性,提高数据特征的表达能力,进而使得基于数据特征获取的目标任务的结果更加准确。另一方面,通过稀疏化打分网络,使得某些特征的分值与另外一些特征的分值差别较大,进而确定哪些特征是有用的或无用的。另一方面,通过残差位置编码确定替换特征的位置,进而保证替换特征时不改变特征在原特征集合中的位置。
为了更直观的看出本申请实施例提供的数据处理方法(后续称为Mix)的有益效果,下面对比不同方法在数据集一、数据集二上的表现结果进行描述。
测试结果如表1所示:
表1
其中,上述方法包括:全卷积网络(Fully Convolutional Networks,FCN)、RefineNet、FuseNet、自监督模型适应方法(self-supervised model adaptation,SSMA)、条件归一化方法(cross-Iteration batch normalization,CBN)、RGB-D融合网络(RGB-D fusion network,RDFNet)、通道交换网络(channel exchanging network,CEN)、融合(fusion)方法、输入拼接(concat)方法、本申请实施例提供的MIX方法。FCN-32S中的32S表示卷积层的特征图是原始图像的1/32。w/o表示不做多模态融合的同结构模型。[Ti]表示微小模型,[s]表示小模型,小模型比微小模型的层数、通道数等更多。
通过表1中的数据可以看出,MIX方法同比其他方法的像素准确率、平均准确率以及平均交并比都有提升。可以表明通过本申请实施例提供的方法,可以提升经过特征替换后得到的数据特征的表达能力,进而使得获取的计算机视觉的结果更加准确。
上面对本申请实施例中的数据处理方法进行了描述,下面对本申请实施例中的数据处理设备进行描述,请参阅图13,本申请实施例中数据处理设备的一个实施例包括:
获取单元1301,用于获取第一数据与第二数据,第一数据与第二数据的模态不同;
获取单元1301,还用于获取第一数据的第一特征集合与第二数据的第二特征集合;
替换单元1302,用于使用第二特征集合中的第二目标特征替换第一特征集合中的第一目标特征,得到第三特征集合,第二目标特征与第一目标特征对应;
获取单元1301,用于基于第三特征集合与第二特征集合获取数据特征,数据特征用于实现计算机视觉任务。
可选地,数据处理设备还可以包括:确定单元1303,用于基于第一分值集合和/或第二分值集合确定第二目标特征。
本实施例中,数据处理设备中各单元所执行的操作与前述图1至图12所示实施例中描述的类似,此处不再赘述。
本实施例中,替换单元1302使用不同模态数据之间的特征进行替换,可以高效融合不同模态数据的信息,使得获取的数据特征具有多模态数据的特性,提高数据特征的表达能力。
参阅图14,本申请提供的另一种数据处理设备的结构示意图。该数据处理设备可以包括处理器1401、存储器1402和通信端口1403。该处理器1401、存储器1402和通信端口1403通过线路互联。其中,存储器1402中存储有程序指令和数据。
存储器1402中存储了前述图1至图12所示对应的实施方式中,由数据处理设备执行的步骤对应的程序指令以及数据。
处理器1401,用于执行前述图1至图12所示实施例中任一实施例所示的由数据处理设备执行的步骤。
通信端口1403可以用于进行数据的接收和发送,用于执行前述图1至图12所示实施例中任一实施例中与获取、发送、接收相关的步骤。
一种实现方式中,数据处理设备可以包括相对于图14更多或更少的部件,本申请对此仅仅是示例性说明,并不作限定。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其 它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。
当使用软件实现所述集成的单元时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。

Claims (27)

  1. 一种数据处理方法,其特征在于,所述方法应用于多模态融合场景,所述方法包括:
    获取第一数据与第二数据,所述第一数据与所述第二数据的模态不同;
    获取所述第一数据的第一特征集合与所述第二数据的第二特征集合;
    使用所述第二特征集合中的第二目标特征替换所述第一特征集合中的第一目标特征,得到第三特征集合,所述第二目标特征与所述第一目标特征对应;
    基于所述第三特征集合与所述第二特征集合获取数据特征,所述数据特征用于实现计算机视觉任务。
  2. 根据权利要求1所述的方法,其特征在于,所述基于所述第三特征集合与所述第二特征集合获取数据特征,包括:
    使用所述第一特征集合中的第三目标特征替换所述第二特征集合中的第四目标特征,得到第四特征集合,所述第三目标特征与所述第四目标特征对应;
    基于所述第三特征集合与所述第四特征集合获取所述数据特征。
  3. 根据权利要求1或2所述的方法,其特征在于,所述使用所述第二特征集合中的第二目标特征替换所述第一特征集合中的第一目标特征之前,所述方法还包括:
    获取所述第一特征集合的第一分值集合,所述第一特征集合中的第一特征与所述第一分值集合中的第一分值一一对应;
    获取所述第二特征集合的第二分值集合,所述第二特征集合中的第二特征与所述第二分值集合中的第二分值一一对应;
    基于所述第一分值集合和/或所述第二分值集合确定所述第二目标特征。
  4. 根据权利要求3所述的方法,其特征在于,所述获取所述第一特征集合的第一分值集合,包括:
    基于打分网络对所述第一特征集合中的各个特征进行评估,得到所述第一分值集合,所述打分网络用于评估特征的重要性;
    所述获取所述第二特征集合的第二分值集合,包括:
    基于打分网络对所述第二特征集合中的各个特征进行评估,得到所述第二分值集合。
  5. 根据权利要求4所述的方法,其特征在于,所述打分网络的输出值服从稀疏分布。
  6. 根据权利要求3所述的方法,其特征在于,所述获取所述第一特征集合的第一分值集合,包括:
    对所述第一特征集合中的各个第一特征进行数学运算,得到所述第一分值集合,所述数学运算是基于所述各个第一特征本身进行的运算,所述数学运算包括求秩运算或求模运算;
    所述获取所述第二特征集合的第二分值集合,包括:
    对所述第二特征集合中的各个第二特征进行数学运算,得到所述第二分值集合。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述获取所述第一数据的第一特征集合与所述第二数据的第二特征集合,包括:
    基于神经网络获取所述第一特征集合与所述第二特征集合,所述神经网络包括注意力网络、多层感知机、池化层或卷积层。
  8. 根据权利要求7所述的方法,其特征在于,所述基于神经网络获取所述第一特征集合 与所述第二特征集合,包括:
    拆分所述第一数据得到多个第一子数据;
    拆分所述第二数据得到多个第二子数据;
    将所述多个第一子数据与所述第二子数据输入所述神经网络,得到所述第一特征集合与所述第二特征集合。
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,所述使用所述第二特征集合中的第二目标特征替换所述第一特征集合中的第一目标特征,包括:
    基于残差位置编码使用所述第二目标特征替换所述第一目标特征,所述残差位置编码用于确定所述第一特征集合与所述第二特征集合中各个特征所在的位置。
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,所述神经网络还包括第一网络层,所述第一网络层的结构与所述神经网络相关。
  11. 根据权利要求1至10中任一项所述的方法,其特征在于,所述方法还包括:
    将所述数据特征输入第二网络层获取所述计算机视觉任务的结果,所述第二网络层与所述计算机视觉任务相关。
  12. 根据权利要求11所述的方法,其特征在于,所述计算机视觉任务为分类任务,所述第二网络层为全连接层;或者所述计算机视觉任务为分割任务或检测任务,所述第二网络层为卷积神经网络层或上采样层。
  13. 一种数据处理设备,其特征在于,所述数据处理设备应用于多模态融合场景,所述数据处理设备包括:
    获取单元,用于获取第一数据与第二数据,所述第一数据与所述第二数据的模态不同;
    所述获取单元,还用于获取所述第一数据的第一特征集合与所述第二数据的第二特征集合;
    替换单元,用于使用所述第二特征集合中的第二目标特征替换所述第一特征集合中的第一目标特征,得到第三特征集合,所述第二目标特征与所述第一目标特征对应;
    所述获取单元,用于基于所述第三特征集合与所述第二特征集合获取数据特征,所述数据特征用于实现计算机视觉任务。
  14. 根据权利要求13所述的数据处理设备,其特征在于,所述获取单元,具体用于使用所述第一特征集合中的第三目标特征替换所述第二特征集合中的第四目标特征,得到第四特征集合,所述第三目标特征与所述第四目标特征对应;
    所述获取单元,具体用于基于所述第三特征集合与所述第四特征集合获取所述数据特征。
  15. 根据权利要求13或14所述的数据处理设备,其特征在于,所述获取单元,还用于获取所述第一特征集合的第一分值集合,所述第一特征集合中的第一特征与所述第一分值集合中的第一分值一一对应;
    所述获取单元,还用于获取所述第二特征集合的第二分值集合,所述第二特征集合中的第二特征与所述第二分值集合中的第二分值一一对应;
    所述数据处理设备还包括:
    确定单元,用于基于所述第一分值集合和/或所述第二分值集合确定所述第二目标特征。
  16. 根据权利要求15所述的数据处理设备,其特征在于,所述获取单元,具体用于基于 打分网络对所述第一特征集合中的各个特征进行评估,得到所述第一分值集合,所述打分网络用于评估特征的重要性;
    所述获取单元,具体用于基于打分网络对所述第二特征集合中的各个特征进行评估,得到所述第二分值集合。
  17. 根据权利要求16所述的数据处理设备,其特征在于,所述打分网络的输出值服从稀疏分布。
  18. 根据权利要求15所述的数据处理设备,其特征在于,所述获取单元,具体用于对所述第一特征集合中的各个第一特征进行数学运算,得到所述第一分值集合,所述数学运算是基于所述各个第一特征本身进行的运算,所述数学运算包括求秩运算或求模运算;
    所述获取单元,具体用于对所述第二特征集合中的各个第二特征进行数学运算,得到所述第二分值集合。
  19. 根据权利要求13至18中任一项所述的数据处理设备,其特征在于,所述获取单元,具体用于基于神经网络获取所述第一特征集合与所述第二特征集合,所述神经网络包括注意力网络、多层感知机、池化层或卷积层。
  20. 根据权利要求19所述的数据处理设备,其特征在于,所述获取单元,具体用于拆分所述第一数据得到多个第一子数据;
    所述获取单元,具体用于拆分所述第二数据得到多个第二子数据;
    所述获取单元,具体用于将所述多个第一子数据与所述第二子数据输入所述神经网络,得到所述第一特征集合与所述第二特征集合。
  21. 根据权利要求13至20中任一项所述的数据处理设备,其特征在于,所述替换单元,具体用于基于残差位置编码使用所述第二目标特征替换所述第一目标特征,所述残差位置编码用于确定所述第一特征集合与所述第二特征集合中各个特征所在的位置。
  22. 根据权利要求13至21中任一项所述的数据处理设备,其特征在于,所述神经网络还包括第一网络层,所述第一网络层的结构与所述神经网络相关。
  23. 根据权利要求13至22中任一项所述的数据处理设备,其特征在于,所述获取单元,还用于将所述数据特征输入第二网络层获取所述计算机视觉任务的结果,所述第二网络层与所述计算机视觉任务相关。
  24. 根据权利要求23所述的数据处理设备,其特征在于,所述计算机视觉任务为分类任务,所述第二网络层为全连接层;或者所述计算机视觉任务为分割任务或检测任务,所述第二网络层为卷积神经网络层或上采样层。
  25. 一种数据处理设备,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述数据处理设备执行如权利要求1至12中任一项所述的方法。
  26. 一种计算机存储介质,其特征在于,包括计算机指令,当所述计算机指令在终端设备上运行时,使得所述终端设备执行如权利要求1至12中任一项所述的方法。
  27. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1至12中任一项所述的方法。
PCT/CN2023/077191 2022-03-02 2023-02-20 一种数据处理方法及相关设备 WO2023165361A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210203516.0A CN114897039A (zh) 2022-03-02 2022-03-02 一种数据处理方法及相关设备
CN202210203516.0 2022-03-02

Publications (1)

Publication Number Publication Date
WO2023165361A1 true WO2023165361A1 (zh) 2023-09-07

Family

ID=82715020

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/077191 WO2023165361A1 (zh) 2022-03-02 2023-02-20 一种数据处理方法及相关设备

Country Status (2)

Country Link
CN (1) CN114897039A (zh)
WO (1) WO2023165361A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117288094A (zh) * 2023-11-24 2023-12-26 太原理工大学 基于激光传感器的掘进机实时定位系统
CN117807434A (zh) * 2023-12-06 2024-04-02 中国信息通信研究院 一种通信数据集处理方法和装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114897039A (zh) * 2022-03-02 2022-08-12 华为技术有限公司 一种数据处理方法及相关设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200357143A1 (en) * 2019-05-09 2020-11-12 Sri International Semantically-aware image-based visual localization
CN112906797A (zh) * 2021-02-25 2021-06-04 华北电力大学 一种基于计算机视觉和深度学习的平面抓取检测方法
CN113673613A (zh) * 2021-08-25 2021-11-19 平安科技(深圳)有限公司 基于对比学习的多模态数据特征表达方法、装置及介质
CN113707309A (zh) * 2021-08-31 2021-11-26 平安科技(深圳)有限公司 基于机器学习的疾病预测方法及装置
CN114897039A (zh) * 2022-03-02 2022-08-12 华为技术有限公司 一种数据处理方法及相关设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200357143A1 (en) * 2019-05-09 2020-11-12 Sri International Semantically-aware image-based visual localization
CN112906797A (zh) * 2021-02-25 2021-06-04 华北电力大学 一种基于计算机视觉和深度学习的平面抓取检测方法
CN113673613A (zh) * 2021-08-25 2021-11-19 平安科技(深圳)有限公司 基于对比学习的多模态数据特征表达方法、装置及介质
CN113707309A (zh) * 2021-08-31 2021-11-26 平安科技(深圳)有限公司 基于机器学习的疾病预测方法及装置
CN114897039A (zh) * 2022-03-02 2022-08-12 华为技术有限公司 一种数据处理方法及相关设备

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117288094A (zh) * 2023-11-24 2023-12-26 太原理工大学 基于激光传感器的掘进机实时定位系统
CN117288094B (zh) * 2023-11-24 2024-01-26 太原理工大学 基于激光传感器的掘进机实时定位系统
CN117807434A (zh) * 2023-12-06 2024-04-02 中国信息通信研究院 一种通信数据集处理方法和装置

Also Published As

Publication number Publication date
CN114897039A (zh) 2022-08-12

Similar Documents

Publication Publication Date Title
WO2021120719A1 (zh) 神经网络模型更新方法、图像处理方法及装置
WO2022083536A1 (zh) 一种神经网络构建方法以及装置
WO2022116933A1 (zh) 一种训练模型的方法、数据处理的方法以及装置
WO2019228358A1 (zh) 深度神经网络的训练方法和装置
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
WO2023165361A1 (zh) 一种数据处理方法及相关设备
WO2021190296A1 (zh) 一种动态手势识别方法及设备
WO2021203865A9 (zh) 分子结合位点检测方法、装置、电子设备及存储介质
WO2021008206A1 (zh) 神经网络结构的搜索方法、图像处理方法和装置
WO2021018245A1 (zh) 图像分类方法及装置
CN110222717A (zh) 图像处理方法和装置
WO2021227787A1 (zh) 训练神经网络预测器的方法、图像处理方法及装置
WO2021057884A1 (zh) 语句复述方法、训练语句复述模型的方法及其装置
WO2021047587A1 (zh) 手势识别方法、电子设备、计算机可读存储介质和芯片
WO2021136058A1 (zh) 一种处理视频的方法及装置
CN113807183A (zh) 模型训练方法及相关设备
WO2020062299A1 (zh) 一种神经网络处理器、数据处理方法及相关设备
CN113627163A (zh) 一种注意力模型、特征提取方法及相关装置
WO2023185925A1 (zh) 一种数据处理方法及相关装置
CN113536970A (zh) 一种视频分类模型的训练方法及相关装置
WO2023207531A1 (zh) 一种图像处理方法及相关设备
CN115795025A (zh) 一种摘要生成方法及其相关设备
WO2022227024A1 (zh) 神经网络模型的运算方法、训练方法及装置
WO2021083312A1 (zh) 训练语句复述模型的方法、语句复述方法及其装置
CN115601513A (zh) 一种模型超参数的选择方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23762762

Country of ref document: EP

Kind code of ref document: A1