WO2023087659A1 - 一种多模态数据处理方法、装置、设备及存储介质 - Google Patents

一种多模态数据处理方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023087659A1
WO2023087659A1 PCT/CN2022/095363 CN2022095363W WO2023087659A1 WO 2023087659 A1 WO2023087659 A1 WO 2023087659A1 CN 2022095363 W CN2022095363 W CN 2022095363W WO 2023087659 A1 WO2023087659 A1 WO 2023087659A1
Authority
WO
WIPO (PCT)
Prior art keywords
modal
information
multimodal
polarizer
data processing
Prior art date
Application number
PCT/CN2022/095363
Other languages
English (en)
French (fr)
Inventor
晁银银
王斌强
董刚
胡克坤
赵雅倩
李仁刚
Original Assignee
浪潮(北京)电子信息产业有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮(北京)电子信息产业有限公司 filed Critical 浪潮(北京)电子信息产业有限公司
Publication of WO2023087659A1 publication Critical patent/WO2023087659A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application relates to the field of multimodal information processing, in particular to a multimodal data processing method, device, equipment and storage medium.
  • multimodal machine learning since there may be complementary information between different modalities, using data from multiple modalities allows the model to make more robust predictions.
  • multimodal systems can still function when data from one of the modalities is missing.
  • multimodal machine learning has developed rapidly, involving fields such as audio-visual speech recognition, multimodal emotion recognition, medical image analysis, and multimedia event detection.
  • a multimodal data processing method comprising:
  • the multimodal fusion network model includes a modal feature extraction network for extracting modal features, a modal feature fusion network for merging modal features, and a modal feature fusion network for merging The decision network for classification tasks or regression tasks based on the target features after the classification;
  • different optical modal information of the target object is obtained, including:
  • the information of three different modes of intensity, polarization and frequency of the target object is obtained, including:
  • the reflected light from the target object is divided into the first beam and the second beam through the beam splitting system; the first beam is transmitted to the optical micro-polarizer system; the second beam is transmitted to the Fourier 4f system;
  • the frequency information is obtained through the Fourier 4f system.
  • the optical micro-polarizer system includes a first convex lens, a micro-polarizer and a first detector; wherein,
  • the first convex lens is used for converging the first light beam onto the micro polarizer
  • the first detector is used to convert the intensity information and polarization information collected by the micro polarizer into two-dimensional matrix data.
  • each pixel unit in the micropolarizer includes two anti-reflection sub-units for collecting intensity information and two anti-reflection sub-units for collecting polarization information.
  • the two linear polarization sub-units are diagonally distributed; the two anti-reflection sub-units are diagonally distributed.
  • the Fourier 4f system includes a second convex lens, a third convex lens, a diffraction screen and a second detector; wherein,
  • the second convex lens is located between the target object and the spectroscopic system, and is used to converge the reflected light from the target object to obtain parallel light and transmit it to the spectroscopic system;
  • a diffraction screen located between the spectroscopic system and the third convex lens, for diffracting the second light beam to obtain diffracted light
  • the second detector is used for collecting frequency spectrum signals.
  • the modal feature extraction network includes a plurality of modal feature extraction sub-networks; each modal feature extraction sub-network is connected with each modal one One to one correspondence.
  • the input of each modal feature extraction sub-network is a multi-modal data set in the form of a two-dimensional matrix, and the output is a modal embedding vector ;
  • the input of the modality feature fusion network is the modality embedding vector, and the output is the fused modality obtained by computing the triple Cartesian product;
  • the input of the decision network is the fusion mode, and the output is the result after completing the classification task or the regression task.
  • the embodiment of the present application also provides a multimodal data processing device, including:
  • the multi-modal information acquisition module is used to obtain different optical modal information of the target object; it is also used to obtain different optical modal information of the object to be measured;
  • the data set production module is used to produce multi-modal data sets according to different optical modal information of the target object
  • the model construction module is used to construct a multimodal fusion network model;
  • the multimodal fusion network model includes a modal feature extraction network for extracting each modal feature, and a modal feature fusion network for merging each modal feature , and a decision network for classifying or regressing the combined target features;
  • a model training module for training a multimodal fusion network model using a multimodal data set
  • the model reasoning module is used to input different optical modal information of the object to be measured into the trained multi-modal fusion network model, and output classification results or regression results.
  • the multimodal information acquisition module is specifically used to acquire at least Information for both modals.
  • the multimodal information acquisition module includes: a spectroscopic system, an optical micro-polarizer system and a Fourier 4f system;
  • the beam splitting system is used to divide the reflected light from the target object into a first beam and a second beam; the first beam is transmitted to the optical micro-polarizer system; the second beam is transmitted to the Fourier 4f system;
  • An optical micropolarizer system for obtaining both intensity and polarization information
  • the optical micro-polarizer system includes a first convex lens, a micro-polarizer and a first detector; wherein,
  • the first convex lens is used for converging the first light beam onto the micro polarizer
  • the first detector is used to convert the intensity information and polarization information collected by the micro polarizer into two-dimensional matrix data.
  • each pixel unit in the micropolarizer includes two anti-reflection sub-units for collecting intensity information and two anti-reflection sub-units for collecting polarization information.
  • the two linear polarization sub-units are diagonally distributed; the two anti-reflection sub-units are diagonally distributed.
  • the Fourier 4f system includes a second convex lens, a third convex lens, a diffraction screen and a second detector; wherein,
  • the second convex lens is located between the target object and the spectroscopic system, and is used to converge the reflected light from the target object to obtain parallel light and transmit it to the spectroscopic system;
  • a diffraction screen located between the spectroscopic system and the third convex lens, for diffracting the second light beam to obtain diffracted light
  • the second detector is used for collecting frequency spectrum signals.
  • the embodiment of the present application also provides a multi-modal data processing device, including a memory and one or more processors, where computer-readable instructions are stored in the memory, and the computer-readable instructions are executed by the one or more processors When executed, the one or more processors are made to execute the steps of any one of the above multimodal data processing methods.
  • the embodiment of the present application also provides one or more non-volatile computer-readable storage media storing computer-readable instructions, and when the computer-readable instructions are executed by one or more processors, the one or more A processor executes the steps of any one of the above multimodal data processing methods.
  • FIG. 1 is a flowchart of a multimodal data processing method provided by the present application according to one or more embodiments
  • FIG. 2 is a schematic structural diagram of a multimodal information collection module provided by the present application according to one or more embodiments;
  • FIG. 3 is a schematic structural diagram of each pixel unit in a micro polarizer provided by the present application according to one or more embodiments;
  • Fig. 4 is the structural representation of existing Fourier 4f system
  • FIG. 5 is a schematic structural diagram of a Fourier 4f system provided by the present application according to one or more embodiments
  • FIG. 6 is a schematic structural diagram of a multimodal fusion network model provided by the present application according to one or more embodiments.
  • FIG. 7 is a schematic diagram of multimodal tensor fusion provided by the present application according to one or more embodiments.
  • FIG. 8 is a schematic structural diagram of a multimodal data processing device provided by the present application according to one or more embodiments.
  • FIG. 9 is a schematic diagram of the internal structure of a computer device provided by the present application according to one or more embodiments.
  • Fig. 10 is a schematic diagram of the internal structure of a computer device provided by the present application according to one or more embodiments.
  • the present application provides a multimodal data processing method, as shown in FIG. 1 , taking the method applied to computer equipment as an example for illustration.
  • the method includes the following steps:
  • the multimodal fusion network model includes a modal feature extraction network for extracting modal features, a modal feature fusion network for merging various modal features, and a modal feature extraction network for A decision network that performs classification tasks or regression tasks on the combined target features;
  • a modal feature extraction network based on the attention mechanism to complete each modal feature extraction, use a Cartesian product-based modal feature fusion network to merge multi-modal information, and finally use a decision network to complete classification and regression tasks;
  • the multi-modal data processing method mainly includes two parts: obtaining different optical modal information of the object and multi-modal information fusion based on neural network, so that the information of each optical modal of the object can be obtained.
  • the development of intelligent information extraction and fusion will enhance the competitiveness in the combined application field of optical information and multi-modal artificial intelligence.
  • step S101 obtains different optical modal information of the target object, which may specifically include: obtaining three different optical modal information of the target object: intensity, polarization and frequency Information for at least two of the modals.
  • step S101 may only obtain information on at least two modalities of the intensity, polarization and frequency of the target object, or may obtain information on at least two other modalities of the target object other than intensity, polarization and frequency, I won't go into details here.
  • Information from at least two of the three different modalities of intensity, polarization and frequency of the target object can be used to make a multimodal dataset.
  • the rich target and environmental features in these modalities are of great significance to target recognition, security, biomedicine and other fields.
  • the intensity mode is the measurement of spectral radiation intensity, which mainly acquires the distribution of different materials and objects in the scene, and obtains an optical image in the traditional sense.
  • Polarization mode is the measurement of light field vector information, which has great irrelevance with spectral radiation intensity images. It can obtain target surface characteristics, shape, shadow and roughness in complex environments such as haze weather, and can be used in atmospheric environment detection, biological It has a wide range of applications in fields such as medical diagnosis and automatic driving. Adding the polarization mode not only improves the probability of identifying the target, but also increases the detection accuracy.
  • Frequency modality is the frequency distribution and variation of acquired images.
  • the low-frequency components in the spectrum represent slowly changing parts and rough outline structures in the distribution function in the spatial domain; the high-frequency components in the spectrum represent sharply changing parts and details in the image.
  • By obtaining the frequency mode more detailed features of the target object can be extracted.
  • the information of three different modes of the intensity, polarization and frequency of the target object may be obtained, which may specifically include: firstly, by spectroscopic The system divides the reflected light from the target object into a first beam and a second beam; the first beam is transmitted to the optical micro-polarizer system; the second beam is transmitted to the Fourier 4f system; then, the intensity is obtained through the optical micro-polarizer system information and polarization information; at the same time, the frequency information is obtained through the Fourier 4f system.
  • a multimodal information collection module which includes a spectroscopic system, an optical micro-polarizer system and a Fourier 4f system, which can not only simultaneously extract three kinds of optical multimodal Modality information, constructing optical multimodal datasets, and can also solve the alignment problem between different modalities.
  • the beam splitting system can select the beam splitting prism 1 to split the light; through the beam splitting prism 1, the luminous flux reflected by the target object is divided into two, and a part is input into the micro polarizer system to obtain the intensity information a And polarization information b, the other part is input into the Fourier 4f system to obtain frequency information c.
  • the setting of the specific type of the spectroscopic system can be determined according to the actual situation, and is not limited here.
  • the optical micro-polarizer system may include a first convex lens 2, a micro-polarizer 3 and a first detector 4; in,
  • the first convex lens 2 is used to converge the first light beam onto the micro polarizer 3;
  • the micro-polarizer 3 is used to simultaneously collect intensity information a and polarization information b; as shown in Figure 3, each pixel unit in the micro-polarizer 3 includes four sub-units arranged in 2 ⁇ 2, specifically including for collecting the intensity information a Two anti-reflection sub-units 31 and two linear polarization sub-units 32 for collecting polarization information b; two linear polarization sub-units 31 are diagonally distributed; two anti-reflection sub-units 32 are diagonally distributed.
  • the first detector 4 is used to convert the intensity information a and polarization information b collected by the micropolarizer 3 into two-dimensional matrix data.
  • the first convex lens 2 converges the outgoing light of the dichroic prism 1 to the micro-polarizer 3, and the light beam passes through the micro-polarizer 3 and is collected by the first detector 4 to simultaneously generate intensity information a and polarization signal b .
  • Each pixel unit in the micro polarizer 3 includes four subunits, corresponding to the four pixels of the first detector 4, including two anti-reflection subunits 31 and two linear polarization subunits 32, which are arranged diagonally. The diagonal distribution of the anti-reflection sub-units 31 and the linear polarization sub-units 32 allows the entire first detector 4 to uniformly collect polarized light and natural light.
  • the linear polarization sub-unit 32 generates linearly polarized light through the polarization principle of the sub-wavelength metal wire grid.
  • the anti-reflection sub-unit 31 improves the transmittance of light by evaporating an anti-reflection film of a specific wavelength band on the substrate, adding an anti-reflection film This somewhat compensates for the reduced pixel resolution of the polarization and intensity images.
  • the micro-polarizer 3 can be fabricated by nanoimprinting or electron beam lithography.
  • the micro-polarizer 3 and the first detector 4 can be integrated on the same substrate. Finally, the two-dimensional matrix data converted by the first detector 4 is transmitted to the computer, and then the two-dimensional matrix data is split into two-dimensional matrix data corresponding to the intensity mode and the polarization mode through the computer.
  • the Fourier 4f system may include a second convex lens 5, a third convex lens 6, a diffraction screen 7 and a second Detector 8;
  • the second convex lens 5 is located between the target object and the spectroscopic system 1, and is used to converge the reflected light from the target object to obtain parallel light and transmit it to the spectroscopic system 1; this can ensure the homology of images of different modalities, first The luminous flux of the target object is converged to emit parallel light through the second convex lens 5, and then the input luminous flux is divided into two by the dichroic prism 1 to generate transmitted light and reflected light, which are transmitted to the optical micro-polarizer system and the Fourier 4f system respectively;
  • the third convex lens 6 is used to converge the diffracted light onto the second detector 8;
  • the second detector 8 is used for collecting frequency spectrum signals.
  • the Fourier 4f system is a "4f system" composed of two convex lenses with the focal length f, which can realize two Fourier transforms in cascade.
  • the distribution of the plane wave carrying the information of the target object on the rear focal plane of the lens is proportional to the Fourier transform of the distribution of the sample, and inverse Fourier transform on the rear focal plane of the second lens to restore the clear image of the original sample.
  • the Fourier 4f system has multiple diffraction systems and application scenarios, so in the specific embodiments of this application, a derivative system of the Fourier 4f system—Fourier spectrum analyzer, namely the Fraunhofer diffraction system is used To collect the frequency information of the target object.
  • the Fourier 4f system includes a second convex lens 5, a third convex lens 6, a diffraction screen 7 and a second detector 8, and the diffraction screen 7 can be slit or window, the second convex lens 5 converges the reflected light from the target object, and the obtained parallel light is incident on the diffraction screen 7, and the third convex lens 6 converges the diffracted light to obtain a spectrum image.
  • the spectrum mode signal can be collected by placing the second detector 8 on the spectrum plane. This system physically realizes the Fourier transform, which can examine the response of the optical system to the image spectrum in the frequency domain, so as to process the information contained in the image.
  • both the second convex lens 5 and the third convex lens 6 are confocal convex lenses.
  • Fourier optics a specific optical lens set applies a forward or inverse Fourier transform to the wavefield, and the Fourier transform extracts global features of the imaged object. Since the light field distribution on the confocal plane of the double lens is equal to the Fourier transform of the intensity distribution of the target object, various operations can be performed on this plane, and many functions can be realized by placing various modulations or filters, such as Abbeport spatial filtering.
  • the modal feature extraction network includes a plurality of modal feature extraction sub-networks; each modal feature extraction sub-network and Each modality corresponds one to one.
  • the input of each modal feature extraction sub-network is a multimodal data set in the form of a two-dimensional matrix, that is, the three modal signals of intensity, polarization and frequency collected by the first detector and the second detector, and the output is the modal embedding vector.
  • the structure of the extraction network of each modal feature extraction sub-network is consistent, including the input layer, flatten layer, fully connected layer and attention layer, but the input and weight parameters of each sub-network are not shared; among them, the attention layer includes Linear maps, ReLU activations and normalization layers.
  • the flatten layer is used to "flatten" the input, that is, to make the multi-dimensional input one-dimensional.
  • ReLU Rectified Linear Unit
  • the input layer of the network is a two-dimensional matrix of the intensity mode, which is assumed to be I 64*64 (the data size of the input layer is related to the number of pixels of the detector), and the flatten layer is used to convert It is transformed into a one-dimensional vector I 4096 input to the fully connected layer.
  • the output I 128 of the fully connected layer is input to the attention layer.
  • the attention layer includes linear mapping, ReLU activation and normalization layer.
  • the ReLU layer contains 128 units, which ensures that the output of the attention layer has the same dimension as the input, and the corresponding normalization layer outputs 128 weights W I .
  • the output weight vector W I is multiplied by the corresponding point of I 128 to obtain the output of the modal feature extraction network:
  • the outputs of the polarization and frequency modes are z P , z f ⁇ R 128 in turn.
  • the input of the modal feature fusion network is the modal embedding vector
  • the output is the fused mode obtained by calculating the triple Cartesian product, that is, the mode
  • the state feature fusion network is to extract the internal relationship between different modalities, and convert multiple modal inputs into a tensor (ie, a three-dimensional matrix) output.
  • a tensor ie, a three-dimensional matrix
  • C can be 0 or 1.
  • the coordinates (z I , z P , and z f ) of each neuron can be viewed as a point in a triple Cartesian space defined by the unimodal output vectors of intensity, polarization, and frequency. This definition is mathematically equivalent to the differentiable outer product between the intensity embedding vector z I , the polarization embedding vector z P , and the frequency embedding vector z f :
  • z I , z P and z f are the unimodal output vectors from the modality feature extraction network.
  • three z I , z P and z f ⁇ R 128 represent a single mode, three and Indicates the dual mode of acquisition, a Get the three-modal interaction.
  • z m ⁇ R 129*129*129 can be obtained by splicing the three-dimensional cubes of seven different semantic sub-regions.
  • modality fusion calculates the Cartesian product and has no learnable parameters, its chance of overfitting is low, because the output neurons of tensor fusion are easy to interpret and semantically very meaningful. Therefore, subsequent layers of the network can easily decode meaningful information.
  • the input of the decision network is the fusion mode, and the output is the result after completing the classification task or the regression task;
  • the decision network includes a flatten layer, two ReLU layer and output layer.
  • the decision network is to set different network output layers and loss functions according to different tasks.
  • the feature data of each object can be expressed as a multimodal tensor z m .
  • z m is input to the flatten layer to obtain a one-dimensional vector, and then input to the ReLU layer.
  • the ReLU layer includes linear mapping operations and ReLU nonlinear activation function operations.
  • the output layer softmax layer or sigmoid layer of the network completes the classification or regression tasks respectively.
  • softmax is a normalized exponential function
  • sigmoid is used as the activation function of the neural network to map variables between 0 and 1.
  • the loss function of the decision network can be a classification cross-entropy loss function for image classification; when the output layer is a sigmoid layer, the loss function of the decision network can be an average error loss function to complete the regression Task.
  • this application uses a simple and compact optical system to not only extract three types of optical multimodal information at the same time, construct an optical multimodal data set, but also solve the alignment problem between different modalities, and then use the attention-based
  • the multimodal data fusion network of force mechanism and Cartesian product extracts different features of each modality, and learns the internal relationship between different modalities, which can greatly improve the network discrimination accuracy and model robustness.
  • the inference network used supports different output layers, which can flexibly implement various tasks such as classification or regression, thereby providing multiple possibilities for subsequent applications.
  • the embodiment of the present application also provides a multimodal data processing device. Since the problem-solving principle of the device is similar to the aforementioned multimodal data processing method, the implementation of the device can be found in multimodal The implementation of the data processing method will not be described repeatedly.
  • the multimodal data processing device provided in the embodiment of the present application, as shown in Figure 8, specifically includes:
  • the multimodal information acquisition module 11 is used to acquire different optical modal information of the target object; it is also used to acquire different optical modal information of the object to be measured;
  • the data set production module 12 is used to make a multimodal data set according to different optical modal information of the target object
  • the model construction module 13 is used to construct a multimodal fusion network model;
  • the multimodal fusion network model includes a modal feature extraction network for extracting each modal feature, and a modal feature fusion for merging each modal feature network, and a decision network for classifying or regressing the combined target features;
  • Model training module 14 for utilizing multimodal dataset training multimodal fusion network model
  • the model reasoning module 15 is configured to input different optical modal information of the object to be measured into the trained multimodal fusion network model, and output classification results or regression results.
  • the rich features of each optical mode of the object and the intrinsic relationship between different optical modes can be obtained through the interaction of the above-mentioned five modules, and multiple Modal information fusion will enrich the target features to efficiently complete classification or regression tasks, thereby promoting the development of multi-modal artificial intelligence information extraction and fusion, and enhancing the competitiveness in the field of combined applications of optical information and multi-modal artificial intelligence.
  • the multimodal information acquisition module 11 can specifically be used to acquire at least Information for both modals.
  • the multimodal information acquisition module may include: a spectroscopic system (such as a spectroscopic prism), an optical micro-polarizer system and Fourier 4f system;
  • the beam splitting system is used to divide the reflected light from the target object into a first beam and a second beam; the first beam is transmitted to the optical micro-polarizer system; the second beam is transmitted to the Fourier 4f system;
  • Optical micro-polarizer system for obtaining intensity information and polarization information
  • the optical micro-polarizer system may include a first convex lens 2, a micro-polarizer 3 and a first detector 4; in,
  • the first convex lens 2 is used to converge the first light beam onto the micro polarizer 3;
  • the micro-polarizer 3 is used to simultaneously collect intensity information a and polarization information b; as shown in Figure 3, each pixel unit in the micro-polarizer 3 includes four sub-units arranged in 2 ⁇ 2, specifically including for collecting the intensity information a Two anti-reflection sub-units 31 and two linear polarization sub-units 32 for collecting polarization information b; two linear polarization sub-units 31 are diagonally distributed; two anti-reflection sub-units 32 are diagonally distributed.
  • the first detector 4 is used to convert the intensity information a and polarization information b collected by the micropolarizer 3 into two-dimensional matrix data.
  • the Fourier 4f system may include a second convex lens 5, a third convex lens 6, a diffraction screen 7 and a second Detector 8;
  • the second convex lens 5 is located between the target object and the spectroscopic system 1, and is used to converge the reflected light from the target object to obtain parallel light and transmit it to the spectroscopic system 1; this can ensure the homology of images of different modalities, first The luminous flux of the target object is converged to emit parallel light through the second convex lens 5, and then the input luminous flux is divided into two by the dichroic prism 1 to generate transmitted light and reflected light, which are transmitted to the optical micro-polarizer system and the Fourier 4f system respectively;
  • the third convex lens 6 is used to converge the diffracted light onto the second detector 8;
  • the second detector 8 is used to collect spectrum signals.
  • the embodiment of the present application also discloses a multi-modal data processing device.
  • the multi-modal data processing device may be a computer device, and the computer device may be a server, and its internal structure diagram may be shown in FIG. 9 , the A computer device includes a processor, memory, network interface, and database connected by a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer readable instructions in the non-volatile storage medium.
  • the database of the computer device is used to store the multimodal fusion network model.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the multimodal data processing device disclosed in the embodiment of the present application may be a computer device, the computer device may be a terminal, and its internal structure diagram may be shown in FIG. 10 , the A computer device includes a processor, memory, network interface, display screen, and input device connected by a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer readable instructions.
  • the internal memory provides an environment for the execution of the operating system and computer readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad, or mouse.
  • the present application also discloses a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions are executed by one or more processors
  • the steps of a multimodal data processing method in any one of the foregoing embodiments may be implemented at the same time.
  • Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same or similar parts of each embodiment can be referred to each other.
  • the devices, equipment, and storage media disclosed in the embodiments since they correspond to the methods disclosed in the embodiments, the description is relatively simple, and for relevant details, please refer to the description of the methods.
  • a multi-modal data processing method includes: obtaining different optical modal information of the target object, making a multi-modal data set; constructing a multi-modal fusion network model; multi-modal fusion network
  • the model includes a modal feature extraction network for extracting modal features, a modal feature fusion network for merging modal features, and a decision network for performing classification tasks or regression tasks on the combined target features ;
  • Use the multi-modal data set to train the multi-modal fusion network model; obtain different optical modal information of the object to be tested, and input it into the trained multi-modal fusion network model, and output the classification result or regression result.
  • the above multimodal data processing method mainly includes two parts: obtaining different optical modal information of the object and multimodal information fusion based on neural network, so that the rich characteristics of each optical modal of the object and the relationship between different optical modalities can be obtained. Intrinsic relationship of multi-modal information and multi-modal information fusion will enrich target features to efficiently complete classification or regression tasks, thereby promoting the development of multi-modal artificial intelligence information extraction and fusion, and improving the combination of optical information and multi-modal artificial intelligence Competitiveness in the field of application.
  • the present application also provides corresponding devices, devices, and computer-readable storage media for the multimodal data processing method, further making the above method more practical, and the devices, devices, and computer-readable storage media have corresponding advantages.
  • Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM random access memory
  • RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Investigating Or Analysing Materials By Optical Means (AREA)

Abstract

本申请公开了一种多模态数据处理方法、装置、设备及存储介质,该方法包括:获取目标物体的不同光学模态信息,制作多模态数据集;构建多模态融合网络模型;多模态融合网络模型包括用于提取各模态特征的模态特征提取网络,用于将各模态特征进行合并的模态特征融合网络,以及用于将合并后的目标特征进行分类任务或回归任务的决策网络;利用多模态数据集训练多模态融合网络模型;获取待测物体的不同光学模态信息,并输入至训练完成的多模态融合网络模型中,输出分类结果或回归结果。

Description

一种多模态数据处理方法、装置、设备及存储介质
相关申请的交叉引用
本申请要求于2021年11月19日提交中国专利局,申请号为202111400866.8,申请名称为“一种多模态数据处理方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及多模态信息处理领域,特别是涉及一种多模态数据处理方法、装置、设备及存储介质。
背景技术
人们对世界的体验是多模态的,为了让人工智能更好地理解人们周围的世界,它需要能够解释和推理多模态信息。在多模态机器学习中,由于不同模态之间可能会存在互补的信息,使用多种模态的数据,可以让模型做出更鲁棒的预测。除此之外当其中一种模态数据缺失时,多模态系统仍然可以运行。近年来多模态机器学习快速发展起来,涉及的领域包括视听语音识别、多模态情感识别、医学图像分析以及多媒体事件检测等。
发明人意识到,尽管学术界产业界在多模态融合领域已经取得了诸多进展,但现阶段的研究主要是针对图片、语音和文字这三种模态。针对一些光学模态,例如偏振、频率等,其相关的多模态数据集和多模态数据融合却研究甚少,但光学模态中丰富的目标和环境特征,对目标识别、安防、生物医学等领域都有重要意义。
发明内容
一种多模态数据处理方法,包括:
获取目标物体的不同光学模态信息,制作多模态数据集;
构建多模态融合网络模型;多模态融合网络模型包括用于提取各模态特征的模态特征提取网络,用于将各模态特征进行合并的模态特征融合网络,以及用于将合并后的目标特征进行分类任务或回归任务的决策网络;
利用多模态数据集训练多模态融合网络模型;和
获取待测物体的不同光学模态信息,并输入至训练完成的多模态融合网络模型中,输出分类结果或回归结果。
在其中一个实施例中,在本申请实施例提供的上述多模态数据处理方法中,获取目标物体的不同光学模态信息,包括:
获取目标物体的强度、偏振和频率这三个不同模态中至少两个模态的信息。
在其中一个实施例中,在本申请实施例提供的上述多模态数据处理方法中,获取目标物体的强度、偏振和频率这三个不同模态的信息,包括:
通过分光系统将来自目标物体的反射光分为第一光束和第二光束;第一光束传输至光学微偏振器系统;第二光束传输至傅里叶4f系统;
通过光学微偏振器系统获取强度信息和偏振信息;
同时,通过傅里叶4f系统获取频率信息。
在其中一个实施例中,在本申请实施例提供的上述多模态数据处理方法中,光学微偏振器系统包括第一凸透镜、微偏振片和第一探测器;其中,
第一凸透镜,用于将第一光束会聚到微偏振片上;
微偏振片,用于同时采集强度信息和偏振信息;和
第一探测器,用于将微偏振片采集的强度信息和偏振信息转化为二维矩阵数据。
在其中一个实施例中,在本申请实施例提供的上述多模态数据处理方法中,微偏振片中每个像素单元包括用于采集强度信息的两个增透子单元和用于采集偏振信息的两个线偏振子单元;
两个线偏振子单元呈对角分布;两个增透子单元呈对角分布。
在其中一个实施例中,在本申请实施例提供的上述多模态数据处理方法中,傅里叶4f系统包括第二凸透镜、第三凸透镜、衍射屏和第二探测器;其中,
第二凸透镜,位于目标物体与分光系统之间,用于将来自目标物体的反射光进行会聚,得到平行光并传输至分光系统;
衍射屏,位于分光系统和第三凸透镜之间,用于将第二光束进行衍射,得到衍射光;
第三凸透镜,用于将衍射光会聚到第二探测器上;和
第二探测器,用于采集频谱信号。
在其中一个实施例中,在本申请实施例提供的上述多模态数据处理方法中,模态特征提取网络包括多个模态特征提取子网络;各模态特征提取子网络与各模态一一对应。
在其中一个实施例中,在本申请实施例提供的上述多模态数据处理方法中,各模态 特征提取子网络的输入为二维矩阵形式的多模态数据集,输出为模态嵌入向量;和
模态特征融合网络的输入为模态嵌入向量,输出为通过计算三重笛卡尔乘积得到的融合模态;和
决策网络的输入为融合模态,输出为完成分类任务或回归任务后的结果。
本申请实施例还提供了一种多模态数据处理装置,包括:
多模态信息采集模块,用于获取目标物体的不同光学模态信息;还用于获取待测物体的不同光学模态信息;
数据集制作模块,用于根据目标物体的不同光学模态信息,制作多模态数据集;
模型构建模块,用于构建多模态融合网络模型;多模态融合网络模型包括用于提取各模态特征的模态特征提取网络,用于将各模态特征进行合并的模态特征融合网络,以及用于将合并后的目标特征进行分类任务或回归任务的决策网络;
模型训练模块,用于利用多模态数据集训练多模态融合网络模型;和
模型推理模块,用于将待测物体的不同光学模态信息输入至训练完成的多模态融合网络模型中,输出分类结果或回归结果。
在其中一个实施例中,在本申请实施例提供的上述多模态数据处理装置中,多模态信息采集模块,具体用于获取目标物体的强度、偏振和频率这三个不同模态中至少两个模态的信息。
在其中一个实施例中,在本申请实施例提供的上述多模态数据处理装置中,多模态信息采集模块,包括:分光系统、光学微偏振器系统和傅里叶4f系统;
分光系统,用于将来自目标物体的反射光分为第一光束和第二光束;第一光束传输至光学微偏振器系统;第二光束传输至傅里叶4f系统;
光学微偏振器系统,用于获取强度信息和偏振信息;和
傅里叶4f系统,用于获取频率信息。
在其中一个实施例中,在本申请实施例提供的上述多模态数据处理装置中,光学微偏振器系统包括第一凸透镜、微偏振片和第一探测器;其中,
第一凸透镜,用于将第一光束会聚到微偏振片上;
微偏振片,用于同时采集强度信息和偏振信息;和
第一探测器,用于将微偏振片采集的强度信息和偏振信息转化为二维矩阵数据。
在其中一个实施例中,在本申请实施例提供的上述多模态数据处理装置中,微偏振片中每个像素单元包括用于采集强度信息的两个增透子单元和用于采集偏振信息的两个 线偏振子单元;
两个线偏振子单元呈对角分布;两个增透子单元呈对角分布。
在其中一个实施例中,在本申请实施例提供的上述多模态数据处理装置中,傅里叶4f系统包括第二凸透镜、第三凸透镜、衍射屏和第二探测器;其中,
第二凸透镜,位于目标物体与分光系统之间,用于将来自目标物体的反射光进行会聚,得到平行光并传输至分光系统;
衍射屏,位于分光系统和第三凸透镜之间,用于将第二光束进行衍射,得到衍射光;
第三凸透镜,用于将衍射光会聚到第二探测器上;和
第二探测器,用于采集频谱信号。
本申请实施例还提供了一种多模态数据处理设备,包括存储器及一个或多个处理器,存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行上述任一项多模态数据处理方法的步骤。
本申请实施例还提供了一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行上述任一项多模态数据处理方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请根据一个或多个实施例提供的多模态数据处理方法的流程图;
图2为本申请根据一个或多个实施例提供的多模态信息采集模块的结构示意图;
图3为本申请根据一个或多个实施例提供的微偏振片中每个像素单元的结构示意图;
图4为现有的傅里叶4f系统的结构示意图;
图5为本申请根据一个或多个实施例提供的傅里叶4f系统的结构示意图;
图6为本申请根据一个或多个实施例提供的多模态融合网络模型的结构示意图;
图7为本申请根据一个或多个实施例提供的多模态张量融合的示意图;
图8为本申请根据一个或多个实施例提供的多模态数据处理装置的结构示意图;
图9为本申请根据一个或多个实施例提供的计算机设备的内部结构示意图;
图10为本申请根据一个或多个实施例提供的计算机设备的内部结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供一种多模态数据处理方法,如图1所示,以该方法应用于计算机设备为例进行说明,该方法包括以下步骤:
S101、获取目标物体的不同光学模态信息,制作多模态数据集;
S102、构建多模态融合网络模型;多模态融合网络模型包括用于提取各模态特征的模态特征提取网络,用于将各模态特征进行合并的模态特征融合网络,以及用于将合并后的目标特征进行分类任务或回归任务的决策网络;
具体地,构建基于注意力机制的模态特征提取网络完成各模态特征提取,采用基于笛卡尔积的模态特征融合网络合并多模态信息,最后利用决策网络完成分类和回归任务;
S103、利用多模态数据集训练多模态融合网络模型;
S104、获取待测物体的不同光学模态信息,并输入至训练完成的多模态融合网络模型中,输出分类结果或回归结果。
在本申请实施例提供的上述多模态数据处理方法中,主要包括获取物体的不同光学模态信息和基于神经网络的多模态信息融合两大部分,这样可以获取物体的各光学模态的丰富特征以及不同光学模态之间的内在关系,并实现多模态信息融合,将丰富目标特征高效完成分类或回归任务,提升了网络判别精度和模型鲁棒性,进而能够促进多模态人工智能信息提取和融合的发展,提升在光信息和多模态人工智能的结合应用领域的竞争力。
在具体实施时,在本申请实施例提供的上述多模态数据处理方法中,步骤S101获取目标物体的不同光学模态信息,具体可以包括:获取目标物体的强度、偏振和频率这三 个不同模态中至少两个模态的信息。在实际应用中,步骤S101可以只获取目标物体的强度、偏振和频率中至少两个模态的信息,也可以获取目标物体除了强度、偏振和频率之外的其它至少两个模态的信息,在此不做赘述。
根据目标物体的强度、偏振和频率这三个不同模态中至少两个模态的信息,可用来制作多模态数据集。这些模态中丰富的目标和环境特征,均对目标识别、安防、生物医学等领域有着重要意义。
具体地,强度模态是光谱辐射强度测量,主要获取场景中的不同材料和物体的分布,得到的是传统意义上的光学图像。
偏振模态是测量光场矢量信息,与光谱辐射强度图像具有很大的不相关性,可以在雾霾天气等复杂环境下获取目标表面特征、形状、阴影和粗糙度,在大气环境检测、生物医学诊断和自动驾驶等领域具有广泛应用。加入偏振模态,不但提升辨别目标的概率,还能够增大探测精度。
频率模态是获取图像的频率分布和变化。频谱中的低频成分表征空间域内分布函数中变化缓慢的部分和粗的轮廓结构;频谱中的高频成分表征图像中急剧变化的部分和细节。通过获取频率模态可以提取到目标物体更多的细节特征。
在具体实施时,在本申请实施例提供的上述多模态数据处理方法中,上述步骤中获取目标物体的强度、偏振和频率这三个不同模态的信息,具体可以包括:首先,通过分光系统将来自目标物体的反射光分为第一光束和第二光束;第一光束传输至光学微偏振器系统;第二光束传输至傅里叶4f系统;然后,通过光学微偏振器系统获取强度信息和偏振信息;同时,通过傅里叶4f系统获取频率信息。
需要说明的是,上述步骤可以由多模态信息采集模块来执行,该多模态信息采集模块包括分光系统、光学微偏振器系统和傅里叶4f系统,它不但可以同时提取三种光学多模态信息,构建光学多模态数据集,还可以解决不同模态之间的对齐问题。
在实际应用中,如图2所示,分光系统可以选择分光棱镜1来进行分光;通过分光棱镜1将目标物体反射出的光通量一分为二,一部分输入到微偏振器系统中获取强度信息a和偏振信息b,另一部分输入到傅里叶4f系统中获取频率信息c。对于分光系统的具体类型设置,可以根据实际情况而定,在此不做限定。
在具体实施时,在本申请实施例提供的上述多模态数据处理方法中,如图2所示,光学微偏振器系统可以包括第一凸透镜2、微偏振片3和第一探测器4;其中,
第一凸透镜2,用于将第一光束会聚到微偏振片3上;
微偏振片3,用于同时采集强度信息a和偏振信息b;如图3所示,微偏振片3中每个像素单元包括2×2排列的四个子单元,具体包括用于采集强度信息a的两个增透子单元31和用于采集偏振信息b的两个线偏振子单元32;两个线偏振子单元31呈对角分布;两个增透子单元32呈对角分布。
第一探测器4,用于将微偏振片3采集的强度信息a和偏振信息b转化为二维矩阵数据。
具体地,如图2所示,第一凸透镜2将分光棱镜1的出射光会聚至微偏振片3,光束经过微偏振片3后被第一探测器4采集同时生成强度信息a和偏振信号b。微偏振片3中的每个像素单元包括四个子单元,对应第一探测器4的四个像素点,其中包括两个增透子单元31和两个线偏振子单元32,呈对角分布。增透子单元31和线偏振子单元32的对角分布,可以让整个第一探测器4均匀的采集偏振光和自然光。微偏振片3的存在虽然会将像素分辨率降低一半,但所获取的不同模态信息中包含的目标特征则可以大大提升后续模型精度。线偏振子单元32是通过亚波长金属线栅的起偏原理产生线偏振光,增透子单元31是通过在基底上蒸镀特定波段的增透膜来提高光的透过率,添加增透膜在一定程度上弥补了偏振图像和强度图像像素分辨率的降低。微偏振片3可以通过纳米压印或者电子束光刻等工艺制作。由于微偏振片3的每个单元需要与第一探测器4的像素对准,可将微偏振片3和第一探测器4集成在同一个基板上。最后将第一探测器4转化得到的二维矩阵数据传输至电脑,再通过电脑将二维矩阵数据拆分为对应强度模态和偏振模态的二维矩阵数据。
在具体实施时,在本申请实施例提供的上述多模态数据处理方法中,如图2所示,傅里叶4f系统可以包括第二凸透镜5、第三凸透镜6、衍射屏7和第二探测器8;其中,
第二凸透镜5,位于目标物体与分光系统1之间,用于将来自目标物体的反射光进行会聚,得到平行光并传输至分光系统1;这样可以保证不同模态图像的同源性,首先通过第二凸透镜5将目标物体的光通量会聚出射平行光,再通过分光棱镜1将输入的光通量一分为二生成透射光和反射光,分别传递至光学微偏振器系统和傅里叶4f系统;
衍射屏7,位于分光系统和第三凸透镜6之间,用于将第二光束进行衍射,得到衍射光;
第三凸透镜6,用于将衍射光会聚到第二探测器8上;
第二探测器8,用于采集频谱信号。
在实际应用中,如图4所示,傅里叶4f系统是由两个焦距均为f的凸透镜组成的“4f系统”,能够实现级联的两个傅里叶变换。搭载目标物体信息的平面波在透镜后焦面上的 分布正比于样品分布的傅里叶变换,在第二个透镜的后焦面上又逆傅里叶变换,还原为原样品的清晰的像。而傅里叶4f系统有多个衍射系统和应用场景,因此在本申请的具体实施例中,采用傅里叶4f系统的衍生系统——傅里叶频谱分析器,即夫琅禾费衍射系统来采集目标物体的频率信息。如图5所示,本申请提供的傅里叶4f系统(即夫琅禾费衍射系统)包括第二凸透镜5、第三凸透镜6、衍射屏7和第二探测器8,衍射屏7可以是狭缝或者窗口,第二凸透镜5将来自目标物体的反射光会聚,得到的平行光入射到衍射屏7上,第三凸透镜6再将衍射光会聚得到频谱图像。在频谱面上放置第二探测器8即可采集频谱模态信号。此系统在物理上实现了傅里叶变换,可以在频域里考查光学系统对图像频谱做出的反应,以此对图像所包含的信息进行处理。
需要说明的是,第二凸透镜5、第三凸透镜6均为共焦凸透镜。根据傅立叶光学,设置的特定光学透镜对波场应用正向或反向傅立叶变换,傅里叶变换可提取成像物体的全局特征。由于在双透镜的共焦面上的光场分布等于目标物体强度分布的傅里叶变换,因此能够在这个面上进行各种操作,通过放置各种调制或者滤波器,可以实现很多功能,例如阿贝波特空间滤波。
在具体实施时,在本申请实施例提供的上述多模态数据处理方法中,如图6所示,模态特征提取网络包括多个模态特征提取子网络;各模态特征提取子网络与各模态一一对应。各模态特征提取子网络的输入为二维矩阵形式的多模态数据集,即第一探测器和第二探测器采集到的强度、偏振和频率三种模态信号,输出为模态嵌入向量。各模态特征提取子网络提取网络的结构具有一致性,包括输入层、flatten层、全连接层和注意力层,但各个子网络的输入和权重参数是不共享的;其中,注意力层包括线性映射、ReLU激活和归一化层。
需要说明的是,flatten层是用来将输入“压平”,即把多维的输入一维化。ReLU(Rectified Linear Unit,修正线性单元)是人工神经网络中常用的激活函数。
以强度模态为例,如图6所示,网络输入层为强度模态的二维矩阵,假设为I 64*64(输入层的数据大小与探测器的像素数有关),采用flatten层将其转化为一维向量I 4096输入到全连接层。再将全连接层的输出I 128,输入到注意力层。注意力层包括线性映射、ReLU激活和归一化层,其中ReLU层含有128个单元,这样保证了注意力层的输出与输入具有相同维度,对应的归一化层输出128个权重W I。最后将输出的权重向量W I与I 128对应点相乘,得到模态特征提取网络的输出:
z I∈R 128
类似的,得到偏振和频率模态的输出依次为z P,z f∈R 128
在具体实施时,在本申请实施例提供的上述多模态数据处理方法中,模态特征融合网络的输入为模态嵌入向量,输出为通过计算三重笛卡尔乘积得到的融合模态,即模态特征融合网络是提取不同模态之间的内在关系,将多个模态输入转换为一个张量(即三维矩阵)输出。在计算三重笛卡尔乘积时,由单模态计算出双模态和三模态。
为了提升网络的通用性和灵活性,若探测器的像素数或者特征提取网络的神经元个数不一致,导致输出矢量大小不同时,可以通过给每个模态矢量添加常数C,来补足长度,例如C可以是0或者1。
每个神经元的坐标(z I、z P和z f)可以看作是由强度、偏振和频率的单模态输出矢量定义的三重笛卡尔空间中的一个点。这个定义在数学上等价于强度嵌入矢量z I、偏振嵌入矢量z P和频率嵌入矢量z f之间的可微外积:
Figure PCTCN2022095363-appb-000001
其中,
Figure PCTCN2022095363-appb-000002
代表向量之间的外积,z I、z P和z f是来自模态特征提取网络的单模态输出向量。具体的,三个z I、z P和z f∈R 128表示单模态,三个
Figure PCTCN2022095363-appb-000003
Figure PCTCN2022095363-appb-000004
表示获取的双模态,一个
Figure PCTCN2022095363-appb-000005
得到三模态的相互作用。最后,如图7所示,七个不同语义子区域的三维立方体通过拼接,可得到z m∈R 129*129*129
需要说明的是,虽然模态融合是计算笛卡尔积,没有可学习的参数,但其过拟合的机会很低,因为张量融合的输出神经元易于解释,在语义上非常有意义。因此,网络的后续层很容易解码出有意义的信息。
在具体实施时,在本申请实施例提供的上述多模态数据处理方法中,决策网络的输入为融合模态,输出为完成分类任务或回归任务后的结果;决策网络包括flatten层、两个ReLU层和输出层。决策网络是根据不同的任务设置不同的网络输出层和损失函数。在模态特征融合网络之后,每个目标物的特征数据可以表示为多模态张量z m
具体地,将z m输入到flatten层得到一维矢量,再输入到ReLU层。ReLU层包括线性映射运算和ReLU非线性激活函数运算。最后网络的输出层softmax层或者sigmoid层,分别完成分类或者回归任务。需要说明的是,softmax为归一化指数函数;sigmoid被用作神经网络的激活函数,将变量映射到0和1之间。当输出层为softmax层时,决策网络的损失函数可以为分类交叉熵损失函数,来用于图片分类;当输出层为sigmoid层时,决策 网络的损失函数可以为平均误差损失函数,来完成回归任务。
需要注意的是,本申请利用简单和紧凑的光学系统不但可以同时提取三种光学多模态信息,构建光学多模态数据集,还可以解决不同模态之间的对齐问题,再利用基于注意力机制和笛卡尔积的多模态数据融合网络提取各模态的不同特征,并且学习不同模态之间的内在关系,可以大大提升网络判别精度和模型鲁棒性。除此之外,所使用的推断网络支持不同的输出层,可灵活实现分类或者回归等多种任务,进而为后续的应用提供多种可能。
基于同一申请构思,本申请实施例还提供了一种多模态数据处理装置,由于该装置解决问题的原理与前述一种多模态数据处理方法相似,因此该装置的实施可以参见多模态数据处理方法的实施,重复之处不再赘述。
在具体实施时,本申请实施例提供的多模态数据处理装置,如图8所示,具体包括:
多模态信息采集模块11,用于获取目标物体的不同光学模态信息;还用于获取待测物体的不同光学模态信息;
数据集制作模块12,用于根据目标物体的不同光学模态信息,制作多模态数据集;
模型构建模块13,用于构建多模态融合网络模型;多模态融合网络模型包括用于提取各模态特征的模态特征提取网络,用于将各模态特征进行合并的模态特征融合网络,以及用于将合并后的目标特征进行分类任务或回归任务的决策网络;
模型训练模块14,用于利用多模态数据集训练多模态融合网络模型;
模型推理模块15,用于将待测物体的不同光学模态信息输入至训练完成的多模态融合网络模型中,输出分类结果或回归结果。
在本申请实施例提供的上述多模态数据处理装置中,可以通过上述五个模块的相互作用,获取物体的各光学模态的丰富特征以及不同光学模态之间的内在关系,并实现多模态信息融合,将丰富目标特征高效完成分类或回归任务,进而能够促进多模态人工智能信息提取和融合的发展,提升在光信息和多模态人工智能的结合应用领域的竞争力。
在具体实施时,在本申请实施例提供的上述多模态数据处理装置中,多模态信息采集模块11,具体可以用于获取目标物体的强度、偏振和频率这三个不同模态中至少两个模态的信息。
在具体实施时,在本申请实施例提供的上述多模态数据处理装置中,为了结构简单且紧凑,该多模态信息采集模块可以包括:分光系统(如分光棱镜)、光学微偏振器系统和傅里叶4f系统;
分光系统,用于将来自目标物体的反射光分为第一光束和第二光束;第一光束传输至光学微偏振器系统;第二光束传输至傅里叶4f系统;
光学微偏振器系统,用于获取强度信息和偏振信息;
傅里叶4f系统,用于获取频率信息。
在具体实施时,在本申请实施例提供的上述多模态数据处理装置中,如图2所示,光学微偏振器系统可以包括第一凸透镜2、微偏振片3和第一探测器4;其中,
第一凸透镜2,用于将第一光束会聚到微偏振片3上;
微偏振片3,用于同时采集强度信息a和偏振信息b;如图3所示,微偏振片3中每个像素单元包括2×2排列的四个子单元,具体包括用于采集强度信息a的两个增透子单元31和用于采集偏振信息b的两个线偏振子单元32;两个线偏振子单元31呈对角分布;两个增透子单元32呈对角分布。
第一探测器4,用于将微偏振片3采集的强度信息a和偏振信息b转化为二维矩阵数据。
在具体实施时,在本申请实施例提供的上述多模态数据处理装置中,如图2所示,傅里叶4f系统可以包括第二凸透镜5、第三凸透镜6、衍射屏7和第二探测器8;其中,
第二凸透镜5,位于目标物体与分光系统1之间,用于将来自目标物体的反射光进行会聚,得到平行光并传输至分光系统1;这样可以保证不同模态图像的同源性,首先通过第二凸透镜5将目标物体的光通量会聚出射平行光,再通过分光棱镜1将输入的光通量一分为二生成透射光和反射光,分别传递至光学微偏振器系统和傅里叶4f系统;
衍射屏7,位于分光系统和第三凸透镜6之间,用于将第二光束进行衍射,得到衍射光;
第三凸透镜6,用于将衍射光会聚到第二探测器8上;
第二探测器8,用于采集频谱信号。
关于上述各个部件更加具体的工作过程可以参考前述实施例公开的相应内容,在此不再进行赘述。
相应地,本申请实施例还公开了一种多模态数据处理设备,该多模态数据处理设备可以是计算机设备,该计算机设备可以是服务器,其内部结构图可以如图9所示,该计 算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储多模态融合网络模型。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种多模态数据处理方法。
在一个实施例中,本申请实施例公开的多模态数据处理设备,该多模态数据处理设备可以是计算机设备,该计算机设备可以是终端,其内部结构图可以如图10所示,该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种多模态数据处理方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
进一步地,本申请还公开了一种非易失性计算机可读存储介质,该非易失性计算机可读存储介质存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时可实现上述任意一个实施例的一种多模态数据处理方法的步骤。本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置、设备、存储介质而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条 件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
综上,本申请实施例提供的一种多模态数据处理方法,包括:获取目标物体的不同光学模态信息,制作多模态数据集;构建多模态融合网络模型;多模态融合网络模型包括用于提取各模态特征的模态特征提取网络,用于将各模态特征进行合并的模态特征融合网络,以及用于将合并后的目标特征进行分类任务或回归任务的决策网络;利用多模态数据集训练多模态融合网络模型;获取待测物体的不同光学模态信息,并输入至训练完成的多模态融合网络模型中,输出分类结果或回归结果。上述多模态数据处理方法主要包括获取物体的不同光学模态信息和基于神经网络的多模态信息融合两大部分,这样可以获取物体的各光学模态的丰富特征以及不同光学模态之间的内在关系,并实现多模态信息融合,将丰富目标特征高效完成分类或回归任务,进而能够促进多模态人工智能信息提取和融合的发展,提升在光信息和多模态人工智能的结合应用领域的竞争力。此外,本申请还针对多模态数据处理方法提供了相应的装置、设备及计算机可读存储介质,进一步使得上述方法更具有实用性,该装置、设备及计算机可读存储介质具有相应的优点。
其中,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动 态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (15)

  1. 一种多模态数据处理方法,其特征在于,包括:
    获取目标物体的不同光学模态信息,制作多模态数据集;
    构建多模态融合网络模型;所述多模态融合网络模型包括用于提取各模态特征的模态特征提取网络,用于将各模态特征进行合并的模态特征融合网络,以及用于将合并后的目标特征进行分类任务或回归任务的决策网络;
    利用所述多模态数据集训练所述多模态融合网络模型;和
    获取待测物体的不同光学模态信息,并输入至训练完成的所述多模态融合网络模型中,输出分类结果或回归结果。
  2. 根据权利要求1所述的多模态数据处理方法,其特征在于,所述获取目标物体的不同光学模态信息,包括:
    获取目标物体的强度、偏振和频率这三个不同模态中至少两个模态的信息。
  3. 根据权利要求2所述的多模态数据处理方法,其特征在于,所述获取目标物体的强度、偏振和频率这三个不同模态的信息,包括:
    通过分光系统将来自目标物体的反射光分为第一光束和第二光束;所述第一光束传输至光学微偏振器系统;所述第二光束传输至傅里叶4f系统;
    通过所述光学微偏振器系统获取强度信息和偏振信息;
    同时,通过所述傅里叶4f系统获取频率信息。
  4. 根据权利要求3所述的多模态数据处理方法,其特征在于,所述光学微偏振器系统包括第一凸透镜、微偏振片和第一探测器;其中,
    所述第一凸透镜,用于将所述第一光束会聚到所述微偏振片上;
    所述微偏振片,用于同时采集强度信息和偏振信息;和
    所述第一探测器,用于将所述微偏振片采集的强度信息和偏振信息转化为二维矩阵数据。
  5. 根据权利要求4所述的多模态数据处理方法,其特征在于,所述微偏振片中每个像素单元包括用于采集强度信息的两个增透子单元和用于采集偏振信息的两个线偏振子单元;
    两个所述线偏振子单元呈对角分布;两个所述增透子单元呈对角分布。
  6. 根据权利要求3所述的多模态数据处理方法,其特征在于,所述傅里叶4f系统包括第二凸透镜、第三凸透镜、衍射屏和第二探测器;其中,
    所述第二凸透镜,位于目标物体与所述分光系统之间,用于将来自目标物体的反射 光进行会聚,得到平行光并传输至所述分光系统;
    所述衍射屏,位于所述分光系统和所述第三凸透镜之间,用于将所述第二光束进行衍射,得到衍射光;
    所述第三凸透镜,用于将所述衍射光会聚到所述第二探测器上;和
    所述第二探测器,用于采集频谱信号。
  7. 根据权利要求1所述的多模态数据处理方法,其特征在于,所述模态特征提取网络包括多个模态特征提取子网络;各所述模态特征提取子网络与各模态一一对应。
  8. 根据权利要求7所述的多模态数据处理方法,其特征在于,各所述模态特征提取子网络的输入为二维矩阵形式的多模态数据集,输出为模态嵌入向量;
    所述模态特征融合网络的输入为所述模态嵌入向量,输出为通过计算三重笛卡尔乘积得到的融合模态;和
    所述决策网络的输入为所述融合模态,输出为完成分类任务或回归任务后的结果。
  9. 一种多模态数据处理装置,其特征在于,包括:
    多模态信息采集模块,用于获取目标物体的不同光学模态信息;还用于获取待测物体的不同光学模态信息;
    数据集制作模块,用于根据目标物体的不同光学模态信息,制作多模态数据集;
    模型构建模块,用于构建多模态融合网络模型;所述多模态融合网络模型包括用于提取各模态特征的模态特征提取网络,用于将各模态特征进行合并的模态特征融合网络,以及用于将合并后的目标特征进行分类任务或回归任务的决策网络;
    模型训练模块,用于利用所述多模态数据集训练所述多模态融合网络模型;和
    模型推理模块,用于将待测物体的不同光学模态信息输入至训练完成的所述多模态融合网络模型中,输出分类结果或回归结果。
  10. 根据权利要求9所述的多模态数据处理装置,其特征在于,所述多模态信息采集模块,具体用于获取目标物体的强度、偏振和频率这三个不同模态中至少两个模态的信息。
  11. 根据权利要求10所述的多模态数据处理装置,其特征在于,所述多模态信息采集模块,包括:分光系统、光学微偏振器系统和傅里叶4f系统;
    所述分光系统,用于将来自目标物体的反射光分为第一光束和第二光束;所述第一光束传输至光学微偏振器系统;所述第二光束传输至傅里叶4f系统;
    所述光学微偏振器系统,用于获取强度信息和偏振信息;和
    所述傅里叶4f系统,用于获取频率信息。
  12. 根据权利要求11所述的多模态数据处理装置,其特征在于,所述光学微偏振器系统包括第一凸透镜、微偏振片和第一探测器;其中,
    所述第一凸透镜,用于将所述第一光束会聚到所述微偏振片上;
    所述微偏振片,用于同时采集强度信息和偏振信息;和
    所述第一探测器,用于将所述微偏振片采集的强度信息和偏振信息转化为二维矩阵数据。
  13. 根据权利要求12所述的多模态数据处理装置,其特征在于,所述微偏振片中每个像素单元包括用于采集强度信息的两个增透子单元和用于采集偏振信息的两个线偏振子单元;
    两个所述线偏振子单元呈对角分布;两个所述增透子单元呈对角分布。
  14. 根据权利要求11所述的多模态数据处理装置,其特征在于,所述傅里叶4f系统包括第二凸透镜、第三凸透镜、衍射屏和第二探测器;其中,
    所述第二凸透镜,位于目标物体与所述分光系统之间,用于将来自目标物体的反射光进行会聚,得到平行光并传输至所述分光系统;
    所述衍射屏,位于所述分光系统和所述第三凸透镜之间,用于将所述第二光束进行衍射,得到衍射光;
    所述第三凸透镜,用于将所述衍射光会聚到所述第二探测器上;和
    所述第二探测器,用于采集频谱信号。
  15. 一种多模态数据处理设备,其特征在于,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行如权利要求1-8任意一项所述的方法的步骤。16、一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如权利要求1-8任意一项所述的方法的步骤。
PCT/CN2022/095363 2021-11-19 2022-05-26 一种多模态数据处理方法、装置、设备及存储介质 WO2023087659A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111400866.8 2021-11-19
CN202111400866.8A CN114330488A (zh) 2021-11-19 2021-11-19 一种多模态数据处理方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023087659A1 true WO2023087659A1 (zh) 2023-05-25

Family

ID=81046073

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/095363 WO2023087659A1 (zh) 2021-11-19 2022-05-26 一种多模态数据处理方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN114330488A (zh)
WO (1) WO2023087659A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116561542A (zh) * 2023-07-04 2023-08-08 北京聆心智能科技有限公司 模型的优化训练系统、方法以及相关装置
CN117226608A (zh) * 2023-09-19 2023-12-15 中山市光大光学仪器有限公司 一种用于分光棱镜镀膜的抛光控制方法及系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114330488A (zh) * 2021-11-19 2022-04-12 浪潮(北京)电子信息产业有限公司 一种多模态数据处理方法、装置、设备及存储介质

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657965A (zh) * 2015-03-12 2015-05-27 长春理工大学 基于离散连续曲波的偏振图像融合方法
CN105139367A (zh) * 2015-07-27 2015-12-09 中国科学院光电技术研究所 一种基于非下采样剪切波的可见光偏振图像融合方法
US20200191706A1 (en) * 2018-12-13 2020-06-18 Imec Vzw Multimodal Imaging System
CN111462128A (zh) * 2020-05-28 2020-07-28 南京大学 一种基于多模态光谱图像的像素级图像分割系统及方法
US20200295519A1 (en) * 2017-09-30 2020-09-17 Femtosecond Research Center Co., Ltd. Femtosecond laser multimodality molecular imaging system
CN111738314A (zh) * 2020-06-09 2020-10-02 南通大学 基于浅层融合的多模态图像能见度检测模型的深度学习方法
CN112129702A (zh) * 2020-09-16 2020-12-25 飞秒激光研究中心(广州)有限公司 多模态信号采集装置及方法、激光影像系统
CN113040722A (zh) * 2021-04-30 2021-06-29 电子科技大学 一种提高频域相干断层成像深度的方法
CN114330488A (zh) * 2021-11-19 2022-04-12 浪潮(北京)电子信息产业有限公司 一种多模态数据处理方法、装置、设备及存储介质

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657965A (zh) * 2015-03-12 2015-05-27 长春理工大学 基于离散连续曲波的偏振图像融合方法
CN105139367A (zh) * 2015-07-27 2015-12-09 中国科学院光电技术研究所 一种基于非下采样剪切波的可见光偏振图像融合方法
US20200295519A1 (en) * 2017-09-30 2020-09-17 Femtosecond Research Center Co., Ltd. Femtosecond laser multimodality molecular imaging system
US20200191706A1 (en) * 2018-12-13 2020-06-18 Imec Vzw Multimodal Imaging System
CN111462128A (zh) * 2020-05-28 2020-07-28 南京大学 一种基于多模态光谱图像的像素级图像分割系统及方法
CN111738314A (zh) * 2020-06-09 2020-10-02 南通大学 基于浅层融合的多模态图像能见度检测模型的深度学习方法
CN112129702A (zh) * 2020-09-16 2020-12-25 飞秒激光研究中心(广州)有限公司 多模态信号采集装置及方法、激光影像系统
CN113040722A (zh) * 2021-04-30 2021-06-29 电子科技大学 一种提高频域相干断层成像深度的方法
CN114330488A (zh) * 2021-11-19 2022-04-12 浪潮(北京)电子信息产业有限公司 一种多模态数据处理方法、装置、设备及存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116561542A (zh) * 2023-07-04 2023-08-08 北京聆心智能科技有限公司 模型的优化训练系统、方法以及相关装置
CN116561542B (zh) * 2023-07-04 2023-10-20 北京聆心智能科技有限公司 模型的优化训练系统、方法以及相关装置
CN117226608A (zh) * 2023-09-19 2023-12-15 中山市光大光学仪器有限公司 一种用于分光棱镜镀膜的抛光控制方法及系统
CN117226608B (zh) * 2023-09-19 2024-04-02 中山市光大光学仪器有限公司 一种用于分光棱镜镀膜的抛光控制方法及系统

Also Published As

Publication number Publication date
CN114330488A (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2023087659A1 (zh) 一种多模态数据处理方法、装置、设备及存储介质
Driss et al. A comparison study between MLP and convolutional neural network models for character recognition
Wang et al. Image sensing with multilayer nonlinear optical neural networks
Möckl et al. Deep learning in single-molecule microscopy: fundamentals, caveats, and recent developments
Yang et al. Multi-view CNN feature aggregation with ELM auto-encoder for 3D shape recognition
Zeng et al. RedCap: residual encoder-decoder capsule network for holographic image reconstruction
Cheng et al. Rotdcf: Decomposition of convolutional filters for rotation-equivariant deep networks
Liu et al. Change detection in multitemporal synthetic aperture radar images using dual-channel convolutional neural network
Yang et al. A novel vision transformer model for skin cancer classification
Kulikajevas et al. 3D object reconstruction from imperfect depth data using extended YOLOv3 network
JP7188856B2 (ja) 動的な画像解像度評価
Melzer et al. Exploring characteristics of neural network architecture computation for enabling SAR ATR
Wang et al. Semantic segmentation of large-scale point clouds based on dilated nearest neighbors graph
Ataky et al. Multiscale analysis for improving texture classification
Leonov et al. Analysis of the convolutional neural network architectures in image classification problems
Dad et al. Quaternion Harmonic moments and extreme learning machine for color object recognition
Goyal et al. Morphological classification of galaxies using Conv-nets
Ashraf et al. Attention 3D central difference convolutional dense network for hyperspectral image classification
Shang et al. Approximating the uncertainty of deep learning reconstruction predictions in single-pixel imaging
Fu et al. Unleashing the potential: AI empowered advanced metasurface research
JP2023529843A (ja) 画像およびデータの分析モデルの適合性を調整する方法
Ji et al. Reducing weight precision of convolutional neural networks towards large-scale on-chip image recognition
DE102021106254A1 (de) Neuronale Kontrollvariablennetze
Liu et al. Fabric defect detection based on visual saliency using deep feature and low-rank recovery
Huang et al. Full-scaled deep metric learning for pedestrian re-identification