CN111652349A

CN111652349A - Neural network processing method and related equipment

Info

Publication number: CN111652349A
Application number: CN202010321526.5A
Authority: CN
Inventors: 段艳杰; 刘裕良; 田光见
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2020-09-11

Abstract

The embodiment of the application provides a processing method of a neural network and related equipment in the field of artificial intelligence, in the method, a server performs tensor convolution operation on an initial multi-modal tensor to be trained according to R orthogonal tensors to obtain a target multi-modal tensor to be trained; updating the training data according to the multi-modal tensor of the target to be trained to obtain updated training data; and further inputting the updated training data into a preset training network, and training to obtain a prediction model. The server performs convolution operation on an initial multi-mode tensor to be trained through the R orthogonal tensors to obtain a converted target multi-mode tensor to be trained, compared with the initial multi-mode tensor to be trained, tensor dimensionality of the target multi-mode tensor to be trained is reduced, parameters input into a prediction model are greatly reduced, complexity of a neural network prediction model constructed by the server is reduced, and processing efficiency of a neural network is improved.

Description

Neural network processing method and related equipment

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a processing method for a neural network and related devices.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence, senses the environment, acquires knowledge and uses knowledge to obtain the best results through a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, human-computer interaction, recommendation and search, AI basic theory, and the like.

In the AI field, a modality refers to a specific way in which a person receives information, each source or form of information may be referred to as a modality, information in the real world generally appears in different modalities, and during presentation of information in different modalities, more content can be expressed in a multi-modality presentation information manner than in a single-modality information presentation manner.

In the prior art, information is generally transmitted between different devices in a tensor form, including a multi-modal tensor and a single-modal tensor, where the multi-modal information can be expressed by a tensor fusion method, in an existing commonly used tensor fusion method, for example, a Tensor Fusion Network (TFN), during a process of implementing tensor fusion, a multi-modal fusion tensor is obtained by calculating an outer product of a plurality of single-modal tensors, and then the multi-modal fusion tensor is input into a prediction model as a feature to obtain a final prediction result.

However, in the TFN, the number of parameters in the multi-modal fusion tensor tends to increase exponentially as the number of the single-modal tensors increases, so that the number of parameters input to the prediction model is too large, the complexity of the prediction model is too high, and the processing efficiency of the neural network is affected.

Disclosure of Invention

The embodiment of the application provides a processing method of a neural network and related equipment, which are used for reducing the complexity of a neural network prediction model constructed by a server and improving the processing efficiency of the neural network.

In the method, a server acquires training data when a prediction model of the neural network is constructed, wherein the training data comprises a plurality of monomodal tensors to be trained; then, the server performs tensor fusion according to a plurality of monomodal tensors to be trained in training data to obtain an initial multimodal tensor to be trained, and performs tensor convolution operation on the initial multimodal tensor to be trained according to R orthogonal tensors to obtain a target multimodal tensor to be trained, wherein R is an integer greater than 1, and any tensor in the R orthogonal tensors is orthogonal to other R-1 tensors; and then, updating the training data according to the target multi-modal tensor to be trained to obtain updated training data, and inputting the updated training data into a preset training network to train to obtain a prediction model. The server performs tensor convolution operation on the initial multi-mode tensor to be trained according to the R orthogonal tensors to obtain a target multi-mode tensor to be trained, namely convolution operation is performed on the initial multi-mode tensor to be trained through the R orthogonal tensors to obtain a converted target multi-mode tensor to be trained.

In a possible implementation manner of the first aspect, the performing, by the server, tensor convolution operation on the initial multi-modal tensor to be trained according to the R orthogonal tensors to obtain a target multi-modal tensor to be trained includes: and carrying out tensor convolution operation on the initial multi-mode tensor to be trained according to the R orthogonal tensors and a preset sliding step length to obtain a target multi-mode tensor to be trained, wherein the order of the initial multi-mode tensor and the order of any one of the R orthogonal tensors are M orders, M is a positive integer, and the number of elements contained in the preset sliding step length is M.

In this embodiment, by defining that the order of the initial multi-modal tensor to be trained, the order of any one of the R orthogonal tensors, and the number of elements included in the preset sliding step are equal, when performing convolution operation on the initial multi-modal tensor using the R orthogonal tensors and the preset sliding step, a failure condition in the convolution processing process due to misalignment of the values of the R orthogonal tensors and the preset sliding step is avoided.

In a possible implementation manner of the first aspect, a dimension of an mth order of the initial multi-modal tensor to be trained is x, a dimension of an mth order of any one of the R orthogonal tensors is y, and a value of an mth element of the preset sliding step is z, where x, y, and z are integers greater than 0, M is an integer not greater than M, x is greater than y, and y is greater than or equal to z.

In this embodiment, a dimension of an mth order of the initial multi-modal tensor to be trained is further defined as x, a dimension of an mth order of any one of the R orthogonal tensors is defined as y, and a value of an mth element of a preset sliding step is defined as z, where M is an integer not greater than M, x, y, and z are integers greater than 0, x is greater than y, and y is greater than or equal to z, so that, for any one of the M orders (the mth order), R orthogonal tensors of which the dimensions are smaller than that of the initial multi-modal tensor to be trained and a smaller predetermined sliding step are used to participate in a convolution processing process, and thus, a subsequently obtained target multi-modal tensor to be trained can further reduce tensor dimensions, and further reduce parameters subsequently input into the prediction model.

In a possible implementation manner of the first aspect, after inputting a preset training network with the updated training data and training to obtain a prediction model, the method further includes: the method comprises the steps that a server obtains data to be predicted, wherein the data to be predicted comprises a plurality of monomodal tensors to be predicted; then, carrying out tensor fusion according to the plurality of monomodal tensors to be predicted to obtain an initial multimodal tensor to be predicted, and carrying out tensor convolution operation on the initial multimodal tensor to be predicted according to the R orthogonal tensors to obtain a target multimodal tensor to be predicted; the server updates the data to be predicted according to the target multi-modal tensor to be predicted to obtain updated data to be predicted; and further inputting the updated data to be predicted into the prediction model, and processing to obtain a prediction result.

In this embodiment, after the prediction model is obtained, the prediction model may be used to predict data to be predicted, where the server performs tensor convolution operation on the initial multi-modal tensor to be predicted according to the R orthogonal tensors to obtain a target multi-modal tensor to be predicted, and compared with the initial multi-modal tensor to be predicted, the target multi-modal tensor to be predicted reduces tensor dimensions, reduces parameters input into the prediction model, reduces complexity of a prediction process performed by the server using the prediction model, further improves processing efficiency of a neural network, and provides an abstract multi-modal fusion feature for a subsequent prediction process, thereby improving a prediction effect of the prediction model.

In a possible implementation manner of the first aspect, after the predetermined training network is input with updated training data and the prediction model is trained, the method may further include: and the server trains the prediction model by using an orthogonal constraint loss function and updates the prediction model.

In this embodiment, in the process of training the prediction model, a loss function may be generally used for training to reduce the loss and error of the convolutional neural network, and specifically, the loss function includes an orthogonal constraint loss function, so that the orthogonal properties of the R orthogonal tensors may be trained, and the model training effect is improved.

In one possible implementation of the first aspect, the orthogonal constraint loss function comprises:

where L is the loss function of the model as a whole, L_regressionIndicating regression error, i.e.

Namely the accumulated sum of absolute errors between the predicted value and the true value; l is_ONamely, it is

Is the objective function after the orthogonal constraint transformation; λ is control L_OThe specific gravity coefficient of (a); t is t_i，t_jAre all orthogonal tensors in the orthogonal tensor network module.

An absolute value representing a cosine distance between two different orthogonal tensors;

the method comprises the steps of calculating the accumulated sum of absolute values of cosine distances between every two R orthogonal tensors; in the process of calculating the cosine distance,<t_i,t_j>representing the inner product between the two tensors, and | | t_i||_FAnd t_i||_FRespectively representing tensors t_iAnd t_jF norm of (d).

In this embodiment, one of the formulas for implementing the orthogonal constraint loss function is provided, and the server may implement a training process of the prediction model using the orthogonal constraint loss function, thereby improving the implementability of the scheme.

In one possible implementation manner of the first aspect, the updated training data includes the target multimodal tensor to be trained and the plurality of monomodal tensors to be trained.

In this embodiment, the server may specifically input the preset training network by using a target multi-modal tensor to be trained and a plurality of single-modal tensors to be trained, and process the preset training network to obtain the prediction model, so that local original information of the plurality of single-modal tensors to be trained and fusion information of the target multi-modal tensor to be trained may be retained in the prediction model, and accuracy of prediction by using the prediction model subsequently is improved.

In one possible implementation manner of the first aspect, the plurality of monomodal tensors to be trained includes at least two of an acoustic modality tensor, a language modality tensor, and a visual modality tensor.

In this embodiment, the processing method of the neural network may be specifically applied to an application scenario of multi-modal emotion analysis, at this time, a plurality of to-be-trained single-modal tensors include at least two of an acoustic modal tensor, a language modal tensor, and a visual modal tensor, and the trained prediction model includes an emotion analysis prediction model, which is applied to the scenario, so that parameters subsequently input to the emotion analysis prediction model are greatly reduced, complexity of an emotion analysis model constructed by a server is reduced, and processing efficiency of the neural network is improved.

In a possible implementation manner of the first aspect, tensor fusion is performed on a plurality of monomodal tensors to be trained, and obtaining an initial multimodal tensor to be trained may include: the server calculates the outer product of the plurality of monomodal tensors to be trained, and then the server further performs tensor fusion according to the outer product of the plurality of monomodal tensors to be trained to obtain the initial multimodal tensor to be trained.

In this embodiment, the operation process of the server in performing tensor fusion on the plurality of monomodal tensors to be trained may specifically include that the server calculates outer products of the plurality of monomodal tensors to be trained, and then the server further performs tensor fusion according to the outer products of the plurality of monomodal tensors to be trained to obtain an initial multimodal tensor to be trained, so that a specific implementation manner of tensor fusion implementation is provided, and the realizability of the scheme is improved.

In a possible implementation manner of the first aspect, the training the obtained prediction model by inputting updated training data into a preset training network may include: and the server performs pooling processing on the updated training data to obtain pooled training data, and then inputs the pooled training data into the predicted network to obtain the prediction model.

In this embodiment, after the server obtains the target multi-modal tensor, the server may specifically perform further dimension reduction processing on the updated training data through a pooling processing operation, further reduce parameters in the prediction model, reduce the complexity of the neural network prediction model constructed by the server, and improve the processing efficiency of the neural network.

A second aspect of the embodiments of the present application provides a processing method for a neural network, which may be applied to a prediction process of a neural network prediction model, in the method, a server obtains data to be predicted, where the data to be predicted includes a plurality of monomodal tensors to be predicted; then, the server performs tensor fusion according to the plurality of monomodal tensors to be predicted to obtain an initial multimodal tensor to be predicted, and performs tensor convolution operation on the initial multimodal tensor to be predicted according to R orthogonal tensors to obtain a target multimodal tensor to be predicted, wherein R is an integer greater than 1, and any tensor in the R orthogonal tensors is orthogonal to other R-1 tensors; then, the server updates the data to be predicted according to the target multi-modal tensor to be predicted to obtain updated data to be predicted; and further inputting the updated data to be predicted into a prediction model, and processing to obtain a prediction result. Compared with the initial multi-mode tensor to be predicted, tensor dimensionality of the target multi-mode tensor to be predicted is reduced, parameters input into the prediction model are reduced, complexity of the server in the prediction process using the prediction model is reduced, processing efficiency of a neural network is improved, abstract multi-mode fusion characteristics are provided for the subsequent prediction process, and therefore the prediction effect of the prediction model is improved.

In a possible implementation manner of the second aspect, the performing a tensor convolution operation on the initial multi-modal tensor to be predicted according to the R orthogonal tensors to obtain a target multi-modal tensor to be predicted includes: and carrying out tensor convolution operation on the initial multi-mode tensor to be predicted according to the R orthogonal tensors and a preset sliding step length to obtain the target multi-mode tensor to be predicted, wherein the order of the initial multi-mode tensor to be predicted and the order of any one of the R orthogonal tensors are M orders, M is a positive integer, and the number of elements contained in the preset sliding step length is M.

In this embodiment, by defining that the order of the initial multi-modal tensor to be predicted, the order of any one of the R orthogonal tensors, and the number of elements included in the preset sliding step are equal, when performing convolution operation on the initial multi-modal tensor to be predicted by using the R orthogonal tensors and the preset sliding step, a failure condition in the convolution processing process due to misalignment of numerical values of the R orthogonal tensors and the preset sliding step is avoided.

In a possible implementation manner of the second aspect, a dimension of an mth order of the initial multi-modal tensor to be predicted is x, a dimension of an mth order of any one of the R orthogonal tensors is y, and a value of an mth element of the preset sliding step is z, where M is an integer not greater than M, x, y, and z are integers greater than 0, x is greater than y, and y is greater than or equal to z.

In this embodiment, a dimension of an mth order of the initial multi-modal tensor to be predicted is further defined as x, a dimension of an mth order of any one of the R orthogonal tensors is defined as y, and a value of an mth element of a preset sliding step is defined as z, where M is an integer not greater than M, x, y, and z are integers greater than 0, x is greater than y, and y is greater than or equal to z, so that, for any one of the M orders (the mth order), R orthogonal tensors of which the dimensions are smaller than that of the initial multi-modal tensor to be predicted and a smaller predetermined sliding step are used to participate in a convolution processing process, and thus, a subsequently obtained target multi-modal tensor to be predicted can further reduce a tensor dimension, and further reduce parameters subsequently input into the prediction model.

In one possible implementation manner of the second aspect, the plurality of monomodal tensors to be predicted includes at least two of an acoustic modality tensor, a language modality tensor, and a visual modality tensor.

In this embodiment, the processing method of the neural network may be specifically applied to an application scenario of multi-modal emotion analysis, where a plurality of to-be-predicted single-modal tensors include at least two of a sound modal tensor, a language modal tensor and a visual modal tensor, and the prediction model includes an emotion analysis prediction model, and is applied to the scenario, so that parameters subsequently input to the emotion analysis prediction model are greatly reduced, complexity of an emotion analysis model constructed by a server is reduced, and processing efficiency of the neural network is improved.

In a possible implementation manner of the second aspect, tensor fusion is performed on a plurality of monomodal tensors to be predicted, and obtaining an initial multimodal tensor to be predicted may include: the server calculates the outer product of the plurality of monomodal tensors to be predicted, and then the server further performs tensor fusion according to the outer product of the plurality of monomodal tensors to be predicted to obtain the initial multimodal tensor to be predicted.

In this embodiment, the operation process of the server in performing tensor fusion on the plurality of monomodal tensors to be predicted may specifically include that the server calculates an outer product of the plurality of monomodal tensors to be predicted, and then the server further performs tensor fusion according to the outer product of the plurality of monomodal tensors to be predicted to obtain an initial multimodal tensor to be predicted, so as to provide a specific implementation manner for implementing tensor fusion, and improve the implementability of the scheme.

In a possible implementation manner of the second aspect, inputting the updated data to be predicted into a prediction model, and processing to obtain a prediction result may include: and the server performs pooling treatment on the updated data to be predicted to obtain pooled data to be predicted, and then the server inputs the pooled data to be predicted into the prediction model to obtain a prediction result after treatment.

In this embodiment, after the server obtains the updated data to be predicted, the server may specifically perform further dimension reduction processing on the updated data to be predicted through pooling processing operation, further reduce parameters input into the prediction model, reduce the complexity of the prediction process performed by the server using the prediction model, and improve the processing efficiency of the neural network.

A third aspect of the embodiments of the present application provides a processing apparatus for a neural network, including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring training data, and the training data comprises a plurality of monomodal tensors to be trained; the fusion unit is used for carrying out tensor fusion according to the plurality of monomodal tensors to be trained to obtain an initial multimodal tensor to be trained; the convolution unit is used for carrying out tensor convolution operation on the initial multi-mode tensor to be trained according to the R orthogonal tensors to obtain a target multi-mode tensor to be trained, wherein R is an integer larger than 1, and any tensor in the R orthogonal tensors is orthogonal to other R-1 tensors; the updating unit is used for updating the training data according to the multi-modal tensor of the target to be trained to obtain updated training data; and the training unit is used for inputting the updated training data into a preset training network and training to obtain a prediction model. The convolution unit conducts tensor convolution operation on the initial multi-mode tensor to be trained according to the R orthogonal tensors to obtain a target multi-mode tensor to be trained, namely convolution operation is conducted on the initial multi-mode tensor to be trained through the R orthogonal tensors to obtain a converted target multi-mode tensor to be trained.

In the third aspect of the present application, the constituent modules of the processing apparatus of the neural network may also be configured to execute the steps executed in each possible implementation manner of the first aspect, which may specifically refer to the first aspect, and are not described here again.

A fourth aspect of the embodiments of the present application provides a processing apparatus for a neural network, including: the device comprises an acquisition unit, a prediction unit and a prediction unit, wherein the acquisition unit is used for acquiring data to be predicted, and the data to be predicted comprises a plurality of monomodal tensors to be predicted; the fusion unit is used for carrying out tensor fusion according to the plurality of monomodal tensors to be predicted to obtain an initial multimodal tensor to be predicted; the convolution unit is used for carrying out tensor convolution operation on the initial multi-modal tensor to be predicted according to the R orthogonal tensors to obtain a target multi-modal tensor to be predicted, wherein R is an integer larger than 1, and any tensor in the R orthogonal tensors is orthogonal to other R-1 tensors; the updating unit is used for updating the data to be predicted according to the target multi-modal tensor to be predicted to obtain updated data to be predicted; and the processing unit is used for inputting the updated data to be predicted into a preset prediction model and processing the data to obtain a prediction result. Compared with the initial multi-mode tensor to be predicted, tensor dimensionality of the target multi-mode tensor to be predicted is reduced, parameters input into the prediction model are reduced, complexity of a server in the prediction process by using the prediction model is reduced, processing efficiency of a neural network is improved, abstract multi-mode fusion characteristics are provided for the subsequent prediction process, and therefore the prediction effect of the prediction model is improved.

In the fourth aspect of the present application, the constituent modules of the processing apparatus of the neural network may also be configured to execute the steps executed in each possible implementation manner of the second aspect, which may specifically refer to the second aspect, and are not described herein again.

In a fifth aspect, an embodiment of the present application provides a server, which includes a processor, a processor coupled to a memory, where the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the method for acquiring a neural network according to the first aspect or the second aspect is performed.

In a sixth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the method for acquiring a neural network according to the first aspect or the second aspect.

In a seventh aspect, an embodiment of the present application provides a circuit system, where the circuit system includes a processing circuit configured to execute the method for acquiring a neural network according to the first aspect or the second aspect.

In an eighth aspect, the present application provides a computer program, which when run on a computer, causes the computer to execute the method for acquiring a neural network according to the first aspect or the second aspect.

In a ninth aspect, the present application provides a chip system comprising a processor for enabling a server to implement the functions referred to in the first or second aspect, e.g. to send or process data and/or information referred to in the method. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the server or the communication device. The chip system may be formed by a chip, or may include a chip and other discrete devices.

For technical effects brought by any one of the fifth to ninth aspects or any one of the possible implementation manners, reference may be made to technical effects brought by different possible implementation manners of the first aspect or the first aspect, or to technical effects brought by different possible implementation manners of the second aspect or the second aspect, and details are not described here again.

Drawings

FIG. 1 is a schematic structural diagram of an artificial intelligence body framework provided by an embodiment of the present application;

fig. 2-1 is a schematic diagram of a network structure of a neural network processing system according to an embodiment of the present application;

fig. 2-2 is a schematic diagram of another network structure of a neural network processing system according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a convolutional neural network provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a convolutional neural network provided in an embodiment of the present application;

fig. 5 is a schematic flowchart of a processing method of a neural network according to an embodiment of the present disclosure;

fig. 6 is another schematic flow chart of a processing method of a neural network according to an embodiment of the present disclosure;

fig. 7 is another schematic flow chart of a processing method of a neural network according to an embodiment of the present disclosure;

fig. 8 is another schematic flow chart of a processing method of a neural network according to an embodiment of the present disclosure;

fig. 9 is a schematic diagram of a processing device of a neural network according to an embodiment of the present disclosure;

fig. 10 is another schematic diagram of a processing apparatus of a neural network according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.

Detailed Description

Some terms used in the embodiments of the present application will be exemplarily described below:

mode: a modality refers to a particular manner in which a person accepts information, and the source or form of each type of information may be referred to as a modality.

Multimodal machine learning: the ability to process and understand multi-source modal information is intended to be achieved through a method of machine learning.

Multimodal fusion: refers to the process of synthesizing information from two or more modalities to make a prediction.

Tensor: it can be seen as a multidimensional array, scalar at 0 th, vector at 1 st, matrix at 2 nd, and usually referred to as N-th order tensor at 3 rd and above.

Tensor fusion: multimodal fusion is performed using tensor calculation.

Orthogonal tensor: two or more tensors are said to be orthogonal if the cosine distance between them is 0.

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

FIG. 1 shows a schematic diagram of an artificial intelligence body framework that describes the overall workflow of an artificial intelligence system, applicable to the general artificial intelligence field requirements.

The artificial intelligence topic framework described above is set forth below in terms of two dimensions, the "intelligent information chain" (horizontal axis) and the "IT value chain" (vertical axis).

The "smart information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process.

The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure:

the infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, intelligent terminal and the like.

Based on the embodiment shown in fig. 1, the artificial intelligence agent framework in fig. 1 can be applied to the multi-modal tensor fusion prediction problem. The modality refers to a specific way for a person to receive information, each source or form of information can be referred to as a modality, information in the real world usually appears in different modalities, and more content can be expressed in a multi-modality way than a single-modality way in the process of presenting information in different modalities. Among them, multi-modal data is widely available in the real world, e.g., images (multi-modal) are often tied to tags (single-modal) and text (single-modal) interpretations; text (multimodal) contains images (single modality) in order to more clearly express the main idea of an article (single modality); a complete movie (multimodal) contains audio (single modality), video (single modality), and subtitles (single modality), among others.

The process of implementing multi-modal fusion prediction by the processing system of the neural network is shown in fig. 2-1, wherein, firstly, step 201 is executed to obtain multi-modal data, a multi-modal fusion prediction model is generated according to the multi-modal data in step 202, and the input data to be predicted is predicted by using the multi-modal fusion prediction model obtained in step 202 in step 203 to obtain a prediction result. However, in the process of generating a multi-modal fusion prediction model from the multi-modal data in step 202, if the existing commonly used tensor fusion method is used for fusion, a series of problems are generated, which are as follows:

1) when TFN is used in step 202, the Tensor Fusion Network (TFN) obtains a multi-modal fusion tensor by calculating the outer product of a plurality of single-modal vectors, and then inputs the multi-modal fusion tensor into a prediction model as a feature to obtain a final prediction result. The dimensionality of the multi-mode fusion tensor presents an exponential increase trend along with the increase of the number of modes, and the model complexity is overhigh due to the fact that the number of parameters of a prediction model based on the tensor features is too much.

2) When a low-rank multi-modal fusion model (LMF) is used in step 202, the low-rank multi-modal fusion model (LMF), which is also a tensor-fused multi-modal prediction model. Aiming at the problems of high dimensionality and multiple parameters of TFN, a low-rank tensor factor is learned from each single-mode representation, then a plurality of single-mode low-rank tensor factors are fused, and final multi-mode fusion prediction is carried out. Although the LMF reduces the parameters of multi-modal fusion prediction to a certain extent, the number of the parameters is still strongly related to the dimensionality of each modality, and the model parameters still have reduced space.

3) When a Hierarchical Polynomial Fusion Network (HPFN) is used in step 202, the Hierarchical Polynomial Fusion Network (HPFN) reduces dimensionality by superimposing a plurality of aggregating module Polynomial Tensor Pools (PTPs) on a high-dimensional multi-modal fusion tensor, thereby reducing model parameters. However, HPFN tries to apply convolution operation to multi-modal tensor dimensionality reduction, but no constraint is applied among multiple convolution kernels, and the convolution kernels are not expanded to 3 rd order, and the prediction effect of the model is to be further improved.

In summary, for the multi-modal fusion prediction problem, in the prior art, too many prediction model parameters of TFN tensor features cause too high model complexity; although the LMF reduces the parameters of multi-mode fusion prediction to a certain extent, the number of the parameters is still strongly related to the dimensionality of each mode; the quantity of parameters of the HPFN is comparable to that of the LMF, and there is still room for a reduction in the model parameters. The method mainly solves the problems of high dimensionality and multiple parameters in the tensor fusion method, and improves the prediction effect. Please refer to fig. 2-2, which is another schematic diagram of the processing system of the neural network of the present application in the process of implementing multi-modal fusion prediction, wherein, in step 204, initial multi-modal data is obtained and is preprocessed to obtain preprocessed multi-modal data; in step 205, modeling is performed using the preprocessed multi-modal data to obtain a multi-modal data prediction model, and in step 206, the input data to be predicted is predicted using the multi-modal data prediction model to obtain a prediction result in step 207. The multi-modal data prediction model created in step 205 is a neural network model including an orthogonal tensor network module. As a preferred embodiment, the features of the training network are first divided into two channels: single-modality channels and multi-modality fusion channels. Wherein the single mode channel outputs a plurality of single mode vectors; the multi-mode fusion channel takes the outer product tensor of the single-mode vectors as input, and the multi-mode fusion tensor with dimension reduction conversion is obtained through the orthogonal tensor network module and further pooling processing. The input characteristics of the training network can be output finally through the feedforward network.

The embodiment of the present application may implement a processing method of a neural network based on the processing system of the neural network of fig. 2-2, and since the core steps in the method include a processing procedure of a convolutional neural network, the convolutional neural network will be first described below. The Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to learning at multiple levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons respond to overlapping regions in an image input thereto.

As shown in fig. 3, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130. The input layer 110 is used to input data such as a multimodal fusion tensor.

Convolutional layer/pooling layer 120:

and (3) rolling layers:

convolutional layer/pooling layer 120 as shown in FIG. 3 may include, for example, convolutional layer 121, pooling layer 122, and in one implementation, the output of the convolutional layer may be used as an input to a subsequent pooling layer. The convolutional layer 121 may include a plurality of convolution operators, which are also called kernels, and function as a filter for extracting specific information from the input image matrix in the image processing, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … …, which depends on the value of the step length (stride)) in the horizontal direction, so as to complete the task of extracting the specific feature from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrixes can be used for extracting different features in the image, for example, one weight matrix is used for extracting image edge information, another weight matrix is used for extracting specific colors of the image, another weight matrix is used for blurring unwanted noise points in the image … …, the dimensions of the multiple weight matrixes are the same, the dimensions of feature maps extracted by the multiple weight matrixes with the same dimensions are also the same, and the extracted multiple feature maps with the same dimensions are combined to form the output of convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 100 to make correct prediction.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to introduce a pooling layer after the convolutional layer, i.e., pooling layer 122, as illustrated by 120 in FIG. 3, may be one convolutional layer followed by one pooling layer. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a particular range to produce an average. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 130:

after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 100 needs to generate one or a set of outputs of the number of classes as needed using the neural network layer 130. Thus, a plurality of hidden layers (131, 132, to 13n shown in fig. 3) and an output layer 140 may be included in the neural network layer 130, and parameters included in the hidden layers may be pre-trained according to associated training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and so on … …

After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 3 is the forward propagation) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 3 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.

It should be noted that the convolutional neural network 100 shown in fig. 3 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 4, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the overall neural network layer 130 for processing.

Referring to fig. 5, a processing method of a neural network in an embodiment of the present application will be described in detail below, where an embodiment of the processing method of the neural network in the embodiment of the present application includes:

501. acquiring training data;

in this embodiment, the method may be applied to a process of constructing a prediction model of a neural network, and when the server constructs the prediction model of the neural network, the server performs step 501 to obtain training data, where the training data includes a plurality of monomodal tensors.

In the execution process of step 501, the server may obtain training data including a plurality of monomodal tensors through the acquisition device, may also obtain training data including a plurality of monomodal tensors through a communication with another device, and may also obtain the training data including a plurality of monomodal tensors through another method, which is not limited herein.

The modality refers to a specific way for a person to receive information, and the source or form of each kind of information can be called a modality; the tensor can be regarded as a multi-dimensional array, a scalar quantity is obtained in 0 order, a vector is obtained in 1 order, a matrix is obtained in 2 order, and the tensor in 3 order and above is generally called an N-order tensor; that is, any one of the plurality of monomodal tensors includes data of a specific monomodal, the data of the monomodal is expressed by a tensor, and the order of the tensor is not limited herein.

502. Carrying out tensor fusion according to the plurality of monomodal tensors to be trained to obtain an initial multimodal tensor to be trained;

in this embodiment, the server performs tensor fusion according to the plurality of monomodal tensors to be trained obtained in step 501, so as to obtain an initial multimodal tensor to be trained.

Specifically, the operation process of the server in performing tensor fusion on the plurality of monomodal tensors to be trained may specifically include that the server calculates an outer product of the plurality of monomodal tensors to be trained, and then the server further performs tensor fusion according to the outer product of the plurality of monomodal tensors to be trained to obtain an initial multimodal tensor to be trained. In addition, in the process of performing tensor fusion according to a plurality of monomodal tensors to be trained to obtain an initial multimodal tensor to be trained, the foregoing TFN, LMF, or other similar tensor fusion process may also be used, which is not limited herein.

The description is given by taking the example that the plurality of monomodal tensors to be trained are three monomodal tensors, wherein the three monomodal tensors are a₁、a₂、a₃Wherein a is₁、a₂、a₃The corresponding dimensions are a, b, and c, respectively, and a, b, and c are positive integers, and the initial multi-modal tensor a obtained in step 502_mThe dimension of (d) is the product of the dimensions of the three monomodal tensors, i.e., a × b × c.

503. Carrying out tensor convolution operation on the initial multi-modal tensor to be trained according to the R orthogonal tensors to obtain a target multi-modal tensor to be trained;

in this embodiment, the server performs tensor convolution operation on the initial multi-modal tensor to be trained according to R orthogonal tensors to obtain a target multi-modal tensor to be trained, where R is an integer greater than 1, and any one of the R orthogonal tensors is orthogonal to the other R-1 tensors.

Specifically, if the cosine distance between two or more tensors is 0, the two or more tensors are said to be orthogonal, i.e., orthogonal tensors. In step 503, R is an integer greater than 1 in the R orthogonal tensors, and any one of the R orthogonal tensors is orthogonal to the other R-1 tensors, that is, the cosine distance of any two of the R orthogonal tensors is 0. The initial multi-modal tensor is taken as a_iR orthogonal tensors comprising t₁、t₂...、t_RFor example, describing the convolution process implemented in step 503, a specific implementation can be as shown in FIG. 6, wherein the initial multi-modal tensor a_i601 through R orthogonal tensors (t)₁、t₂...、t_R) After the convolution operation 602, the target multimodal tensor 603 is obtained, wherein the target multimodal tensor b_iCan use a_i×t₁、a_i×t₂...a_i×t_RAnd (4) showing. In particular, the monomodal tensors at different orders in the convolution operation implemented in step 503The orthogonal tensor network operation can be represented by table 1:

order of the scale	Monomodal tensor	R orthogonal tensors	Operation of	Results
					1	Vector quantity	Vector quantity	Inner product	Scalar quantity
2	Matrix array	Two-dimensional convolution kernel	Convolution with a bit line	Matrix array
					3	Tensor	Three-dimensional convolution kernel	Convolution with a bit line	Tensor
>3	Tensor	High dimensional convolution kernel	Convolution with a bit line	Tensor

TABLE 1

As a preferred implementation manner, during the implementation of the convolution operation 602, a preset sliding step (stride) may be further used to implement the convolution operation, wherein in the implementation of the step 503, specifically, the step may be implemented according to the R orthogonal tensors and the preset sliding step (stride)_m) For the initial multi-modal tensor (a)_i) Carrying out tensor convolution operation to obtain the multi-modal tensor (b)_i) And the order of the initial multi-modal tensor and the order of any one of the R orthogonal tensors are M orders, M is a positive integer, and the number of elements contained in the preset sliding step is M. By limiting the order of the initial multi-mode tensor to be equal to the number of elements contained in any one of the R orthogonal tensors and the preset sliding step, when the convolution operation is performed on the initial multi-mode tensor by using the R orthogonal tensors and the preset sliding step, the fault condition in the convolution processing process caused by the fact that the numerical values of the R orthogonal tensors and the preset sliding step are not aligned is avoided.

Further, the dimension of the mth order of the initial multi-modal tensor is x, the dimension of the mth order of any one of the R orthogonal tensors is y, and the value of the mth element of the preset sliding step length is z, where x, y, and z are integers greater than or equal to 0, M is an integer not greater than M, x is greater than y, and y is greater than or equal to z. Therefore, for any order (mth order) in the M orders, the user uses the R orthogonal tensors with the dimension smaller than that of the initial multi-modal tensor and the smaller preset sliding step length to participate in the convolution processing process, so that the subsequently obtained target multi-modal tensor can be further reduced in tensor dimension, and the parameters subsequently input into the prediction model are further reduced

Illustratively, the order of the initial multi-modal tensor and the order of any one of the R orthogonal tensors, and the number of elements included in the preset sliding step may be represented by M, where M is a positive integer. Further, the method may be further limited to any one of M (M is not)Greater than M), a dimension of the initial multi-modal tensor in the mth order is greater than a dimension of any one of the R orthogonal tensors in the mth order, and the dimension of any one of the R orthogonal tensors in the mth order is greater than a value of the mth element of the preset sliding step. At this time, compare with a_iHas the dimension of

b_iIs reduced to

504. Updating the training data according to the multi-modal tensor of the target to be trained to obtain updated training data;

in this embodiment, the server updates the training data according to the target multi-modal tensor to be trained obtained in step 503, so as to obtain updated training data.

In the implementation of step 504, the update process can be implemented in a variety of ways:

1) and directly replacing the plurality of monomodal tensors to be predicted in the training data obtained in the step 501 with the target multimodal tensor to be predicted obtained in the step 503 to obtain updated training data. Therefore, the prediction model obtained by subsequently using the updated training data for training can include the fusion information of the multi-modal tensor of the target to be trained, and the accuracy of the subsequent prediction by using the prediction model is improved.

2) On the basis of the plurality of monomodal tensors to be predicted in the training data obtained in step 501, the target multimodal tensor to be predicted obtained in step 503 is added newly to obtain updated training data, that is, the updated training data includes the target multimodal tensor to be trained and the plurality of monomodal tensors to be trained. Therefore, in a prediction model obtained by subsequently using the updated training data for training, the local original information of the plurality of monomodal tensors to be trained and the fusion information of the target multimodal tensor to be trained can be simultaneously reserved, and the accuracy of the subsequent prediction by using the prediction model is further improved.

505. And inputting the updated training data into a preset training network, and training to obtain a prediction model.

In this embodiment, the server inputs the updated training data in step 504 into a preset training network, and trains to obtain a prediction model.

As a preferred implementation manner, the embodiment shown in fig. 5 may be specifically applied to an application scenario of multi-modal emotion analysis, in this case, the plurality of single-modal tensors included in the training data acquired in step 501 may specifically include at least two of an Acoustic (Acoustic) modal tensor, a Language (Language) modal tensor, and a Visual (Visual) modal tensor. Therefore, in step 505, the trained prediction model includes an emotion analysis prediction model, and is applied to the scene, so that parameters subsequently input to the emotion analysis prediction model in step 505 are greatly reduced, the complexity of the emotion analysis model constructed by the server is reduced, and the processing efficiency of the neural network is improved.

In an optional implementation manner, in step 505, the updated training data is input into a preset training network, and in the process of obtaining the prediction model through training, the server may further perform pooling processing on the updated training data to obtain pooled training data, and then the server inputs the pooled training data into the preset training network by using the pooled training data, and finally obtains the prediction model. Therefore, parameters in the prediction model are further reduced, the complexity of the neural network prediction model constructed by the server is reduced, and the processing efficiency of the neural network is improved.

In an alternative implementation, after obtaining the prediction model in step 505, the server may train the prediction model using an orthogonal constraint loss function to obtain a trained prediction model.

Specifically, in the process of training the prediction model, a loss function can be generally used for training to reduce the loss and the error of the convolutional neural network, and specifically, the loss function includes an orthogonal constraint loss function, so that the orthogonal properties of R orthogonal tensors can be trained, and the model training effect is improved.

As a preferred embodiment, the orthogonal constraint penalty function comprises:

the method comprises the steps of calculating the accumulated sum of absolute values of cosine distances between every two R orthogonal tensors; in the process of calculating the cosine distance,<t_i,t_j>representing the inner product between the two tensors, and | | t_i||_FAnd t_i||_FRespectively representing tensors t_iAnd t_jF norm of (d). Therefore, one of formulas for realizing the orthogonal constraint loss function is provided, the server can use the orthogonal constraint loss function to realize the training process of the prediction model, and the realizability of the scheme is improved.

In this embodiment, the server obtains training data when a prediction model of a neural network is constructed, where the training data includes a plurality of monomodal tensors to be trained; then, the server performs tensor fusion according to a plurality of monomodal tensors to be trained in training data to obtain an initial multimodal tensor to be trained, and performs tensor convolution operation on the initial multimodal tensor to be trained according to R orthogonal tensors to obtain a target multimodal tensor to be trained, wherein R is an integer greater than 1, and any tensor in the R orthogonal tensors is orthogonal to other R-1 tensors; and then, updating the training data according to the target multi-modal tensor to be trained to obtain updated training data, and inputting the updated training data into a preset training network to train to obtain a prediction model. The server performs tensor convolution operation on the initial multi-mode tensor to be trained according to the R orthogonal tensors to obtain a target multi-mode tensor to be trained, namely convolution operation is performed on the initial multi-mode tensor to be trained through the R orthogonal tensors to obtain a converted target multi-mode tensor to be trained.

In the embodiment of the present application, the embodiment described in fig. 5 specifically introduces that parameters input into the prediction model can be greatly reduced through the orthogonal tensor, so as to reduce the complexity of the neural network prediction model constructed by the server; the following describes, by way of specific examples, how the optimized prediction model can be used to optimize the prediction process.

Referring to fig. 7, another embodiment of a processing method of a neural network provided in the embodiment of the present application includes:

701. acquiring training data;

in this embodiment, the method may be applied to a process of constructing a prediction model of a neural network, and when the server constructs the prediction model of the neural network, the server performs step 701 to obtain training data, where the training data includes a plurality of monomodal tensors.

702. Carrying out tensor fusion according to the plurality of monomodal tensors to be trained to obtain an initial multimodal tensor to be trained;

in this embodiment, the server performs tensor fusion according to the plurality of monomodal tensors to be trained obtained in step 701 to obtain an initial multimodal tensor to be trained.

703. Carrying out tensor convolution operation on the initial multi-modal tensor to be trained according to the R orthogonal tensors to obtain a target multi-modal tensor to be trained;

704. Updating the training data according to the multi-modal tensor of the target to be trained to obtain updated training data;

in this embodiment, the server updates the training data according to the target multi-modal tensor to be trained obtained in step 703, so as to obtain updated training data.

705. And inputting the updated training data into a preset training network, and training to obtain a prediction model.

In this embodiment, the server inputs the updated training data in step 704 into a preset training network, and trains to obtain a prediction model.

In this embodiment, the implementation process of steps 701 to 705 and the corresponding beneficial effects are similar to the implementation process of steps 501 to 505 in fig. 5, and are not described herein again.

706. Acquiring data to be predicted;

in this embodiment, the method may be applied to a prediction process using a neural network prediction model, and in the prediction process using the neural network prediction model, the server performs step 706 to obtain data to be predicted, where the data to be predicted includes a plurality of monomodal tensors to be predicted.

In the execution process of step 706, the server may obtain data to be predicted, which includes a plurality of monomodal tensors, through the acquisition device, may also obtain data to be predicted, which includes a plurality of monomodal tensors, through a communication with another device, and may also obtain the data to be predicted, which includes a plurality of monomodal tensors, through another method, which is not limited herein.

In a preferred implementation manner, the method may be applied to an application scenario of multi-modal emotion analysis, in this case, the plurality of single-modal tensors included in the data to be predicted acquired in step 706 may specifically include at least two of an acoustic modal tensor, a language modal tensor, and a visual modal tensor.

707. Carrying out tensor fusion according to the plurality of monomodal tensors to be predicted to obtain an initial multimodal tensor to be predicted;

in this embodiment, the server performs tensor fusion according to the data to be predicted obtained in step 706 to obtain an initial multi-modal tensor to be predicted.

Specifically, the operation process of the server in tensor fusion of the plurality of monomodal tensors to be predicted may specifically include that the server calculates an outer product of the plurality of monomodal tensors to be predicted, and then the server further performs tensor fusion according to the outer product of the plurality of monomodal tensors to be predicted to obtain an initial multimodal tensor to be predicted. In addition, in the process of performing tensor fusion according to a plurality of monomodal tensors to be predicted to obtain an initial multimodal tensor to be predicted, the foregoing TFN, LMF, or other similar tensor fusion processes may also be used, which is not limited herein.

The following description will be given by taking the plurality of monomodal tensors to be predicted as three monomodal tensors as an example, where the three monomodal tensorsThe single mode tensor is a₁、a₂、a₃Wherein a is₁、a₂、a₃The corresponding dimensions are a, b and c, and a, b and c are positive integers, and the initial multi-modal tensor a to be predicted obtained in step 707_mThe dimension of (d) is the product of the dimensions of the three monomodal tensors, i.e., a × b × c.

708. Carrying out tensor convolution operation on the initial multi-modal tensor to be predicted according to the R orthogonal tensors to obtain a target multi-modal tensor to be predicted;

in this embodiment, the server performs tensor convolution operation on the initial multi-modal tensor to be predicted according to R orthogonal tensors to obtain a target multi-modal tensor to be predicted, where R is an integer greater than 1, and any one of the R orthogonal tensors is orthogonal to the other R-1 tensors.

Specifically, the performing tensor convolution operation on the initial multi-modal tensor to be predicted according to the R orthogonal tensors to obtain the target multi-modal tensor to be predicted includes: and carrying out tensor convolution operation on the initial multi-mode tensor to be predicted according to the R orthogonal tensors and a preset sliding step length to obtain the target multi-mode tensor to be predicted, wherein the order of the initial multi-mode tensor to be predicted and the order of any one of the R orthogonal tensors are M orders, M is a positive integer, and the number of elements contained in the preset sliding step length is M. Therefore, by limiting that the order of the initial multi-modal tensor to be predicted, the order of any one of the R orthogonal tensors and the number of elements contained in the preset sliding step are equal, when the convolution operation is performed on the initial multi-modal tensor to be predicted by using the R orthogonal tensors and the preset sliding step, the fault condition in the convolution processing process caused by the fact that the numerical values of the R orthogonal tensors and the preset sliding step are not aligned is avoided.

In a possible implementation manner, the dimension of the mth order of the initial multi-modal tensor to be predicted is x, the dimension of the mth order of any one of the R orthogonal tensors is y, and the value of the mth element of the preset sliding step is z, where M is an integer not greater than M, x, y, and z are integers greater than 0, x is greater than y, and y is greater than or equal to z. Therefore, the dimension of the mth order of the initial multi-modal tensor to be predicted is further defined as x, the dimension of the mth order of any one of the R orthogonal tensors is defined as y, and the value of the mth element of the preset sliding step is defined as z, wherein M is an integer not greater than M, x, y and z are integers greater than 0, x is greater than y, and y is greater than or equal to z.

As the R orthogonal tensors used in step 708, R orthogonal tensors used in step 703 may be used. In addition, in the process of performing tensor convolution operation on the initial multi-modal tensor to be predicted in step 708, the server may refer to the implementation process of performing tensor convolution operation on the initial multi-modal tensor to be trained in step 503, which is not described herein again.

709. Updating the data to be predicted according to the target multi-modal tensor to be predicted to obtain updated data to be predicted;

in this embodiment, the server updates the training data according to the target multi-modal tensor to be trained obtained in step 708, so as to obtain updated training data.

In step 709, the process of updating the data to be predicted can be implemented in various ways:

1) and directly replacing the plurality of monomodal tensors to be predicted in the prediction data obtained in the step 706 with the target multimodal tensor to be predicted obtained in the step 708 to obtain updated prediction data. Therefore, when the updated prediction data is input into the prediction model subsequently, the input parameters of the prediction model can include the fusion information of the target multi-modal tensor to be predicted, and the accuracy of prediction by using the prediction model subsequently is improved.

2) On the basis of the plurality of monomodal tensors to be predicted in the data to be predicted obtained in step 706, the target multimodal tensor to be predicted obtained in step 708 is added to the data to be predicted to obtain updated prediction data, that is, the updated prediction data includes the target multimodal tensor to be predicted and the plurality of monomodal tensors to be predicted. Therefore, when the updated prediction data is input into the prediction model subsequently, the input parameters of the prediction model can simultaneously reserve the local original information of the plurality of monomodal tensors to be predicted and the fusion information of the target multimodal tensor to be predicted, and the accuracy of prediction by using the prediction model subsequently is further improved.

710. And inputting the updated data to be predicted into a prediction model, and processing to obtain a prediction result.

In this embodiment, the server may input the updated data to be predicted obtained in step 709 into the prediction model, and process the data to be predicted to obtain the prediction result. The prediction model used in step 710 may be the prediction model trained in step 705.

In an alternative embodiment, the application scenario applicable to fig. 7 may include: and (3) realizing the training process of the training network (steps 701 to 705) and the prediction process of the data to be predicted in the prediction model (steps 706 to 710) by using the same server, and at the moment, successively executing training according to the training data by using the same server to obtain the prediction model, and inputting the data to be predicted into the prediction model to obtain a prediction result.

In an alternative embodiment, the application scenario applicable to fig. 7 may further include: the method comprises the steps of respectively realizing a training process of a training network (steps 701 to 705) and a prediction process of data to be predicted in a prediction model (steps 706 to 710) by using a plurality of different servers, wherein at the moment, any one of the plurality of different servers (exemplarily, the training process has higher requirement on the computing capacity of the server, and the plurality of different servers with stronger processing capacity can be used) can be used for training according to the training data to obtain the prediction model, and then the plurality of different servers can be communicated with each other to ensure that other servers can obtain the prediction model, and further, the plurality of different servers can respectively input different parameters to be predicted, so that the prediction process of a large number of parameters to be predicted can be realized, and the processing efficiency of the neural network is improved.

In addition, in order to verify the effectiveness of the embodiment, the embodiment is tested in the CMU Multimodal emotion recognition (CMU-MOSI) of the university of kynuron-merlon in the open data set with three modalities, and the test result is shown in table 2, and it can be seen from table 2 that the method (i.e., the process from step 706 to step 710) provided by the present invention is significantly better than other methods in prediction accuracy. It can be seen that the process of processing the initial multi-modal tensor by using the orthogonal tensor network in the embodiment is helpful for learning better multi-modal tensor fusion representation, thereby achieving better prediction effect.

Method of producing a composite material	The invention	TFN	LMF	HPFN
					Accuracy (%)	78.1	73.9	76.4	77.5

TABLE 2

In the specific implementation process of the embodiment of fig. 7, the specific flowchart may also refer to the content shown in fig. 8, wherein, in step 706, data to be predicted is obtained (801, 802, and 803 are taken as examples in the figure, it is obvious that the number of the monomodal data may be other, which is not limited herein), and a plurality of monomodal tensors (804) to be predicted in step 701 may be obtained by performing feature extraction through a sub-network (subnet); then, obtaining an initial multi-modal tensor to be predicted through tensor fusion in the step 702 (805); then, in step 703, a tensor convolution operation (806) is performed to obtain a target multi-modal tensor (807) to be predicted, where the tensor convolution operation (806) is similar to the foregoing implementation process of fig. 6 and is not described herein again; finally, in step 704, the obtained target multi-modal tensor (807) to be predicted (optionally, the plurality of single-modal tensors in 804 may also be added) is used as an input of a prediction model (808), a prediction model is obtained through training, and then the processes from step 706 to step 710 may be performed by using the prediction model, that is, a process of predicting data to be predicted is implemented.

In this embodiment, a server obtains data to be predicted, where the data to be predicted includes a plurality of monomodal tensors to be predicted; then, the server performs tensor fusion according to the plurality of monomodal tensors to be predicted to obtain an initial multimodal tensor to be predicted, and performs tensor convolution operation on the initial multimodal tensor to be predicted according to R orthogonal tensors to obtain a target multimodal tensor to be predicted, wherein R is an integer greater than 1, and any tensor in the R orthogonal tensors is orthogonal to other R-1 tensors; then, the server updates the data to be predicted according to the target multi-modal tensor to be predicted to obtain updated data to be predicted; and further inputting the updated data to be predicted into a prediction model, and processing to obtain a prediction result. Compared with the initial multi-mode tensor to be predicted, tensor dimensionality of the target multi-mode tensor to be predicted is reduced, parameters input into the prediction model are reduced, complexity of the server in the prediction process using the prediction model is reduced, processing efficiency of a neural network is improved, abstract multi-mode fusion characteristics are provided for the subsequent prediction process, and therefore the prediction effect of the prediction model is improved.

An embodiment of the present invention further provides a processing apparatus of a neural network, specifically referring to fig. 9, fig. 9 is a schematic structural diagram of the processing apparatus of the neural network provided in the embodiment of the present application, and the processing apparatus 900 of the neural network includes:

an obtaining unit 901, configured to obtain training data, where the training data includes a plurality of monomodal tensors to be trained;

a fusion unit 902, configured to perform tensor fusion according to the plurality of monomodal tensors to be trained, to obtain an initial multimodal tensor to be trained;

a convolution unit 903, configured to perform tensor convolution operation on the initial multi-modal tensor to be trained according to R orthogonal tensors to obtain a target multi-modal tensor to be trained, where R is an integer greater than 1, and any tensor in the R orthogonal tensors is orthogonal to the other R-1 tensors;

an updating unit 904, configured to update the training data according to the multi-modal tensor of the target to be trained, so as to obtain updated training data;

and a training unit 905, configured to input the updated training data into a preset training network, and train to obtain a prediction model.

In one possible design, the convolution unit 903 is specifically configured to:

and carrying out tensor convolution operation on the initial multi-mode tensor to be trained according to the R orthogonal tensors and a preset sliding step length to obtain the target multi-mode tensor to be trained, wherein the order of the initial multi-mode tensor to be trained and the order of any one of the R orthogonal tensors are M orders, M is a positive integer, and the number of elements contained in the preset sliding step length is M.

In one possible design, the dimension of the mth order of the initial multi-modal tensor to be trained is x, the dimension of the mth order of any one of the R orthogonal tensors is y, and the value of the mth element of the preset sliding step is z, where M is an integer not greater than M, x, y, and z are integers greater than 0, x is greater than y, and y is greater than or equal to z.

In one possible design, the apparatus further includes a processing unit 906;

the obtaining unit 901 is further configured to obtain data to be predicted, where the data to be predicted includes a plurality of monomodal tensors to be predicted;

the fusion unit 902 is further configured to perform tensor fusion according to the plurality of monomodal tensors to be predicted, so as to obtain an initial multimodal tensor to be predicted;

the convolution unit 903 is further configured to perform tensor convolution operation on the initial multi-modal tensor to be predicted according to the R orthogonal tensors, so as to obtain a target multi-modal tensor to be predicted;

the updating unit 904 is further configured to update the data to be predicted according to the target multi-modal tensor to be predicted, so as to obtain updated data to be predicted;

the processing unit 906 is configured to input the updated data to be predicted into the prediction model, and process the data to obtain a prediction result.

In one possible design, the update unit 904 is also used for

And training the prediction model by using an orthogonal constraint loss function, and updating the prediction model.

In one possible design, the target neural network loss function includes:

In one possible design, the updating unit 904 is specifically configured to:

calculating an outer product of a plurality of monomodal tensors to be trained;

and carrying out tensor fusion according to the outer product of the plurality of monomodal tensors to be trained to obtain the initial multimodal tensor to be trained.

In one possible design, the updating unit 904 is specifically configured to:

performing pooling treatment on the updated training data to obtain pooled training data;

and inputting the training data after the pooling into the preset training network to obtain the prediction model.

In one possible design, the updated training data includes the target multimodal tensor to be trained and the plurality of monomodal tensors to be trained.

In one possible design, the plurality of monomodal tensors to be trained includes at least two of an acoustic modality tensor, a language modality tensor, and a visual modality tensor.

It should be noted that, the information interaction, the execution process, and the like between the modules/units in the processing apparatus 900 of the neural network are based on the same concept as the method embodiment corresponding to fig. 5 in the present application, and specific contents may refer to the description in the foregoing method embodiment in the present application, and are not described herein again.

An embodiment of the present application further provides another processing apparatus for a neural network, specifically referring to fig. 10, fig. 10 is a schematic structural diagram of the processing apparatus for a neural network provided in the embodiment of the present application, where the processing apparatus 1000 for a neural network includes:

an obtaining unit 1001 configured to obtain data to be predicted, where the data to be predicted includes a plurality of monomodal tensors to be predicted;

the fusion unit 1002 is configured to perform tensor fusion according to the plurality of monomodal tensors to be predicted, so as to obtain an initial multimodal tensor to be predicted;

a convolution unit 1003, configured to perform tensor convolution operation on the initial multi-modal tensor to be predicted according to R orthogonal tensors to obtain a target multi-modal tensor to be predicted, where R is an integer greater than 1, and any tensor in the R orthogonal tensors is orthogonal to the other R-1 tensors;

the updating unit 1004 is configured to update the data to be predicted according to the target multi-modal tensor to be predicted, so as to obtain updated data to be predicted;

the processing unit 1005 is configured to input a preset prediction model by using the updated data to be predicted, and process the data to obtain a prediction result.

In one possible design, the convolution unit 1003 is specifically configured to:

and carrying out tensor convolution operation on the initial multi-mode tensor to be predicted according to the R orthogonal tensors and a preset sliding step length to obtain the target multi-mode tensor to be predicted, wherein the order of the initial multi-mode tensor to be predicted and the order of any one of the R orthogonal tensors are M orders, M is a positive integer, and the number of elements contained in the preset sliding step length is M.

In a possible design, the dimension of the mth order of the initial multi-modal tensor to be predicted is x, the dimension of the mth order of any one of the R orthogonal tensors is y, and the value of the mth element of the preset sliding step is z, where M is an integer not greater than M, x, y, and z are integers greater than 0, x is greater than y, and y is greater than or equal to z.

In one possible design, the updating unit 1004 is specifically configured to:

calculating an outer product of a plurality of monomodal tensors to be predicted;

and carrying out tensor fusion according to the outer product of the plurality of monomodal tensors to be predicted to obtain the initial multimodal tensor to be predicted.

In one possible design, the updating unit 1004 is specifically configured to:

pooling the updated data to be predicted to obtain pooled data to be predicted;

and inputting the data to be predicted after the pooling into a preset training network to obtain the prediction model.

In one possible design, the updated data to be predicted includes the target multimodal tensor and the plurality of monomodal tensors.

In one possible design, the plurality of monomodal tensors to be predicted includes at least two of an acoustic modality tensor, a language modality tensor, and a visual modality tensor.

It should be noted that, the information interaction, the execution process, and the like between the modules/units in the processing apparatus 1000 of the neural network are based on the same concept as the method embodiment corresponding to fig. 7 in the present application, and specific contents may refer to the description in the foregoing method embodiment in the present application, and are not described herein again.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a server provided in the embodiment of the present application, where a processing device 900 of a neural network described in the embodiment corresponding to fig. 9 may be deployed on the server 1100 to implement the function of the server in the embodiment corresponding to fig. 5, or a processing device 1000 of a neural network described in the embodiment corresponding to fig. 10 may be deployed on the server 1100 to implement the function of the server in the embodiment corresponding to fig. 7. In particular, the server 1100 is implemented by one or more servers, and the server 1100 may vary greatly according to configuration or performance, and may include one or more Central Processing Units (CPUs) 1123 (e.g., one or more processors) and memory 1132, and one or more storage media 1130 (e.g., one or more mass storage devices) storing applications 1142 or data 1144. Memory 1132 and storage media 1130 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1122 may be provided in communication with the storage medium 1130 to execute a series of instruction operations in the storage medium 1130 on the server 1100. It should be understood that the server shown in fig. 11 is only an example of a server, and the server 1100 may not include the memory 1132 and the storage medium 1130 inside, but an external memory may be configured outside the server 1100, that is, the memory 1132, the storage medium 1130 and the central processor 1122 may be devices independent of each other, for example, an external memory is used in an in-vehicle server.

The server 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input-output interfaces 1158, and/or one or more operating systems 1141, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

In this embodiment of the application, the central processing unit 1122 is configured to execute the processing method of the neural network executed by the server in the embodiment corresponding to fig. 5, or is configured to execute the processing method of the neural network executed by the server in the embodiment corresponding to fig. 7. It should be noted that, for a specific implementation manner of the central processing unit 1122 executing the processing method of the neural network, reference may be made to descriptions in each method embodiment corresponding to fig. 5 and fig. 7, and details are not repeated here.

Also provided in embodiments of the present application is a computer program product, which when run on a computer, causes the computer to perform the steps performed by the server in the method described in the foregoing embodiment shown in fig. 5, or causes the computer to perform the steps performed by the server in the method described in the foregoing embodiment shown in fig. 7.

Also provided in an embodiment of the present application is a computer-readable storage medium, in which a program for signal processing is stored, and when the program is executed on a computer, the program causes the computer to execute the steps executed by the server in the method described in the foregoing embodiment shown in fig. 5, or causes the computer to execute the steps executed by the server in the method described in the foregoing embodiment shown in fig. 7.

The execution device, the training device, the terminal device or the communication device provided by the embodiment of the application may specifically be a chip, and the chip includes: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored in the storage unit to make the chip in the server execute the processing method of the neural network described in the embodiment shown in fig. 5 or make the chip in the server execute the processing method of the neural network described in the embodiment shown in fig. 7. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, referring to fig. 12, fig. 12 is a schematic structural diagram of a chip provided in the embodiment of the present application, where the chip may be represented as a neural network processor NPU 120, and the NPU 120 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 1203, and the controller 1204 controls the arithmetic circuit 1203 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuitry 1203 internally includes multiple processing units (PEs). In some implementations, the operational circuitry 1203 is a two-dimensional systolic array. The arithmetic circuit 1203 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 1203 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 1202 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 1201 and performs matrix operation with the matrix B, and partial results or final results of the obtained matrix are stored in an accumulator (accumulator) 1208.

The unified memory 1206 is used for storing input data and output data. The weight data directly passes through a Memory Access Controller (DMAC) 1205, and the DMAC is transferred to the weight Memory 1202. The input data is also carried into the unified memory 1206 by the DMAC.

The BIU is a Bus Interface Unit 1210 for the interaction of the AXI Bus with the DMAC and an Instruction Fetch memory (IFB) 1209.

A Bus Interface Unit 1210(Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1209 to fetch instructions from the external memory, and is also used for the storage Unit access controller 1205 to fetch the original data of the input matrix a or the weight matrix B from the external memory.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1206 or to transfer weight data into the weight memory 1202 or to transfer input data into the input memory 1201.

The vector calculation unit 1207 includes a plurality of operation processing units, and performs further processing on the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 1207 can store the processed output vector to the unified memory 1206. For example, the vector calculation unit 1207 may apply a linear function and/or a nonlinear function to the output of the operation circuit 1203, for example, linear interpolation is performed on the feature planes extracted by the convolution layer, and further, for example, a vector of accumulated values is used to generate an activation value. In some implementations, the vector calculation unit 1207 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to arithmetic circuitry 1203, e.g., for use in subsequent layers in a neural network.

An instruction fetch buffer (issue fetch buffer)1209 connected to the controller 1204, configured to store instructions used by the controller 1204;

the unified memory 1206, the input memory 1201, the weight memory 1202, and the instruction fetch memory 1209 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

When the first neural network, the skill, the new skill, or the skill selected by the second neural network is embodied as a neural network, the operations of the layers in the neural network may be performed by the operation circuit 1203 or the vector calculation unit 1207.

Wherein any of the aforementioned processors may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits configured to control the execution of the programs of the method of the first aspect.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Claims

1. A method of processing a neural network, comprising:

acquiring training data, wherein the training data comprises a plurality of monomodal tensors to be trained;

carrying out tensor fusion according to the plurality of monomodal tensors to be trained to obtain an initial multimodal tensor to be trained;

carrying out tensor convolution operation on the initial multi-modal tensor to be trained according to R orthogonal tensors to obtain a target multi-modal tensor to be trained, wherein R is an integer greater than 1, and any tensor in the R orthogonal tensors is orthogonal to other R-1 tensors;

updating the training data according to the multi-modal tensor of the target to be trained to obtain updated training data;

and inputting the updated training data into a preset training network, and training to obtain a prediction model.

2. The method of claim 1, wherein performing a tensor convolution operation on the initial multi-modal tensor to be trained according to the R orthogonal tensors to obtain a target multi-modal tensor to be trained comprises:

3. The method according to claim 2, wherein the dimension of the mth order of the initial multi-modal tensor to be trained is x, the dimension of the mth order of any one of the R orthogonal tensors is y, and the value of the mth element of the preset sliding step is z, where M is an integer no greater than M, x, y, and z are integers greater than 0, x is greater than y, and y is greater than or equal to z.

4. The method according to any one of claims 1 to 3, wherein after the predictive model is trained by inputting the updated training data into a preset training network, the method further comprises:

acquiring data to be predicted, wherein the data to be predicted comprises a plurality of monomodal tensors to be predicted;

carrying out tensor fusion according to the plurality of monomodal tensors to be predicted to obtain an initial multimodal tensor to be predicted;

carrying out tensor convolution operation on the initial multi-modal tensor to be predicted according to the R orthogonal tensors to obtain a target multi-modal tensor to be predicted;

updating the data to be predicted according to the target multi-modal tensor to be predicted to obtain updated data to be predicted;

and inputting the updated data to be predicted into the prediction model, and processing to obtain a prediction result.

5. The method according to any one of claims 1 to 3, wherein after the training with the updated training data input into a preset training network to obtain a prediction model, the method further comprises:

6. The method of any of claims 1 to 3, wherein the updated training data comprises the target multi-modal tensor to be trained and the plurality of single-modal tensors to be trained.

7. The method according to any one of claims 1 to 3, wherein the plurality of monomodal tensors to be trained comprises at least two of an acoustic modality tensor, a linguistic modality tensor, and a visual modality tensor.

8. A method of processing a neural network, comprising:

carrying out tensor convolution operation on the initial multi-modal tensor to be predicted according to R orthogonal tensors to obtain a target multi-modal tensor to be predicted, wherein R is an integer larger than 1, and any tensor in the R orthogonal tensors is orthogonal to other R-1 tensors;

and inputting the updated data to be predicted into a prediction model, and processing to obtain a prediction result.

9. The method of claim 8, wherein performing a tensor convolution operation on the initial multi-modal tensor to be predicted according to the R orthogonal tensors to obtain a target multi-modal tensor to be predicted comprises:

10. The method according to claim 9, wherein the dimension of mth order of the initial multi-modal tensor to be predicted is x, the dimension of mth order of any one of the R orthogonal tensors is y, and the value of mth element of the preset sliding step is z, where M is an integer no greater than M, x, y, and z are integers greater than 0, x is greater than y, and y is greater than or equal to z.

11. The method according to any one of claims 8 to 10, wherein the updated data to be predicted comprises the target multimodal tensor to be predicted and the plurality of monomodal tensors to be predicted.

12. The method according to any one of claims 8 to 10, wherein the plurality of monomodal tensors to be predicted includes at least two of an acoustic modality tensor, a language modality tensor, and a visual modality tensor.

13. A processing apparatus of a neural network, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring training data, and the training data comprises a plurality of monomodal tensors to be trained;

the fusion unit is used for carrying out tensor fusion according to the plurality of monomodal tensors to be trained to obtain an initial multimodal tensor to be trained;

the convolution unit is used for carrying out tensor convolution operation on the initial multi-mode tensor to be trained according to R orthogonal tensors to obtain a target multi-mode tensor to be trained, wherein R is an integer larger than 1, and any tensor in the R orthogonal tensors is orthogonal to other R-1 tensors;

the updating unit is used for updating the training data according to the target multi-modal tensor to be trained to obtain updated training data;

and the training unit is used for inputting the updated training data into a preset training network and training to obtain a prediction model.

14. The apparatus of claim 13, wherein the convolution unit is specifically configured to:

15. The apparatus of claim 14, wherein a dimension of an mth order of the initial multi-modal tensor to be trained is x, a dimension of an mth order of any one of the R orthogonal tensors is y, and a value of an mth element of the preset sliding step is z, where M is an integer no greater than M, x, y, and z are integers greater than 0, x is greater than y, and y is greater than or equal to z.

16. The apparatus according to any one of claims 13 to 15, wherein the apparatus further comprises a processing unit;

the acquiring unit is further configured to acquire data to be predicted, where the data to be predicted includes a plurality of monomodal tensors to be predicted;

the fusion unit is further configured to perform tensor fusion according to the plurality of monomodal tensors to be predicted to obtain an initial multimodal tensor to be predicted;

the convolution unit is further configured to perform tensor convolution operation on the initial multi-modal tensor to be predicted according to the R orthogonal tensors to obtain a target multi-modal tensor to be predicted;

the updating unit is further configured to update the data to be predicted according to the target multi-modal tensor to be predicted, so as to obtain updated data to be predicted;

and the processing unit is used for inputting the updated data to be predicted into the prediction model and processing the data to obtain a prediction result.

17. The apparatus according to any one of claims 13 to 15, wherein the updating unit is further configured to train the prediction model using an orthogonal constraint loss function to update the prediction model.

18. The apparatus of any one of claims 13 to 15, wherein the updated training data comprises the target multimodal tensor to be trained and the plurality of single modality tensors to be trained.

19. The apparatus of any one of claims 13 to 15, wherein the plurality of monomodal tensors to be trained comprises at least two of an acoustic modality tensor, a linguistic modality tensor, and a visual modality tensor.

20. A processing apparatus of a neural network, comprising:

the device comprises an acquisition unit, a prediction unit and a prediction unit, wherein the acquisition unit is used for acquiring data to be predicted, and the data to be predicted comprises a plurality of monomodal tensors to be predicted;

the fusion unit is used for carrying out tensor fusion according to the plurality of monomodal tensors to be predicted to obtain an initial multimodal tensor to be predicted;

a convolution unit, configured to perform tensor convolution operation on the initial multi-modal tensor to be predicted according to R orthogonal tensors to obtain a target multi-modal tensor to be predicted, where R is an integer greater than 1, and any one of the R orthogonal tensors is orthogonal to the other R-1 tensors;

the updating unit is used for updating the data to be predicted according to the target multi-modal tensor to be predicted to obtain updated data to be predicted;

and the processing unit is used for inputting a preset prediction model by using the updated data to be predicted and processing to obtain a prediction result.

21. The apparatus according to claim 20, wherein the convolution unit is specifically configured to:

22. The apparatus of claim 21, wherein a dimension of an mth order of the initial multi-modal tensor to be predicted is x, a dimension of an mth order of any one of the R orthogonal tensors is y, and a value of an mth element of the preset sliding step is z, where M is an integer no greater than M, x, y, and z are integers greater than 0, x is greater than y, and y is greater than or equal to z.

23. The apparatus of any of claims 20 to 22, wherein the updated data to be predicted comprises the target multi-modal tensor and the plurality of single-modal tensors.

24. The apparatus according to any one of claims 20 to 22, wherein the plurality of monomodal tensors to be predicted comprises at least two of an acoustic modality tensor, a linguistic modality tensor, and a visual modality tensor.

25. A server, comprising a processor coupled to a memory, the memory storing program instructions that, when executed by the processor, implement the method of any of claims 1 to 7 or cause a computer to perform the method of any of claims 8 to 12.

26. A computer-readable storage medium comprising a program which, when run on a computer, causes the computer to perform the method of any one of claims 1 to 7 or causes the computer to perform the method of any one of claims 8 to 12.

27. Circuitry comprising processing circuitry configured to perform a method as claimed in any of claims 1 to 7 or to cause a computer to perform a method as claimed in any of claims 8 to 12.