CN117494812A

CN117494812A - Model reasoning method, device, electronic equipment and storage medium

Info

Publication number: CN117494812A
Application number: CN202311281058.3A
Authority: CN
Inventors: 乔飞; 任二祥; 曲成
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-02-02

Abstract

The invention provides a model reasoning method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring multi-mode characteristics extracted by a near-sensing end according to a target task to be inferred; generating a multi-modal text feature sequence according to the multi-modal features; and inputting the multi-mode text feature sequence into a pre-trained large language model to obtain a task reasoning result of the target task. According to the method, the acquisition work and the feature extraction work of the multi-mode data are completed at the near-sensing end, so that the front-end neural network operation can be accelerated, the equipment energy consumption can be reduced, the delay and the communication loss can be reduced, the energy pressure caused by the multi-mode data acquisition can be reduced, the calculation pressure of a high-calculation-force end can be reduced, the multi-mode text feature sequence which can be processed by a large language model is converted from the multi-mode feature, the limitation of a data mode is avoided, the reasoning of a target task is realized, and the model reasoning speed and the reasoning accuracy are effectively improved.

Description

Model reasoning method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a model reasoning method, a device, an electronic device, and a storage medium.

Background

With the rapid development of the internet of things technology, how to deploy a traditional deep learning network to internet of things equipment becomes a popular research topic; meanwhile, multi-modal feature fusion and task reasoning are also one of research hotspots in the field of deep learning in recent years. Therefore, how to design a more efficient and energy-saving multi-modal feature fusion network deployment architecture is a relatively research-valuable topic.

In the prior art, the multi-mode data acquisition work is often completed through the edge end internet of things equipment, but the special edge end equipment for deploying specific tasks is high in price, and because a large number of edge equipment is required for acquiring the multi-mode data to acquire data in different data modes, the energy consumption of the edge equipment is high, and the pressure is brought to the energy resource utilization of the edge network.

Meanwhile, the size and the required calculation amount of a model used by the traditional multi-mode feature fusion method are often large, and the task reasoning method also has the problems of huge model, extremely large calculation amount required by training and reasoning processes and excessively high requirement on equipment calculation capability.

Furthermore, even if the training calculation amount is reduced by some existing algorithms, the limitation of the data mode type exists, for example, only the task of fusion reasoning of the bimodal features such as image-text or voice-text can be realized, so that the reasoning accuracy is low.

Therefore, how to solve the problems of low reasoning speed and low reasoning accuracy caused by large calculation amount and limitation of data mode types in the existing model reasoning method is an important subject to be solved in the artificial intelligence field.

Disclosure of Invention

The invention provides a model reasoning method, a device, electronic equipment and a storage medium, which are used for overcoming the defects of low reasoning speed and low reasoning accuracy caused by large calculated amount and limitation of data mode types in the conventional model reasoning method, and effectively improving the reasoning speed and the reasoning accuracy.

In one aspect, the present invention provides a model reasoning method, including: acquiring multi-modal characteristics extracted by a near-sensing end according to a target task to be inferred, wherein the multi-modal characteristics comprise single-modal characteristics of a plurality of different data modes; generating a multi-modal text feature sequence according to the multi-modal features; and inputting the multi-modal text feature sequence into a pre-trained large language model to obtain a task reasoning result of the target task.

Further, the obtaining the multi-modal feature extracted by the proximity sensor according to the target task to be inferred includes: acquiring single-mode data of a plurality of different data modes in a target scene according to the target task; and respectively extracting features of the single-mode data of a plurality of different data modes to obtain single-mode features of the plurality of different data modes.

Further, the unimodal data of each modality corresponds to a pre-trained feature extraction network model; correspondingly, the feature extraction for the single-mode data of a plurality of different data modes respectively comprises the following steps: inputting the single-mode data of each mode to the corresponding feature extraction network model respectively to obtain single-mode features of the single mode; based on the single-mode characteristics of a plurality of single modes, obtaining single-mode characteristics of a plurality of different data modes; the feature extraction network model is obtained by training and optimizing a training sample data set formed according to the multi-mode data and the corresponding feature extraction result.

Further, the feature extraction network model includes, but is not limited to, a convolutional network layer and a flattening layer, or the feature extraction network model includes, but is not limited to, a convolutional network layer, a flattening layer, and a multi-headed attention calculation layer.

Further, the unimodal feature of each modality corresponds to a pre-trained transformation network model; correspondingly, the generating the multi-modal text feature sequence according to the multi-modal features comprises the following steps: inputting the single-mode characteristics of each data mode to a corresponding conversion network model respectively to obtain a single-mode text characteristic sequence; splicing a plurality of single-mode text feature sequences to obtain the multi-mode text feature sequences; the conversion network model is constructed based on a transducer model.

Further, the single-mode data includes any one of image data, audio data, video data, and text data.

In a second aspect, the present invention also provides a model reasoning apparatus, including: the multi-modal feature acquisition module is used for acquiring multi-modal features extracted by the near-sensing end according to a target task to be inferred, wherein the multi-modal features comprise single-modal features of a plurality of different data modalities; the text feature sequence generation module is used for generating a multi-mode text feature sequence according to the multi-mode features; and the target task reasoning module is used for inputting the multi-mode text feature sequence into a pre-trained large language model to obtain a task reasoning result of the target task.

In a third aspect, the present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the model reasoning method as described in any of the above when executing the program.

In a fourth aspect, the invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a model reasoning method as claimed in any of the above.

In a fifth aspect, the invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a model reasoning method as claimed in any of the above.

According to the model reasoning method provided by the invention, the multi-modal characteristics extracted by the near-sensing end are obtained according to the target task to be inferred, and the multi-modal characteristics comprise single-modal characteristics of a plurality of different data modes; generating a multi-modal text feature sequence according to the multi-modal features; and inputting the multi-mode text feature sequence into a pre-trained large language model to obtain a task reasoning result of the target task. The method completes the acquisition work and the feature extraction work of the multi-mode data at the near-sensing end, thereby not only accelerating the operation of the front-end neural network and reducing the energy consumption of equipment, but also reducing the delay and the communication loss, reducing the energy pressure brought by the acquisition of the multi-mode data and sharing the calculation pressure of the high-calculation-force end. On the basis, the multi-mode text feature sequence which can be processed by converting the multi-mode features into the large language model is not limited by the data mode, so that the reasoning of the target task is realized, and the reasoning speed and the reasoning accuracy of the model are effectively improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a model reasoning method provided by the invention;

FIG. 2 is a schematic diagram of the reasoning structure of the model reasoning method provided by the invention;

FIG. 3 is one of the schematic reasoning diagrams of the model reasoning method provided by the invention;

FIG. 4 is a schematic diagram of a model reasoning method according to the second embodiment of the present invention;

fig. 5 is a schematic structural diagram of a model inference device provided by the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 shows a schematic flow chart of a model reasoning method provided by the invention. As shown in fig. 1, the method includes:

s110, acquiring multi-mode features extracted by a near-sensing end according to a target task to be inferred, wherein the multi-mode features comprise single-mode features of a plurality of different data modes.

It can be understood that one or more sensors for collecting data in different data modes are installed in the target scene, so that not only is the sensor corresponding to the data in different data modes used, but also feature extraction is performed on the collected data in a single mode, and single mode features corresponding to the data in each data mode are obtained.

That is, the single-mode data of each data modality has a corresponding acquisition device and feature extraction device (the acquisition device and the feature extraction device are the same device), i.e., a sensor.

The sensors corresponding to the different data modes form a near sensing end together.

Specifically, on the basis of determining a target task to be inferred, data of which modes need to be acquired for achieving target task inference can be known, and therefore corresponding single-mode data are acquired by using sensors installed in corresponding data modes in a target scene.

After acquiring the single-mode data of different data modes, the sensor also performs feature extraction on the collected single-mode data to obtain a plurality of single-mode features. The single mode features of the plurality of different data modes form a multi-mode feature.

The foregoing feature extraction process may be implemented by a neural network model, or may be implemented by other feature extraction methods, which are not specifically limited herein.

Among other things, target tasks include, but are not limited to, target detection, automatic question and answer, image captions, machine translation, text classification, emotion analysis, information retrieval, and the like. The target scene corresponds to the target task, and can be any scene related to the target task.

The single-mode data is determined according to the target task, and the single-mode data may be image data, audio data, video data or text data, which is not particularly limited herein.

S120, generating a multi-modal text feature sequence according to the multi-modal features;

it can be understood that, on the basis of acquiring the multi-modal feature extracted by the proximity sensor according to the target task to be inferred in step S110, further, a multi-modal text feature sequence is generated according to the multi-modal feature.

Specifically, for the obtained single-mode characteristics of a plurality of different data modes, generating a corresponding single-mode text characteristic sequence according to the single-mode characteristics of each data mode to obtain a plurality of single-mode text characteristic sequences.

It should be noted that each unimodal text feature sequence retains the unimodal features of its corresponding data modality.

And further, fusing a plurality of single-mode text feature sequences to obtain the multi-mode text feature sequences.

The process of generating the unimodal text feature sequence may be implemented by a neural network model, or may be implemented by other manners, which are not specifically limited herein.

The process of fusing the plurality of single-mode text feature sequences may adopt a splicing mode or other fusing modes, and is not particularly limited herein.

S130, inputting the multi-mode text feature sequence into a pre-trained large language model to obtain a task reasoning result of the target task.

It can be understood that, further, based on the step S120 of generating the multimodal text feature sequence according to the multimodal features, the large language model (Large Language Model, LLM) can be utilized to perform reasoning of the target task, so as to obtain a task reasoning result.

Specifically, the multi-modal text feature sequence is input into a pre-trained large language model, and the task reasoning result of the output target task can be obtained.

Among them, the large language model is a recent research hotspot in the field of artificial intelligence, such as ChatGPT, LLaMA, OPT and the like. The large language model has strong language understanding capability and knowledge reserve, and has great potential in the field of natural language processing.

Meanwhile, due to huge general knowledge reserves and strong logic reasoning capability of the large language model, the method has good application prospect in the aspects of multi-mode fusion and reasoning. However, large language models can only understand text content.

In consideration of the above, in this embodiment, the step S120 is used to generate the multimodal text feature sequence according to the multimodal features, so as to build a "bridge" between the large language model and each data modality, thereby effectively improving the feature fusion capability of multiple different data modalities and implementing the multimodal feature fusion.

In the embodiment, the multi-modal characteristics extracted by the near-sensing end are obtained according to the target task to be inferred, wherein the multi-modal characteristics comprise single-modal characteristics of a plurality of different data modalities; generating a multi-modal text feature sequence according to the multi-modal features; and inputting the multi-mode text feature sequence into a pre-trained large language model to obtain a task reasoning result of the target task. The method completes the acquisition work and the feature extraction work of the multi-mode data at the near-sensing end, thereby not only accelerating the operation of the front-end neural network and reducing the energy consumption of equipment, but also reducing the delay and the communication loss, reducing the energy pressure brought by the acquisition of the multi-mode data and sharing the calculation pressure of the high-calculation-force end. On the basis, the multi-mode text feature sequence which can be processed by converting the multi-mode features into the large language model is not limited by the data mode, so that the reasoning of the target task is realized, and the reasoning speed and the reasoning accuracy of the model are effectively improved.

Based on the above description, the reasoning architecture of the model reasoning method provided in this embodiment has a larger difference than the existing model reasoning architecture.

Specifically, in one embodiment, fig. 2 shows a schematic diagram of an inference structure of the model inference method provided by the present invention.

As shown in fig. 2, the overall inference architecture is composed of a near-sensing end and a high-computation end, and is configured to collect multi-modal data (single-modal data of a plurality of different data modalities) at the near-sensing end, and perform feature extraction operation on the collected multi-modal data to obtain multi-modal features (single-modal features of a plurality of different data modalities), so as to transmit the extracted multi-modal features to the high-computation end.

After receiving the multi-mode features transmitted by the near-sensing end, the high-computation-force end executes text feature sequence generation operation and task reasoning operation aiming at the single-mode features of a plurality of different data modes, so as to obtain a task reasoning result.

It should be noted that, in the reasoning architecture, the most difference from the existing model reasoning architecture is that, first, the model reasoning method provided in this embodiment needs not only to collect multi-mode data at the near-sensing end, but also to extract features of the multi-mode data to obtain multi-mode features; the existing model reasoning architecture only collects multi-mode data at the near sensing end, and the data mode type of the multi-mode data is limited.

The multi-mode data is acquired at the near sensing end, and the multi-mode characteristics are extracted, so that the front-end neural network operation can be accelerated, the energy consumption of equipment can be reduced, the delay and the communication loss can be reduced, and the calculation pressure of an energy pressure sharing high-calculation-force end caused by multi-mode data acquisition can be reduced.

And secondly, the universal indication reserve and logic understanding capability of the high-computation-power end pre-training large language model are utilized, the feature fusion capability of the multi-mode data is improved, the model performance is improved, and the multi-mode feature fusion is realized. Meanwhile, the network model chain is converted to approach the sensing end feature extraction network and the pre-training large language model, so that training or fine adjustment of the large language model is not needed, and the requirement on training calculation amount can be reduced.

It should be noted that, the above embodiment of the present invention uses the high-power end as the execution subject.

Based on the above embodiment, further, according to a target task to be inferred, obtaining the multi-modal feature extracted by the proximity sensor includes: acquiring single-mode data of a plurality of different data modes in a target scene according to a target task; and respectively extracting features of the single-mode data of a plurality of different data modes to obtain single-mode features of the plurality of different data modes.

It can be understood that on the basis of determining the target task to be inferred, it can be known which modes of data need to be acquired to achieve the target task inference, so that the corresponding single-mode data are acquired by using the sensors installed in the corresponding data modes in the target scene.

Further, feature extraction is performed on the single-mode data of the different data modes respectively, so that single-mode features of the different data modes can be obtained.

In the process of feature extraction, the embodiment constructs an independent and special feature extraction network model aiming at the single-mode data of each data mode, and before feature extraction, independent network training can be carried out aiming at each feature extraction network model.

That is, the unimodal data for each data modality corresponds to a pre-trained feature extraction network model.

Correspondingly, the feature extraction is carried out on the single-mode data of a plurality of different data modes respectively, and the feature extraction comprises the following steps: inputting the single-mode data of each mode to the corresponding feature extraction network model respectively to obtain single-mode features of the single mode; based on the single-mode characteristics of the single modes, the single-mode characteristics of a plurality of different data modes are obtained.

It should be noted that, the feature extraction network model corresponding to the monomodal data of each data mode is independent of the feature extraction network model corresponding to the monomodal data of other data modes, including independent model training and feature extraction.

The feature extraction network model corresponding to each data mode is obtained by training and optimizing a training sample data set formed according to the mode data and the feature extraction result corresponding to the mode data.

Wherein the feature extraction network model includes, but is not limited to, a convolutional network layer and a flattening layer, or the feature extraction network model includes, but is not limited to, a convolutional network layer, a flattening layer, and a multi-headed attention calculation layer.

Specifically, the embodiment designs a lightweight feature extraction network model based on a convolutional neural network, so as to be suitable for cross attention calculation of a conversion network model and meet the hardware limit of a near-sensing end.

In the unfolding process, in the process of processing the single-mode data by the feature extraction network model, the single-mode data is input into a convolution network layer of the feature extraction network model, the multi-channel features generated after the convolution neural network processing are flattened, and positions are added into the flattened layer for embedding, so that the method is suitable for cross attention calculation in the conversion network model.

Further, a multi-head attention calculation layer can be added after the flattening layer, so that key information in the model attention feature can be facilitated.

In a specific embodiment, taking a ResNet image feature extraction network model as an example, a last pooling layer and a linear layer of the network model can be replaced by a flattening layer, an image channel is used as embedding in the flattening layer, a multi-channel feature map output by a residual block is flattened along the length and the width of the image, and position embedding is added to generate image features adapting to cross attention calculation of a conversion network model.

Optionally, a multi-headed attention calculation layer is added after the flattening layer to help model attention to key image information in the image.

In this embodiment, according to a target task, single-mode data of a plurality of different data modes in the target scene are collected, then a lightweight feature extraction network model is utilized to respectively perform feature extraction on the single-mode data of the plurality of different data modes, so as to obtain single-mode features of the plurality of different data modes, namely multi-mode features, and further, the multi-mode features are converted into multi-mode text feature sequences, so that the multi-mode text feature sequences are input into a pre-trained large language model, and a task reasoning result of the target task is obtained. The method completes the acquisition work and the feature extraction work of the multi-mode data at the near-sensing end, thereby not only accelerating the operation of the front-end neural network and reducing the energy consumption of equipment, but also reducing the delay and the communication loss, reducing the energy pressure brought by the acquisition of the multi-mode data and sharing the calculation pressure of the high-calculation-force end. On the basis, the multi-mode text feature sequence which can be processed by converting the multi-mode features into the large language model is not limited by the data mode, so that the reasoning of the target task is realized, and the reasoning speed and the reasoning accuracy of the model are effectively improved.

On the basis of the foregoing embodiment, further, generating a multimodal text feature sequence according to the multimodal feature includes: inputting the single-mode characteristics of each data mode to a corresponding conversion network model respectively to obtain a single-mode text characteristic sequence; and splicing a plurality of single-mode text feature sequences to obtain a multi-mode text feature sequence.

It will be appreciated that in the process of generating the text feature sequence, as such, the present embodiment constructs separate and specific conversion network models for the unimodal feature of each data modality, and before generating the text feature sequence, independent network training may be performed for each conversion network model.

The conversion network model can be constructed based on a transducer model.

Regarding the network structure of the conversion network model, in one particular embodiment, the conversion network model includes a feed forward network layer, a cross-attention network layer, and a self-attention network layer.

After receiving the multi-modal characteristics (the single-modal characteristics of a plurality of different data modes) transmitted by the proximity sensing end, the high computing power end is unfolded, and the single-modal characteristics of each data mode correspond to a pre-trained conversion network model, so that the corresponding single-modal text characteristic sequence can be obtained only by inputting the single-modal characteristics of each data mode into the corresponding conversion network model and performing cross attention calculation.

It should be noted that, here, the unimodal features of different data modalities are converted into unimodal text feature sequences that can be processed by the large language model, but each unimodal text feature sequence still retains the essential features of the corresponding unimodal features.

And on the basis of obtaining the single-mode text feature sequences of different data modes output by the plurality of conversion network models, splicing the plurality of single-mode text feature sequences to obtain the multi-mode text feature sequence.

In a specific embodiment, the single-mode text feature sequences of different data modes output by the plurality of conversion network models are input into the synthesis network layer, and the multi-mode text feature sequences are output.

In this embodiment, the unimodal feature of each data mode is input to the corresponding conversion network model respectively to obtain a unimodal text feature sequence, and a plurality of unimodal text feature sequences are spliced to obtain a multi-modal text feature sequence, and then the multi-modal text feature sequence is used as the input of the large language model to obtain the task reasoning result of the target task. According to the method, the acquisition work and the feature extraction work of the multi-mode data are completed at the near-sensing end, so that the front-end neural network operation can be accelerated, the energy consumption of equipment can be reduced, the delay and the communication loss can be reduced, the energy pressure caused by the acquisition of the multi-mode data can be reduced, and the computing pressure of the high-computation-force end equipment can be shared by the distributed near-sensing equipment. On the basis, the multi-mode text feature sequence which can be processed by converting the multi-mode features into the large language model is not limited by the data mode, so that the reasoning of the target task is realized, and the reasoning speed and the reasoning accuracy of the model are effectively improved.

In some embodiments, FIG. 3 shows one of the inference schematics of the model inference method provided by the present invention.

As shown in fig. 3, first, at the proximity sensor, single-mode data of a data mode required for a target task, such as image data, and also such as audio data, and other mode data, are acquired using one or more sensors installed in a target scene.

Then, feature extraction of each single-mode data is carried out at a near-sensing end (sensor end), single-mode features are obtained, and the single-mode features are transmitted to a conversion network model of a high-computation end.

Then, after receiving the unimodal features of the corresponding data modes, the conversion network model constructed based on the transducer converts the unimodal features of the corresponding data modes into corresponding unimodal text feature sequences.

And then inputting the single-mode text feature sequences output by the plurality of conversion network models into a synthesis network layer for fusion to obtain the multi-mode text feature sequences.

And finally, the large language model takes the multi-mode text feature sequence as input to obtain a task reasoning result of the target task.

When the target task is inferred, the random combination of the single-mode text feature sequences of any of the plurality of data modes may be fused according to the actual situation, which is not particularly limited herein.

In other embodiments, FIG. 4 shows a second schematic diagram of the model reasoning method provided by the present invention.

As shown in fig. 4, each of the single-mode data acquisition, feature extraction, and text feature sequence conversion of fig. 4 may exist as a single-mode module, such as an image mode module, as well as an audio mode module or other data mode module, as compared to the illustrated fig. 3, apart from the near-sensing end and the high-computation-force end.

When training is performed, the network training may be performed for each individual unimodal module, i.e., the feature extraction network model and the transformation network model in each individual unimodal module.

In actual reasoning, the end-to-end reasoning 'one-tap' service of 'data acquisition-feature extraction-text feature sequence conversion' can be realized.

In other words, in fig. 4, before the text feature sequence synthesis is performed, the single-mode data of multiple different data modes can be used as a single-mode module, and according to the mode fusion requirement, each single-mode module can be combined or split with other single-mode modules to complete data acquisition, feature extraction and text feature sequence conversion of the corresponding data modes, and the single-mode text feature sequence synthesis generated by the single-mode modules is used for reasoning of a final result by using a large language model.

Different single-mode modules can be trained respectively, so that the requirement on a specific training set is reduced, and the training process is simplified.

For different target tasks, each single-mode module in the embodiment can perform independent pre-training for extracting specific features for different tasks in the same data mode so as to pertinently improve the data mode feature extraction capability of the specific task.

Meanwhile, the pre-trained large-scale language model has wide general knowledge and strong logic reasoning capability, and under the condition that the length of the synthesized multi-modal text feature sequence does not exceed the length limit of the input token of the large language model, the large language model can combine the feature information of different single-modal text sequences corresponding to multiple data modes, so that the task reasoning result with higher accuracy is obtained by reasoning.

And by retraining specific features of different tasks in the same data mode, each single-mode module can obtain the capability of extracting specific features, so that the reasoning accuracy of the specific tasks is improved.

In still other embodiments, taking image acquisition as an example, an image is acquired by an image sensing device at a near-sensing end, and a convolution operation in a residual block in a residual network model (ResNet) is performed at the near-sensing end to acquire a feature map.

After the feature extraction of the near-sensing end is completed, the feature map is transmitted to a conversion network model of the rear end, and a specific single-mode text feature sequence is generated in the conversion network model through network model calculation based on Transformer Decoder structural design.

The generation of the corresponding single-mode text feature sequences of the voice and electrocardio modes is similar to that of the image modes, namely, the text feature sequences containing voice features and the single-mode text feature sequences containing heartbeat features are obtained through in-memory calculation of a sensor and calculation of a conversion network model.

After the single-mode text feature sequence of the single-mode module is generated, the multi-mode text feature sequence containing descriptions of the image, the voice and the electrocardio features is obtained through a synthesis network layer, and is adjusted and converted into an input format suitable for a large language model.

The large language model judges the characteristic information contained in the multi-mode text characteristic sequence through the general knowledge and logic reasoning capability obtained by large-scale pre-training, judges the emotion state of a target person in the state according to the target task requirement, and provides task requirement targets such as state description, solution and the like.

Fig. 5 shows a schematic structural diagram of the model inference apparatus provided by the present invention.

As shown in fig. 5, the apparatus includes: the multi-mode feature acquisition module 510 is configured to acquire multi-mode features extracted by the proximity sensing end according to a target task to be inferred, where the multi-mode features include single-mode features of a plurality of different data modes; a text feature sequence generating module 520, configured to generate a multi-modal text feature sequence according to the multi-modal feature; and the target task reasoning module 530 is used for inputting the multi-mode text feature sequence into a pre-trained large language model to obtain a task reasoning result of the target task.

In this embodiment, the multi-modal feature obtaining module 510 obtains the multi-modal features extracted by the proximity sensor according to the target task to be inferred, where the multi-modal features include single-modal features of a plurality of different data modalities; the text feature sequence generation module 520 generates a multimodal text feature sequence according to the multimodal features; the target task reasoning module 530 inputs the multimodal text feature sequence to a pre-trained large language model to obtain the task reasoning result of the target task. The device completes the acquisition work and the feature extraction work of the multi-mode data at the near-sensing end, so that the front-end neural network operation can be accelerated, the energy consumption of equipment can be reduced, the delay and the communication loss can be reduced, the energy pressure caused by the acquisition of the multi-mode data can be reduced, and the computing pressure of the high-computation-force end equipment can be shared by the distributed near-sensing equipment. On the basis, the multi-mode text feature sequence which can be processed by converting the multi-mode features into the large language model is not limited by the data mode, so that the reasoning of the target task is realized, and the reasoning speed and the reasoning accuracy of the model are effectively improved.

It should be noted that, the model inference device provided in this embodiment and the model inference method described above may be referred to correspondingly, and will not be described herein.

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface (communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 820, and memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a model reasoning method comprising: acquiring multi-modal characteristics extracted by a near-sensing end according to a target task to be inferred, wherein the multi-modal characteristics comprise single-modal characteristics of a plurality of different data modes; generating a multi-modal text feature sequence according to the multi-modal features; and inputting the multi-modal text feature sequence into a pre-trained large language model to obtain a task reasoning result of the target task.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the model reasoning method provided by the methods described above, the method comprising: acquiring multi-modal characteristics extracted by a near-sensing end according to a target task to be inferred, wherein the multi-modal characteristics comprise single-modal characteristics of a plurality of different data modes; generating a multi-modal text feature sequence according to the multi-modal features; and inputting the multi-modal text feature sequence into a pre-trained large language model to obtain a task reasoning result of the target task.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the model reasoning method provided by the above methods, the method comprising: acquiring multi-modal characteristics extracted by a near-sensing end according to a target task to be inferred, wherein the multi-modal characteristics comprise single-modal characteristics of a plurality of different data modes; generating a multi-modal text feature sequence according to the multi-modal features; and inputting the multi-modal text feature sequence into a pre-trained large language model to obtain a task reasoning result of the target task.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of model reasoning, comprising:

acquiring multi-modal characteristics extracted by a near-sensing end according to a target task to be inferred, wherein the multi-modal characteristics comprise single-modal characteristics of a plurality of different data modes;

generating a multi-modal text feature sequence according to the multi-modal features;

and inputting the multi-modal text feature sequence into a pre-trained large language model to obtain a task reasoning result of the target task.

2. The model reasoning method of claim 1, wherein the obtaining the multi-modal feature extracted by the proximity sensor according to the target task to be reasoning includes:

acquiring single-mode data of a plurality of different data modes in a target scene according to the target task;

and respectively extracting features of the single-mode data of a plurality of different data modes to obtain single-mode features of the plurality of different data modes.

3. The model reasoning method of claim 2, wherein the single-mode data for each mode corresponds to a pre-trained feature extraction network model;

correspondingly, the feature extraction for the single-mode data of a plurality of different data modes respectively comprises the following steps:

inputting the single-mode data of each mode to the corresponding feature extraction network model respectively to obtain single-mode features of the single mode;

based on the single-mode characteristics of a plurality of single modes, obtaining single-mode characteristics of a plurality of different data modes;

the feature extraction network model is obtained by training and optimizing a training sample data set formed according to the multi-mode data and the corresponding feature extraction result.

4. A model reasoning method according to claim 3, characterized in that the feature extraction network model comprises but is not limited to a convolutional network layer and a flattening layer, or the feature extraction network model comprises but is not limited to a convolutional network layer, a flattening layer and a multi-headed attention calculation layer.

5. The model reasoning method of claim 1, wherein the unimodal feature of each modality corresponds to a pre-trained transformation network model;

correspondingly, the generating the multi-modal text feature sequence according to the multi-modal features comprises the following steps:

inputting the single-mode characteristics of each data mode to a corresponding conversion network model respectively to obtain a single-mode text characteristic sequence;

splicing a plurality of single-mode text feature sequences to obtain the multi-mode text feature sequences;

the conversion network model is constructed based on a transducer model.

6. The model reasoning method of claim 2, wherein the single-mode data includes any one of image data, audio data, video data, and text data.

7. A model reasoning apparatus, comprising:

the multi-modal feature acquisition module is used for acquiring multi-modal features extracted by the near-sensing end according to a target task to be inferred, wherein the multi-modal features comprise single-modal features of a plurality of different data modalities;

the text feature sequence generation module is used for generating a multi-mode text feature sequence according to the multi-mode features;

and the target task reasoning module is used for inputting the multi-mode text feature sequence into a pre-trained large language model to obtain a task reasoning result of the target task.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the model reasoning method as claimed in any of claims 1 to 6 when executing the program.

9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the model reasoning method of any of claims 1 to 6.

10. A computer program product comprising a computer program which, when executed by a processor, implements the model reasoning method of any of claims 1 to 6.