CN117874488A

CN117874488A - Zero sample learning model optimization method, device, equipment and storage medium

Info

Publication number: CN117874488A
Application number: CN202311750518.2A
Authority: CN
Inventors: 曹伟朋; 姚旭洋; 明仲; 许智武; 顾炯炯; 郑亮; 曹雏清
Original assignee: Hart Robotics Industry Technology Research Institute In Yangtze River Delta; Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Shenzhen
Current assignee: Hart Robotics Industry Technology Research Institute In Yangtze River Delta; Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Shenzhen
Priority date: 2023-12-18
Filing date: 2023-12-18
Publication date: 2024-04-12

Abstract

The application discloses a zero sample learning model optimization method, device, equipment and storage medium, wherein the method comprises the following steps: acquiring visual characteristics learned by a visual encoder from a sample image; acquiring text data, inputting the text data into a text encoder, and acquiring semantic attribute characteristics learned from the text data by the text encoder; querying semantic attribute features by using visual features through a multi-mode interaction module to obtain a first interaction result, querying the visual features by using the semantic attribute features to obtain a second interaction result, and aligning the visual features and the semantic attribute features according to the first interaction result and the second interaction result; when the zero sample learning model executes the optimization task, a preset comprehensive loss function generation model is adopted to generate a comprehensive loss function of the optimization task; and optimizing the zero sample learning model based on the comprehensive loss function of the optimization task to obtain an optimized zero sample learning model. The method and the device are beneficial to improving the optimization efficiency of the zero sample learning model.

Description

Zero sample learning model optimization method, device, equipment and storage medium

Technical Field

The application relates to the technical field of internet, in particular to a zero sample learning model optimization method, device, equipment and storage medium.

Background

Zero sample learning is an attractive approach whose core idea is to learn known categories through limited annotation data and then migrate this learned knowledge to the unknown categories and identify them. The model of zero sample learning is simply referred to as a zero sample learning model.

However, the existing server-side equipment cannot optimize the zero sample learning model according to the comprehensive loss function, which is not beneficial to improving the optimization efficiency of the zero sample learning model. The method is characterized in that the existing server equipment only can optimize the zero sample learning model by adopting a unilateral loss function, and the loss of the zero sample learning model comes from multiple aspects, so that when the zero sample learning model is optimized by adopting the unilateral loss function, the optimization time is long, the optimization effect is not ideal, a large amount of time resources and equipment resources are consumed, and therefore, the existing server equipment cannot optimize the zero sample learning model according to the comprehensive loss function, and the optimization efficiency of the zero sample learning model is not improved.

Disclosure of Invention

The embodiment of the application provides a zero sample learning model optimization method, device, equipment and storage medium, which are used for solving the technical problems that the existing server equipment cannot optimize a zero sample learning model according to a comprehensive loss function and is unfavorable for improving the optimization efficiency of the zero sample learning model.

In a first aspect, an embodiment of the present application provides a zero sample learning model optimization method, applied to a server device, where the server device stores a zero sample learning model, the zero sample learning model includes a visual encoder, a text encoder, and a multi-modal interaction module, and the zero sample learning model optimization method includes:

acquiring a sample image, inputting the sample image into the visual encoder, and acquiring visual features learned from the sample image by the visual encoder;

acquiring text data, inputting the text data into the text encoder, and acquiring semantic attribute features learned from the text data by the text encoder;

inquiring the semantic attribute features by using the visual features through the multi-modal interaction module to obtain a first interaction result, inquiring the visual features by using the semantic attribute features to obtain a second interaction result, and aligning the visual features and the semantic attribute features according to the first interaction result and the second interaction result;

Acquiring an optimization task, and generating a comprehensive loss function of the optimization task by adopting a preset comprehensive loss function generation model when the zero sample learning model executes the optimization task;

and optimizing the zero sample learning model based on the comprehensive loss function of the optimization task to obtain the optimized zero sample learning model.

Illustratively, the synthetic loss function generation model includes:

L＝L _cls +λ ₁ L _deb +λ ₂ L _regT ；

wherein L is a comprehensive loss function, lambda ₁ And lambda (lambda) ₂ Is a preset constant, L _cls For cross entropy loss, L _deb L for depolarization loss _regT And calculating the sum of attribute regression losses for each multi-modal interaction module.

Illustratively, the querying, by the multimodal interaction module, the semantic attribute feature by using the visual feature to obtain a first interaction result, querying, by using the semantic attribute feature, the visual feature to obtain a second interaction result, and aligning, according to the first interaction result and the second interaction result, the visual feature and the semantic attribute feature includes:

establishing a first interaction channel with the visual encoder through the multi-mode interaction module, acquiring the visual characteristics of the visual encoder through the first interaction channel, and inquiring the semantic attribute characteristics by using the visual characteristics to obtain a first interaction result;

Inputting the first interaction result to the visual encoder through the multi-mode interaction module so that the visual encoder can acquire first semantic information corresponding to the semantic attribute features based on the first interaction result;

establishing a second interaction channel with the text encoder through the multi-mode interaction module, acquiring the visual characteristics of the text encoder through the second interaction channel, and inquiring the semantic attribute characteristics through the multi-mode interaction module by using the visual characteristics to acquire a second interaction result;

and inputting the second interaction result to the text encoder through the multi-mode interaction module so that the text encoder obtains second semantic information corresponding to the visual feature based on the second interaction result.

Illustratively, the visual encoder comprises one of a ViT model, a DeiT model, a Swin transform model, or a combination thereof, and the text encoder comprises one of a GloVe model, a Word2Vec model, a BERT model, or a combination thereof.

In a possible implementation manner of the first aspect, there is one multi-modal interaction module between an nth network layer and a next network layer of the nth network layer of the visual encoder;

The obtaining the optimization task, when the zero sample learning model executes the optimization task, adopting a preset comprehensive loss function generation model to generate a comprehensive loss function of the optimization task, wherein the generating comprises the following steps:

acquiring the optimization task, and inputting a data set of the optimization task into an Nth network layer of the visual encoder;

performing matrix processing on the Nth data characteristic output by the Nth network layer by using the multi-mode interaction module corresponding to the Nth network layer, outputting the Nth data characteristic after matrix processing to the next network layer of the Nth network layer, and judging whether the next network layer is the last network layer;

and if the next network layer is the last network layer, adopting a preset comprehensive loss function generation model to generate the comprehensive loss function of the optimization task.

In a possible implementation manner of the first aspect, before the acquiring a sample image, inputting the sample image into the visual encoder, and acquiring a visual feature learned by the visual encoder from the sample image, the zero sample learning model optimization method includes:

Reading in a visual encoder and acquiring a network architecture of the visual encoder;

and removing the appointed network layer in the network architecture, and acquiring the visual encoder of which the appointed network layer is removed.

and acquiring a preset training mode, training the visual encoder by adopting the training mode, and acquiring the trained visual encoder.

and updating the weight parameters of the visual encoder by adopting a preset optimizer.

In a possible implementation manner of the first aspect, after the optimizing the zero sample learning model based on the comprehensive loss function of the optimizing task, the method further includes:

And providing a calling interface of the zero sample learning model.

In a possible implementation manner of the first aspect, the multi-modal interaction module is composed of two encoders of the transducer model.

In a second aspect, an embodiment of the present application provides a zero sample learning model optimization apparatus, which is applied to a server device, where the server device stores a zero sample learning model, and the zero sample learning model includes a visual encoder, a text encoder, and a multi-modal interaction module, and includes:

the first acquisition module is used for acquiring a sample image, inputting the sample image into the visual encoder and acquiring visual features learned from the sample image by the visual encoder;

the second acquisition module is used for acquiring text data, inputting the text data into the text encoder and acquiring semantic attribute features learned from the text data by the text encoder;

the interaction module is used for inquiring the semantic attribute features by using the visual features through the multi-mode interaction module to obtain a first interaction result, inquiring the visual features by using the semantic attribute features to obtain a second interaction result, and aligning the visual features and the semantic attribute features according to the first interaction result and the second interaction result;

The generating module is used for acquiring an optimization task, and generating a comprehensive loss function of the optimization task by adopting a preset comprehensive loss function generating model when the zero sample learning model executes the optimization task;

and the optimization module is used for optimizing the zero sample learning model based on the comprehensive loss function of the optimization task to obtain the optimized zero sample learning model.

In a third aspect, an embodiment of the present application provides a server device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the zero sample learning model optimization method of any one of the first aspects when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the zero sample learning model optimization method of any one of the first aspects above.

In a fifth aspect, embodiments of the present application provide a computer program product, which when run on a server device, causes the server device to perform the zero-sample learning model optimization method of any one of the first aspects above.

It will be appreciated that the advantages of the second to fifth aspects may be found in the relevant description of the first aspect, and are not described here again.

Compared with the prior art, the embodiment of the application has the beneficial effects that:

the method has the advantages that on one hand, the zero sample learning model is optimized based on the comprehensive loss function of the optimization task, and the optimized zero sample learning model is obtained, so that the server equipment can optimize the zero sample learning model according to the comprehensive loss function, and the optimization efficiency of the zero sample learning model is improved; on the other hand, the server device optimizes the zero sample learning model based on the comprehensive loss function of the optimization task, so that the optimization effect is improved, and the adaptability of the optimized zero sample learning model is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an application scenario diagram of a zero sample learning model optimization method provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart of a zero-sample learning model optimization method provided in an embodiment of the present application;

FIG. 3 is a flowchart for generating a comprehensive loss function according to an embodiment of the present application;

FIG. 4 is a schematic block diagram of a zero sample learning model optimization device provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a server device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

In addition, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be regarded as not exist and not within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

The zero sample learning model optimization method provided by the embodiment of the application can be applied to a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (personal digital assistant, PDA) and other service end devices, and the embodiment of the application does not limit the specific types of the service end devices.

For example, the server device may be a Station (ST) in a WLAN, which may be a cellular telephone, a cordless telephone, a Session initiation protocol (Session InitiationProtocol, SIP) telephone, a wireless local loop (Wireless Local Loop, WLL) station, a personal digital assistant (Personal Digital Assistant, PDA) device, a handheld device with wireless communication capabilities, a computing device or other processing device connected to a wireless modem, an in-vehicle device, a car networking terminal, a computer, a laptop computer, a handheld communication device, a handheld computing device, a satellite radio, a wireless modem card, a television Set Top Box (STB), a customer premise equipment (customer premise equipment, CPE) and/or other devices for communicating over a wireless system as well as next generation communication systems, such as a mobile terminal in a 5G network or a mobile terminal in a future evolved public land mobile network (Public Land Mobile Network, PLMN) network, etc.

By way of example, but not limitation, when the server device is a wearable device, the wearable device may also be a generic name for applying wearable technology to intelligently design daily wear and develop wearable devices, such as glasses, gloves, watches, apparel, shoes, and the like. The wearable device is a portable device that is worn directly on the body or integrated into the clothing or accessories of the user. The wearable device is not only a hardware device, but also can realize a powerful function through software support, data interaction and cloud interaction. The generalized wearable intelligent device comprises full functions, large size, and complete or partial functions which can be realized independent of a smart phone, such as a smart watch or a smart glasses, and is only focused on certain application functions, and needs to be matched with other devices such as the smart phone for use, such as various smart bracelets, smart jewelry and the like for physical sign monitoring.

Referring to fig. 1, fig. 1 is an application scenario diagram of a zero sample learning model optimization method provided in an embodiment of the present application, and is described in detail as follows:

the server-side equipment stores a zero sample learning model, wherein the zero sample learning model comprises a visual encoder, a text encoder and a multi-mode interaction module.

The method comprises the steps of obtaining a sample image, inputting the sample image into the visual encoder, and outputting the visual characteristics learned from the sample image to a multi-modal interaction module by the visual encoder.

The text encoder acquires text data, inputs the text data into the text encoder, and learns semantic attribute features from the text data to the multi-modal interaction module.

The sample image and text data may be obtained locally, or may be provided by the client device, for convenience of explanation, for example:

the server device is connected with the plurality of client devices through a network, wherein the network comprises one or a combination of a 2G network, a 3G network, a 4G network, a 5G network and a WIFI network.

The plurality of client devices comprise a client device A, a client device B, a client device C and a client device D.

The client device A, the client device B, the client device C and the client device D respectively submit a sample image A, a sample image B, a sample image C and a sample image D to the server device. The server device may obtain a sample image a, a sample image B, a sample image C, and a sample image D.

The client device A, the client device B, the client device C and the client device D respectively submit text data A, text data B, text data C and text data D to the server device. The server device may obtain text data a, text data B, text data C, and text data D.

In the embodiment of the application, the server device can be connected with different client devices at the same time to acquire sample images and text data submitted by the different client devices.

Referring to fig. 2, fig. 2 is a flow chart of a zero sample learning model optimization method provided in an embodiment of the present application, where the method may be applied to a server device, where the server device stores a zero sample learning model, and the zero sample learning model includes a visual encoder, a text encoder, and a multi-modal interaction module.

The server device may be any one of a server, a mobile phone, a camera, a tablet computer, a wearable device, a vehicle-mounted server device, an augmented reality (augmented reality, AR)/Virtual Reality (VR) server device, a notebook computer, a personal computer (personal computer, PC), a netbook, and a personal digital assistant (personal digital assistant, PDA), which is not limited in the embodiments of the present application.

As shown in fig. 2, the zero sample learning model optimization method provided in the embodiment of the present application includes the following steps, which are described in detail as follows:

s201, acquiring a sample image, inputting the sample image into the visual encoder, and acquiring visual features learned from the sample image by the visual encoder;

The ViT model is an image classification model based on self-attention, and ViT is Vision Transformer.

The DeiT model is a high-performance image classification model which requires less Data and less computing resources, and the English language of DeiT is Data-efficient image Transformers.

The Swin transducer model is a deep learning model based on the transducer, and advanced performance is achieved in visual tasks.

The GloVe model is a Global vector model, is a word characterization tool based on Global word frequency statistics, and can express words into Vectors to capture semantic characteristics.

Among them, the Word2Vec model is one of language models, which is a model for learning semantic knowledge from a large amount of text predictions in an unsupervised manner, and is widely used in natural language processing.

The BERT model (Bidirectional Encoder Representation from Transformers) is a pre-trained model.

The multimode interaction module consists of two encoders of a transducer model.

Wherein, before the sample image is acquired, the sample image is input into the visual encoder, and the visual characteristics learned by the visual encoder from the sample image are acquired, the zero sample learning model optimization method comprises the following steps:

step A, reading in a visual encoder and acquiring a network architecture of the visual encoder;

and B, acquiring a preset training mode, training the visual encoder by adopting the training mode, and acquiring the trained visual encoder.

and step C, updating weight parameters of the visual encoder by adopting a preset optimizer.

The step a may be performed before or after the step B, the step C may be performed before or after the step a, the step B, the step C may be performed simultaneously, the step a, the step B, and the step C may be performed not simultaneously, and the specific order of execution is not limited herein.

S202, acquiring text data, inputting the text data into the text encoder, and acquiring semantic attribute features learned from the text data by the text encoder;

s203, inquiring the semantic attribute features by using the visual features through the multi-mode interaction module to obtain a first interaction result, inquiring the visual features by using the semantic attribute features to obtain a second interaction result, and aligning the visual features and the semantic attribute features according to the first interaction result and the second interaction result;

S204, acquiring an optimization task, and generating a comprehensive loss function of the optimization task by adopting a preset comprehensive loss function generation model when the zero sample learning model executes the optimization task;

illustratively, the synthetic loss function generation model includes:

L＝L _cls +λ ₁ L _deb +λ ₂ L _regT ；

The depolarization loss is obtained through a preset depolarization loss function.

Wherein lambda is ₁ And lambda (lambda) ₂ Default to the user or system, for ease of illustration, examples are as follows:

λ ₁ 0.3 lambda ₁ At 0.7, l=l _cls +0.3*L _deb +0.7L _regT ；

λ ₁ 0.3 lambda ₁ At 0.5, l=l _cls +0.3*L _deb +0.5L _regT ；

λ ₁ 0.6 lambda ₁ At 0.3, l=l _cls +0.6*L _deb +0.3L _regT ；

Wherein the user can change lambda according to different classification requirements ₁ And lambda (lambda) ₂ The comprehensive loss function generation model passes through the modified lambda ₁ And lambda (lambda) ₂ The generated comprehensive loss function can better meet the classification requirements of users, and is beneficial to improving the generalization capability of the zero sample learning model.

S205, optimizing the zero sample learning model based on the comprehensive loss function of the optimization task, and obtaining the optimized zero sample learning model.

Wherein after the zero sample learning model is optimized based on the comprehensive loss function of the optimization task and the optimized zero sample learning model is obtained, the method further comprises:

and providing a calling interface of the zero sample learning model.

And providing a calling interface of the zero sample learning model, and calling the zero sample learning model to execute the task of the application scene through the calling interface.

The application has the following application scenes, and the detailed description is as follows:

1. video monitoring: the number of annotation categories limited to the dataset makes many objects difficult to detect properly. By the method and the device, certain description information can be provided for the category which is not provided with the category label.

2. Target detection and semantic segmentation: the method and the device can perform target detection or semantic segmentation tasks by changing the loss function and part of the modules, and can detect or segment the types which are not marked in the data set.

3. Safety monitoring: labeling data sets can generally label only part of common dangerous objects, and a large number of unknown objects can also cause potential safety hazards. It is thus possible, using the present application, to provide a certain text description for such objects and to determine, based on the post-processing, whether such objects belong to dangerous objects and to give an alarm.

4. Automatic driving: for identifying new traffic signs, vehicle types and pedestrian behaviour to enhance the safety of an autonomous car.

The present application may be combined with other methods to adapt to more tasks, and for convenience of illustration, examples are as follows:

1. the zero sample learning model is combined with DETR and the method is used for target detection.

2. And transferring resource intensive processing such as image classification, image detection and the like of the zero sample learning model to the cloud by using an image and video compression technology, so as to realize cloud edge coordination.

3. The model compression technology is utilized to reduce the parameter quantity and the calculated quantity of the zero sample learning model, and realize the trade-off and adaptation of the performance and the equipment requirement.

Referring to fig. 3, fig. 3 is a flowchart of generating a comprehensive loss function according to an embodiment of the present application, which is described in detail below:

s301, acquiring the optimization task, and inputting a data set of the optimization task into an N-th network layer of the visual encoder;

s302, performing matrix processing on the Nth data characteristic output by the Nth network layer by utilizing the multi-mode interaction module corresponding to the Nth network layer, outputting the Nth data characteristic after matrix processing to the next network layer of the Nth network layer, and judging whether the next network layer is the last network layer;

s303, if the next network layer is the last network layer, a preset comprehensive loss function generation model is adopted to generate the comprehensive loss function of the optimization task.

Wherein, there is one multi-mode interaction module between the nth network layer and the next network layer of the nth network layer of the visual encoder. The insertion positions of the multi-mode interaction modules have different performance improvement and larger fluctuation.

The user can select the insertion position of the multi-mode interaction module according to the service requirement or the layer number of the visual encoder.

Wherein, every two multimode interaction modules are separated by one or two network layers.

For ease of illustration, a visual encoder with 12 network layers is illustrated as follows:

a multi-mode interaction module is inserted between a 1 st network layer and a 2 nd network layer of a visual encoder, a multi-mode interaction module is inserted between a 3 rd network layer and a 4 th network layer, a multi-mode interaction module is inserted between a 5 th network layer and a 6 th network layer, a multi-mode interaction module is inserted between a 7 th network layer and an 8 th network layer, a multi-mode interaction module is inserted between a 9 th network layer and a 10 th network layer, and a multi-mode interaction module is inserted between a 11 th network layer and a 12 th network layer.

For ease of illustration, the visual encoder has 6 network layers, as detailed below:

a multi-mode interaction module is inserted between a 1 st network layer and a 2 nd network layer of a visual encoder, a multi-mode interaction module is inserted between the 2 nd network layer and a 3 rd network layer, a multi-mode interaction module is inserted between the 3 rd network layer and a 4 th network layer, a multi-mode interaction module is inserted between the 4 th network layer and a 5 th network layer, and a multi-mode interaction module is inserted between the 5 th network layer and a 6 th network layer.

In the embodiment of the application, the zero sample learning model is optimized based on the comprehensive loss function by generating the comprehensive loss function, so that the adaptability of the optimized zero sample learning model is improved.

Referring to fig. 4, fig. 4 is a schematic block diagram of a zero-sample learning model optimizing apparatus provided in the embodiment of the present application, where the zero-sample learning model optimizing apparatus 400 shown in fig. 4 may be applied to a server device in an application scenario diagram shown in fig. 1, and the zero-sample learning model optimizing apparatus 400 shown in fig. 4 is described in detail below by taking the server device as an example, where the zero-sample learning model optimizing apparatus 400 may include a first obtaining module 401, a second obtaining module 402, an interaction module 403, a generating module 404, and an optimizing module 405.

A first obtaining module 401, configured to obtain a sample image, input the sample image into the visual encoder, and obtain a visual feature learned by the visual encoder from the sample image;

a second obtaining module 402, configured to obtain text data, input the text data into the text encoder, and obtain semantic attribute features learned from the text data by the text encoder;

The interaction module 403 is configured to query the semantic attribute feature by using the visual feature through the multi-modal interaction module to obtain a first interaction result, query the visual feature by using the semantic attribute feature to obtain a second interaction result, and align the visual feature with the semantic attribute feature according to the first interaction result and the second interaction result;

the generating module 404 is configured to obtain an optimization task, and when the zero sample learning model executes the optimization task, generate a model by adopting a preset comprehensive loss function, and generate a comprehensive loss function of the optimization task;

and the optimizing module 405 is configured to optimize the zero sample learning model based on the comprehensive loss function of the optimizing task, and obtain the optimized zero sample learning model.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a server device according to an embodiment of the present application.

As shown in fig. 5, the server device 2 of fig. 5 includes: at least one processor 20, a memory 21 and a computer program 22 stored in the memory 21 and executable on the at least one processor 20, the processor 20 implementing the steps of any of the various method embodiments described above when executing the computer program 22.

The server device 2 may include, but is not limited to, a processor 20, a memory 21. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the server device 2 and is not meant to be limiting of the server device 2, and may include more or fewer components than shown, or may combine certain components, or may include different components, such as input-output devices, network access devices, etc.

The processor 20 may be a central processing unit (Central Processing Unit, CPU), and the processor 20 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 21 may in some embodiments be an internal storage unit of the server device 2, such as a hard disk or a memory of the server device 2. The memory 21 may also be an external storage device of the server device 2 in other embodiments, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the server device 2. Further, the memory 21 may also include both an internal storage unit and an external storage device of the server device 2. The memory 21 is used for storing an operating system, application programs, boot loader (BootLoader), data, other programs, etc., such as program codes of the computer program. The memory 21 may also be used for temporarily storing data that has been output or is to be output.

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein again.

Embodiments of the present application provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps that may implement the various method embodiments described above.

The computer readable storage medium has stored therein program code that is executable by the processor to perform the zero sample learning model optimization method described in the method embodiments above.

The computer readable storage medium has a storage space for program code.

The program code includes code for any of the steps in the zero sample learning model optimization method described in the method embodiments above.

For example, the program code is invoked by the processor, and may perform the steps of:

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Embodiments of the present application provide a computer program product that, when run on a server device, causes the server device to perform steps that may be performed in the method embodiments described above.

The computer program product is loaded by the server device and may perform the steps of:

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

Based on such understanding, the present application implements all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a camera device/server apparatus, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (RAM, random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. The zero sample learning model optimizing method is characterized by being applied to a server device, wherein the server device stores a zero sample learning model, the zero sample learning model comprises a visual encoder, a text encoder and a multi-mode interaction module, and the zero sample learning model optimizing method comprises the following steps:

2. The zero-sample learning model optimization method of claim 1, wherein there is one multi-modal interaction module between an nth network layer of the visual encoder and a next network layer of the nth network layer;

3. The zero-sample learning model optimization method according to claim 1, characterized in that before the acquiring a sample image, inputting the sample image into the visual encoder, acquiring a visual feature learned by the visual encoder from the sample image, the zero-sample learning model optimization method comprises:

4. The zero-sample learning model optimization method according to claim 1, characterized in that before the acquiring a sample image, inputting the sample image into the visual encoder, acquiring a visual feature learned by the visual encoder from the sample image, the zero-sample learning model optimization method comprises:

5. The zero-sample learning model optimization method according to claim 1, characterized in that before the acquiring a sample image, inputting the sample image into the visual encoder, acquiring a visual feature learned by the visual encoder from the sample image, the zero-sample learning model optimization method comprises:

6. The zero-sample learning model optimization method according to claim 1, wherein after the optimizing the zero-sample learning model based on the comprehensive loss function of the optimizing task, the method further comprises:

And providing a calling interface of the zero sample learning model.

7. The zero-sample learning model optimization method of claim 1, wherein the multi-modal interaction module consists of two transducers' encoders.

8. The utility model provides a zero sample study model optimizing apparatus, is characterized in that is applied to server-side equipment, the server-side equipment stores zero sample study model, zero sample study model includes visual encoder, text encoder and multi-modal interaction module, includes:

9. A server device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the zero sample learning model optimization method according to any one of claims 1 to 7 when executing the computer program.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the zero sample learning model optimization method according to any one of claims 1 to 7.